Uploaded image for project: 'Solution Center'
  1. Solution Center
  2. SOL-241

EXAPowerlytics tutorial featuring rpart, stratified sampling, and Redis

    XMLWordPrintable

    Details

    • Type: How To
    • Status: Obsolete
    • Affects Version/s: EXASolution 5.0.6
    • Fix Version/s: None
    • Component/s: EXAPowerlytics
    • Labels:
      None
    • Solution:
      Hide

      Note: Most of the solution is still valid for version 6 and later, except that after version 6, we use BucketFS to store analytical models, as described in SOL-257

      Introduction and task description

      EXAPowerlytics allows to call upon arbitrary functionality programming languages like R, Python, Java, and Lua. This allows to create complex models right inside the database by using techniques from the fields of Data Mining or Machine Learning. Then models can be published via tools like Redis and used from different places inside of organizations. However, scripts in those languages cannot handle arbitrary large data sets. To solve this problem, there are various approaches. A particularly straightforward alternative for applications in the fields of Data Mining or Machine Learning is to avoid actually working with large datasets inside of user defined functions by using sampling.

      In these fields, many algorithms are based on statistics of the data set and therefore, they often do not really need all the data but instead a good sample is sufficient, provided that it exhibits the same statistical properties as the original data.

      One popular kind of sampling is stratified sampling. Here, the data set is divided into a number of subpopulations and each of these subpopulations is sampled independently. The goals are:

      1. Guarantee that each subpopulation is represented in the result and
      2. The proportion of the number of elements of subpopulations with respect to the total number of elements should be the same in the sampled version as in the original dataset

      A typical are of application of stratified sampling in Machine Learning is Classification. Here, a certain attribute is considered the class attribute. Now, for sampling, we want to make sure, that each possible value of the class attribute is also available in the sample and that the prior probablility of the class attribute values is the same in the sample as in the original data set. It is easy to achieve these properties using stratified sampling, where the sub-populations are defined by the different values of the class attribute.

      In this solution, we consider the following publicly available dataset:

      http://archive.ics.uci.edu/ml/machine-learning-databases/00229/Skin_NonSkin.txt

      which you should download from the specified location and put into some convenient location.

      If you want to know more details about the dataset, you find a description here: https://archive.ics.uci.edu/ml/datasets/Skin+Segmentation

      Our goals are:

      1. Create a model of the data using R and the rpart library right inside EXASolution using stratified sampling
      2. Publish the model via Redis
      3. Apply the published model inside of User Defined Functions in EXASolution

      Note: If you are using EXASOL 6.0 or above, you can check SOL-257 to see how to store analytical models with BucketFS.

      Data preparation

      First we create a schema and import the dataset.

      create schema rpart_sampling_tutorial;
      create or replace table skin_seg_staging_1(x1 int, x2 int, x3 int, y int);
      import into skin_seg_staging_1 from local csv file '[Path very you put]/Skin_NonSkin.txt'  column separator = 'TAB' row separator= 'CRLF';
      

      The original dataset is too small to actually cause the problem we want to illustrate. Therefore, we scale the dataset by factor of 10.

      create or replace table skin_seg_staging_2(nr int identity, x1 int, x2 int, x3 int, y int);
      insert into skin_seg_staging_2 (x1,x2,x3,y) (select a.x1,a.x2,a.x3,a.y from skin_seg_staging_1 a,(values 0,1,2,3,4,5,6,7,8,9)  order by random());
      

      now we derive a dataset for training

      create or replace table skin_seg_training as select * from skin_seg_staging_2 where nr < 0.8*(select count(*) from skin_seg_staging_2);
      

      and for testing

      create or replace table skin_seg_test as select * from skin_seg_staging_2 where nr >= 0.8*(select count(*) from skin_seg_staging_2);
      

      Using stratified sampling in order to use rpart with large datasets

      This script creates a classifier in R using the rpart library and Redis in order to store the model in a central location.
      Please note, that the script needs to load the complete dataset which is bound to cause problems if the dataset is too large.

      create or replace R set script create_classifier(name varchar(2000), x1 int, x2 int, x3 int, y int)
      returns int
      as
      library(rpart)
      library(rredis)	
      redisConnect(host="[some host where you can start redis and that is reachable from EXASolution]", port=6379)
      
      run <- function(ctx)
      { # fetch all records from this group into a single vector
          ctx$next_row(NA)
          df <- data.frame(x1=ctx$x1, x2=ctx$x2, x3=ctx$x3, y=factor(ctx$y))
          fit <- rpart(y ~ ., data = df)
          pfit <- prune(fit,fit$cptable[which.min(fit$cptable[,"xerror"]),"CP"])
      	redisSet(ctx$name[[1]],pfit)
          return(length(raw))
      }
      /
      

      Let's create a tree with the name "my_classifier":

      select create_classifier('my_classifier',x1,x2,x3,y) x from skin_seg_training;
      

      The dataset used training is too large for our poor R-UDF which results in "[22002] VM error: ..."

      As already mentioned in the introduction, stratified sampling may be worth a shot in such a situation.
      Actually, it is straightforward to implement stratified sampling as a UDF in EXASolution:

      create or replace LUA set script my_stratified_sampler(num_samples number, num_total number, x1 number, x2 number, x3 number, y number)
      emits (x1 number, x2 number, x3 number, y number) as
      function run(ctx)
          local positions = {}
          local stratum_size = ctx.size()
          local num_samples_in_stratum = math.max(math.floor(0.5+(stratum_size / ctx.num_total) * ctx.num_samples),1)
          for i=1,num_samples_in_stratum do
              positions[i] = math.random(stratum_size)
          end
          table.sort(positions)
          local c = 1
          local i = 1
          repeat
              while positions[c] == i do
                  ctx.emit(ctx.x1, ctx.x2, ctx.x3, ctx.y)
                  c = c+1
              end
              i = i + 1
          until not ctx.next()
      end
      /
      

      Now we create a classifier by sampling the large training set down to only a thousandth of the original size

      select create_classifier('my_classifier',x1,x2,x3,y)
      from (select my_stratified_sampler((select count(*) from skin_seg_training)/1000,
                                         (select count(*) from skin_seg_training),
                                         x1,
                                         x2,
                                         x3,
                                         y)
            from skin_seg_training
            group by y);
      

      After executing this query, the R-Object which represents the model is serialized into the Redis key-value store using the key "my_classifier". From there it can be loaded into any R session. In particular, we can also access it from R-UDFs in EXASolution.
      The following script loads the classifier from Redis and uses it to classify data:

      create or replace R set script apply_classifier(name varchar(20000), x1 int, x2 int, x3 int, y int)
      emits(x1 int, x2 int, x3 int, y int, c int)
      as
      library(rpart)
      library(rredis)
      redisConnect(host="[the same redis host as above]", port=6379)
      run <- function(ctx)
      { # fetch all records from this group into a single vector
          repeat {
             if (!ctx$next_row(NA)) break
             fit <- redisGet(ctx$name[[1]])
             df <- data.frame(x1=ctx$x1, x2=ctx$x2, x3=ctx$x3)  # y is not needed here!
             c <- predict(fit, df, type='class')
             ctx$emit(ctx$x1,ctx$x2,ctx$x3,ctx$y,c)
          }
      }
      /
      

      Let's see how well our tree performs at classifying the test data:

      select sum(correct)/count(*)
      from (select case when y=c then 1 else 0 end as correct
            from (select apply_classifier('my_classifier',x1,x2,x3,y)
                  from skin_seg_test
                  group by floor(random()*4*nproc())));
      

      In our experiments, we achieved 98% correctness.

      Hint: See also SOL-239 for an explanation how the group by clause in statement above is used to guarantee the parallel execution of the classification function.

      Recap

      • For many statistical datamining and machine learning algorithms, sheer data volume is not important
      • Therefore sampling is a great option when applying standard implementations of algorithms in a big data environment.
      • With EXAPowerlytics
        • sampling and model training can be performed right inside the database
        • The generated models then can be applied (in parallel!) on data in the database
        • or in third-party tools like RStudio
      Show
      Note: Most of the solution is still valid for version 6 and later, except that after version 6, we use BucketFS to store analytical models, as described in SOL-257 Introduction and task description EXAPowerlytics allows to call upon arbitrary functionality programming languages like R, Python, Java, and Lua. This allows to create complex models right inside the database by using techniques from the fields of Data Mining or Machine Learning. Then models can be published via tools like Redis and used from different places inside of organizations. However, scripts in those languages cannot handle arbitrary large data sets. To solve this problem, there are various approaches. A particularly straightforward alternative for applications in the fields of Data Mining or Machine Learning is to avoid actually working with large datasets inside of user defined functions by using sampling. In these fields, many algorithms are based on statistics of the data set and therefore, they often do not really need all the data but instead a good sample is sufficient, provided that it exhibits the same statistical properties as the original data. One popular kind of sampling is stratified sampling. Here, the data set is divided into a number of subpopulations and each of these subpopulations is sampled independently. The goals are: Guarantee that each subpopulation is represented in the result and The proportion of the number of elements of subpopulations with respect to the total number of elements should be the same in the sampled version as in the original dataset A typical are of application of stratified sampling in Machine Learning is Classification. Here, a certain attribute is considered the class attribute . Now, for sampling, we want to make sure, that each possible value of the class attribute is also available in the sample and that the prior probablility of the class attribute values is the same in the sample as in the original data set. It is easy to achieve these properties using stratified sampling, where the sub-populations are defined by the different values of the class attribute. In this solution, we consider the following publicly available dataset: http://archive.ics.uci.edu/ml/machine-learning-databases/00229/Skin_NonSkin.txt which you should download from the specified location and put into some convenient location. If you want to know more details about the dataset, you find a description here: https://archive.ics.uci.edu/ml/datasets/Skin+Segmentation Our goals are: Create a model of the data using R and the rpart library right inside EXASolution using stratified sampling Publish the model via Redis Apply the published model inside of User Defined Functions in EXASolution Note: If you are using EXASOL 6.0 or above, you can check SOL-257 to see how to store analytical models with BucketFS. Data preparation First we create a schema and import the dataset. create schema rpart_sampling_tutorial; create or replace table skin_seg_staging_1(x1 int , x2 int , x3 int , y int ); import into skin_seg_staging_1 from local csv file '[ Path very you put]/Skin_NonSkin.txt' column separator = 'TAB' row separator = 'CRLF' ; The original dataset is too small to actually cause the problem we want to illustrate. Therefore, we scale the dataset by factor of 10. create or replace table skin_seg_staging_2(nr int identity , x1 int , x2 int , x3 int , y int ); insert into skin_seg_staging_2 (x1,x2,x3,y) ( select a .x1, a .x2, a .x3, a .y from skin_seg_staging_1 a ,( values 0,1,2,3,4,5,6,7,8,9) order by random()); now we derive a dataset for training create or replace table skin_seg_training as select * from skin_seg_staging_2 where nr < 0.8*( select count (*) from skin_seg_staging_2); and for testing create or replace table skin_seg_test as select * from skin_seg_staging_2 where nr >= 0.8*( select count (*) from skin_seg_staging_2); Using stratified sampling in order to use rpart with large datasets This script creates a classifier in R using the rpart library and Redis in order to store the model in a central location. Please note, that the script needs to load the complete dataset which is bound to cause problems if the dataset is too large. create or replace R set script create_classifier( name varchar (2000), x1 int , x2 int , x3 int , y int ) returns int as library(rpart) library(rredis) redisConnect( host = "[ some host where you can start redis and that is reachable from EXASolution]" , port=6379) run <- function (ctx) { # fetch all records from this group into a single vector ctx$next_row(NA) df <- data .frame(x1=ctx$x1, x2=ctx$x2, x3=ctx$x3, y=factor(ctx$y)) fit <- rpart(y ~ ., data = df) pfit <- prune(fit,fit$cptable[which. min (fit$cptable[, "xerror" ]), "CP" ]) redisSet(ctx$ name [[1]],pfit) return ( length ( raw )) } / Let's create a tree with the name "my_classifier": select create_classifier( 'my_classifier' ,x1,x2,x3,y) x from skin_seg_training; The dataset used training is too large for our poor R-UDF which results in " [22002] VM error: ..." As already mentioned in the introduction, stratified sampling may be worth a shot in such a situation. Actually, it is straightforward to implement stratified sampling as a UDF in EXASolution: create or replace LUA set script my_stratified_sampler(num_samples number , num_total number , x1 number , x2 number , x3 number , y number ) emits (x1 number , x2 number , x3 number , y number ) as function run(ctx) local positions = {} local stratum_size = ctx. size () local num_samples_in_stratum = math. max (math. floor (0.5+(stratum_size / ctx.num_total) * ctx.num_samples),1) for i=1,num_samples_in_stratum do positions[i] = math.random(stratum_size) end table .sort(positions) local c = 1 local i = 1 repeat while positions[ c ] == i do ctx.emit(ctx.x1, ctx.x2, ctx.x3, ctx.y) c = c +1 end i = i + 1 until not ctx. next () end / Now we create a classifier by sampling the large training set down to only a thousandth of the original size select create_classifier( 'my_classifier' ,x1,x2,x3,y) from ( select my_stratified_sampler(( select count (*) from skin_seg_training)/1000, ( select count (*) from skin_seg_training), x1, x2, x3, y) from skin_seg_training group by y); After executing this query, the R-Object which represents the model is serialized into the Redis key-value store using the key "my_classifier". From there it can be loaded into any R session. In particular, we can also access it from R-UDFs in EXASolution. The following script loads the classifier from Redis and uses it to classify data: create or replace R set script apply_classifier( name varchar (20000), x1 int , x2 int , x3 int , y int ) emits(x1 int , x2 int , x3 int , y int , c int ) as library(rpart) library(rredis) redisConnect( host = "[the same redis host as above]" , port=6379) run <- function (ctx) { # fetch all records from this group into a single vector repeat { if (!ctx$next_row(NA)) break fit <- redisGet(ctx$ name [[1]]) df <- data .frame(x1=ctx$x1, x2=ctx$x2, x3=ctx$x3) # y is not needed here! c <- predict(fit, df, type = 'class' ) ctx$emit(ctx$x1,ctx$x2,ctx$x3,ctx$y, c ) } } / Let's see how well our tree performs at classifying the test data: select sum (correct)/ count (*) from ( select case when y= c then 1 else 0 end as correct from ( select apply_classifier( 'my_classifier' ,x1,x2,x3,y) from skin_seg_test group by floor (random()*4*nproc()))); In our experiments, we achieved 98% correctness. Hint: See also SOL-239 for an explanation how the group by clause in statement above is used to guarantee the parallel execution of the classification function. Recap For many statistical datamining and machine learning algorithms, sheer data volume is not important Therefore sampling is a great option when applying standard implementations of algorithms in a big data environment. With EXAPowerlytics sampling and model training can be performed right inside the database The generated models then can be applied (in parallel!) on data in the database or in third-party tools like RStudio
    • Category 1:
      UDFs and In-Database Analytics

      Attachments

        Activity

          People

          • Assignee:
            CaptainEXA Captain EXASOL
            Reporter:
            CaptainEXA Captain EXASOL
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: