Uploaded image for project: 'Solution Center'
  1. Solution Center
  2. SOL-594

Amazon S3 to EXASOL: Load many files from bucket into table in parallel

    XMLWordPrintable

    Details

    • Type: How To
    • Status: Published
    • Affects Version/s: EXASOL 6.0.0, Exasol 6.1.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Solution:
      Hide

       

      Load files from Amazon S3 into Exasol in parallel

      Remark: if you intend to load only a single file from S3, see EXASOL-1774

      Summary

      In this solution we will load files from S3 into Exasol in parallel. You can do this using a python script. But first, you have to do some preparations. The script needs Boto, a python package that allows listing folders' content in S3. To make Boto accessable to the script, it has to be on every node in the cluster. Therefore, we use BucketFS, Exasol's replicated file system.

      We will

      1. Create a bucket
      2. Upload the boto file to the cluster
      3. Execute the script

      Prerequisites

      • You need Python as "Extra Database Feature" in your license
      • You need an AWS account with a valid AWS-AccessKeyId and a corresponding AWS-SecretAccessKey or need to work on an instance, that doesn't need access keys

       

      Steps

      Step 1 and 2 are only needed for versions before 6.0.10. Starting at version 6.0.10, the boto library is already included. So if you use version 6.0.10 or later, go directly to step 3

      1. Create a bucket

      Open EXAoperation

      • Go to EXABuckets.
      • Create a new BucketFS Service by clicking Add. Click Add once more to create the service
      • Go to the newly created BucketFS Service by clicking on it's name (probably bucketfs1)
      • Add a service bucket by clicking Add. Define a name, set the checkmark for public readable and define read/write passwords (note them down somewhere, we need them later on). Create the service by clicking Add

      2. Upload the Boto file to the cluster

      You can access BucketFS via curl or via the BucketFS-Explorer. BucketFS-Explorer is a Java-based program you can run on any operating system.

      Make sure the BucketFS-port 2582 is accessible.

      Via BucketFS-Explorer:

      Via curl:

       Adapt the following command and execute it on the command line

      curl -i -X PUT -T 'C:\Users...\boto-....whl' http://w:my_write_password@localhost:2582/python/
      

      3. Execute the script

      Open EXAplus and connect to your database.

      • Create a connection to S3 by modifying Connection name, url and access keys in the following statement and executing it:
      CREATE CONNECTION S3_MY_BUCKETNAME
          TO 'http://my_bucketname.s3.my_region.amazonaws.com' -- my_region could e.g. be eu-west-1
          USER 'my_access_key'
          IDENTIFIED BY 'my_secret_key';
      
      
      • S3_GET_FILENAMES, a python script
      • GET_CONNECTION_NAME, a lua scalar script
      • S3_PARALLEL_READ
      • Modify the following line at the top of the script if you named your bucket differently or delete the line completely if using 6.0.10 or later:
      sys.path.extend(glob.glob('/buckets/bucketfs1/python/*'))
      
      • Modify the statement at the bottom of the file:
      execute script DATABASE_MIGRATION.s3_parallel_read(
      true						-- if true, statements are executed immediately, if false only statements are generated
      , true						-- force reload: if true, table and logging table will be truncated, all files in bucket will be loaded again
      , 'S3_IMPORT_LOGGING'		-- schema you want to use for the logging tables
      ,'PRODUCT'					-- name of the schema that holds the table you want to import into
      ,'test' 					-- name of the table you want to import into
      ,'S3_MY_BUCKETNAME'			-- connection name ( see statement above)
      ,'my_project/' 				-- folder name, if you want to import everything, leave blank
      , ''					    -- filter for file-names, to include all files, put empty string, for using only files that include the word banana, put: 'banana'
      ,2 							-- number of parallel connections you want to use
      ,'ENCODING=''ASCII'' SKIP=1  ROW SEPARATOR = ''CRLF''' -- file options, see manual, section 'import' for further information
      )
      ;
      • Execute all the statements in the file. The first three of them will generate the scripts, the last one on the bottom will start the import. Next time you want to use the import, you'll only need to execute the "execute script..." statement at the bottom
      • The script generates a table in the schema S3_IMPORT_LOGGING. This table has three rows: filename, last_modified and status. The table is used to keep track of the import staus. Only files that have been modified or added since the last time the script was executed will be imported. If you want to import files that have already been importet, set the parameter 'force reload' to true
      Show
        Load files from Amazon S3 into Exasol in parallel Remark: if you intend to load only a single file from S3, see  EXASOL-1774 Summary In this solution we will load files from S3 into Exasol in parallel. You can do this using a python script. But first, you have to do some preparations. The script needs Boto, a python package that allows listing folders' content in S3. To make Boto accessable to the script, it has to be on every node in the cluster. Therefore, we use BucketFS, Exasol's replicated file system. We will Create a bucket Upload the boto file to the cluster Execute the script Prerequisites You need Python as "Extra Database Feature" in your license You need an AWS account with a valid AWS-AccessKeyId and a corresponding AWS-SecretAccessKey or need to work on an instance, that doesn't need access keys   Steps Step 1 and 2 are only needed for versions before 6.0.10. Starting at version 6.0.10, the boto library is already included. So if you use version 6.0.10 or later, go directly to step 3 1. Create a bucket Open EXAoperation Go to EXABuckets. Create a new BucketFS Service by clicking Add . Click Add once more to create the service Go to the newly created BucketFS Service by clicking on it's name (probably bucketfs1) Add a service bucket by clicking Add. Define a name , set the checkmark for public readable and define read/write passwords (note them down somewhere, we need them later on). Create the service by clicking Add 2. Upload the Boto file to the cluster You can access BucketFS via curl or via the BucketFS-Explorer. BucketFS-Explorer is a Java-based program you can run on any operating system. Make sure the BucketFS-port 2582 is accessible. Via BucketFS-Explorer: Go to https://github.com/EXASOL/bucketfs-explorer#getting-started Download and start the BucketFS-Explorer like it's described there Download the file boto- [some-version] .whl from https://pypi.python.org/pypi/boto#downloads Upload it into the bucket you created in the first step via the BucketFS-Explorer Via curl:  Adapt the following command and execute it on the command line curl -i -X PUT -T 'C:\Users...\boto-....whl' http: //w:my_write_password@localhost:2582/python/ 3. Execute the script Open EXAplus and connect to your database. Create a connection to S3 by modifying Connection name, url and access keys in the following statement and executing it: CREATE CONNECTION S3_MY_BUCKETNAME      TO 'http://my_bucketname.s3.my_region.amazonaws.com' -- my_region could e. g . be eu-west-1      USER 'my_access_key'      IDENTIFIED BY 'my_secret_key' ; Download the script  https://github.com/EXASOL/database-migration/blob/master/s3_to_exasol.sql and open it in EXAplus The file contains three scripts that you'll need for the import: S3_GET_FILENAMES, a python script GET_CONNECTION_NAME , a lua scalar script S3_PARALLEL_READ Modify the following line at the top of the script if you named your bucket differently or delete the line completely if using 6.0.10 or later: sys.path.extend(glob.glob( '/buckets/bucketfs1/python/*' )) Modify the statement at the bottom of the file: execute script DATABASE_MIGRATION.s3_parallel_read( true -- if true , statements are executed immediately, if false only statements are generated , true -- force reload : if true , table and logging table will be truncated, all files in bucket will be loaded again , 'S3_IMPORT_LOGGING' -- schema you want to use for the logging tables , 'PRODUCT' -- name of the schema that holds the table you want to import into , 'test' -- name of the table you want to import into , 'S3_MY_BUCKETNAME' -- connection name ( see statement above) , 'my_project/' -- folder name , if you want to import everything, leave blank , '' -- filter for file - names , to include all files, put empty string , for using only files that include the word banana, put: ' banana' ,2 -- number of parallel connections you want to use , ' ENCODING =' 'ASCII' ' SKIP=1 ROW SEPARATOR = ' 'CRLF' '' -- file options , see manual, section ' import' for further information ) ; Execute all the statements in the file. The first three of them will generate the scripts, the last one on the bottom will start the import. Next time you want to use the import, you'll only need to execute the "execute script..." statement at the bottom The script generates a table in the schema S3_IMPORT_LOGGING. This table has three rows: filename , last_modified and status . The table is used to keep track of the import staus. Only files that have been modified or added since the last time the script was executed will be imported. If you want to import files that have already been importet, set the parameter 'force reload' to true
    • Category 1:
      ETL / Data Integration
    • Category 2:
      Scripting

      Attachments

        Issue Links

          Activity

            People

            • Assignee:
              CaptainEXA Captain EXASOL
              Reporter:
              CaptainEXA Captain EXASOL
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated: