OBSOLETED BY https://docs.exasol.com/loading_data/loading_data_from_amazon_s3_in_parallel.htm
In this solution we will load files from S3 into Exasol in parallel. You can do this using a python script. But first, you have to do some preparations. The script needs Boto, a python package that allows listing folders' content in S3. To make Boto accessable to the script, it has to be on every node in the cluster. Therefore, we use BucketFS, Exasol's replicated file system.
- Create a bucket
- Upload the boto file to the cluster
- Execute the script
if you intend to load only a single file from S3, see
- You need Python as "Extra Database Feature" in your license
- You need an AWS account with a valid AWS-AccessKeyId and a corresponding AWS-SecretAccessKey or need to work on an instance, that doesn't need access keys
How to Load files from Amazon S3 into Exasol in parallel?
Step 1 and 2 are only needed for versions before 6.0.10. Starting at version 6.0.10, the boto library is already included. So if you use version 6.0.10 or later, go directly to step 3
Step 1: Create a bucket
- Go to EXABuckets.
- Create a new BucketFS Service by clicking Add. Click Add once more to create the service
- Go to the newly created BucketFS Service by clicking on it's name (probably bucketfs1)
- Add a service bucket by clicking Add. Define a name, set the checkmark for public readable and define read/write passwords (note them down somewhere, we need them later on). Create the service by clicking Add
Step 2: Upload the Boto file to the cluster
You can access BucketFS via curl or via the BucketFS-Explorer. BucketFS-Explorer is a Java-based program you can run on any operating system.
Make sure the BucketFS-port 2582 is accessible.
Adapt the following command and execute it on the command line
curl -i -X PUT -T 'C:\Users...\boto-....whl' http:
Step 3: Execute the script
Open EXAplus and connect to your database.
- Create a connection to S3 by modifying Connection name, url and access keys in the following statement and executing it:
CREATE CONNECTION S3_MY_BUCKETNAME
TO 'http://my_bucketname.s3.my_region.amazonaws.com' USER 'my_access_key'
IDENTIFIED BY 'my_secret_key';
- S3_GET_FILENAMES, a python script
- GET_CONNECTION_NAME, a lua scalar script
- Modify the following line at the top of the script if you named your bucket differently or delete the line completely if using 6.0.10 or later:
- Modify the statement at the bottom of the file:
execute script DATABASE_MIGRATION.s3_parallel_read(
true , true , 'S3_IMPORT_LOGGING' ,'PRODUCT' ,'test' ,'S3_MY_BUCKETNAME' ,'my_project/' , '' banana'
,2 ,'ENCODING=''ASCII'' SKIP=1 ROW SEPARATOR = ''CRLF''' import' for further information
- Execute all the statements in the file. The first three of them will generate the scripts, the last one on the bottom will start the import. Next time you want to use the import, you'll only need to execute the "execute script..." statement at the bottom
- The script generates a table in the schema S3_IMPORT_LOGGING. This table has three rows: filename, last_modified and status. The table is used to keep track of the import staus. Only files that have been modified or added since the last time the script was executed will be imported. If you want to import files that have already been importet, set the parameter 'force reload' to true