[SOL-594] Amazon S3 to EXASOL: Load many files from bucket into table in parallel Created: 06.03.2018  Updated: 12.08.2020  Resolved: 12.08.2020

Status: Obsolete
Project: Solution Center
Component/s: None
Affects Version/s: EXASOL 6.0.0, Exasol 6.1.0
Fix Version/s: None

Type: How To
Reporter: Captain EXASOL Assignee: Captain EXASOL
Labels: None

Issue Links:
is related to EXASOL-1774 Import/Export from/to Amazon S3™ buckets Resolved

OBSOLETED BY https://docs.exasol.com/loading_data/loading_data_from_amazon_s3_in_parallel.htm


In this solution we will load files from S3 into Exasol in parallel. You can do this using a python script. But first, you have to do some preparations. The script needs Boto, a python package that allows listing folders' content in S3. To make Boto accessable to the script, it has to be on every node in the cluster. Therefore, we use BucketFS, Exasol's replicated file system.

We will

  1. Create a bucket
  2. Upload the boto file to the cluster
  3. Execute the script

Remark: if you intend to load only a single file from S3, see EXASOL-1774


  • You need Python as "Extra Database Feature" in your license
  • You need an AWS account with a valid AWS-AccessKeyId and a corresponding AWS-SecretAccessKey or need to work on an instance, that doesn't need access keys

How to Load files from Amazon S3 into Exasol in parallel?

Step 1 and 2 are only needed for versions before 6.0.10. Starting at version 6.0.10, the boto library is already included. So if you use version 6.0.10 or later, go directly to step 3

Step 1: Create a bucket

Open EXAoperation

  • Go to EXABuckets.
  • Create a new BucketFS Service by clicking Add. Click Add once more to create the service
  • Go to the newly created BucketFS Service by clicking on it's name (probably bucketfs1)
  • Add a service bucket by clicking Add. Define a name, set the checkmark for public readable and define read/write passwords (note them down somewhere, we need them later on). Create the service by clicking Add

Step 2: Upload the Boto file to the cluster

You can access BucketFS via curl or via the BucketFS-Explorer. BucketFS-Explorer is a Java-based program you can run on any operating system.

Make sure the BucketFS-port 2582 is accessible.

Via BucketFS-Explorer:

Via curl:

 Adapt the following command and execute it on the command line

curl -i -X PUT -T 'C:\Users...\boto-....whl' http://w:my_write_password@localhost:2582/python/

Step 3: Execute the script

Open EXAplus and connect to your database.

  • Create a connection to S3 by modifying Connection name, url and access keys in the following statement and executing it:
    TO 'http://my_bucketname.s3.my_region.amazonaws.com' -- my_region could e.g. be eu-west-1
    USER 'my_access_key'
    IDENTIFIED BY 'my_secret_key';

  • S3_GET_FILENAMES, a python script
  • GET_CONNECTION_NAME, a lua scalar script
  • Modify the following line at the top of the script if you named your bucket differently or delete the line completely if using 6.0.10 or later:
  • Modify the statement at the bottom of the file:
execute script DATABASE_MIGRATION.s3_parallel_read(
true						-- if true, statements are executed immediately, if false only statements are generated
, true						-- force reload: if true, table and logging table will be truncated, all files in bucket will be loaded again
, 'S3_IMPORT_LOGGING'		-- schema you want to use for the logging tables
,'PRODUCT'					-- name of the schema that holds the table you want to import into
,'test' 					-- name of the table you want to import into
,'S3_MY_BUCKETNAME'			-- connection name ( see statement above)
,'my_project/' 				-- folder name, if you want to import everything, leave blank
, ''					    -- filter for file-names, to include all files, put empty string, for using only files that include the word banana, put: 'banana'
,2 							-- number of parallel connections you want to use
,'ENCODING=''ASCII'' SKIP=1  ROW SEPARATOR = ''CRLF''' -- file options, see manual, section 'import' for further information
  • Execute all the statements in the file. The first three of them will generate the scripts, the last one on the bottom will start the import. Next time you want to use the import, you'll only need to execute the "execute script..." statement at the bottom
  • The script generates a table in the schema S3_IMPORT_LOGGING. This table has three rows: filename, last_modified and status. The table is used to keep track of the import staus. Only files that have been modified or added since the last time the script was executed will be imported. If you want to import files that have already been importet, set the parameter 'force reload' to true
Category 1: ETL / Data Integration
Category 2: Scripting

Comment by Peggy Schmidt-Mittenzwei [ 31.03.2020 ]


Generated at Sun Aug 01 06:33:56 CEST 2021 using Jira 7.13.18#713018-sha1:e1230154f8ff8cc9272975bf568fc732e806fd68.