-
Type:
How To
-
Status: Obsolete
-
Affects Version/s: EXASolution 5.0
-
Fix Version/s: None
-
Component/s: EXASolution
-
Labels:None
-
Solution:
-
Category 1:ETL / Data Integration
- mentioned in
With EXASOL bulk loader (EXAloader) it is possible to import or export data from or to a CSV file. Using WebHDFS provides a smooth way to get access to CSV files within a HDFS. This solution was tested with Hortonworks Sanbox 2.3.
The initial examples assume that Hadoop does not require Kerberos authentication. There is a separate section below for Kerberos.
All EXASOL nodes need access to the WebHDFS service on the namenode and on all data nodes. The namenode service for WebHDFS typically runs on the following ports:
If you use HTTPS, the ports are:
The following example uses the "sample_07" table provided with Hortonworks Sanbox.
The IMPORT below uses an error table. Refer to EXASOL manual for further details on error tables.
Internally, EXASOL connects to the hadoop name node, which redirects the http request to the datanode where the file is actually located.
create or replace table sample_07(code varchar(2000), description varchar(2000), total_emp int, salary int); import into sample_07 from csv at 'http://<webhdfs_server>:<webhdfs_port>/webhdfs/v1/apps/hive/warehouse/sample_07' FILE 'sample_07?op=OPEN&user.name=root' column separator = 'TAB' row separator = 'LF' errors into err_tab(current_timestamp) reject limit unlimited;
The following example shows how to export a table into a csv in HDFS.
It is possible to specify a view or statement, too.
insert into ten values 1,2,3,4,5,6,7,8,9,0; export ten into csv at 'http://<webhdfs_server>:<webhdfs_port>/webhdfs/v1/user/guest/showCase' file 'ten.csv?op=CREATE&user.name=root';
It is possible to specify multiple files within the same IMPORT or EXPORT statement. If multiple files are specified, thus IMPORT or EXPORT jobs are distributed evenly among EXASOL database nodes. For each file the responsible database node sends a http request to the hadoop name node and is redirected to the hadoop data node where the file is located. Thus, the load on both clusters are distributed.
Example for parallel IMPORT:
import into sample_07 from csv at 'http://<webhdfs_server>:<webhdfs_port>/webhdfs/v1/apps/hive/warehouse/sample_07' FILE 'sample_07?op=OPEN&user.name=root' FILE 'sample_08?op=OPEN&user.name=root' column separator = 'TAB' row separator = 'LF' errors into err_tab(current_timestamp) reject limit unlimited;
EXAloader is able to read/write compressed csv files. The supported compression algorithms are:
Compression is done automatically if one of the supported ending is specified for a file. The compression is done within the EXASOL cluster, thus all communication to/from hadoop cluster is compressed.
import into sample_07 from csv at 'http://<webhdfs_server>:<webhdfs_port>/webhdfs/v1/apps/hive/warehouse/sample_07' FILE 'sample_07.zip?op=OPEN&user.name=root' column separator = 'TAB' row separator = 'LF' errors into err_tab(current_timestamp) reject limit unlimited;
It is possible to import/export from/to Kerberos-secured Hadoop clusters by using delegation tokens. First you have to retrieve a delegation token as follows:
kinit {principal}
curl -s --negotiate -u : http://<webhdfs_server>:<webhdfs_port>/webhdfs/v1/?op=GETDELEGATIONTOKEN | grep -Po 'urlString":"\K[^"]*'
The delegation token should look like this:
LgAMaGFkb29wdGVzdGVyDGhhZG9vcHRlc3RlcgCKAVoc9RV9igFaQQGZfY_9j7QUeh9XRVB-ODyU6YW4BX7yiX3fZ_wSV0VCSERGUyBkZWxlZ2F0aW9uETEwLjQ4LjEwNS4xMzo4MDIW
You can now run the IMPORT or EXPORT statements as described above, except that you have to specify the delegation parameter instead of the user.name parameter.
Here is the IMPORT statement example for Kerberos:
import into sample_07 from csv at 'http://<webhdfs_server>:<webhdfs_port>/webhdfs/v1/apps/hive/warehouse/sample_07' FILE 'sample_07?op=OPEN&delegation=<token>' column separator = 'TAB' row separator = 'LF' errors into err_tab(current_timestamp) reject limit unlimited;
And the EXPORT statement example for Kerberos:
export ten into csv at 'http://<webhdfs_server>:<webhdfs_port>/webhdfs/v1/user/guest/showCase' file 'ten.csv?op=CREATE&delegation=<token>';
Please remember that the token validity will cease after a certain amount of time, depending on the Hadoop configuration. Thus, it must be refreshed regularly.