Uploaded image for project: 'Solution Center'
  1. Solution Center
  2. SOL-322

How to IMPORT/EXPORT CSV files from/to HDFS

    Details

    • Type: How To
    • Status: Obsolete
    • Affects Version/s: EXASolution 5.0
    • Fix Version/s: None
    • Component/s: EXASolution
    • Labels:
      None
    • Solution:
      Hide

      Introduction

      With EXASOL bulk loader (EXAloader) it is possible to import or export data from or to a CSV file. Using WebHDFS provides a smooth way to get access to CSV files within a HDFS. This solution was tested with Hortonworks Sanbox 2.3.

      Information

      The initial examples assume that Hadoop does not require Kerberos authentication. There is a separate section below for Kerberos.

      Requirements

      • An EXASOL database
      • A Hadoop cluster with WebHDFS service started (this Solution has been tested with Hortonworks Sandbox 2.3, the services are already started by default)
      • All EXASOL database nodes require access to HDFS namenode and all HDFS datanodes (WebHDFS port must be open)

      Prerequisites

      All EXASOL nodes need access to the WebHDFS service on the namenode and on all data nodes. The namenode service for WebHDFS typically runs on the following ports:

      • 50070 on each namenode (dfs.namenode.http-address property)
      • 50075 on each data node (dfs.datanode.http-address property).

      If you use HTTPS, the ports are:

      • 50470 on each namenode (dfs.namenode.https-address)
      • 50475 on each datanode (dfs.datanode.https-address).

      Simple IMPORT

      The following example uses the "sample_07" table provided with Hortonworks Sanbox.
      The IMPORT below uses an error table. Refer to EXASOL manual for further details on error tables.
      Internally, EXASOL connects to the hadoop name node, which redirects the http request to the datanode where the file is actually located.

      create or replace table sample_07(code varchar(2000), description varchar(2000), total_emp int, salary int);
      
      import into sample_07 from csv at
      'http://<webhdfs_server>:<webhdfs_port>/webhdfs/v1/apps/hive/warehouse/sample_07'
      FILE 'sample_07?op=OPEN&user.name=root'
      column separator = 'TAB'
      row separator = 'LF'
      errors into err_tab(current_timestamp)
      reject limit unlimited;
      

      Simple EXPORT

      The following example shows how to export a table into a csv in HDFS.
      It is possible to specify a view or statement, too.

      insert into ten values 1,2,3,4,5,6,7,8,9,0;
      
      export ten into csv at 
      'http://<webhdfs_server>:<webhdfs_port>/webhdfs/v1/user/guest/showCase'
      file 'ten.csv?op=CREATE&user.name=root';
      

      Parallel IMPORT/EXPORT

      It is possible to specify multiple files within the same IMPORT or EXPORT statement. If multiple files are specified, thus IMPORT or EXPORT jobs are distributed evenly among EXASOL database nodes. For each file the responsible database node sends a http request to the hadoop name node and is redirected to the hadoop data node where the file is located. Thus, the load on both clusters are distributed.

      Example for parallel IMPORT:

      import into sample_07 from csv at
      'http://<webhdfs_server>:<webhdfs_port>/webhdfs/v1/apps/hive/warehouse/sample_07'
      FILE 'sample_07?op=OPEN&user.name=root'
      FILE 'sample_08?op=OPEN&user.name=root'
      column separator = 'TAB'
      row separator = 'LF'
      errors into err_tab(current_timestamp)
      reject limit unlimited;
      

      Compression

      EXAloader is able to read/write compressed csv files. The supported compression algorithms are:

      1. zip
      2. gz
      3. bz2

      Compression is done automatically if one of the supported ending is specified for a file. The compression is done within the EXASOL cluster, thus all communication to/from hadoop cluster is compressed.

      import into sample_07 from csv at
      'http://<webhdfs_server>:<webhdfs_port>/webhdfs/v1/apps/hive/warehouse/sample_07'
      FILE 'sample_07.zip?op=OPEN&user.name=root'
      column separator = 'TAB'
      row separator = 'LF'
      errors into err_tab(current_timestamp)
      reject limit unlimited;
      

      Kerberos Imports and Exports

      It is possible to import/export from/to Kerberos-secured Hadoop clusters by using delegation tokens. First you have to retrieve a delegation token as follows:

      kinit {principal}
      curl -s --negotiate -u : http://<webhdfs_server>:<webhdfs_port>/webhdfs/v1/?op=GETDELEGATIONTOKEN | grep -Po 'urlString":"\K[^"]*'
      

      The delegation token should look like this:

      LgAMaGFkb29wdGVzdGVyDGhhZG9vcHRlc3RlcgCKAVoc9RV9igFaQQGZfY_9j7QUeh9XRVB-ODyU6YW4BX7yiX3fZ_wSV0VCSERGUyBkZWxlZ2F0aW9uETEwLjQ4LjEwNS4xMzo4MDIW
      

      You can now run the IMPORT or EXPORT statements as described above, except that you have to specify the delegation parameter instead of the user.name parameter.

      Here is the IMPORT statement example for Kerberos:

      import into sample_07 from csv at
      'http://<webhdfs_server>:<webhdfs_port>/webhdfs/v1/apps/hive/warehouse/sample_07'
      FILE 'sample_07?op=OPEN&delegation=<token>'
      column separator = 'TAB'
      row separator = 'LF'
      errors into err_tab(current_timestamp)
      reject limit unlimited;
      

      And the EXPORT statement example for Kerberos:

      export ten into csv at 
      'http://<webhdfs_server>:<webhdfs_port>/webhdfs/v1/user/guest/showCase'
      file 'ten.csv?op=CREATE&delegation=<token>';
      

      Please remember that the token validity will cease after a certain amount of time, depending on the Hadoop configuration. Thus, it must be refreshed regularly.

      Show
      Introduction With EXASOL bulk loader (EXAloader) it is possible to import or export data from or to a CSV file. Using WebHDFS provides a smooth way to get access to CSV files within a HDFS. This solution was tested with Hortonworks Sanbox 2.3. Information The initial examples assume that Hadoop does not require Kerberos authentication. There is a separate section below for Kerberos. Requirements An EXASOL database A Hadoop cluster with WebHDFS service started (this Solution has been tested with Hortonworks Sandbox 2.3, the services are already started by default) All EXASOL database nodes require access to HDFS namenode and all HDFS datanodes (WebHDFS port must be open) Prerequisites All EXASOL nodes need access to the WebHDFS service on the namenode and on all data nodes. The namenode service for WebHDFS typically runs on the following ports: 50070 on each namenode (dfs.namenode.http-address property) 50075 on each data node (dfs.datanode.http-address property). If you use HTTPS, the ports are: 50470 on each namenode (dfs.namenode.https-address) 50475 on each datanode (dfs.datanode.https-address). Simple IMPORT The following example uses the "sample_07" table provided with Hortonworks Sanbox. The IMPORT below uses an error table. Refer to EXASOL manual for further details on error tables. Internally, EXASOL connects to the hadoop name node, which redirects the http request to the datanode where the file is actually located. create or replace table sample_07(code varchar (2000), description varchar (2000), total_emp int , salary int ); import into sample_07 from csv at 'http://<webhdfs_server>:<webhdfs_port>/webhdfs/v1/apps/hive/warehouse/sample_07' FILE 'sample_07?op= OPEN & user . name =root' column separator = 'TAB' row separator = 'LF' errors into err_tab( current_timestamp ) reject limit unlimited; Simple EXPORT The following example shows how to export a table into a csv in HDFS. It is possible to specify a view or statement, too. insert into ten values 1,2,3,4,5,6,7,8,9,0; export ten into csv at 'http://<webhdfs_server>:<webhdfs_port>/webhdfs/v1/ user /guest/showCase' file 'ten. csv ?op= CREATE & user . name =root' ; Parallel IMPORT/EXPORT It is possible to specify multiple files within the same IMPORT or EXPORT statement. If multiple files are specified, thus IMPORT or EXPORT jobs are distributed evenly among EXASOL database nodes. For each file the responsible database node sends a http request to the hadoop name node and is redirected to the hadoop data node where the file is located. Thus, the load on both clusters are distributed. Example for parallel IMPORT: import into sample_07 from csv at 'http://<webhdfs_server>:<webhdfs_port>/webhdfs/v1/apps/hive/warehouse/sample_07' FILE 'sample_07?op= OPEN & user . name =root' FILE 'sample_08?op= OPEN & user . name =root' column separator = 'TAB' row separator = 'LF' errors into err_tab( current_timestamp ) reject limit unlimited; Compression EXAloader is able to read/write compressed csv files. The supported compression algorithms are: zip gz bz2 Compression is done automatically if one of the supported ending is specified for a file. The compression is done within the EXASOL cluster, thus all communication to/from hadoop cluster is compressed. import into sample_07 from csv at 'http://<webhdfs_server>:<webhdfs_port>/webhdfs/v1/apps/hive/warehouse/sample_07' FILE 'sample_07.zip?op= OPEN & user . name =root' column separator = 'TAB' row separator = 'LF' errors into err_tab( current_timestamp ) reject limit unlimited; Kerberos Imports and Exports It is possible to import/export from/to Kerberos-secured Hadoop clusters by using delegation tokens. First you have to retrieve a delegation token as follows: kinit {principal} curl -s --negotiate -u : http: //<webhdfs_server>:<webhdfs_port>/webhdfs/v1/?op=GETDELEGATIONTOKEN | grep -Po 'urlString ":" \K[^"]*' The delegation token should look like this: LgAMaGFkb29wdGVzdGVyDGhhZG9vcHRlc3RlcgCKAVoc9RV9igFaQQGZfY_9j7QUeh9XRVB-ODyU6YW4BX7yiX3fZ_wSV0VCSERGUyBkZWxlZ2F0aW9uETEwLjQ4LjEwNS4xMzo4MDIW You can now run the IMPORT or EXPORT statements as described above, except that you have to specify the delegation parameter instead of the user.name parameter. Here is the IMPORT statement example for Kerberos: import into sample_07 from csv at 'http: //<webhdfs_server>:<webhdfs_port>/webhdfs/v1/apps/hive/warehouse/sample_07' FILE 'sample_07?op=OPEN&delegation=<token>' column separator = 'TAB' row separator = 'LF' errors into err_tab(current_timestamp) reject limit unlimited; And the EXPORT statement example for Kerberos: export ten into csv at 'http: //<webhdfs_server>:<webhdfs_port>/webhdfs/v1/user/guest/showCase' file 'ten.csv?op=CREATE&delegation=<token>' ; Please remember that the token validity will cease after a certain amount of time, depending on the Hadoop configuration. Thus, it must be refreshed regularly.
    • Category 1:
      ETL / Data Integration

      Attachments

        Issue Links

          Activity

            People

            • Assignee:
              CaptainEXA Captain EXASOL
              Reporter:
              CaptainEXA Captain EXASOL
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: