Uploaded image for project: 'Solution Center'
  1. Solution Center
  2. SOL-367

XML-parsing using Java UDFs

    XMLWordPrintable

    Details

    • Type: Explanation
    • Status: Published
    • Affects Version/s: EXASOL 6.0.0, Exasol 6.1.0
    • Fix Version/s: None
    • Component/s: EXAPowerlytics
    • Labels:
      None
    • Solution:
      Hide

      1. Overview

      This solution describes how to use the Java DOM Parser to parse a XML-Text stored in an EXASOL database. The solution will result in a Java-UDF that can be used to structure the content of a XML-Text that contains .
      Documentation on the DOM Parser can be found here:
      http://www.tutorialspoint.com/java_xml/java_dom_parse_document.htm

      The general approach is:

      1. Implementation and tests (pure Java, Maven)
      2. Deployment (EXABucketFS)
      3. UDF implementation (EXASOL-SQL, Java)
      4. Usage (EXASOL-SQL)

      Important notes

      The solution uses maven-assemble-plugin to create a single jar including all dependencies. Otherwise it would be necessary to deploy all dependencies (including DOM Parser, ...) manually to the EXASOL cluster.

      2. Prerequisites

      This solution was created by using the following tools and features:

      1. EXASOL database > V6, advanced edition
      2. JDK
      3. Eclipse (Mars)
      4. Maven, including the maven-assembly-plugin (http://maven.apache.org/plugins/maven-assembly-plugin/)

      3. Java implementation and tests

      Implementation

      Within the attached xmlexample.zip file you will find 8 files:

      1. City.java (Represents a "row" in a CITIES table. Members: ID and Name)
      2. DomParserExample.java (XML-parsing implemenation)
      3. DomParserExampleTest.java (XML-parsing test)
      4. ProcessXML.java (run method, according to EXASOL java UDF run method)
      5. Data.java(Emulates a table)
      6. Iterator.java (Used for JUnit tests for ProcessXML, implements ExaIterator)
      7. Metadata.java (Used for JUnit tests for ProcessXML, implements ExaMetadata)
      8. ProcessXMLTest.java (JUnit test for the run method)

      In addition the pom.xml for maven is attached as part of xmlexample.zip. The POM file includes junit, and maven-assembly-plugin (used to create a single jar including all dependencies).

      Maven project in Eclipse:

      Building the sources

      1. "cd" to the workspace of the maven project
      2. run "mvn clean package assembly:single"
        The jar with all dependencies can be found in the "target"-folder. Name: "xmlexample-0.0.1-SNAPSHOT-jar-with-dependencies.jar"

      You can find a precompiled jar attached to this solution. xmlexample-0.0.1-SNAPSHOT-jar-with-dependencies.jar

      4. Deploying jar to EXASOL cluster

      SOL-201 describes how to install jar libraries in the cluster.
      For this solution, the fat jar was deployed to an EXABucket named "jars" using the following curl© command:

      curl -X PUT -T xmlexample-0.0.1-SNAPSHOT-jar-with-dependencies.jar 
       http://w:<w_pwd>@<db_node>:<EXABucketFS_port>/jars/xmlexample-0.0.1-SNAPSHOT-jar-with-dependencies.jar
      

      5. UDF and usage

      The attached file Data.sql creates a table for storing the XML-Text and inserts two rows (including two rows for the CITIES table, each).
      Please note: This approach requires the XML texts to have less than 2 million characters.

      Data.sql
      create schema xmlparsing;
      
      create or replace table myXML (v varchar(2000000));
      
      insert into myXML values 
      '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
      <document>     
           <entry>
                <id>1</id>
                <name>Nuremberg</name>
           </entry>
           <entry>
                <id>2</id>
                <name>Berlin</name>
           </entry>
      </document>';
      
      insert into myXML values 
      '<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
      <document>     
           <entry>
                <id>3</id>
                <name>London</name>
           </entry>
           <entry>
                <id>4</id>
                <name>Stockholm</name>
           </entry>
      </document>';
      
      commit;
      

      The attached file UDF.sql describes how to create a UDF that imports the installed jar library and uses the XML-parsing functionality.
      In addition, the UDF function is used to insert data into an EXASOL table.

      UDF.sql
      create or replace java scalar script processXML(xml varchar(2000000)) 
      emits (id int, name varchar(200000)) as
      %scriptclass com.exasol.xmlexample6.ProcessXML;
      
      %jar /buckets/bucketfs1/jars/xmlexample-0.0.1-SNAPSHOT-jar-with-dependencies.jar;
      }
      /
      commit;
      -- Usage: SELECT
      select processXML(v) from xmlparsing.myxml;
      
      -- Usage: INSERT INTO ... FROM SELECT
      create table cities (id int, name varchar(200000));
      insert into cities select processXML(v) from xmlparsing.myxml;
      commit;
      
      select * from cities;
      
      

      6. Use UDF for ETL

      The last example shows how to use the functionality for inserting data into a table via SELECT statement.
      For ETL matters you typically want to change two different things:

      1. Use IMPORT statement instead of INSERT ... SELECT
      2. Load XML file from a server

      For 1. we will demonstrate how to use the existing UDF for an ETL UDF. 2. will not be demonstrated to keep this solution as simple as possible, but the existing code could easily be extended to connect to a server (e.g. ftp) to retrieve the XML file.

      IMPORT FROM UDF

      The delivered source code already includes the method "generateSqlForImportSpec(...)" which is mandatory for ETL UDFs.
      Class ProcessXML.java:

      // ...
      public static String generateSqlForImportSpec(ExaMetadata meta, ExaImportSpecification importSpec)  {
          	  Map<String, String> params = importSpec.getParameters();
          	  
          	  String etludf = new String(meta.getScriptSchema() + "." + meta.getScriptName());
          	  
          	  String mySelect = new String("SELECT ");  
          	  mySelect += etludf;
          	  mySelect += " (";
          	  mySelect += params.get("COLUMN_NAME");
          	  mySelect += ") FROM ";
          	  mySelect += params.get("TABLE_NAME");
          	  return mySelect;
      }
      // ...
      

      This callback function is called by EXASOL engine if you use the UDF within an IMPORT statement and generates an appropriate SELECT statement for the IMPORT.
      This solution requires that the tablestoring the XML data is located in the currently opened schema, but could be extended with a property "SCHEMA_NAME" easily.

      IMPORT FROM UDF
      IMPORT INTO test.mycities FROM SCRIPT test.processxml
      WITH COLUMN_NAME = 'V'
           TABLE_NAME  = 'MYXML';
      
      Show
      1. Overview This solution describes how to use the Java DOM Parser to parse a XML-Text stored in an EXASOL database. The solution will result in a Java-UDF that can be used to structure the content of a XML-Text that contains . Documentation on the DOM Parser can be found here: http://www.tutorialspoint.com/java_xml/java_dom_parse_document.htm The general approach is: Implementation and tests (pure Java, Maven) Deployment (EXABucketFS) UDF implementation (EXASOL-SQL, Java) Usage (EXASOL-SQL) Important notes The solution uses maven-assemble-plugin to create a single jar including all dependencies. Otherwise it would be necessary to deploy all dependencies (including DOM Parser, ...) manually to the EXASOL cluster. 2. Prerequisites This solution was created by using the following tools and features: EXASOL database > V6, advanced edition JDK Eclipse (Mars) Maven, including the maven-assembly-plugin ( http://maven.apache.org/plugins/maven-assembly-plugin/ ) 3. Java implementation and tests Implementation Within the attached xmlexample.zip file you will find 8 files: City.java (Represents a "row" in a CITIES table. Members: ID and Name) DomParserExample.java (XML-parsing implemenation) DomParserExampleTest.java (XML-parsing test) ProcessXML.java (run method, according to EXASOL java UDF run method) Data.java(Emulates a table) Iterator.java (Used for JUnit tests for ProcessXML, implements ExaIterator) Metadata.java (Used for JUnit tests for ProcessXML, implements ExaMetadata) ProcessXMLTest.java (JUnit test for the run method) In addition the pom.xml for maven is attached as part of xmlexample.zip . The POM file includes junit, and maven-assembly-plugin (used to create a single jar including all dependencies). Maven project in Eclipse: Building the sources "cd" to the workspace of the maven project run "mvn clean package assembly:single" The jar with all dependencies can be found in the "target"-folder. Name: "xmlexample-0.0.1-SNAPSHOT-jar-with-dependencies.jar" You can find a precompiled jar attached to this solution. xmlexample-0.0.1-SNAPSHOT-jar-with-dependencies.jar 4. Deploying jar to EXASOL cluster SOL-201 describes how to install jar libraries in the cluster. For this solution, the fat jar was deployed to an EXABucket named "jars" using the following curl© command: curl -X PUT -T xmlexample-0.0.1-SNAPSHOT-jar-with-dependencies.jar http: //w:<w_pwd>@<db_node>:<EXABucketFS_port>/jars/xmlexample-0.0.1-SNAPSHOT-jar-with-dependencies.jar 5. UDF and usage The attached file Data.sql creates a table for storing the XML-Text and inserts two rows (including two rows for the CITIES table, each). Please note : This approach requires the XML texts to have less than 2 million characters. Data.sql create schema xmlparsing; create or replace table myXML (v varchar(2000000)); insert into myXML values '<?xml version= "1.0" encoding= "UTF-8" standalone= "yes" ?> <document> <entry> <id>1</id> <name>Nuremberg</name> </entry> <entry> <id>2</id> <name>Berlin</name> </entry> </document>'; insert into myXML values '<?xml version= "1.0" encoding= "UTF-8" standalone= "yes" ?> <document> <entry> <id>3</id> <name>London</name> </entry> <entry> <id>4</id> <name>Stockholm</name> </entry> </document>'; commit; The attached file UDF.sql describes how to create a UDF that imports the installed jar library and uses the XML-parsing functionality. In addition, the UDF function is used to insert data into an EXASOL table. UDF.sql create or replace java scalar script processXML(xml varchar(2000000)) emits (id int , name varchar(200000)) as %scriptclass com.exasol.xmlexample6.ProcessXML; %jar /buckets/bucketfs1/jars/xmlexample-0.0.1-SNAPSHOT-jar-with-dependencies.jar; } / commit; -- Usage: SELECT select processXML(v) from xmlparsing.myxml; -- Usage: INSERT INTO ... FROM SELECT create table cities (id int , name varchar(200000)); insert into cities select processXML(v) from xmlparsing.myxml; commit; select * from cities; 6. Use UDF for ETL The last example shows how to use the functionality for inserting data into a table via SELECT statement. For ETL matters you typically want to change two different things: Use IMPORT statement instead of INSERT ... SELECT Load XML file from a server For 1. we will demonstrate how to use the existing UDF for an ETL UDF. 2. will not be demonstrated to keep this solution as simple as possible, but the existing code could easily be extended to connect to a server (e.g. ftp) to retrieve the XML file. IMPORT FROM UDF The delivered source code already includes the method "generateSqlForImportSpec(...)" which is mandatory for ETL UDFs. Class ProcessXML.java: // ... public static String generateSqlForImportSpec(ExaMetadata meta, ExaImportSpecification importSpec) { Map< String , String > params = importSpec.getParameters(); String etludf = new String (meta.getScriptSchema() + "." + meta.getScriptName()); String mySelect = new String ( "SELECT " ); mySelect += etludf; mySelect += " (" ; mySelect += params.get( "COLUMN_NAME" ); mySelect += ") FROM " ; mySelect += params.get( "TABLE_NAME" ); return mySelect; } // ... This callback function is called by EXASOL engine if you use the UDF within an IMPORT statement and generates an appropriate SELECT statement for the IMPORT. This solution requires that the tablestoring the XML data is located in the currently opened schema, but could be extended with a property "SCHEMA_NAME" easily. IMPORT FROM UDF IMPORT INTO test.mycities FROM SCRIPT test.processxml WITH COLUMN_NAME = 'V' TABLE_NAME = 'MYXML' ;
    • Category 1:
      ETL / Data Integration
    • Category 2:
      UDFs and In-Database Analytics

      Attachments

        Issue Links

        1. Data.sql
          0.6 kB
          Mathias Brink
        2. Project.JPG
          85 kB
          Mathias Brink
        3. UDF.sql
          0.5 kB
          Mathias Brink

          Activity

            People

            • Assignee:
              CaptainEXA Captain EXASOL
              Reporter:
              CaptainEXA Captain EXASOL
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated: