Uploaded image for project: 'Solution Center'
  1. Solution Center
  2. SOL-443

How to create and use User Defined Functions (UDF) ?

    Details

    • Type: How To
    • Status: Published
    • Affects Version/s: EXASOL 6.0.0, Exasol 6.1.0, Exasol 6.2.x
    • Fix Version/s: None
    • Component/s: EXASolution
    • Labels:
    • Solution:
      Hide

      Background

      This is a basic tutorial on User Defined Functions. For details regarding the programming languages see the EXAOL manual, section 2.11.3 More technical insight is also given in the white paper "Big Data Science - the future of analytics" (link).

      UDFs vs. control scripts

      Control scripts User Defined Functions (UDFs)
      for control jobs for operations on data sets
      written in Lua written in SQL, Lua, Java, Python, R
      executed via SQL statement: 'EXECUTE SCRIPT ...' executed within SQL 'SELECT' statement, e.g. 'SELECT udf(col1, col2) FROM tbl1;'
      supports executing of SQL statements & SQL error handling supports loading of Java / Python / R packages
      runs iteratively & single threaded on one node runs massively parallel

      Creating a UDF

      Parallelism

      Tables are stored in EXASOL row-wise distributed across a cluster's nodes. A UDF environment runs "sandboxed" on each node, computing a local result set integrated into EXASOL's SQL pipeline (see white paper (link): "A peek under the hood"). The local result sets are sent to the node handling the client connection and merged to the global result set to be sent to the client.

      Scalar UDFs

      A scalar UDF computes on one input row per UDF instance and returns one output row. It is automatically executed in massively parallel in the EXASOL cluster.

      Aggregate UDFs

      An aggregate UDF consumes multiple input tuples, i.e. the whole columns of the table or the group if a GROUP BY clause is defined in the SELECT statement. One UDF instance returns one single value.

      The number of UDF instances to be started is handled by the GROUP BY clause. Without a GROUP BY clause one instance is started which runs on one node, not in parallel. A GROUP BY clause that results in multiple groups causes multiple UDF instances, which are then computed massively parallel in EXASOL.

      Analytical UDFs

      These consume as input values multiple data tuples (rows), and they also return a result set containing multiple rows. Parallelism is handled by the GROUP BY (see above).

      In the following example, a running sum is computed. Notice that the simple algorithm iterates through the entire input set within one instance (repeat loop).

      Additional Notes

      Adding language specific libraries / packages

      EXASOL 5 or before: Libraries are made available via EXAOperation: Software -> UDF Libraries :

      Additional References

      EXASOL 6 or later: See detailed information about EXABucket FS in SOL-503 and in particular the subsection Expanding script languages using BucketFS in the EXASOL user manual.

       SOL-542 : Deploying an R scoring model in Exasol via UDF

       SOL-190 : UDF Hadoop Connection 

      https://docs.exasol.com/database_concepts/udf_scripts/debug_udf_script_output.htm

      SOL-568 : Examining network interfaces through UDF

      SOL-559 : UDF to synchronize archive volumes via FTP

      SOL-558 : UDF Signature script generator

      SOL-570 : Querying and Converting JSON Data with the JSON_TABLE UDF

       

       

      Show
      Background This is a basic tutorial on User Defined Functions. For details regarding the programming languages see the EXAOL manual, section 2.11.3 More technical insight is also given in the white paper "Big Data Science - the future of analytics" (link). UDFs vs. control scripts Control scripts User Defined Functions (UDFs) for control jobs for operations on data sets written in Lua written in SQL, Lua, Java, Python, R executed via SQL statement: 'EXECUTE SCRIPT ...' executed within SQL 'SELECT' statement, e.g. 'SELECT udf(col1, col2) FROM tbl1;' supports executing of SQL statements & SQL error handling supports loading of Java / Python / R packages runs iteratively & single threaded on one node runs massively parallel Creating a UDF Parallelism Tables are stored in EXASOL row-wise distributed across a cluster's nodes. A UDF environment runs "sandboxed" on each node, computing a local result set integrated into EXASOL's SQL pipeline (see white paper (link): "A peek under the hood" ). The local result sets are sent to the node handling the client connection and merged to the global result set to be sent to the client. Scalar UDFs A scalar UDF computes on one input row per UDF instance and returns one output row. It is automatically executed in massively parallel in the EXASOL cluster. Aggregate UDFs An aggregate UDF consumes multiple input tuples, i.e. the whole columns of the table or the group if a GROUP BY clause is defined in the SELECT statement. One UDF instance returns one single value. The number of UDF instances to be started is handled by the GROUP BY clause. Without a GROUP BY clause one instance is started which runs on one node, not in parallel. A GROUP BY clause that results in multiple groups causes multiple UDF instances, which are then computed massively parallel in EXASOL. Analytical UDFs These consume as input values multiple data tuples (rows), and they also return a result set containing multiple rows. Parallelism is handled by the GROUP BY (see above). In the following example, a running sum is computed. Notice that the simple algorithm iterates through the entire input set within one instance (repeat loop). Additional Notes Adding language specific libraries / packages EXASOL 5 or before : Libraries are made available via EXAOperation: Software -> UDF Libraries : Additional References EXASOL 6 or later : See detailed information about EXABucket FS in SOL-503 and in particular the subsection Expanding script languages using BucketFS in the EXASOL user manual.   SOL-542  : Deploying an R scoring model in Exasol via UDF  SOL-190 : UDF Hadoop Connection  https://docs.exasol.com/database_concepts/udf_scripts/debug_udf_script_output.htm SOL-568 : Examining network interfaces through UDF SOL-559  : UDF to synchronize archive volumes via FTP SOL-558 : UDF Signature script generator SOL-570  : Querying and Converting JSON Data with the JSON_TABLE UDF    
    • Category 1:
      UDFs and In-Database Analytics - Lua
    • Category 2:
      Database Design

      Attachments

        Issue Links

        1. adding_libs.png
          adding_libs.png
          70 kB
        2. first_udf.png
          first_udf.png
          33 kB
        3. SOL-443_udf.sql
          1 kB
        4. udf_aggregate.png
          udf_aggregate.png
          9 kB
        5. udf_analytic.png
          udf_analytic.png
          9 kB
        6. udf_parallelism.png
          udf_parallelism.png
          50 kB
        7. udf_scalar.png
          udf_scalar.png
          10 kB

          Activity

            People

            • Assignee:
              CaptainEXA Captain EXASOL
              Reporter:
              CaptainEXA Captain EXASOL
            • Votes:
              0 Vote for this issue
              Watchers:
              0 Start watching this issue

              Dates

              • Created:
                Updated: