Details

    • Type: How To
    • Status: Obsolete
    • Affects Version/s: EXASolution 5.0
    • Fix Version/s: None
    • Component/s: EXAPowerlytics
    • Labels:
    • Solution:
      Hide

      Note: This solution is no longer maintained. For the latest information, please refer to our documentation:

      https://community.exasol.com/t5/database-features/parsing-json-data-with-python/ta-p/1247

      Background

      This solution is an example of how to load and parse JSON data with a simple SQL Statement within EXASOL. In this case, the integrated python user-defined functions (UDFs) in combination with the python JSON library are used.

      Prerequisites

      First of all, we create a small script to load data from a URL:

      --small script to load data from an url and optionally split the data based on newlines
      --/
      create or replace python scalar script load_data_from_http
      (url varchar(500),split_on_newline boolean) emits (output_data varchar(2000000)) as
      def run(ctx):
      	import urllib2
      	response = urllib2.urlopen(ctx.url)
      	data = response.read()
      	if ctx.split_on_newline == True:
      		lines = data.split('\n')
      		for line in lines:
      			if len(line) > 1:
      				ctx.emit(line)
      	else:
      		ctx.emit(data)
      /
      

      This script will give you a varchar(2000000) field called OUTPUT_DATA with the content of the file.
      In this example we load JSON data from the Canadian Recalls and Safety Alerts Dataset. You'll find the license here .

      How to parse the JSON data.

      The following script is an example and was created to parse the JSON file of the Canadian Recalls and Safety Alerts .
      Please adjust it to your needs / your JSON file. In the EMITS section, you can define the output you want and in the run function, you define how the data is parsed.

      Step 1

      With ctx.emit you'll add a row to the result of this function.

      --this is an example script to parse JSON (INPUT). Please adjust it to your json-format
      --try/except to handle missing values and emit null instead
      --/
      create or replace python scalar script json_parsing_recalls("INPUT" varchar(2000000)) 
      emits (recallid varchar(50), title varchar(1000), category varchar(100), date_published int, url varchar(100)) as
      import json
      def run(ctx):
      	j = json.loads(ctx.INPUT)
      	for x in range(0,len(j['results']['ALL'])):
      		try:
      			recallId = j['results']['ALL'][x]['recallId']
      		except KeyError:
      			recallId = None
      		try:
      			title = j['results']['ALL'][x]['title']
      		except KeyError:
      			title = None
      		try:
      			category = ','.join(j['results']['ALL'][x]['category'])
      		except KeyError:
      			category = None
      		try:
      			date_published = j['results']['ALL'][x]['date_published']
      		except KeyError:
      			date_published = None
      		try:
      			url = j['results']['ALL'][x]['url']
      		except KeyError:
      			date_published = None
      		ctx.emit(recallId,title,category,date_published,url)
      /
      

      Step 2

      Now you can use both scripts also nested to load and parse the data. The inner select first loads the data and the outer select parses the output.

      select json_parsing_recalls(OUTPUT_DATA) from (
      	--select statement that reads the data from the url (in this case from canadian open data)
      	-- data source: http://open.canada.ca/data/en/dataset/d38de914-c94c-429b-8ab1-8776c31643e3
      	-- license: http://open.canada.ca/en/open-government-licence-canada
      	select load_data_from_http('http://healthycanadians.gc.ca/recall-alert-rappel-avis/api/recent/en?_ga=1.18277497.1100922614.1438786533',false)
      );
      

      This will give you a resultset with the parsed json data:

      Additional Notes

      The script to load data from a URL in this simple example is limited to a maximum of 2 million characters per file or line (because of the varchar(2000000) in the definition). If you have longer JSON-objects, feel free to adjust it to your needs by e.g. combining the functions into one.

      Show
      Note: This solution is no longer maintained. For the latest information, please refer to our documentation: https://community.exasol.com/t5/database-features/parsing-json-data-with-python/ta-p/1247 Background This solution is an example of how to load and parse JSON data with a simple SQL Statement within EXASOL. In this case, the integrated python user-defined functions (UDFs) in combination with the python JSON library are used. Prerequisites First of all, we create a small script to load data from a URL: --small script to load data from an url and optionally split the data based on newlines --/ create or replace python scalar script load_data_from_http (url varchar(500),split_on_newline boolean ) emits (output_data varchar(2000000)) as def run(ctx): import urllib2 response = urllib2.urlopen(ctx.url) data = response.read() if ctx.split_on_newline == True: lines = data.split( '\n' ) for line in lines: if len(line) > 1: ctx.emit(line) else : ctx.emit(data) / This script will give you a varchar(2000000) field called OUTPUT_DATA with the content of the file. In this example we load JSON data from the Canadian Recalls and Safety Alerts Dataset. You'll find the license here . How to parse the JSON data. The following script is an example and was created to parse the JSON file of the Canadian Recalls and Safety Alerts . Please adjust it to your needs / your JSON file. In the EMITS section, you can define the output you want and in the run function, you define how the data is parsed. Step 1 With ctx.emit you'll add a row to the result of this function. -- this is an example script to parse JSON (INPUT). Please adjust it to your json-format -- try /except to handle missing values and emit null instead --/ create or replace python scalar script json_parsing_recalls( "INPUT" varchar(2000000)) emits (recallid varchar(50), title varchar(1000), category varchar(100), date_published int , url varchar(100)) as import json def run(ctx): j = json.loads(ctx.INPUT) for x in range(0,len(j[ 'results' ][ 'ALL' ])): try : recallId = j[ 'results' ][ 'ALL' ][x][ 'recallId' ] except KeyError: recallId = None try : title = j[ 'results' ][ 'ALL' ][x][ 'title' ] except KeyError: title = None try : category = ',' .join(j[ 'results' ][ 'ALL' ][x][ 'category' ]) except KeyError: category = None try : date_published = j[ 'results' ][ 'ALL' ][x][ 'date_published' ] except KeyError: date_published = None try : url = j[ 'results' ][ 'ALL' ][x][ 'url' ] except KeyError: date_published = None ctx.emit(recallId,title,category,date_published,url) / Step 2 Now you can use both scripts also nested to load and parse the data. The inner select first loads the data and the outer select parses the output. select json_parsing_recalls(OUTPUT_DATA) from ( -- select statement that reads the data from the url ( in this case from canadian open data ) -- data source : http:// open .canada.ca/ data /en/dataset/d38de914-c94c-429b-8ab1-8776c31643e3 -- license: http:// open .canada.ca/en/ open -government-licence-canada select load_data_from_http( 'http://healthycanadians.gc.ca/recall-alert-rappel-avis/api/recent/en?_ga=1.18277497.1100922614.1438786533' , false ) ); This will give you a resultset with the parsed json data: Additional Notes The script to load data from a URL in this simple example is limited to a maximum of 2 million characters per file or line (because of the varchar(2000000) in the definition). If you have longer JSON-objects, feel free to adjust it to your needs by e.g. combining the functions into one.
    • Category 1:
      UDFs and In-Database Analytics - python
    • Category 2:
      Clients, Interfaces & Drivers

      Attachments

        Activity

          People

          • Assignee:
            CaptainEXA Captain EXASOL
            Reporter:
            CaptainEXA Captain EXASOL
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: