Tech Blog

The goddess of wisdom turns her hand to data – using AWS Athena with Exasol’s Virtual Schemas

In the last article of our series about Exasol’s Virtual Schemas we took on a developer’s perspective and learned how to build our own Virtual Schema adapter.

Today we approach Virtual Schemas from a user’s angle and set up a connection between Exasol and Amazon’s AWS Athena in order to query data from regular files lying on S3, as if they were part of an Exasol database.

Athena is Presto-as-a-Service

Athena is based on the Open Source project Apache Presto. In their own words the Presto team provides a “distributed SQL Query Engine for Big Data“. In short, you get a cluster of machines able to dig through large amounts of data, controlled by SQL commands.

Since Athena is provided as a service, you pay per use, not for the existence of the cluster, which is the better choice in cases of infrequent use.

Sample data set

In our example we are going to use files in a Character Separated Value (CSV) format to demonstrate the idea behind Athena. When I say regular files I mean files that, in contrast to database storage, don’t necessarily contain structures like indexing that can improve search speed. CSV is the perfect example because most people consider it more of a data exchange format than as storage that allows querying data.

I know that data mining is often a very dry topic,  but we’re going to use a fun data set for our example: a UFO sightings data set provided by the “NUFORC”. I picked this data for my own amusement mainly. But it also presents a few challenges that are typical for this kind of data import, as we will see later.

Sighting data are available in a raw and a sanitized version, where the latter one contains only complete data sets. Please download the cleaned-up data set CSV as a first step (ca. 15 MiB, at the time of writing).

Setting up the AWS side

Please follow the steps below as a preparation for using Athena.

And I’m assuming here that you already have an AWS account and the necessary permissions.

Uploading the CSV to S3

  1. Create an S3 bucket and write down the name
  2. Upload the CSV to the S3 bucket
Sample data set for Athena

Creating the entry in the AWS Glue catalog

AWS Glue is a managed ETL solution. We only need the catalog part in order to register the schema of the data present in the CSV file.

  1. Register the database structure as a table manually in AWS Glue
    1. Name it ufo_sightings
    2. Provide the S3 bucket path
      s3//<bucket name>/[path/]
    3. Pick “CSV” as data format
    4. Select “comma” as a delimiter
    5. Add columns
      1. observation_time: string
      2. location: string
      3. state: string
      4. country: string
      5. appearance: string
      6. seconds_observed: string
      7. rough_duration: string
      8. description: string
      9. entry_date: string
      10. latitude: string
      11. longitude: string
    6. Review and finish

You might be wondering why only string types are used. In my experiments I found that the infrastructure under Athena isn’t very flexible when it comes to converting data formats. The data and time representation present in the CSV file aren’t recognized out-of-the-box. So you have to convert them in your SQL code. Unfortunately, this is also true for latitude and longitude which would normally be decimals. To make things worse, even the column seconds_observed contains entries that aren’t numbers.

On the other hand, that is a typical example of imperfect data quality that you often have to deal with when importing someone else’s data.

Generally speaking in any ETL scenario, you want your staging database to be as forgiving as possible when it comes to data formats. That way you don’t lose data sets during the first step of import. The same rule applies in our Athena example, since we’re using Athena as our staging area to get the data into Exasol’s analytics database.

After this you should have a catalog entry in Glue that looks similar to the screenshot below.

Glue catalog entry

Running a test query via the Athena console

Open the Athena console and run the following query to see whether your setup works so far.

SELECT * FROM "ufo_sightings" LIMIT 10;

This is the simplest possible query you can run, but it’s good enough for a smoke test.

Please note the quotes around the table name. In Athena everything is lower case by default. Exasol’s Loading...analytics database converts all unquoted identifiers to upper case — which is the standard behavior for an SQL database. If you later want to reuse our SQL statements, I recommend making it a habit to quote all identifiers.

Preparing the Athena Virtual Schema

Now that the AWS setup works, it’s time to configure the Virtual Schema. I am going to present a condensed variant of the necessary steps. You can find more details in our Virtual Schema deployment guide.

Installing the Athena JDBC driver

Before you can work with the Virtual Schema you first need the Athena JDBC driver. You can find download instructions and a link to the latest version in the Athena user guide.

After you’ve downloaded the JDBC driver, you have to install it in two different locations:

  1. In EXAoperation
  2. In BucketFS

The reason for this is that the Virtual Schemas are split into two parts, one running in the core database, the other one running in a UDF container. While the first mentioned gets its drivers from EXAoperation, the second can only see BucketFS contents.

  1. Open EXAoperation
  2. Click “Software” in the box on the left side
  3. Click the tab “JDBC Drivers”
  4. Click “Add” to create a new entry for the Athena driver
    1. Driver name: Athena
    2. Main class: com.simba.athena.jdbc.Driver
    3. Prefix: jdbc:awsathena:
    4. With driver versions newer than 2.0.7 you need to disable the security manager
  5. Select the new entry
  6. Upload the JDBC driver

After that you should have an entry which looks like this:

Athena JDBC driver in EXAoperation

Next you install the same driver in a bucket in BucketFS.

  1. Open EXAoperation
  2. Click “EXABuckets”
  3. Add a bucket called “jars”
  4. Open a terminal
  5. Upload the JDBC driver to the bucket through an HTTP PUT request (for example via curl)
curl -X PUT -T AthenaJDBC42_2.0.7.jar https://w:jars@localhost:2580/jars/AthenaJDBC42_2.0.7.jar

Installing the Virtual Schema adapter

Check our Virtual Schema releases on GitHub to download the latest version of the JAR file containing the JDBC adapter. Once you’ve obtained the latest JAR file, install it via BucketFS.

In the examples below I used version 1.15.0.  You need to replace that version number with the one of the JAR file you downloaded.

curl -X PUT -T virtualschema-jdbc-adapter-dist/target/virtualschema-jdbc-adapter-dist-1.15.0.jar http://w:jars@localhost:2580/jars/virtualschema-jdbc-adapter-dist-1.15.0.jar

Now connect to your Exasol database and register the JDBC adapter UDF.

CREATE SCHEMA ADAPTER;CREATE OR REPLACE JAVA ADAPTER SCRIPT adapter.athena_adapter AS%scriptclass com.exasol.adapter.jdbc.JdbcAdapter;%jar /buckets/bfsdefault/jars/virtualschema-jdbc-adapter-dist-1.15.0.jar;%jar /buckets/bfsdefault/jars/AthenaJDBC42_2.0.7.jar;/;

Next you define the connection details. Check the online documentation of the CREATE CONNECTION command. Here you provide your credentials to Athena — i.e. an AWS access key — in a secure way.

CREATE OR REPLACE CONNECTION ATHENA_CONNECTIONTO 'jdbc:awsathena://AwsRegion=<region code>;S3OutputLocation=s3://<s3 path to log output>'USER '<access key ID>'IDENTIFIED BY '<access key>';

Last but not least, create the actual Virtual Schema.

CREATE VIRTUAL SCHEMA UFOUSING ADAPTER.ATHENA_ADAPTERWITHSQL_DIALECT = 'ATHENA'CONNECTION_NAME = 'ATHENA_CONNECTION'SCHEMA_NAME = 'default';

Querying the Virtual Schema

Congratulations, you’re all set and can run your first query on that Virtual Schema.

OPEN SCHEMA "UFO";SELECT * FROM "ufo_sightings" LIMIT 10;

Obviously this command is a little bit boring and more intended as a smoke test. It also serves to show that you can copy-paste statements you issued via the Athena console 1:1 given that you use identifier quoting.

Before I let you play around with the data on your own I’d like to demonstrate one more statement that shows a typical example of dealing with imperfect source data quality.

SELECT COUNT(1) as "apperance_count","appearance",ROUND(AVG(TO_NUMBER("duration_seconds"))) as "average_seconds"FROM "ufo_sightings"WHERE "country" = 'us' AND IS_NUMBER("duration_seconds", '9999999')GROUP BY "appearance"ORDER BY "apperance_count" DESC;

Here we group the UFO sightings by appearance. Since some of the entries in the column duration_seconds contain illegal characters, we’re filtering them out. Alternatively, you could count them and set the duration to zero. In the first case you get fewer sightings, in the second you have lower average observation times. Either way, you won’t get more information out of the data than the original data set contains.

UFO sightings

So it’s lights which are most often spotted. And flying saucers are not even in the top-five. That’s a bit too anti-climactic for my taste – I was hoping for the mothership and aliens with ray guns. Oh well.

Conclusion

You can use Exasol’s Virtual Schemas to conveniently query data from CSV files located on S3 via Athena. The data appear as if they were present in a regular table in the Exasol database.

I’ll let you be the judge of whether or not chasing the rides of little green men is a worthy job for the goddess of wisdom and an Loading...in-memory analytics database.

Sebastian Bär