Tech Blog

The magic of ‘EXPLAIN VIRTUAL SELECT’ — a peek behind the scenes of Exasol Virtual Schemas

Why use an Exasol Virtual Schema?

Here at Exasol we have a lot of Open Source projects available on GitHub. One of the larger projects is the Exasol Virtual Schema. A Virtual Schema is an abstraction layer that makes external data sources accessible through our data analytics platform. And they’re designed to look as if they’re regular schemas with tables inside our platform.

This has multiple benefits:

  • users can apply their SQL skills
  • it makes it really easy to correlate data from the external sources with internal data
  • and you can base your own ETL on virtual schemas.

The two sides of Virtual Schema architecture

Exasol’s Virtual Schema consists of two main parts. The part in the database imports data into our data analytics platform. The so-called Virtual Schema Adapter is a plug-in that abstracts the access to the external data source. Those two parts communicate with each other via a simple JSON-based protocol. While the exact details aren’t important for this article, you at least need to know that Virtual Schemas based on 3rd-party Loading...relational databases use an IMPORT statement under the hood to speed up the import.

So the adapter creates this IMPORT statement and the database executes it.

Let’s assume you’re happily developing your Virtual Schema Adapter and then during an integration test you get an error. You realize you’re not sure were things went wrong. Was it on the adapter side or within the data analytics platform? That’s when this comes in handy :

Enter EXPLAIN VIRTUAL SELECT.

Use EXPLAIN VIRTUAL SELECT to see how deep the rabbit hole goes

What this statement does is that it shows you the SQL statement that the Virtual Schema Adapter created from the user’s original query. This is the statement that the data analytics platform will execute.

Here’s an example that x-rays a very basic SELECT on a test table I created in an AWS Redshift database that serves as data source.

EXPLAIN VIRTUAL SELECT * FROM REDSHIFT.all_datatypes;

When I execute this query in my integration test environment, it gives me the following result. I added extra white spaces and newlines for better readability. Also I redacted the instance ID and password.

IMPORT INTO (
    c0 DECIMAL(19, 0),
    c1 BOOLEAN,
    c2 CHAR(1000) UTF8,
    c3 VARCHAR(1000) UTF8,
    c4 DATE,
    c5 DOUBLE,
    c6 DECIMAL(10, 0),
    c7 VARCHAR(2000000) UTF8,
    c8 DOUBLE,
    c9 DECIMAL(5, 0),
    c10 VARCHAR(256) UTF8,
    c11 TIMESTAMP,c12 TIMESTAMP
) FROM JDBC
AT 'jdbc:redshift://rsvs.redacted.eu-central-1.redshift.amazonaws.com:5439/rsvs'
USER 'master'
IDENTIFIED BY 'redacted'
STATEMENT 'SELECT * FROM "public"."all_datatypes"'

As you can see, our adapter converted the original SELECT statement into an IMPORT. While it’s not particularly eye-catching, our data analytics platform can read it just fine.

When you take a look closer you’ll notice that the IMPORT command…

… starts with a precise definition of the column types
… defines which JDBC connection the data analytics platform should use to connect to the external data source
… hands over credentials or a connection
… defines a sub-select that serves as input source for the import.

Why ‘connection definitions’ are preferable to ‘defining credentials’ in the Virtual Schema

For the sake of the example I intentionally put the credentials into the Virtual Schema definition. EXPLAIN VIRTUAL SCHEMA demonstrates nicely why using a connection definition would be preferable. Connections don’t expose credentials, so always use them if you need to store access tokens, user names or passwords.

Debug the IMPORT statement separately

Now that you have the generated statement in your hands, you can execute it manually on our data analytics platform. This way you can easily distinguish between problems that are related to your Virtual Schema Adapter and those that occur where the platform executes the statements created by the adapter.

Conclusion

Exasol Virtual Schemas are a nice way to hide the extra complexity of accessing an external data source from data analytics platform users. It allows them to access the data source as if it were an integral part of the platform. You can write your own Virtual Schema adapter. And EXPLAIN VIRTUAL SCHEMA is a valuable tool in your debugging toolbox.