Insights Blog

SQL-on-Hadoop: Not Very Hadoop, Not Very Appetising.

When is a Burger not a Burger… when it’s a Pizza! “SQL-on-Loading...Hadoop” is in fact not very Hadoop and isn’t very appetising.

I think we can all agree that Hadoop is amazing.

First there was Map/Reduce and the HDFS file system for sharing processing and data storage across a cluster of machines – absolutely perfect for tasks similar to indexing the entire Internet.

Unfortunately not every task is like indexing the Internet, for example the kind of fast analytical queries over structured data that EXASOL does so well.

For these types of tasks, there are now projects like Spark to provide very low latency execution frameworks to replace Map/Reduce; and there are also Loading...in-memory, columnar file systems to replace HDFS.

The resulting solutions are marketed as being “SQL on Hadoop”, but I would say they are “SQL on SomethingNew” – because all of the “Hadoopiness” has been taken away and replaced.

To give a less technical example: Imagine I want to get “The Beatles” to play at my party. John and George are unfortunately not available and so I need to replace them with somebody else. Paul and Ringo are still alive and still playing, but imagine they are doing something else on the night, so I will have to replace them too.

My questions to you is this: can I still call the band “The Beatles” if neither John, Paul, George nor Ringo are playing?

In a recent presentation to a Loading...Big Data Meetup in London I used the example of a cheese burger – but one that had no cheese, no meat and no bread – but with a pizza added. I would say that is a pizza (a new thing) rather than any sort of burger (a converted old thing).

It sounds like I am being pedantic, but by marketing these solutions as “SQL on Hadoop”, these companies are dishonestly implying that they have all the advantages of Hadoop and are on a platform that has been over a decade in the making.

The truth is that these execution frameworks and file systems have arrived within the last year or two, and so are phenomenally immature as yet. I personally think some are promising and will eventually perform well.

However, these projects are slowly rediscovering some of the techniques that have been a feature of Exasol for well over a decade. Fast, low latency execution platform – columnar, in-memory file storage – all spread efficiently across a cluster of machines. Sound familiar?

These new “SQL on so-called Hadoop” solutions are over a decade behind Exasol in terms of performance, reliability and ease of use.

Until they catch up (if they ever do!) you’re better off giving Exasol a try here.