Insights Blog

Gene genie – the power to analyze vast amounts of genome data

For a long time only “structured” data could be analyzed in databases. But what about data that doesn’t come in convenient rows and columns, like the Human Genome data?

For a long time, it was only so-called “structured” data that could be analyzed in databases. This kind of data comes in rows and columns (you know, like you find in spreadsheets) and has a predictable format. Most accounting data is like this and the science of business over the last 30 years has been revolutionized by using high speed, high volume SQL databases to get insight from that structured data.

However, what about other sciences? What about data that doesn’t come in convenient rows and columns?

I’ve been looking recently at Human Genome data. This definitely doesn’t come into the category of “structured” data, although there is some element of structure to it.

For example, human DNA can be expressed as a long sequence of 4 different letters (A, T, C and G). These letters are read in groups of three, and of the 64 possible combinations of these letters, 62 refer to making specific amino acids and the other two are used to indicate a space between larger groupings of these triplets called genes.

The human genome can be represented by around 3 billion letters, which would equate to around 800 MB of data (unzipped), although the actual size of a file containing a human genome is much larger due to the way that sequencing is done and because of data quality issues. The science of physically “sequencing” these 3 billion letters from a sample of DNA is now well established. Unfortunately, the science of analyzing that data is rather less well understood.

The good news is that anyone can acquire this data. The “1000 genomes project” data, for example, has been available in the AWS cloud for some time and consists of >200 terabytes for just 1700 participants. You can imagine the data volumes associated with more contemporary projects, such as the Million Human Genomes project.

But while you can get your hands on genome data, how do you go about analyzing it? The problem is that there’s a lot of it, and it’s very difficult to interpret.

Well, you certainly wouldn’t have to start from scratch; geneticists have written a number of libraries to calculate various common metrics of interest. For example, the “Pybedtools” library for Loading...Python allows you to identify genes that show a given genetic variation.

You could become a Python developer and write a few million lines of code on a big server to make use of this library. Alternatively, you could use Exasol’s Loading...in-memory analytic database (in the cloud or on your own servers) and import these genomic libraries so that you could build User Defined Functions around them.

The upshot of this second approach is that you can run database queries that are “in-memory” and parallel and are therefore extraordinarily fast. You also have the benefit of being able to blend this “unstructured” genetic data with, for example, structured patient data and use the SQL language and mainstream business intelligence tools (such as Tableau) to give you great visualizations of the data without requiring lots of computer code.

More and more, we are talking to organizations with data requirements that extend well beyond traditional accounting data. Genetics is a growing area of interest, but our system is designed, through the use of our User Defined Function framework, to support any kind of data at all.

Why not have a look for yourself? You’ll be surprised at the kinds of analytics you can do with Exasol.