Human Language data has value, if you make an effort to extract it.
I’m currently (between running Proofs of Concept for clients) trying to find a way to demonstrate that our in-memory database EXASOL is far more than just SQL and is in fact a fully featured tool for Data Science.
The first area I wanted to explore is human language analysis, but unfortunately all of the standard natural language processing (NLP) examples are SO dull.
Fortunately, I came across the work of Omoju Miller who is a PhD candidate at UC Berkeley and who uses the lyrics of Hip Hop music as a fresh way of explaining computational thinking to a community that is woefully under-represented in (and under-impressed by) the IT world.
Her tool of choice is Python and her text comes from the song lyrics of Jay Z, who (WARNING!) often uses strong words that are distinctly “Not Suitable For Work”. In fact some of the words might be considered “Not Suitable For Anywhere”. So if that is an issue for you, then please use another text corpus for your data exploration – Barry Manilow lyrics or the speeches of Ronald Reagan perhaps.
What I particular like about her work is that she uses Python, and so I can borrow some of her ideas to create Python User Defined Functions (UDFs) in EXASOL that allow me to do all the fancy language processing right inside the fastest database in the world.
Step One: Import some libraries
The “nltk” library for Python is the Natural Language Toolkit – it contains the basic libraries for analysing the use of words in documents. It isn’t installed as standard (because not every customer needs it) – but it is extremely simple to make it available within EXASOL.
I simply import it using the cluster management webpage:
Step Two: Write User Defined Functions
I can now wrap my Python code (or some ideas I’ve “borrowed” from Omoju Miller) in user defined functions to take advantage of the features within the NLTK library. These include not only functions for loading data, but also it has text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning.
In other words everything I need to get some insight from a large volume of unpredictable but melodic text.
Step Three: Load data, run queries
Easy – just load the lyrics from a flat file to a database table and then use these user defined functions within SQL select queries against that table.
Stop! Why would a Data Scientist spend time doing this???
The example using Jay Z song lyrics is likely soon to become the subject of one of my semi-legendary YouTube videos. It’s a fascinating example and not at all trivial – Jay Z’s language is not “The Queen’s English”. He uses non-dictionary words with inconsistent spelling and even this native English speaker sometimes finds it hard to understand him 100% of the time.
The real prize for Data Scientists is the ability to gain insight from the human language data in and around their organisation. It is one thing to analyse the very structured data of a million toasters and fridges from the “Internet of Things”, but it’s quite another to work with human language, which has been written by different humans on different days when they are in different moods.
One massive source of huma language data of interest to Data Scientists is on Twitter, where NLP techniques have been used to measure the happiness of a company’s customers and the success of product launches.
Additionally, all companies have internally vast amounts of human language data that is rarely ever analysed by computer. Fault logs, customer complaints, website comments – all of these are occasionally read by humans, but they can’t read everything and can miss insights only possible through scientific computer analysis of the data.
In short – human language data has value, if you make an effort to extract it.
Or as Jay Z would say:
“To turn that into something you gotta learn from Jay
You will get return in your investment if attention you pay”