Data science – you reap what you sow, that’s how you have a golden harvest

Everyone loves big data analytics. But too many people believe that analysis on big data works like alchemy where you take unstructured “debris-type” data, chant some obscure arcane incantations (to be fair, many people would be hard pressed to see a difference between neural networks and magic) and turn it into golden insights. But, using a tired but accurate phrase, data analytics works based on the “garbage-in-garbage-out” principle. Haphazardly adding more stones to your alchemy analytics project might fill your pockets, but with cobbles instead of cash.

Bad input from inferior data, improper sample sizes or incorrect model specification represent a prime reason some data analytics projects fail to meet expectations or even fail altogether. Worse yet, if new adopters fail their first project because of a flawed approach, some give up immediately, thinking “analytics just isn’t for them.” Trying to do too much too fast is frequently to blame.

As a rule of thumb, if you have enough data processing power (and if you are using EXASOL, you do) to think about diving straight into big data analytics, consider doing “big analytics” on just “data” first. “Quick wins” may be a buzzword but a single successful project can provide the practical experience needed for doing analytics on large data sets. Start small but smart, then when you’re ready, go big and win big.

The elephant in big data’s room – No, not Hadoop

Successful data analysis requires the careful consideration of several factors. Let’s start with data size. Why do people think bigger data equals better data? To begin we need to consider which types of errors exist in data. Sources of error in data come from two groups, namely random and systematic error. As the name suggest, random error represents the amount of variation in data that happens by chance, and it consequently decreases with more data. More data i.e. “big data” decreases the random error.

But let’s be honest here: do you really need the data from ten thousand customers if five hundred would do? The answer is that it depends. Always keep in mind that given an ample sample size, which is dependent on other factors we’ll get into later, more data really isn’t that much better due to diminishing marginal returns. People notoriously misjudge how big of a sample size they need for an analysis. Being really sure results aren’t down to chance compared to being super-really sure isn’t going to be making anyone any money, but it will cost your organization more saving all that data.

In contrast, systematic error is independent of sample size. Such error is variation in a data set that isn’t caused by chance, but by some inherent bias of the contained variables/data. If the way you collected your data means you’ll have twice as many men as women in your data set, no amount of data is going to save you. Worse still, if it isn’t something you might identify, such as gender, but a variable you didn’t or can’t always track like income, realizing there is systematic bias at work in the dataset is difficult.

Indeed, recognizing these biases in the data becomes harder the larger your dataset. This is because one of the most straightforward ways to identify systemic bias in your data is to simply look at it yourself. Naturally, tens of thousands or even millions of observations make this difficult to do. Even if you end up using your database to filter your data for telltale signs of a bias, having a bigger data set means that analysis will take longer (even if you run it on the world’s fastest in-memory database).

Don’t cut yourself on Occam’s razor

By the same token, a common mistake made by novice analysts is running too many tests on the same data set. Even terabyte-stuffed datasets will show you correlations that are purely down to chance if you test often enough. This is because a naïve test assumes independence (i.e. it assumes you only run a single test on the data), but many analysts will simply keep running analyses until they find a significant correlation. As the economist Ronald Coase put it: “If you torture the data long enough, it will confess.” One way to deal with such a problem is to employ testing that corrects too many hypotheses. Approaches include the family wise error rate or false discovery rate among others, but take care to choose a method that is best suited for the particular task. For example, using R with UDFs in EXASOL is as simple as running a function in your database.

While simple models are easy to explain, they aren’t always the best choice. Models that are too simple can come with a host of problems such as confounding variables. When undetected, these make it seem as if the variable you think is doing the work turns out to simply be caught up in the effects of another variable you didn’t even know about. Variables in your model could also be influencing each other, creating more complex relationships. Ways to fix these problems include proxy variables for confounding, as well as using nested models and interaction terms to deal with non-independent variables by specifying the relationship. And although you need to identify if any of these “problems” are the case, you need to maintain your model as lean as possible. Why?

Because overfitting is a problem too. If you only focus on making a model fit your data as accurately as you can, you can construct a complex model that will fit the data very well, but be needlessly complicated, and won’t actually help you uncover any kind of insights into a causal relationship. Why would you need to rank a website for search with 200 variables of criteria (according to Google) when only a couple will have a profound effect?

When looking at potential causes some will be more convincing than others. Simply ranking them by their effect size and then picking, say, the first five or 10 will yield most (but obviously not all) of the predictive power available from the model. A way of judging this is by using a scree-plot. While a linear relationship of a single variable won’t tell you the full story, adding even a single variable should come with careful scrutiny. Too many variables will leave yourself open to multicollinearity and other items that you don’t want in your model. If you are doing machine learning, separating your data into a training and validation set is critical too.

Now, considering you have actually managed to find what looks like a significant causal relationship (lucky you!), hold off on that salary review a little longer. Let’s say you’ve found two interesting conclusions from your latest analyses. The first larger effect shows that in surveys, customers strongly prefer round-the-clock support. But such a task is simply not possible due to the size of the support team (fortunately at EXASOL, we are available 24/7, so go ahead and call us at 3am if you happen to find yourself counting bugs instead of sheep). The second smaller effect shows that in A/B testing, website visitors slightly preferred a different shade of green on your website. This should be a reasonable effort for the person responsible. A great idea that can’t be put into practice isn’t nearly as valuable as a modest improvement that can be implemented immediately. Data on its own won’t tell you how feasible a change is, but it is nevertheless critical to assess for optimal success.

To wrap up, it isn’t just important to collect a lot of data, but to know what kind of data it is and how you are going to analyze it. If you want to learn more about how to turn data into value, go ahead and download our whitepaper here.

Big Data Whitepaper: „Turning data into value“

So starten Sie am einfachsten mit der Analyse Ihrer Daten, um das meiste herauszuholen.

play_arrow Jetzt herunterladen

Abonnieren Sie unser Blog