The data lake versus data warehouse debate has raged for over a decade. But is a resolution finally in sight with the so-called ‘data lakehouse’, or is this just another example of a new buzzword generating hype? Exasol’s Market Intelligence Lead, Helena Schwenk, investigates.
Why are we debating the data warehouse versus data lake yet again?
This question has been asked for years by data professionals, as the merits of one approach have been weighed up against the other. The debate, however, has grown ever more heated in recent years, thanks to the greater prevalence of the cloud as a location for data analytics workloads, the brittleness of Loading...Hadoop deployments and the emergence of a new concept: the data ‘lakehouse’.
If that phrase is new to you, you’re not alone – it’s a fairly new concept. But in short, the data lakehouse refers to a hybrid data architecture that aims to mix the best of a data warehouse and data lake.
Similarities, differences and overlaps
To understand how a data lake, data warehouse or data lakehouse can underpin a modern analytics infrastructure, it’s worth unpicking some of their similarities and differences. In terms of similarities, all are fundamentally used for the management of transactional and operational data that form the basis of BI and advanced analytical workloads. This means they’re used for a wide-ranging set of analytical use cases across both business and developer functions.
However, it’s also worth remembering that each serves different goals as borne out by their definitions.
These are optimized for well-known, predefined and repeatable analytics needs that can be scaled across many users in the organization. As such, data warehouses are best suited to more structured and governed data – such as in the financial services or healthcare sectors. Also, they generally support a SQL processing strategy. Data warehouses are suited to complex queries, high levels of concurrent access and high-performance data access requirements.
These collate raw or unrefined data captured from a diverse array of sources. Data lakes are also likely to contain data types that aren’t subject to rigorous governance. They support a range of different processing styles and approaches, including Loading...machine learning and batch-orientated workloads. Data lakes aren’t typically optimized for performance and the demands of production delivery – such as concurrency, latency and workload management.
Some overlaps do occur. For instance, a data warehouse can also be used for operationalizing Loading...data science whereby Loading...machine learning models are run against governed data. And a data lake can introduce analysis using approaches that leverage star schemas for batch orientated queries, for example.
On the other hand, data lakehouses aim to combine elements of data warehousing with core elements of the data lake. Put simply, they are designed to provide the lower costs of cloud storage for raw data alongside support for certain analytics concepts – such as SQL access to curated data tables, or support for large scale processing of Loading...machine learning workloads.
While this sounds good in practice, the lakehouse is an emerging and immature concept – which means there are differing views on how best to realize it. The main reason for this is that there are proponents on either side of the architectural divide. Those on the data warehouse side of the fence build the lakehouse around relational technology concepts, while those on the data lake side have roots in Loading...machine learning and Spark processing where support for processing Loading...Java, Loading...Python and R workloads is paramount.
Although the lakehouse is an interesting concept, it’s still very ill-defined and (unsurprisingly) subject to a lot of hype and speculation. And though a heated discussion will continue for some time, it’s unlikely to irrevocably remove the need for either the data lake or data warehouse, and more importantly, overturn the enormous amount of innovation in the market.
This is especially true when you consider the decades of data warehouse development seen in areas such as query and performance optimization, Loading...in-database analytics, columnar storage and compression.
The strategy for data democratization is co-existence
Choosing between a data warehouse or data lake or, dare I say, lakehouse need not be an either/or decision. You’re unlikely to find that replacing one with the other is an optimum solution. Instead, it’s about recognizing the similarities and differences between them and using each architectural design for its strengths, or even combining their uses.
There is often a lot of redundancy across data stores in an organization, so having the ability to store and process data across both a data lake and warehouse can help bring some order to the chaos. It’ll also enable businesses to scale their analytics projects more effectively and help democratise data within their organizations. This approach of co-existence draws on the strengths of each architectural design to serve a wider number of use cases than any of these architectures can support independently.
Another good example of this co-existence is when the insights generated in a data lake (or lakehouse) are propagated into a data warehouse to be consumed by a wider audience in a repeatable and scalable manner. Moreover, the need to access multiple analytical data stores in distributed locations can also be supported through data virtualization, enabling data to be federated across the data lake, lakehouse and the warehouse.
Delivering a modern data infrastructure requires flexibility
These examples illustrate the need for a more flexible approach, enabling analytics use cases that are well-defined and repeatable (via the data warehouses) alongside support for instances that are more experimental, Loading...machine learning- and developer-led (via the data lake). Likewise, both can support different tasks and user roles across the spectrum; from non-technical and business-oriented to data scientist and developer centric.
Trying to address the constantly changing data environment is a tall order, and many data professionals are consumed with trying to overcome the roadblocks to data access and availability. Making sure the data warehouse and data lake work together and not against each other is part of the process.
It’s likely too that the promise of the lakehouse offering a one-size-fits-all (but largely unproven) approach will initially be a step too far the many who’ve invested significantly in data lakes and data warehouses. But this kind kind of architectural conversation is important to have. The future of data analytics and the architectures that underpin it will no doubt evolve in the coming years, so we’d love to hear your thoughts on how it’ll take shape.
Let us know what you think on our social channels – it’s going to be fascinating to see how the debate – and wider market – evolves next.
Helena Schwenk is Exasol’s Market Intelligence Lead and presents our DataXpresso podcast. Helena has more than 26 years’ of experience working in the data analytics field, having spent 18 years as an industry analyst specialising in Loading...big data, Loading...advanced analytics and AI.