Insights Blog

Data Lake Scaling Issues: Is Data Mesh the Answer?

Can data mesh solve self-serve analytics

It’s good enough for Netflix and Exasol customer Zalando, but is data mesh architecture the right approach for your organization and its data democratization journey? Exasol CTO Mathias Golombek investigates:

– What data mesh architecture is
– Why you might implement a data mesh
– The pros and cons of data mesh
– The pioneers of data mesh architecture

In my recent blog series I delved into one of 2021’s hottest data topics – data democratization – exploring how it can fit into a business’ overarching data strategy along with some practical advice on how to implement data democratization in your own organization. 

For today’s follow up, I’m introducing another contemporary data concept – the data mesh. I’ll explore the link between data democratization and data mesh as a means to connect siloed data and create a self-service data infrastructure that makes data highly available and easily discoverable for the people who need it. 

To be clear, I’m not advocating data mesh as a silver bullet to all the issues people experience with data lakes. It’s a concept that works for some, not everyone. Ultimately, you’ll need to make up your own mind.

So, let’s get started.

What is data mesh architecture?

The cloud is one of, if not the, most disruptive driver of radically new data-architecture approaches. But to fully understand what’s driving the need for data mesh, we need to appreciate the mess many organisations find themselves in when they try to scale their data.

Ananth Packkildurai’s article in Data Engineering Weekly contains a great analogy for the sad state of data infrastructure in many organizations. He likens the modern data generation process to the equivalent of writing a dictionary without any definitions, shuffling the words up randomly and then hiring expensive and analysts to try and make sense of it all. While this analogy certainly doesn’t apply to every organization it definitely resonates – and is at the core of why the data mesh principle has gained such a following over the last few years.  

To write about data mesh and not acknowledge the ground-breaking work of its creator, ThoughtWorks consultant Zhamak Dehghani, would be unforgivable. Her papers: How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh and Data Mesh Principles and Logical Architecture have become required reading on the topic and I urge you to check them out, if you haven’t already.

Why implement a data mesh?

To summarize, Dehgani’s data mesh theory argues that data platforms based on traditional data warehouse or data lake models have common failure modes that mean they don’t scale well. Instead of centralized lakes, or warehouses, data mesh advocates the shift to a more de-centralized and distributed architecture that fuels a self-serve data infrastructure and treats data more as a self-contained product.

Dehghani maintains that as your data lakes grow, so too does the complexity of the data management involved. In a traditional lake architecture you’ve typically got producers of data who generate it and send it into to the data lake. However, the data consumers down the line don’t necessarily have the same domain knowledge as the data producer and therefore struggle to understand it. The consumers then have to go back to the data producer to try and understand the data. Depending on whether the producer is a person or a machine the required level of human domain expertise may or may not available.

By treating data as a product data mesh pushes data ownership responsibility to the team with the domain understanding to create, catalog and store the data. The theory is that doing this at the data creation phase brings more visibility to the data and make it easier to consume. As well as stopping any human knowledge siloes forming, it helps to truly democratize the data because data consumers don’t have to worry about the data discovery and can focus on experimentation, innovation and producing more value from the data.

Approach the data mesh with caution 

That’s the theory, anyway. But despite data mesh architecture gaining a lot of traction there are concerns in the industry about its application. And of course there are plenty strong advocates for the benefits of data warehouses and lakes. Going a stage further, my colleague Helena Schwenk recently blogged on the new concept of the data ‘lakehouse’ as a means to increasing the flexibility of modern data infrastructures. 

As I said at the start, data mesh isn’t a panacea. But if you do go down this route, getting your tech stack right – or as right as possible – will be crucial to data mesh efforts. You need a very powerful central system that can handle all this diverse access, which is the beauty of the simplicity and performance of the Exasol database.

Learning from the pioneers 

If you’re looking to implement the data mesh architecture, let me share a few examples of companies you can learn from, who’ve been very open and transparent about their journeys. 

Netflix processes trillions of events and petabytes of data a day. As it has scaled up original productions, data integration across the streaming service and the studio has become a priority. So Netflix turned to data mesh as a way to integrate data across hundreds of different data stores in a way that enables it to holistically optimize cost, performance and operational concerns presented a significant challenge. This great YouTube video explains more.

Europe’s biggest online fashion retailer – and Exasol customer – Zalando has also been on a journey from a centralized data lake towards embracing a distributed data mesh architecture. Here’s another great YouTube video from NDC Oslo where Max Schultze outlines Zalando’s ongoing efforts to make creation of data products simple.

I’d love to hear your thoughts on data mesh architecture as well, so get in touch on social media and let us know what you think!

Mathias Golombek is Exasol’s CTO. He joined in 2004 as a software developer, led the database optimization team and became a member of the executive board in 2013. Although Mathias is primarily responsible for Exasol’s technology, his most important role is to create a great environment in which smart people enjoy building powerful products.