Agile Data Engine - Blog

What's A Data Lakehouse? Can I spend my vacation there?

Written by Christoph Papenfuss | Apr 26, 2024 7:37:34 AM

Terminology in the Data Management Business

 

If you work in or get in contact with the Data Management Business, you'll notice there are many buzz words flying around. It can get quite messy with all the vendor and influencer-imposed terminology. If you are planning on building some kind of Data Platform or improving your existing one, it can be quite a challenge to at first pick the correct concepts for you (not talking about technology, tools and implementation yet). Furthermore, it can be difficult to separate a concept from a commercial offering names used by vendors. Therefore, we want to give you an overview of the general terminology and concepts that we think you come across and that are useful to understand.

 

Why should I look into Data Management in the first place and spend my time learning the ideas behind the big words?

In the end it sums up to better decision making and savings in time and costs. Companies want to take data-driven decisions and are in need for democratization of data. A data management platform can function as a single source of truth. Meaning departments no longer look at different versions of the same facts which enables better collaboration. Furthermore, access to data is simplified and no longer only possible for higher management levels. Data can be prepared and provided, e.g., for analytics use cases. Through storing not only up to date information, which obviously supports better decision making, but also historic data, the foundation for future use cases is also laid.

However, any data management platform is not automatically a sure-fire success. There is no product that you can just buy that solves all your problems. The platform has to be built. For that you need to know what you want to achieve in order to get the basics for implementing a good quality data solution architecture right. The next sections will do the first step and help you with understanding the general concepts in data management.

 

Data Warehouse – Older than dirt, but well established

For all terminologies we describe below, including the data warehouse, it is about the concept and not about any technology. Let’s look at the definition provided by Bill Inmon in 1990s that still holds: “A data warehouse is a subject-oriented, integrated, time-variant and non-volatile collection of data in support of management’s decision-making process”. A collection of data, i.e., a data storage. But certainly not any storage or simply a copy, but one that is analytics ready supporting data-driven decisions.

A data warehouse collects data from different sources. However, it is essential that it harmonizes it and puts it in a consistent whole (integrated). So, it is not about the system where it comes from (which could be multiple), but about the matter it describes (subject-oriented). Essential for data-driven decisions are not only the availability and correctness of data, but also the timeliness. Data can change over time, i.e., the data warehouse also needs to contain up to date information (time-variant). However, historic data can also be useful in analytics. Here the data warehouse shows its advantage as it collects data and processes data over time and can therefore provide historic information (non-volatile).

Data Warehouse Essentials: Central Hub for Analytics and Data-Driven Decisions

To sum it up, a data warehouse is a central location that joins and harmonizes data from various sources. It always contains up to date information, but also stores historic data. As such it is the central go to place for analytics and data-driven decisions.

Some also define a data warehouse by the type of data it can contain. However, with emerging technologies this differentiation becomes more and more vague. Generally speaking, a data warehouse can store structured data, i.e., data that can be stored in tables. For easier understanding just think of data that you could store in Excel. To a certain extent data warehouses can also handle semi-structured data. This is data, e.g., stored in JSON or XML files. The third category would be unstructured data. Meaning, e.g., PDFs or images. However, this leads us straight to the next section on Data Lakes.

Data Lake – The thing that I choose if I don’t want to waste my time modelling and organizing data?

Data Lake: The Smart Choice for Time-Efficient Data Management

As mentioned above, the term unstructured data leads us directly towards the term data lake. This is also a widespread and established concept in data management today. It was introduced to overcome challenges with exploding data volumes and also to serve the need for utilizing those unstructured data in various use cases. The trend was also made possible since cloud storage became much cheaper.

A data lake has a somewhat similar purpose to a data warehouse in storing all data in a central repository. But as a concept, it is more loosely defined than a data warehouse, as you can see for example, from this definition by Gartner: “A data lake is a concept consisting of a collection of storage instances of various data assets. These assets are stored in a near-exact, or even exact, copy of the source format and are in addition to the originating data stores”.

Some might think that the concept of a data lake sounds desirable since it contains data in its raw format and therefore no modelling or overhead tasks are needed. At this point we would like to mention that this is a fallacy. Data lakes are also in strong need of data modelling and governance concepts (more in our next blog post) as they tend to become data swamps if data is just “dumped.”

 

Data Lakehouse – The thing that sounds like a desirable holiday location

Data Lakehouse: Merging the Best of Data Lake and Data Warehouse

This is probably the most fancy sounding name. Above you have most likely noticed that with evolving technologies the borderlines between the two concepts of data warehouse and data lake have become more ambiguous in practice. It is no longer as easy as, e.g., differentiating by the type of data that is stored, since this is overlapping. One prevailing convention is that a modern data platform must have both data lake and data warehouse capabilities and one task of a data lake is to be a universal staging area for all the data. For instance, the term data lake can be used to describe the landing zone of a data warehouse (stage where data from different sources is replicated to before modelling the data).

This shows how the terms can no longer be clearly differentiated. The most known combined technology concept nowadays is a data lakehouse. It is pretty much as it says in the name, a concept that combines elements from both the lake and the warehouse. So basically a fancy name for something that is very logical to do.

 

Data Mart – Simply inescapable

Data Mart: An Essential Component in Data Management Projects

You are most likely to encounter a data mart in your data management projects. At least if you are doing your data modelling right (more on this in the next blog post). A data mart can be thought of as a subset of a data warehouse. Its aim is to be application centric. Your data warehouse might contain one set of harmonized and normalized data that is relevant for several use cases. A data mart takes that data and prepares it in such a way that it is ideal for the needs of the business and users and prepares it as needed by the target system (e.g different BI Tools). I.e., there could be several data marts all being based on the same data but prepared differently.

 

Data Mesh – The new kid on the block

Exploring Data Mesh: Decentralizing Data Management in 2024

Data Mesh is the newest term that is included in this blog post. It was developed by Zhamak Dehghani in 2019. The first important thing to notice is that data mesh is not a successor of a data warehouse or a data lake or any combination of it. It is also not a silver bullet that will solve all your problems. But let us discuss what it actually is. So far, the idea behind any analytics platform was almost always to have one centralized system around one central data team.

However, it was discovered that this could become a bottleneck if analytical request increase and the data team cannot handle them all. Therefore, a data mesh follows the idea of a decentralized, domain-driven architecture. The domain team takes the responsibility for its data and its data management. The data is then published, as a product, to consumers beyond the domain. That means that the central data team enables the domain teams to consume and create data products. Obviously, this also means that good governance principles and standardization are needed.

Data Powerhouse – The term to make you smile

Unlocking the Potential of Data Powerhouse: Beyond Just a Buzzword

Okay, to be honest, this is more in here for fun and to show you that there are many terms around that sort of combine what has already been around previously. Similarly, I saw the term “Data Ocean” somewhere lately. That gets you thinking on many layers and actually made me smile. But back to the Data Powerhouse: This has been used to describe the combination of the use of Microsoft’s “Power Platform” together with a data warehouse or a data lake.

Therefore, it is a good example of our claim in the introduction stating that it can be hard to differentiate between a concept and commercial offering names. However, some people also use the term to describe what your company can become when implementing an efficient data infrastructure.

Which fancy name should I pick now?

Next steps for your data warehouse journey

In the end you can call it what you want as long as you get it right, it suits your needs and you can explain it to others you work with so that you are all on the same page. Agile Data Engine can help you with building and operating a resilient data warehouse or data lakehouse. Make sure to check out our popular white paper about this topic. Check this blog for some posts about a crucial ingredient for a resilient data warehouse: data modeling.

You want to read this blog post in German? Schau Dir diesen und andere Blog Posts auf der Webseite unseres Implementierungspartners an.

Want to learn more, network with peers, and optimize your data function? Join us in Düsseldorf on April 23 for our Data Vault Experience Workshop.