How do you choose the right Data Modeling approach?

Apr 26, 2024 10:38:32 AM

Data modeling is critical.

But what are my options?

In the last blog post we argued that data modeling is very relevant today. Simply having the possibility to store large amounts does not mean you should just “dump” it in your data platform. On the contrary, we believe you should do proper data modeling if you have large amounts of data. If you have read our last blog post, you might remember that data modeling provides a plethora of benefits such as easier access to information, improved query performance and lower costs. But nothing is free in life - data modeling comes with challenges such as understanding business requirements and choosing the best modeling approach. In this blog post, we will dig deeper and shine a light on the stages of data modeling and some common approaches.

The Three Musketeers of Data Modeling

Data modeling is usually done in three stages. Each one of them is equally important and should not be skipped.

 

Conceptual Modeling: Understanding the building bocks

Conceptual data modeling is driven by business requirements and stakeholder input, aiming to capture the semantics of the business domain. It serves as a foundation for more detailed logical data modeling, which defines how data will be structured within a specific database management system (DBMS). You jointly uncover general business concepts, attributes, entities and their relations. This step is completely devoid of any technology discussions. The output is a high-level overview of the system to be built and serves as input for the logical data model.

 

Logical Modeling: Detailing Data Beyond Basics

The purpose of a logical data model is to provide a detailed blueprint of how data will be structured within a DBMS. It defines the logical relationships and constraints between data elements, entities, and attributes without being tied to a specific database technology. Attributes can be described by their data type, length and precision. Naming conventions still rely on business names for objects. The logical data models facilitate communication between stakeholders, database designers, and developers by providing a common understanding of the data structure and relationships within the system.

 

Physical Modeling: The Technological Implementation Phase

A physical data model is a representation of how data is stored and accessed within a specific DBMS. It translates the logical data model into the actual database schema, including tables, columns, indexes, and constraints. Physical data models consider implementation details such as data types, storage allocation, and performance optimizations to meet system requirements efficiently. These models serve as the basis for database implementation and provide guidance for database administrators in configuring and maintaining the underlying database system.

 

Enhancing Clarity with Entity Relationship Diagrams

You can use entity relationship diagrams for drawing your models (see also next section) with increasing details as you go along the phases. This gives you an easily readable and standardized representation.

 

Data Modeling: The Journey from Hierarchies to Graphs

Data  models have evolved over time as database technology has changed. We went from hierarchical models via network data models to relational data models. The concept of relational data models was first introduced by Edgar F. Codd in 1970. It organizes data in tables that have rows and columns. Rows represent unique records while columns describe attributes. This is a well-known concept that is still in use today. Earlier we mentioned entity relationship diagrams. They are a form of relational model and depict entities, their attributes and the relationships between those entities. Another important approach that we will look at in the next section is   dimensional modeling. It is mainly used in data warehouses and data marts (see first blog post). It utilizes a so-called star schemawhere the center of it is a fact table that contains transactions or other events. This could, e.g., be purchasing events of products. The fact table is then connected to dimensional tables around it that, e.g., contain details about the products purchased or about the customers purchasing it. For the sake of completeness, it should be mentioned here that there is also the Snowflake schema which follows the same logic, but could also have multiple layers of dimensional tables.

Needless to say, there are more modeling techniques which we will not dive into today.

 

A Tale of Three Methodologies: The Big Names: Inmon vs. Kimball vs. Data Vault – What are my Design Options?

The big & classic names that you come across when you investigate data modeling for data warehouses are Inmon and Kimball. A relative newcomer to the scene is the Data Vault 2.0 concept. In contrast to the other two, it is not named after its inventor, which in this case would be Dan Linstedt.

Inmon, who is often called “the father of data warehousing”, takes a top-down approach. He focuses on building the data warehouse in third normal form first for the entire enterprise (enterprise data warehouse). In doing so a single source of truth is created. For the different business areas data marts are created that all have the data warehouse as their source. Obviously, this causes a large upfront investment since entire business processes and requirements for a complete enterprise need to be clear and understood before modeling.

 

Kimball vs. Inmon: Navigating the Trade-offs of Bottom-Up and Top-Down Data Warehouse Design Strategies

Kimball on the other hand follows a bottom-up design. He described his approach: "...the data warehouse is nothing more than the union of all the data marts". Data marts are built based on the business requirements. This means that the data warehouse layer is already in a dimensional form (star schema or snowflake schema, see above). Meaning in Kimballs approach the data warehouse is built by one to many business specific data marts. Kimball tends to have a lower footprint but sacrifices the idea of a single source of truth to some extent. Inmon was not a big fan of this approach though: "You can catch all the minnows in the ocean and stack them together and they still do not make a whale”. Regardless of the differences, both aim to establish an enterprise-wide data warehouse to facilitate business intelligence (and more). In real life, it’s not always possible to make a clear distinction between the two and hybrid approaches often make perfect sense.

 

Data Vault 2.0: Flexibility and Efficiency in Modern Data Architecture

And then there is Data Vault 2.0 (which is also supported by Inmon). The basic idea is to provide a highly scalable, flexible, and resilient modeling approach for building data warehouses that can easily adapt to changing business requirements and accommodate large volumes of data. It is comparable to the architecture of Inmon as it also follows the idea of having a normalized data warehouse and business specific data marts that are fed through the warehouse. So called Hubs, Satellites and Links are the essential building blocks (types of tables) in a Data Vault design. Briefly described, Hubs are the entities, Satellites their attributes and Links are used to model the relationships between the entities. One of the core advantages of a Data Vault is that it is very modular and flexible. While very powerful & effective, Data Vault is also not a silver bullet for all organizations. Our next blog post will look into Data Vault 2.0 in detail.

 

The Number Three: Three Phases of Modeling and Three Main Model Designs

There you have it! We started with the three phases of data model development: conceptual model, logical model and physical model. We highly recommend going through each phase step by step to ensure that you achieve the results that you need. The entity relationship diagram is well suited to work on this. When it comes to the actual design part, our partner INFORM DataLab will be happy to work with you to find the right approach amongst the names mentioned here (Inmon, Kimball and Data Vault). Variations and mixed approaches are common.

Regardless of your desired modeling approach, we at Agile Data Engine offer a unique data warehouse automation platform that supports all different modeling approaches and ensures your data warehouse has a solid foundation to build upon along with the resilience to adapt and support your business needs in the long term.

If you want to read this blog post in German, schau Dir diesen und andere Blog Posts auf der Webseite unseres Implementierungspartners Inform Datalab an.

Check out our next post going into Data Vault 2.0 in detail as we believe it is a very good design to follow when integrating multiple sources, and if you're curious to continue diving into Data Vault while networking with peers, join us in Düsseldorf on April 23 for our Data Vault Experience Workshop!