Picture a world where every data manager and user speaks a different data dialect. Chaos reigns. Ambiguity, inefficiency, and miscommunication suppress efficient operations, alignment, and the decisions that stand between your organization and growth.
Data modeling adds clarity and a unified language to data work and analysis by creating blueprints that help understand data structures and connections. In this article, I make an effort to explain why data modeling is worth it, and how you can implement it on the levels of conceptual, logical, and physical data models.
Data modeling is the process of creating a visual representation of data structures, their relationships, attributes, and the rules within an information system. It is essentially a blueprint for how corporate data is used and mapped to the real world.
The purpose of data modeling is to facilitate understanding, organization, and management of data. It guides data storage, retrieval, and use, ensuring that business processes, applications, and software run with optimally formatted data.
Data modeling, especially in an enterprise context, helps data managers and users:
Data modeling can be done at different levels for different purposes. The most common data models are the conceptual, logical, and physical data models. Traditional data modeling is done top-down, starting from the conceptual data model, then deriving the logical data model, and finally creating the actual physical data model implementation to the target database or enterprise data warehouse.
Conceptual data modeling is a broadly defined and used concept with several interpretations. Nevertheless, the main purpose of a conceptual data model is to identify and define relevant concepts and their relationships. This fosters a shared understanding for stakeholders from business to technical.
This represents the highest level of data modeling, and shouldn’t dive into too much detail. For the same reason, a conceptual data model should be independent of technology and application.
Zooming out from the details and looking at the bigger picture helps you see the related concepts in your focus area. This gives you a better understanding of what you need to take into consideration when building your application or analytics solution on top of certain data.
Usually, conceptual modeling starts by identifying the relevant concepts and how those concepts relate and interact with each other. It is a good practice to describe the concepts (and the relationships as well) with a couple of sentences to make sure you have a mutual understanding of what you are trying to model.
Let’s say you need a conceptual data model for your library system. The data model should describe the key entities, attributes, and relations within the system. Here’s what it could look like in simple terms:
Entities
Book: A book is a physical or electronic representation of a written work, typically bound between covers and written by one or more authors.
Author: An author is the person responsible for creating the written content of a book.
Book-Author Relationship: Book can have one or more authors, and an author can write one or more books.
When you apply conceptual data modeling for analytical purposes, you’re trying to understand and identify the business terminology and concepts and their relationships to the domain in scope. However, unlike when designing the actual business applications, the data usually already exists for those concepts. As part of the conceptual modeling process, you may add to the model additional information such as synonyms, locations, source systems, and even quality traits of the available data.
It is also a good practice to link the conceptual model to any existing business glossaries or taxonomies. This metadata can serve as input for a data catalog that provides a centralized view of available data assets in your organization, including their characteristics and how they can be used for analytical purposes.
The next data model type – the logical data model – adds a layer of definition by resolving many-to-many relationships between entities and attributes with intermediate tables and defining unique or natural keys and references between these tables.
The purpose of the logical data model is to act as an explicit representation of the domain in scope. It should capture how to store and organize all relevant and essential domain data without losing information. The logical data model should reflect the business’ understanding of the data on a detailed level. This means that there is no place for surrogate keys, data types, or indices in a logical model. Query performance and storage optimization should be left to the physical, database-specific data model where it belongs – more on this later.
Logical data modeling is an often overlooked process in the analytical landscape. A well-defined logical model that’s based on business requirements:
Logical data models can include normalization or de-normalization of the data model if required and depending on what purpose the data model serves. For instance, a logical data model for analytics can be done in normalized or de-normalized form. The main point, however, is to represent the tables and attributes within the scope in an end-user-friendly language, and to describe how those interact with each other.
Continuing on the library example, your logical data model could clarify data structure and associations with the following:
Attributes for the Book entity
ISBN - Natural unique identifier for a book
Title
Genre
Description
Language
Attributes for the BookAuthor entity - All the attributes together guarantee uniqueness for individual tuples (rows in the database)
ISBN
First Name
Middle Name
Last Name
Attributes for the Author entity
First Name
Middle Name
Last Name
Birthdate
Biography
Nationality
Keys and cardinalities
The logical data model only contains natural keys. A natural key can be a combination of multiple attributes, like in the Author table shown above. In the logical data model, many-to-many relationships are resolved by deriving additional “bridge” tables based on the combination of natural keys of entities in the relationship.
This logical data model further refines the conceptual model by adding specific attributes and defining relationships with keys. It might also contain normalization or de-normalization based on the requirements for the use case. It provides a more detailed view of how the entities relate to each other and the attributes associated with each entity, aiming to guide the implementation phase while remaining independent of the specific database technology or physical implementation details.
In the final phase, the logical data model is transformed into a physical data model by defining data types and adding technical attributes like surrogate keys, constraints, and indices.
The purpose of physical data modeling is to store the data in a predefined database in a robust manner that enables high performance and takes the technical details of the chosen technology into consideration. Well-defined data types can have a significant impact on query performance and will also optimize computing and cloud storage costs.
In analytics, we mostly handle downstream data. This means that the data we are dealing with already exists and there most likely is a predefined schema for it. We should not lose that schema in the integration process to our data platform or data warehouse, but leverage it and bring in the correct data types for attributes and other available metadata.
Keep in mind that physical data modeling is technology-specific, so you must understand the indexing, compression, and performance of data types and tables, as well as how the data is or should be distributed in your chosen analytics database technology. Flowing the data through the different layers of your data warehouse or data platform probably requires mastering different kinds of modeling methodologies like Data Vault and star schema.
At this point in our library example, we want to see the data flow into the right places, adhering to our defined structures on the conceptual and logical data modeling levels. Physical data modeling must be tailored to a particular database system. Here’s what it could look like for a relational database:
Constraints and indices
Primary keys and foreign keys: Defined to ensure data integrity and enforce relationships between tables.
Indices: Created on columns frequently used in queries for faster data retrieval. Usually, indices are automatically created based on the primary key. There might be also additional indices created based on regular query patterns to increase query performance, but it should be remembered that indices usually have an impact on the write operations and require more computing and storage.
Data Types
Use of specific data types like INTEGER, VARCHAR, DATE, etc., based on the database system's requirements and best practices for storing different types of data.
Note
Column-oriented databases designed for analytics rarely support or enforce constraints. This means that it is the data engineer’s responsibility to ensure that duplicates or missing references are not inserted into the table. Indexing in column-oriented databases differs compared to row databases. It’s crucial to know the technology you are working with.
This physical data model provides a detailed blueprint for the database implementation. It specifies the tables, their columns, data types, constraints, and indices needed to build the database schema in a specific relational database, translating the logical data model into a technical representation ready for implementation.
That’s quite a lot to digest, so let’s sum up how conceptual, logical, and physical data models differ and relate to each other.
1. Conceptual data model:2. Logical data model:
3. Physical data model:
Data modeling should be a collaborative effort that involves all relevant stakeholders. One of the most common reasons for a data model to become obsolete is that it was developed in isolation from the stakeholders who have the domain knowledge. Communication ensures that your teams are on the same page in understanding the domain and related data requirements.
The other prerequisite for successful data modeling is iteration. Data models should be built in increments and revisited from time to time because they will require changes when people change, business evolves, language evolves, society changes, and so on. You can try to model the whole world in one go, but even if it worked out, by the time you’re finished, the model probably doesn’t match reality anymore.
To put these fundamentals in place, you need tools that were purpose-built for collaborative data management, development, and operations. Other features that make a good data modeling tool include:
You can find several tools built for data modeling, but it’s also wise to consider systems that cover a bigger slice of the data pie – not least because increasing complexity is a common hurdle on the road to efficient management.
One route to consider is a DataOps platform like Agile Data Engine, which enables a lot more than just the creation of enterprise data models and data warehouse modeling.
Meet a DataOps platform made for frictionless data modeling -->
Putting effort into data modeling helps to succeed in capturing the essential and precise information about the domain you’re interested in.
Valid data models create trust and ensure the delivery of correct data. They make it easier and more efficient to store, organize, process, and leverage your data to its full potential. Up-to-date data models also enable applying data governance policies and meeting different kinds of compliance requirements, such as GDPR.
Watch the video below to see Agile Data Engine’s data modeling in action.