Data quality issues are present in all data warehouses. But how can you tell if the quality of your data is good? In this blog, I will go through data quality dimensions commonly referenced in the literature that provide you with the criteria to recognise and monitor issues with data quality. I will also discuss how testing these data quality dimensions could be applied in different data warehouse zones.
You should first approach data quality management by defining appropriate dimensions for your use case. This set of data quality dimensions is the foundation for systematic data quality management. You can implement standardized tests and templates for testing similar data quality characteristics over different entities.
Several reference frameworks can help you to get started. For example, UK Government Data Quality Hub lists the following data quality dimensions:
Finding a fitting set of dimensions depends on your use case and business requirements. And naturally, this typically evolves over time.
The next step is to define how you test the data quality dimensions in your data warehouse. Data warehouses contain zones/layers for different purposes. Usually, there are at least the following zones in a data warehouse:
Zone | Description |
Staging |
|
Data warehouse |
|
Publish |
|
Our DataOps maturity test lets you analyze the current state of data quality, capabilities, ways of working, tech stack, culture, and more.
Take 3 minutes to answer a set of questions. Get your DataOps maturity score with our recommendations on how to improve data quality consistently.
You can run data quality tests in each zone of a data warehouse for somewhat different purposes. You could, for example, be interested in analysing and comparing the quality of incoming data from different source systems or monitoring the status of your publish models used for reporting. This table gives examples of how you could test the different data quality dimensions in the data warehouse zones:
Data quality dimension | Staging | Data warehouse | Publish |
Accuracy |
|
|
|
Completeness |
|
|
|
Uniqueness |
|
|
|
Consistency |
|
|
|
Timeliness |
|
|
|
Validity |
|
|
|
Business requirements should guide your decisions on how much data quality testing you do and in which zones these tests will be applied. As testing requires many kinds of resources, it also has cost and performance implications. Testing the same characteristics over different data warehouse zones might seem excessive. Still, the redundancy can give insight into how effectively data quality is improved when data gets transformed through the data warehouse. This quality improvement could also happen, for example, when cleaned up data from a master data management system is introduced in the data warehouse zone. Multi-zone testing also often makes it easier to pinpoint root causes for data quality problems.
Typically you would also build some monitoring outside of the data warehouse. For example, the solution could include monitoring source data integrations for the timeliness and completeness of incoming data or end-use data in the BI/other tools. When multiple components run tests with some overlap, you can catch critical issues even if one system is temporarily down.
We have seen and built various ways to manage data quality in our customer environments. The solutions range from custom business rules and mapping tables in the data warehouse load logic to implementing dedicated master data management systems. We compiled our best practices for data quality monitoring with Agile Data Engine into a simple guide.
Also see the next blog in the series: Data quality monitoring for short-term and long-term purposes.