Best practice for addressing Data Vault challenges in Databricks

Dec 23, 2024 9:43:23 AM

Recently, Agile Data Engine participated in the German Data Vault User Group’s “Willibald Challenge”, a unique opportunity for software vendors to tackle real-world data challenges using a sample dataset from a fictional company. This hands-on competition gave us the chance to demonstrate our platform’s capabilities for Data Vault modelling supported by a robust DataOps framework. For this challenge, we deployed a data lakehouse with Agile Data Engine on Databricks, an ideal choice for us given its compatibility with our platform and the robust features it offers for data engineering and analytics. 

 

About the Willibald Challenge 

The Willibald Challenge was designed to test the capabilities of a software vendor like Agile Data Engine in solving common and advanced issues related to Data Vault. Using a provided sample dataset, participants must address various tasks like detecting duplicate and deleted records, setting up foundational Data Vault modelling structures (such as hubs, satellites, and links), and managing advanced Data Vault features such as multi-active satellites and hierarchical links. These challenges represent typical, yet complex, data problems that companies face in modern data management. 

Screenshot 2024-12-16 at 10.12.23

Building the Data Vault Model on Databricks 

Our choice to use Databricks as the underlying cloud platform for this challenge proved to be highly effective. Databricks, with its versatile data lakehouse capabilities, provides the storage scalability of a data lake combined with the data management features of a traditional warehouse. With Agile Data Engine’s support for Databricks SQL, we were able to build and deploy the Data Vault model smoothly, ensuring data integrity and flexibility as we progressed through the challenge tasks: 

 

Data Vault Modelling

Agile Data Engine provided a solid foundation for the Data Vault model, allowing us to easily build and manage hubs, satellites, and links on Databricks. We leveraged Databricks SQL to streamline these processes, making it efficient to set up each core component while focusing on scalability and future growth. From identifying relationships and driving key patterns to handling duplicates, deletes or data sources without primary keys, our platform enabled quick development and deployment, all running smoothly within Databricks. 

Advanced Data Vault Components

Beyond standard modelling, the challenge called for implementing advanced features like multi-active satellites, effectivity satellites, transactional and hierarchical links. Multi-active satellites, which handle concurrent records for entities with multiple active instances, or effectivity satellites which track changes in the relationships were straightforward to configure in Agile Data Engine. Hierarchical and transactional links were also easy to manage with Agile Data Engine, and Databricks’ SQL engine allowed us to handle complex relationships between entities, maintaining data integrity across levels. 

Data Quality and Integrity Checks

Detecting duplicates and deletions is a common task in data management, yet it’s often resource-intensive. Databricks SQL, combined with Agile Data Engine’s data quality monitoring tools, allowed us to set up efficient checks and balances. The SQL-native support made it easy to write queries and validate data, ensuring duplicate records and deletions were effectively managed without compromising model performance. 

ADE demonstrated how sophisticated data modelling issues can be handled:


 

Leveraging Databricks Native Features 

In addition to its technical compatibility, Databricks provided several key features that enhanced our experience and extended the capabilities of Agile Data Engine. 

Data Lineage Tracking

Databricks’ Unity Catalog lineage  feature was a standout benefit, offering clear visibility into data transformations and lineage across the Data Vault model on a column level. This functionality not only helped us track changes through the pipeline but also provided crucial insights for debugging and optimization. 

Integrated Business Intelligence

Another highlight was Databricks’ support for BI Integration, which made it easy for us to connect the Data Vault model to visualization and reporting tools. This integration delivered a seamless experience, enabling visualization of Data Vault challenge results directly within the intuitive Databricks dashboard. It streamlined the process from data preparation to actionable insights, eliminating the need for additional technical tools. 

ADE used Databricks-integrated business intelligence to seamlessly report the final results:A screenshot of a graph

Description automatically generated

 

Key takeaways from Agile Data Engine's Willibald Challenge on Databricks

 Our participation in the Willibald Challenge on Databricks proved to be a rewarding experience, both for testing Agile Data Engine’s capabilities in a demanding setting and for optimizing our processes. Databricks’ support for SQL-based querying, data lineage, and BI integration aligned perfectly with our platform’s capabilities and our objectives in the challenge. 

Some of the valuable insights we gained include the following.

Seamless Data Vault Modeling on Databricks

The compatibility between Agile Data Engine and Databricks SQL allowed us to rapidly set up a stable, high-performance Data Vault model with all the necessary components. 

Scalability and Advanced Processing

From multi-active and effectivity satellites to transactional and hierarchical links, Databricks handled advanced Data Vault components with ease, allowing Agile Data Engine to support complex data requirements without bottlenecks. 

Enhanced Data Governance

Databricks’ Unity Catalog lineage tracking brought transparency and efficiency to our data management practices, making it easy to trace data flows and transformations across the pipeline. 

Further relevant points on data architecture 

Before we finish, let’s discuss some other architectural aspects. We did not specifically utilize them as part of this challenge, but they are very relevant for real-life implementations.  

Agile Data Engine helps you design, automate, and orchestrate your Data Vault implementation, which provides all the benefits of Data Vault. Data Vaults provide agility in the development and industrialization of your pipelines.  

Implementing a Data Vault model at scale requires some engineering effort to optimize the data layout of growing objects. Predictive Optimization (PO) in Databricks simplifies this process by automating table management through Unity Catalog and the Data Intelligence Platform. PO intelligently optimizes data layouts for Unity Catalog-managed tables, resulting in improved query performance and reduced storage costs.

A diagram of data intelligence

Description automatically generated

We addressed two common challenges in building data platforms: ensuring high-quality data and delivering it quickly to end users. However, it’s often unclear whether users fully leverage all available data for insights and decision-making. This is where Databricks' AI/BI Genie steps in. With access to metadata in Unity Catalog, Genie enables business users to ask questions without needing in-depth knowledge, delivering answers by leveraging all relevant information.

Other features that make a Databricks Lakehouse a great place to implement Data Vault with Agile Data Engine include Databricks'  Serverless Warehouse, Variant Datatype, and Autoloader.

Final thoughts on the Agile Data Engine Willibald Challenge 

Our team thoroughly enjoyed tackling these challenges. Working with the DDVUG and the Databricks team on refining our final solution felt like a small project with a serious fun factor. 

If you get a chance, please take some time to watch the recording of our final presentation of the Willibald Challenge. We hope that it conveys the robustness of our combined solution: Databricks as a database plus Agile Data Engine as the all-in-one automation platform. This delivers agility through automated schema changes, easy small and frequent deployments, and effortless model adjustments. 

To learn more about how Agile Data Engine helps solve these complex challenges, you can check out our whitepapers or ask for a demo.