NashTech Insights

Data Mesh Architecture with Azure and DataBricks

Kundan Kumar
Kundan Kumar
Table of Contents

In General, the Data mesh architecture comprises several components that interact and work together. The journey starts with a mindset of building decentralized data products that will own and govern by the expertise of the Domain team. Each Domain team will publish a detailed agreement or contract about their data product and how to use it to their respective Consumer(another domain team). In this architecture, there will be a data infrastructure as a platform called a self-service data platform. This platform will provide support for all Domain Data teams across an organization to build robust, secure, and standard data products. The Federated Governance Group will be responsible for creating global policies and security standards to be followed by all the domain teams to build their data products. Let’s try to implement a Data mesh on Azure to understand it more clearly. 

In the data mesh architecture, you implement the principle of domain-oriented data ownership. You first identify the different logical problem spaces you want to address across the organization. This Logical problem spaces called domains. The Domain team(group of domain experts, Application team, Data Engineers, ML Engineers, and Data scientists) will own their domain applications and store and process the domain data for analytical consumption by other domains. Each domain should provide its product specifications:

1. Data, its meaning, the format or shape of its data, and its refresh cycle.
2. Processes for locating and gaining access to the product.
3. Quality standards to make it available for consumption.
4. Security measures for different access requirements.

By these specifications, other domains can consume the data product as per their requirement effectively. In data mesh, different domain teams can use different technology stacks to build their data product. The Domain teams must follow the global policies and standards to build their Data product within the organization. But, your product must be Discoverable, understandable, AddressableInteroperableSecure, and trustworthy .

In the above architecture, there are three domains respectively Cart, Checkout, and Marketing.

The Checkout domain application stores its operational data in Azure SQL. The team ingests data from Azure SQL to Azure blob storage using Azure Data Factory. The application pushes the data to the Azure Event Hub as well. Similarly, the Cart domain application stores its operational data in Azure CosmosDB. The Team ingests data from Azure CosmosDB to Azure blob storage. There is a global policy that each domain team will provide the data to the Event hub as well Azure blob storage in JSON format.

The Marketing Data domain unit makes an agreement with the checkout and cart domain for data consumption. The marketing domain consumes data from event hub and blob storage and builds its data analytics and ML platform on Databricks. Ingesting Data into the Databricks Lakehouse which is built on top of Spark and Delta Lake. Databricks SQL for Dash-boarding and databricks ML models are the data product for the Marketing team.

The data products provide data for other teams with a service agreement/contract. Data contracts is a metadata-driven ingestion framework. You should store data contracts in metadata records within a centrally managed catalogue. It covers:

  1. Data Product Provider, including team, owner, and the output port to access.
  2. Data Product Consumer, including team, and responsible contact, Purpose of data usage.
  3. Schema and semantics of used data attributes.
  4. Service-level objectives, such as latency, availability, Up time, Error rates.
  5. Terms, such as query intervals and data processing volumes.
  6. Start date, End date.
  7. Interoperability standards.
  8. Data Lineage.

The list can be as long as the global policies are created and time passes. The Data contract helps the domain teams to ensure the data quality of the product and provides transparency for data usage and dependencies. You can use Azure Purview to fulfill the data contract requirement. You can build web application on top of Azure purview for the data contract in place. The central data governance team will review all data contracts regularly across the organization. It is recommended to implement self-service and fully automated data contracts.

The above domain team checkout and Cart publish a data contract with the Marketing domain team. The contract has its product specification like the schema of their data and the metadata pushed into the Azure data catalog. They have mentioned the output ports to consume data like Event hub topic and Rest address to access blob “https://{storage}{container}“. It is specified what permissions you required to access their data assets.

Federated governance is another important part of the data mesh architecture. In the federated governance, we have the governance team or here they are called the governance group. Federated governance is a collaborative approach to data governance. Here, the governance group consists of representatives of the different domains within the company. The Governance Team builds global policies, which are the rules of play in the data mesh. These rules define how the domain teams have to build their data products across the organization. There can be several components of Federated governance:

This is a set of guidelines, standards, and practices. Here, the Governance group outlined the technical and data standards that are required for the different systems and applications to work together. This policy ensures that the data is consistent and can be easily integrated across the different domains in the data mesh. For example, above architecture, the global policy is the different teams will need to provide the data in a specific JSON format in a specific blob storage location and event hub topic.

The documentation policy is going to outline the guidelines and standards for documenting data products across the different domains in data mesh. As a best practice, we are also going to have some guidelines on data lineage, metadata management, and data quality management. for example, a rule can be that each data domain going to have a Confluence page for each data product and its specification and usage. This enables collaboration and data sharing in data mesh.

Here the governance group defines guidelines and practices to ensure that the data is secure in the data mesh environment. So here we have guidelines on access control, data protection, data classification, and data retention. For example, The domain team creates role-based access on Azure AD to access their data product. A security scan of each Data product is required to make it available for consumption.

Here governance group defines guidelines and practices that ensure the privacy of data in each Data product. Things such as guidelines on data collection, data sharing, data retention, and data anonymization. Here also can have things such as data subject rights, for example, the right to access, rectify and delete personal data (PII). So the data privacy policy will ensure that Data products comply with the applicable data protection regulations such as GDPR, and CCPA.

Here governance group defines guidelines and practices that ensure compliance with legal and regulatory requirements related to data in data mesh. The compliance policy will ensure that the data product is managed in compliance with the applicable laws, regulations, and industry standards.

Each domain team has to follow the global policies and guidelines to build their data product. The Self Serve Data platform team will provide all necessary support to the domain teams to fulfil the policies. They have to provide the data platform required to each domain team across the organisation.

The data platform team is responsible for building and operating the central self-serve data platform. The team’s primary goal is to provide a reliable, scalable, and secure data platform that meets the needs of the organization’s data consumers. The data platform team is usually a team of data engineers, data architects, data scientists, and other data professionals who specialize in data management, data governance, and data operations. This team builds the solutions and components that domain teams (both data producers and data consumers) can use to both create and consume data products. The data platform team enables a smooth development experience and reduces the complexity of building, deploying, and maintaining data products. The data platform should be governed according to the organization’s regulatory policies. They are also responsible for the day-to-day operations of the data platform, including monitoring, troubleshooting, and performance optimization. The self-service data platform provides:

Platform Solutions: These solutions consist of composable components for provisioning Azure resources. For example, the Data platform team provides an infrastructure-as-code (IaC) template to create Azure resources. Data domain teams onboarding onto the data mesh can use these Iac templates. Using IaC templates domain teams can create a set of Azure resources like standard Identity and Access Management (IAM) permissions, networking, security policies, Big data services, and relevant Azure resources APIs, etc. which enables faster data product development.

Common services: These services provide data product discoverability, management, sharing, and observability. These services facilitate data consumers’ trust in data products, and are an effective way for data producers to alert data consumers to issues with their data products. For example:

This is a centralized repository. It maintains the metadata about the different data products stored on the platform. It is going to provide information about the product, such as the source, the schema, and the lineage of the data. You can use Azure Purview to onboard each data product logically in a central repository. It will help to define data governance to register metadata and its discovery.

Access Management is responsible for controlling access to the data stored in the platform. It ensures that only users with the necessary permissions can use the different data products. This can include things such as identity and access management tools, authentication and other authorization mechanisms that are going to be implemented by the data platform team.

The monitoring is going to provide real time insights into the platform’s performance, health and usage. This is going to help you to identify issues with bottlenecks such as, slow queries, data inconsistencies and other constraints. Here we can have things such as logging, alerting, and other dash boarding tools.

The police automation module automates your data governance and compliance policies. It is going to ensure that the data is used and managed according to the organization’s regulatory policies. So here things such as data classification, retention, and deletion policies will be automated.

To follow the policy each product should publish its data on Azure Event Hub, Data platform team build a common messaging platform. The data platform team exposed interfaces to create topics, request for topic access, etc across the organization. Domain teams can use the interfaces to publish and consume the data from CMP.

You can evolve your self-serve data platform with time and add more services and interfaces for the domain teams.

Overall, Data Mesh brings a paradigm shift in how organizations approach data management, promoting autonomy, collaboration, and scalability. To adopt a Data mesh architecture one can start with understanding the business domains. Microsoft Azure provides cloud services to build your data products for the domains. Azure resources can be used for building Federated Governance and domain-agnostic self-serve data platform.


Kundan Kumar

Kundan Kumar

Kundan is a senior software consultant at NashTech. He enjoys learning and working on new technologies. He is a Big Data enthusiast and has worked on Snowflake, Spark, Flink, Apache Beam, BigQuery, GCP data flow, Kafka etc.

Leave a Comment

Your email address will not be published. Required fields are marked *

Suggested Article

%d bloggers like this: