Unlocking Data Insights: A Comprehensive Guide to DataHub

Manish Mishra

Introduction

What is DataHub?

DataHub is a modern data catalog that is built for end-to-end data observability, data discovery, and data governance. A datahub is a centralized system for data definition, delivery, and storage. The data hub can ingest data from different sources such as Databricks, postgres, spark, BigQuery, etc. It is used for data governance.

Key Objectives of DataHub

Search all corners of the data stack
Track end-to-end lineage
Manage Entity Ownership
Govern with Tags, Glossary Terms, and Domains
Create Users, Access Policies & Groups
Data Quality Checks

Core Components of DataHub

DataHub User Interface

The DataHub User Interface serves as a user-friendly tool for managing data efficiently. It allows users to invite others, create groups, and assign specific roles and permissions for controlled data access. The interface also supports easy data ingestion and the implementation of policies to ensure organized and secure data handling. Overall, it provides a straight forward solution for organizations to manage their data effectively.

Some of the basic terms are

1. User’s

DataHub provides the functionality to add users to the DataHub platform. To add a new user click on the setting icon on the home page. After that in the left panel click on user & groups.

Now click Invite Users copy the URL and share it with the user to whom you want to invite to this platform. You can also assign the roles such as Editor or reader.

2. Group

A group is the combination of two or more users that work on the same functionality or as a team. To Create a group click on Create Group name the group and describe for what purpose this group was created and click on Create.

3. Roles & Permission

In Datahub, we have to define the roles and permission of a user, on the data that is ingested from other sources.

There are three main roles

Admin: The user who has all the permission.
Editor: The user who can manage the data, data owner, etc.
Reader: The user who can view the data, its quality, data lineage, etc.

Data Ingestion

DataHub supports too many different sources from where we can ingest the data some of the sources are Databricks, BigTable, Hive, Postgres, Great Expectations, etc. The data that are ingested we can observe the data, data lineage, data quality, schema, owner of the data, etc.

Types of Data Ingestion Sources

Databricks
Apache Spark
Great Expectations
BigQuery
DynamoDB
Postgres
PowerBI and many more

Ingesting data from UI

1. Click on the Ingestion tab on the datahub UI

2. Click on Create New Source. It will suggest different sources. We are taking an example of Postgres data source.

3. Search for Postgres and after that specify the configuration of Postgres sources such as host and port, username, password, and database name.

4. In this part you can specify which table, view, and schema you want to ingest or deny.

5. In the advanced option you can enable or disable the table, view for data lineage, and can also set profiling(sample of data) on ingested data.

Click on next, set the scheduled time, and finish it.

6. Now you can click on run and check the status of it.

Similarly, you can do it for other data sources as per their configuration.

Monitoring & Reporting Tools

If we talk about the monitoring and reporting tools that can be used with datahub are many more but we are going to discuss the Great Expectations tools that can be used to monitor the data quality while ingesting the data from other sources. It also generates the records of which data are valid or not.

You can refer to this link to integrate great expectations with other data sources for quality checks.

If you want to integrate great expectations with the Datahub platform then specify the below line of code in the yaml file. Refer above link

- name: datahub_action
    action: 
      module_name: datahub.integrations.great_expectations.action
      class_name: DataHubValidationAction
      server_url: http://35.240.***.***:8080
      platform_alias: databricks
      platform_instance_map: {
        cart_datasource : 558*******524.metastore.main, 
  //take this platform instance by visting on the data asset in datahub platform 
      }

Conclusion

DataHub emerges as a cutting-edge data catalog, offering end-to-end observability, discovery, and governance. With a focus on comprehensive data management, it facilitates centralized user administration, group collaboration, and precise roles and permissions. Supporting diverse data sources, DataHub enables seamless ingestion with detailed insights into data lineage, quality, and ownership. The platform’s integration with monitoring tools, exemplified by Great Expectations, further enhances its capabilities, ensuring data quality and reliability. DataHub emerges as a pivotal solution, empowering organizations to navigate the complexities of modern data ecosystems with efficiency.

DataHub Integration with Databricks Unity Catalog

Read About IT

Great Expectations with Databricks Unity Catalog

READ ABOUT IT

Data Lineage

READ ABOUT IT

Manish Mishra

Manish Mishra is a Software Consultant with a focus on Scala, Apache Spark, and Databricks. My proficiency extends to using the Great Expectations tool for ensuring robust data quality. I am passionate about leveraging cutting-edge technologies to solve complex challenges in the dynamic field of data engineering.

Solutions

Industry

Our thinking

Unlocking Data Insights: A Comprehensive Guide to DataHub

Manish Mishra

Table of Contents

Introduction

What is DataHub?

Key Objectives of DataHub

Core Components of DataHub

DataHub User Interface

Some of the basic terms are

Data Ingestion

Types of Data Ingestion Sources

Monitoring & Reporting Tools

Conclusion

Related Articles

Manish Mishra

Leave a Comment Cancel Reply

Suggested Article

NashTech

Solutions

Useful links

Connect with us

Our achievements