NashTech Insights

Tale of Three Data Houses: Data WAREHOUSE, Data Lake & Data LakeHOUSE

Raviyanshu Pratap Singh
Raviyanshu Pratap Singh
Table of Contents
ball, binary, computer data-63527.jpg

Before diving into the technical aspect of each of the three terms, let’s take an analogy of building a library to better understand these three terms.

Analogy: Building a Library

Traditional Library

Suppose you are building a library and you started with a traditional idea that your library should have a well-organized structure, with books neatly arranged on shelves according to a predefined system. You hired a librarian who curates & makes sure that the books are placed on their respective shelves. The old and iconic books are carefully selected and processed to ensure consistency and reliability.

With this system, users can easily find and access specific books based on the library’s categorizing system. You see the focus over here is on retrieval and analysis of structured information.

Unsorted Book Collection

But now your library is growing which is a good sign but with this rising you need to import more reading materials to your library. Hence, you started storing the books with the stacking method in which books are stored on the shelves without any particular order or processing. The books may differ in format, such as hardcovers, paperbacks, or loose pages. These books can represent different genres, languages, or topics.

With this stacked system, users can explore and analyze the books as needed, based on their specific requirements. With this stacked method you are focusing more on flexibility and scalability.
But with this system, you are noticing that some of the old and iconic books that require extra care and special care are not getting as a result of which they are having wear and tear and your iconic and essential books are not coming up to shelves.

A Hybrid Library

You soon realized that along with flexibility and scalability, you also need the traditional library features like a library categorizing system for retrieval and analysis of structured information. With this hybrid method, your library maintains a structured core library while also incorporating a flexible collection of books. Now think of this system as

The core library follows a predefined organization system, similar to a data warehouse, where books are cataloged, categorized, and easily retrievable. However, the library also includes a section where unsorted books are stored, representing the data lake aspect. These unsorted books can be accessed and analyzed in their raw form or transformed as needed ‘.

Keeping this analogy as our understanding let’s look at all three Ds in detail:

Data Warehouse

A data warehouse is a centralized repository that organizes and stores structured data from various sources. It is designed for query and analysis purposes and provides a high level of data reliability, consistency, and integrity just like our core library feature i.e. retrieval and analysis of structured information.

Some of the key features that a data warehouse possesses are:

  • Structured data: Data warehouses typically handle structured data, which is organized into tables and follows a predefined schema. The data warehouse contains a staging area where the data from different sources can be brought together into one common format.
  • ETL (Extract, Transform, Load): The ETL process is commonly used in data warehousing to extract data from source systems, transform it into a consistent format, and load it into the warehouse. ETL tools can perform several other tasks on the ingested data before finally loading the data into the data warehouse. These tasks for example can include the elimination of duplicate records.
  • Schema-on-write: Data must conform to a predefined schema before being loaded into the warehouse, ensuring data integrity.

Data Warehouse Architecture

The data warehouse architecture consists of three main components:

  • Data Sources: This is the first layer of the architecture in this data is extracted from various operating systems such as databases, or external sources. These sources can be of different forms like sales data, customer, information, transaction records, etc. The data can be of different structures or different formats.
  • Data Integration: In the second layer the data extracted data from different sources needs to be transformed and integrated into a consistent format for analysis. In this layer, the major transformation happens to our data like cleaning the data, resolving the inconsistencies, removing duplicates, and most important ensuring data quality.
  • Data Storage & Access: Once this data is integrated, it’s then stored in a structured manner within the data warehouse into tables, columns, and rows, similar to a traditional database. Once the data has the structure you can easily access it with your query language like SQL

Additional to the above three layers the data warehouse architecture can also include additional components like metadata management in other words which stores information about the data in the warehouse (e.g., data definitions, data lineage, and data transformations), and data governance processes to ensure data security, privacy, and compliance.

As a whole picture, the data warehouse architecture helps you to transform a business’s diverse data from various sources into a structured and accessible format, empowering users to extract valuable information and gain a better understanding of their operations, customers, and overall business performance.

Why is Data Warehouse losing popularity?

With the growing modern world and along with the fast rise of the internet, social media and the availability of multi-media devices have totally disordered the data world, and as a result of this, a new terminology has risen called big data. Big data can be defined in terms of four Vs:

Data that arrives in higher Volumes, with more Velocity, along with a greater Variety of formats and a higher Veracity

Data Warehouse fails to address these four Vs as it suffers from both storage and scalability issues since its architecture lacks the use of in-memory and parallel processing techniques, preventing them from vertically scaling the data warehouse. Also, the lack of supporting the streaming architecture failed with real-time data making it, not a good candidate for addressing Velocity.

They are not well suited to store and query a variety of unstructured data and lastly, data warehouses mainly focused on schema, and less on lineage, data quality, and other veracity variables.

Data Lake

A data lake acts like a storage repository that holds vast amounts of raw and unprocessed data. It’s designed to store both kinds of data structured and unstructured data in its native format, providing flexibility and scalability just like the unstored book collection of your library.

Some of the key features of a data lake include:

  • Raw Data Storage: Data lakes store data in its raw and original form, without the need for changing structuring or transformation, and at the lowest level of this storage is possible only because of the blob of data. Blobs are by nature unstructured enabling the storage of semi-structured and unstructured data on the data lake.
  • Schema-on-read: Data lakes allow for a schema-on-read approach, where data is interpreted at the time of retrieval, enabling flexibility in data exploration and analysis.
  • Hadoop-based technology: Data lakes often leverage Hadoop-based technologies, such as Apache Hadoop and Apache Spark, for distributed storage and processing capabilities.

Why don’t we use it?

Absolutely relevant question if it can store raw data and can compute the streaming data, here is a reason for abandoning it while it’s easy to fetch data in the raw format, and then transforming it for business value can be a really expensive process. Also, the traditional data lakes have low latency query performance as a result of which they can’t be used for interactive queries to avoid this organization data teams are using the mixed architecture of data lake and data warehouse.

The schema-on-read strategy where data can be ingested in any format without schema enforcement can result in data quality issues, allowing the valuable data lake to become a “data swamp”.

On top of all this, the data lake does not offer any kind of transactional guarantees. The data files can lead to expensive re-writes of previously written data since data files can only be appended even for making a simple update.

Data Lakehouse

The best of both worlds and a unified approach to data management that combines the scalability and flexibility of a data lake with the reliability and performance of a data warehouse. It is based upon low-cost and directly accessible data storage that provides analytics DBMS and performance features like ACID transactions, data versioning, auditing, caching, and query optimization.

How can we build a possible Data Lakehouse Solution?

A possible data lakehouse solution can be built on top of a term called Delta Lake. Dela Lake is an open-source file-based framework that provides ACID transaction scalable metadata handling, a unified process model that spans batch and streaming, full audit history, and support for SQL DML statements.

Some key features of delta lake which make it OP are:

  • ACID Transactions: Delta Lake supports ACID (Atomicity, Consistency, Isolation, Durability) transactions, which ensure data integrity and consistency even in the face of concurrent read and write operations.
  • Schema Evolution: Delta Lake allows for schema evolution, which means that you can easily evolve the schema of your data over time without interrupting or restructuring the existing data. This flexibility enables you to accommodate changes in your data requirements without having to perform complex migrations or transformations.
  • Time Travel: The Time Travel feature in Delta Lake enables you to access historical versions of your data. It provides the ability to query data as it existed at specific points in time, facilitating data auditing, compliance, and debugging tasks. Time Travel allows you to easily track changes, roll back to previous versions, and compare data snapshots.
  • Data Versioning and Optimized Reads: Delta Lake maintains versions of the data, which enables efficient and optimized reads. It uses a technique called “Delta Encoding” to store and retrieve only the changes between versions, reducing storage costs and improving query performance. This feature enhances the speed and efficiency of data retrieval, especially when working with large datasets.
  • Data Quality and Data Governance: Delta Lake provides built-in data quality features, such as data validation and constraint enforcement. It allows you to define and enforce data quality rules, ensuring the integrity and reliability of your data.

Architecture and Components

A data lakehouse architecture is made up of three layers, the first layer is the data lakehouse storage layer which is built on top of cloud technologies like ADLS, s3, etc providing high scalability and low cost. The second layer is the transactional layer provided by delta lake which provides ACID guarantees, DML support, and scalable metadata processing.

The top layer is made up of high-performance query engines such as Apache Spark and Apache Hive which takes the power of underlying cloud computing resources.

Use Cases of Data Lakehouse

  • Advanced Analytics: Organizations can leverage the Data Lakehouse to perform advanced analytics, including machine learning, predictive modeling, and anomaly detection. The availability of large and diverse datasets enables the discovery of valuable insights and patterns.
  • Real-time Personalization: With the ability to process and analyze data in real time, organizations can personalize customer experiences, optimize marketing campaigns, and deliver targeted recommendations.
  • Internet of Things (IoT): The Data Lakehouse architecture is well-suited for managing and analyzing IoT data generated by sensors, devices, and machines. It allows organizations to derive actionable insights, monitor performance, and optimize processes in real time.

To Summarize

Just like the hybrid library solution that helped you to build. a scalable library and also provided you a better insight of your business, similarly Data Lakehouse represents a significant evolution in data management and analytics, offering organizations a unified and scalable platform to unlock the full potential of their data. By combining the strengths of data lakes and data warehouses, the Data Lakehouse empowers businesses to achieve real-time analytics, improve decision-making, and fuel innovation.

Raviyanshu Pratap Singh

Raviyanshu Pratap Singh

Leave a Comment

Your email address will not be published. Required fields are marked *

Suggested Article

%d bloggers like this: