Why Delta Lake
As a part this blog post, we will try to understand what is delta lake and why even we should think of it ? and last whether we should integrate them in our existing system.
Delta lake provides reliability and performance improvements over existing data lakes. It brings the features from both data lake and data warehouse.
It offers scalability of data lake and performance reliability of data warehouse/database.
First we will go through the limitation of data warehouse and limitations of data lake, then will see how we can overcome those challenges with use of delta lake.
Limitation of Data Warehouse
Before Big Data, we used data warehouse for building business use case reports. Data warehouse is successful when data volume and velocity is low, and the data is in structured format. But it faces a lot of challenges when it comes to unstructured data, high volume and velocity of data. It is not good for following:
- Processing semi-structured and unstructured data like voice, audio, and IoT device messages.
- Streaming data applications to provide near real-time analysis.
- Exponential increase in data volumes because of storage and scalability issues.
Limitation of Data Lake
Data Lake is a central repository that can store all types of structured, semi-structured and unstructured data. It can handle the bid data volume and velocity but it doesn’t provide some basic features such as ACID transaction like data warehouse and schema on write which might lead to writing bad data in it.
Delta Lake Features
Support for ACID transactions
Delta Lake supports ACID (Atomicity, Consistency, Isolation, Durability) transactions, that ensure data integrity and consistency even in the presence of concurrent read and write operations. This is specially important when dealing with complex data processing pipelines or when multiple users are accessing and modifying the data simultaneously.
Schema Enforcement and Evolution
Delta lake enforce schema on write when writing data to the storage. Thus, columns and their data types are maintained and writing bad data into storage can be avoided.
Data in Delta Lake carry a schema, as per business need we need to evolve the schema over time. It supports adding, deleting, or modifying columns in existing tables while maintaining backward compatibility with previous versions of the data. This flexibility is very useful when dealing with evolving business / data needs.
Supports DML (Data Manipulation Language)
Delta lake support DML (insert, update, delete) operation like SQL. Along with this it does support complex merge and upsert scenarios. It ensure that data modifications are performed atomically and consistently. Delta Lake maintains transactional logs, which allows you to roll back changes, query previous versions of data, and recover from failures.
Time Travel
The Delta lake transaction log keeps information about every change made to the data in the order of execution. With help of transaction log it allows users to query data as of a specific point in time or view historical versions of the data. This feature is beneficial for auditing, debugging, or performing analyses based on data changes over time.
Integrating Batch and Streaming into One Model
Delta lake allows us to work seamlessly with both batch and streaming data within the same data lake. It integrates with Apache Spark, wherein it can read / write from delta table and can be used as source and sink for spark application. Sinced, spark is good with batch and stream processing, we can say that delta lake supports both batch and stream together with all transactional and consistency guarantee.
Scalable Metadata Handling
To work the big data application like apache spark it become important to handle the scalable metadata of the application and delta lake handle this with the help of transactional log.
Delta Lake uses a transaction log to track all the changes made to a table. This log is an append-only file that captures the metadata changes, data modifications, and structural updates. By using a transaction log, Delta Lake ensures durability and consistency of metadata operations while enabling efficient metadata updates and query optimizations. It creates a checkpoint file after every ten transactions with a full transactional state.
A Delta lake reader can process this checkpoint file and a few transaction logs created after that. This results in a fast and scalable metadata handling system.
Delta Lake highly compatible and rich ecosystem Integration
It is highly compatible with big data tools such as Apache Spark, Apache Kafka, Apache NiFi, and Apache Flume.
Delta Lake supports SQL-based access, allowing you to interact with Delta Lake tables using SQL queries. You can use SQL commands to create tables, insert data, perform updates, deletes, and other data manipulation tasks.
It’s built on top of the open-source Apache Parquet columnar file format. Parquet is widely supported in the big data ecosystem and provides efficient compression and schema evolution.
Benefits of Delta Lakehouse
- With Data Lakehouse architecture, data engineer can store a single copy of data in the data lake, acting as a single source of truth.
- It also reduces unnecessary data movement between the data lake and data warehouse storage, thus minimizing the cost.
- Data Lakehouse can handle BI, Analytics reports, data science, machine learning together.
- Thus, Data Lakehouse solves the shortcomings of both data lake and data warehouse by combining their strengths and is the way forward for data teams.
Work with delta format
Create a Delta Table from CSV
// reading data from csv table
val bikedata = spark.read.format("csv").option("header", true).option("inferSchema", true)
.load("/databricks-datasets/bikeSharing/data-001/data.csv")
// save to delta format
bikedata.write.format("delta").mode("overwrite").save("/tmp/bikes-delta")
Read a Delta Table
// read a delta table
val deltatable = spark.read.format("delta").load("/tmp/bikes-delta")
Summary
As we have seen the downside of data warehouse in terms of scalability and big data handling capability, we have also seen the limitation of data lake in terms of ACID guarantees and others. To overcome all of the challenges encountered by the traditional data warehouse and data lake, lake-house came up with solution to most of the problem.
Whenever we had to design a system or improve our existing system in order to handle the big data velocity, volume, ACID transaction guarantee, serve both ML and business analytics together along with unified streaming and batch processing we should think of lake-house backed by delta lake.
In addition to the reliability and performance benefits switching to delta lake can save storage cost by reducing multi copy storage scenario. It will be easier for developer to manage and debug the issue.