NashTech Blog

Unlocking Time Travel: Delta Lake in DatabrickS

Table of Contents
an artist s illustration of artificial intelligence ai this image represents the boundaries set in place to secure safe accountable biotechnology it was created by khyati trehan as pa

Delta Lake in Databricks now automatically versions the large datasets stored in your data lake, allowing seamless access to any historical data version. Time travel feature simplifies your data pipeline by providing easy auditing, the ability to roll back data in case of accidental errors or deletions, and the capability to reproduce experiments and reports.

Overview of Delta Lake

Delta Lake is an open source storage layer that brings reliability to data lakes.It introduces capabilities like ACID transactions, efficient handling of metadata at scale, and the fusion of streaming and batch data processing. It seamlessly operates on your existing data lake infrastructure and is fully aligned with the Apache Spark APIs.

Delta Lake uses transaction logging to keep track of changes in your data. This feature enables you to access historical versions of your evolving data, essentially allowing you to “time travel” within the Delta table to view previous data snapshots. Additionally, it provides benefits for auditing, logging, and efficient data tracking.

Need of Time Travel

1. Version Tracking:  Time travel becomes crucial for scenarios where one needs to identify the dataset version used in the last model training or track improvements in model performance linked to a specific dataset.

2. Reproduce experiments and reports: ML engineers aim to develop multiple models using a given dataset. However, when they attempt to reproduce these models after a certain period, they face difficulties. This is mainly due to changes in the source data over time, making it a challenge to accurately replicate their experiments.

3. Rollbacks -The capability to revert to a specific dataset version, whether for improved model performance or to address any issues that may have arisen.

Different ways to achieve Time Travel in Databricks

You can achieve Time Travel in Databricks by

1. Specifying a Timestamp: Timestamps are a data type used to store a date and time value. We can use these timestamp values to view the previous data of the table and rollback to it as well.

Command: Select * from table_name TIMESTAMP AS OF “YYYY-MM-DD HH:MM:SS”

2. Indicating Version Number: A unique version number is created for each successfully executed operation when the SQL query runs. We can use this version number to view the previous data of the table and rollback to it as well.

Command: 
Select * from table_name VERSION AS OF “version_number”
or
Select* from table_name@v2

Implementation

Let’s understand each of the ways with the help of an example.

1. Create a Delta table “person” and insert values in it.

Command:

create table person(id int, name string, salary double);
insert into table person values(01,”Sahil”,1200.00),(02,”Rishi”,16800.00),(03,”Ravi”,11000).

Table person has been created Successfully.

2. Perform some update, delete operations on the person table.

update person set salary=12000 where name=”Sahil”;

delete from person where id=3;

3. Run the below SQL query to view the data history and operations performed on the table.

Command: DESCRIBE HISTORY person
–-This command returns the version number ,timestamp, operations performed on the table.

As we can see above a total of 4 versions are created.

0 – Create Table operation at 2024-01-18T08:29:27Z
1 – WRITE operation at 2024-01-18T08:29:33Z
2 – UPDATE operation at 2024-01-18T08:41:13Z
3 – DELETE operation at 2024-01-18T08:42:00Z

Now, let’s consider the scenario where we want to switch back to an earlier version to inspect the data before the delete operation took place. This time-travel capability can be achieved through a timestamp or version number.

1. Using Timestamp

Command: Select * from person TIMESTAMP AS OF “2024-01-18T08:41:13Z”

2. Using Version Number

Command: Select * from person VERSION AS OF 2

Rollback Versions:

To revert to the previous state, utilize the RESTORE command. Restore command helps you to change the current state of the table. For more information click here.

Command:
RESTORE TABLE person TO TIMESTAMP AS OF “2024-01-18T08:29:33Z”
or
RESTORE TABLE person TO VERSION AS OF 1

Display the current data of the person table.

Conclusion

Delta’s time travel is a real game-changer for developers. It helps data scientists organize their experiments, allows data engineers to fix mistakes easily in their pipelines, lets data analysts generate reports effortlessly, and even allows everyone to roll back to a previous state when needed. It’s a versatile feature that brings big benefits to everyone working with data.

Picture of Sahil Babbar

Sahil Babbar

Leave a Comment

Your email address will not be published. Required fields are marked *

Suggested Article

Scroll to Top