Streaming data with log-based CDC using Debezium, Kafka and KSQL

Bao Nguyen Viet

Key takeaways:
These days, there is a growing demand for data streaming, particularly in distributed systems. In this article, I would like to introduce how to build a log-based CDC (Change Data Capture) solution using Debezium, Kafka, and KSQL. This solution effectively synchronizes data changes from diverse data sources to multiple data sinks, achieving near real-time updates. Moreover, it enables seamless data transformation on-the-fly during the streaming procedure.

What is Log-Based CDC?

Log-based Change Data Capture (CDC) is a technique used to capture and replicate data changes at the database level by extracting the transaction log or write-ahead log (WAL) of a database management system (DBMS). It involves reading the database’s transaction log to identify and extract individual data modifications (inserts, updates, and deletes) made to the tables.

In a log-based CDC approach, the transaction log serves as a reliable source of truth for tracking changes to the data. Rather than relying on triggers or polling mechanisms, which can add overhead and impact performance, log-based CDC directly accesses the database’s log files to capture changes. This method provides a non-intrusive and low-impact way to extract real-time data changes without placing additional load on the database server..

Tooling

There are many tools in the market that can help to achieve real-time data streaming, however I chose 3 prominent tools (Debezium, Kafka, and KSQL) to build the solution based on their specific functionalities and advantages.

Apache Kafka

Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.

Debezium

Debezium is an open-source tool for CDC that is based on Apache Kafka. It can capture row-level changes using transactional logs. The order of events recorded by Debezium is the same as how changes were made to the database. These events become topics published to Apache Kafka.

KSQL

KSQL is an open-source streaming SQL engine for Apache Kafka. It provides a simple and completely interactive SQL interface for stream processing on Kafka; no need to write code in a programming language such as Java or Python. KSQL is open-source (Apache 2.0 licensed), distributed, scalable, reliable, and real-time.

The combination of Debezium, Kafka, and KSQL provides a powerful ecosystem for real-time data streaming and processing. Debezium captures changes from databases, Kafka acts as a reliable and scalable messaging system for data streaming, and KSQL enables real-time data manipulation and analysis using SQL-like queries. Together, they form a robust foundation for building scalable, fault-tolerant, and real-time data streaming architectures.

Data Streaming Log-Based CDC Business Scenarios

I apply the data streaming with log-based CDC for 2 business scenarios:

Business Scenario 1: Only CDC

With the demand of synchronizing the data changes from the source databases to the destination databases. Therefore, in this business scenario, I modify 2 PostgresSQL source databases, and expect the changes are synchronized to 1 PostgresSQL sink database in near real-time.

Architectural Overview

Architectural Details

Architectural Explanation

In the first PostgreSQL source database, we have a Users table with the fields of user_id, first_name, and last_name. And in the second PostgreSQL source database, we have a Wages table with the fields of user_id and wage.

Similarly, in the PostgreSQL sink database, we have the same Users and Wages tables with the same fields like the PostgreSQL source databases.

And we expect when the Users and Wages tables are inserted, updated, or deleted in the PostgreSQL source databases, then the changes are captured and synchronized to the Users and Wages tables in the PostgreSQL sink database in almost real-time.

For technical details, you can refer to the GIT link below. In here, I only summarize the data input and output of this solution:

Data Input

Firstly, we insert, update, and delete on PostgreSQL source 1 database. For example:

INSERT INTO users VALUES(1, ‘first 1’, ‘last 1’);
UPDATE users SET first_name = ‘first 1 updated’ WHERE user_id = 1;
DELETE FROM users WHERE user_id = 1;

Secondly, we insert, update, and delete on PostgreSQL source 2 database. For example:

INSERT INTO wages VALUES(1, ‘1000’);
UPDATE wages SET wage = 1000 + 1 WHERE user_id = 1;
DELETE FROM wages WHERE user_id = 1;

Data Output

After each insert, update, and delete action above, we check data on PostgreSQL sink database:

SELECT * FROM users;
SELECT * FROM wages;

The changes on PostgreSQL source 1 and 2 databases should be synchronized to PostgreSQL sink database in almost real-time (as the screening result below).

Screening Result

GIT

https://github.com/nashtech-garage/logbasedcdc-CDC-only

Business Scenario 2: CDC + Data Transformation

Occasionally, the demand involves not just synchronizing data changes but also performing on-the-fly data transformations (such as merging two data sources to create a new one) before loading them into the destination database. Therefore, in this particular business scenario, I am making modifications to two PostgresSQL source databases and expecting that the changes will be synchronized and transformed in near real-time to a single PostgresSQL sink database.

Architectural Overview

Architectural Details

Architectural Explanation

In the PostgreSQL source databases, we have the same Users and Wages tables as the business scenario 1.

However, in the PostgreSQL sink database, we have a new User_Wages table with the fields of user_id, full_name, and wage.

And we expect when the Users and Wages tables are inserted, updated, or deleted in the PostgreSQL source databases, then the changes are captured, transformed, and synchronized to the User_Wages table in the PostgreSQL sink database in almost real-time.

This data transformation will be included the following rules:

User_Wages.user_id = Users.user_id
User_Wages.full_name = Users.first_name + ‘ ‘ + Users.last_name
User_Wages.wage = Wages.wage

Similarly, for technical details, you can refer to the GIT link below. In here, I only summarize the data input and output of this solution:

Data Input

Firstly, we insert, update, and delete on PostgreSQL source 1 database. For example:

INSERT INTO users VALUES(2, ‘first 2’, ‘last 2’);
UPDATE users SET first_name = ‘first 2 updated’ WHERE user_id = 2;
DELETE FROM users WHERE user_id = 2;

Secondly, we insert, update, and delete on PostgreSQL source 2 database. For example:

INSERT INTO wages VALUES(2, ‘2000’);
UPDATE wages SET wage = 2000 + 1 WHERE user_id = 2;
DELETE FROM wages WHERE user_id = 2;

Data Output

After each insert, update, and delete action above, we check data on PostgreSQL sink database:

SELECT * FROM user_wages;

The changes on PostgreSQL source 1 and 2 databases should be transformed and synchronized to PostgreSQL sink database in almost real-time (as the screening result below).

Screening Result

GIT

https://github.com/nashtech-garage/logbasedcdc-CDC-and-Transformation

Conclusion

Log-based change data capture is the modern way to turn databases into streaming data sources. The first scenario meets the needs of data streaming by synchronizing data changes from different data sources (such as PostgreSQL and MySQL) to multiple data sinks in nearly real-time. On the other hand, the second scenario not only fulfills the same requirements as the first business scenario but also enables on-the-fly data transformation while streaming the data.

References

https://www.striim.com/blog/log-based-change-data-capture/
https://kafka.apache.org
https://debezium.io
https://docs.ksqldb.io

Nashtech Accelerator

Additionally, there are more data solutions for your interest:

https://accelerator.nashtechglobal.com/data-solutions

Bao Nguyen Viet

I am Senior Data Engineer at NashTech Vietnam. I have been with the company for over 5 years and during this time, I have gained extensive experience and knowledge in database systems (SQL Server, PostgreSQL, and MySQL) on cloud (Azure, AWS, and GCP). My primary responsibilities include the development, testing, and deployment of data solutions to ensure high quality and reliable data systems are delivered to our clients. I am passionate about exploring new technologies and implementing best practices to improve our deliverables.

1 thought on “Streaming data with log-based CDC using Debezium, Kafka and KSQL”

Maxi
9th February 2024 at 2:46 pm

Interesting Post.
I have a question about how you get the data for stream_user_wages.

Could you tell me how you get the structure:
‘{“schema”:{“type”:”struct”,”fields”:[{“type”:”int32″,”optional……….

Do you do it automatically or do you obtain it with some query?
It’s really necessary? because you could access the fields directly like this:
stream_users.payload->after->first_name

Reply

Streaming data with log-based CDC using Debezium, Kafka and KSQL

Bao Nguyen Viet

Table of Contents

What is Log-Based CDC?

Tooling

Apache Kafka

Debezium

KSQL

Data Streaming Log-Based CDC Business Scenarios

Business Scenario 1: Only CDC

Business Scenario 2: CDC + Data Transformation

Conclusion

Bao Nguyen Viet

1 thought on “Streaming data with log-based CDC using Debezium, Kafka and KSQL”

Leave a Comment Cancel Reply

Suggested Article

NashTech

Solutions

Useful links

Connect with us

Our achievements