Key takeaways:
These days, there is a growing demand for data streaming, particularly in distributed systems. In this article, I would like to introduce how to build a log-based CDC (Change Data Capture) solution using Debezium, Kafka, and KSQL. This solution effectively synchronizes data changes from diverse data sources to multiple data sinks, achieving near real-time updates. Moreover, it enables seamless data transformation on-the-fly during the streaming procedure.
What is Log-Based CDC?
Log-based Change Data Capture (CDC) is a technique used to capture and replicate data changes at the database level by extracting the transaction log or write-ahead log (WAL) of a database management system (DBMS). It involves reading the database’s transaction log to identify and extract individual data modifications (inserts, updates, and deletes) made to the tables.
In a log-based CDC approach, the transaction log serves as a reliable source of truth for tracking changes to the data. Rather than relying on triggers or polling mechanisms, which can add overhead and impact performance, log-based CDC directly accesses the database’s log files to capture changes. This method provides a non-intrusive and low-impact way to extract real-time data changes without placing additional load on the database server..
Tooling
There are many tools in the market that can help to achieve real-time data streaming, however I chose 3 prominent tools (Debezium, Kafka, and KSQL) to build the solution based on their specific functionalities and advantages.
Apache Kafka
Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
Debezium
Debezium is an open-source tool for CDC that is based on Apache Kafka. It can capture row-level changes using transactional logs. The order of events recorded by Debezium is the same as how changes were made to the database. These events become topics published to Apache Kafka.
KSQL
KSQL is an open-source streaming SQL engine for Apache Kafka. It provides a simple and completely interactive SQL interface for stream processing on Kafka; no need to write code in a programming language such as Java or Python. KSQL is open-source (Apache 2.0 licensed), distributed, scalable, reliable, and real-time.
The combination of Debezium, Kafka, and KSQL provides a powerful ecosystem for real-time data streaming and processing. Debezium captures changes from databases, Kafka acts as a reliable and scalable messaging system for data streaming, and KSQL enables real-time data manipulation and analysis using SQL-like queries. Together, they form a robust foundation for building scalable, fault-tolerant, and real-time data streaming architectures.
Data Streaming Log-Based CDC Business Scenarios
I apply the data streaming with log-based CDC for 2 business scenarios:
Business Scenario 1: Only CDC
With the demand of synchronizing the data changes from the source databases to the destination databases. Therefore, in this business scenario, I modify 2 PostgresSQL source databases, and expect the changes are synchronized to 1 PostgresSQL sink database in near real-time.
Architectural Overview
Architectural Details
Architectural Explanation
In the first PostgreSQL source database, we have a Users table with the fields of user_id, first_name, and last_name. And in the second PostgreSQL source database, we have a Wages table with the fields of user_id and wage.
Similarly, in the PostgreSQL sink database, we have the same Users and Wages tables with the same fields like the PostgreSQL source databases.
And we expect when the Users and Wages tables are inserted, updated, or deleted in the PostgreSQL source databases, then the changes are captured and synchronized to the Users and Wages tables in the PostgreSQL sink database in almost real-time.
For technical details, you can refer to the GIT link below. In here, I only summarize the data input and output of this solution:
Data Input
Firstly, we insert, update, and delete on PostgreSQL source 1 database. For example:
INSERT INTO users VALUES(1, ‘first 1’, ‘last 1’);
UPDATE users SET first_name = ‘first 1 updated’ WHERE user_id = 1;
DELETE FROM users WHERE user_id = 1;
Secondly, we insert, update, and delete on PostgreSQL source 2 database. For example:
INSERT INTO wages VALUES(1, ‘1000’);
UPDATE wages SET wage = 1000 + 1 WHERE user_id = 1;
DELETE FROM wages WHERE user_id = 1;
Data Output
After each insert, update, and delete action above, we check data on PostgreSQL sink database:
SELECT * FROM users;
SELECT * FROM wages;
The changes on PostgreSQL source 1 and 2 databases should be synchronized to PostgreSQL sink database in almost real-time (as the screening result below).
Screening Result
GIT
https://github.com/nashtech-garage/logbasedcdc-CDC-only
Business Scenario 2: CDC + Data Transformation
Occasionally, the demand involves not just synchronizing data changes but also performing on-the-fly data transformations (such as merging two data sources to create a new one) before loading them into the destination database. Therefore, in this particular business scenario, I am making modifications to two PostgresSQL source databases and expecting that the changes will be synchronized and transformed in near real-time to a single PostgresSQL sink database.
Architectural Overview
Architectural Details
Architectural Explanation
In the PostgreSQL source databases, we have the same Users and Wages tables as the business scenario 1.
However, in the PostgreSQL sink database, we have a new User_Wages table with the fields of user_id, full_name, and wage.
And we expect when the Users and Wages tables are inserted, updated, or deleted in the PostgreSQL source databases, then the changes are captured, transformed, and synchronized to the User_Wages table in the PostgreSQL sink database in almost real-time.
This data transformation will be included the following rules:
- User_Wages.user_id = Users.user_id
- User_Wages.full_name = Users.first_name + ‘ ‘ + Users.last_name
- User_Wages.wage = Wages.wage
Similarly, for technical details, you can refer to the GIT link below. In here, I only summarize the data input and output of this solution:
Data Input
Firstly, we insert, update, and delete on PostgreSQL source 1 database. For example:
INSERT INTO users VALUES(2, ‘first 2’, ‘last 2’);
UPDATE users SET first_name = ‘first 2 updated’ WHERE user_id = 2;
DELETE FROM users WHERE user_id = 2;
Secondly, we insert, update, and delete on PostgreSQL source 2 database. For example:
INSERT INTO wages VALUES(2, ‘2000’);
UPDATE wages SET wage = 2000 + 1 WHERE user_id = 2;
DELETE FROM wages WHERE user_id = 2;
Data Output
After each insert, update, and delete action above, we check data on PostgreSQL sink database:
SELECT * FROM user_wages;
The changes on PostgreSQL source 1 and 2 databases should be transformed and synchronized to PostgreSQL sink database in almost real-time (as the screening result below).
Screening Result
GIT
https://github.com/nashtech-garage/logbasedcdc-CDC-and-Transformation
Conclusion
Log-based change data capture is the modern way to turn databases into streaming data sources. The first scenario meets the needs of data streaming by synchronizing data changes from different data sources (such as PostgreSQL and MySQL) to multiple data sinks in nearly real-time. On the other hand, the second scenario not only fulfills the same requirements as the first business scenario but also enables on-the-fly data transformation while streaming the data.
References
https://www.striim.com/blog/log-based-change-data-capture/
https://kafka.apache.org
https://debezium.io
https://docs.ksqldb.io
Nashtech Accelerator
Additionally, there are more data solutions for your interest:
https://accelerator.nashtechglobal.com/data-solutions