Introduction
Apache Spark Streaming and Apache Kafka are powerful tools for processing and managing real-time data streams. By integrating Spark Streaming with Kafka, you can build robust and scalable streaming data pipelines. Here’s a summary of the steps to integrate Spark with Kafka.
What is Apache Spark Streaming?
Apache Spark is a scalable, fault-tolerant streaming processing system that supports both batch and streaming workloads. It extends the core Spark API to handle real-time data from various sources like Kafka, Flume, and Amazon Kinesis. This processed data can be pushed to other systems like databases, Kafka, live dashboards.
What is Apache Kafka?
Kafka is a publish-subscribe messaging system designed for high throughput and fault tolerance. It’s highly scalable and can handle large volumes of data efficiently. Kafka clusters are widely used for real-time data streaming applications.
A Kafka cluster is a highly scalable and fault-tolerant system and it also has a much higher throughput compared to other message brokers such as ActiveMQ and RabbitMQ.
Steps to Integrate Spark with Kafka
Run Kafka Producer Shell
Use the Kafka Producer shell to produce JSON data to a Kafka topic. Read more kafka commands.
bin/kafka-topics.sh --create --topic kafka-topic --bootstrap-server localhost:9092 bin/kafka-console-producer.sh --topic kafka-topic --bootstrap-server localhost:9092
Spark Streaming With Kafka
In order to streaming data from Kafka topic, we need to use below Kafka client SBT dependencies.
libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % "3.4.2" % Test
Use readStream() on SparkSession to load a streaming Dataset from Kafka. Specify Kafka broker details, topic name, and starting offsets. Option startingOffsets earliest is used to read all data available in the Kafka at the start of the query and by default startingOffsets value is latest.
val df = spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "192.168.1.100:9092")
.option("subscribe", "kafka_topic")
.option("startingOffsets", "earliest")
.load()
Spark Streaming Write to Console
-
- Convert Binary Data: Since Kafka messages are in binary format, convert them to String using selectExpr() and CAST(value AS STRING).
- Define Schema: Define the schema for the JSON data to be processed.
- Parse JSON: Parse the JSON data using the defined schema.
val employeeDf = df.selectExpr("CAST(value AS STRING)")val schema = new StructType() .add("id",IntegerType) .add("firstname",StringType) .add("lastname",StringType) .add("dob_year",IntegerType) .add("dob_month",IntegerType) .add("gender",StringType) .add("salary",IntegerType)val finalDf = employeeDf .select(from_json(col("value"), schema).as("data")) .select("data.*")
Write the processed data to the console for visualization or further processing.
finalDf.writeStream
.format("console")
.outputMode("append")
.start()
.awaitTermination()
Conclusions
Integrating Spark Streaming with Kafka enables real-time data processing and analysis. It allows you to build end-to-end streaming pipelines for various use cases such as real-time analytics, monitoring, and machine learning.
By following these steps, you can efficiently process and analyze streaming data from Kafka using Apache Spark, leveraging the strengths of both platforms for scalable and fault-tolerant data processing.