Introduction to Apache Spark Streaming with Apache Kafka

Manish Mishra

Introduction

Apache Spark Streaming and Apache Kafka are powerful tools for processing and managing real-time data streams. By integrating Spark Streaming with Kafka, you can build robust and scalable streaming data pipelines. Here’s a summary of the steps to integrate Spark with Kafka.

What is Apache Spark Streaming?

Apache Spark is a scalable, fault-tolerant streaming processing system that supports both batch and streaming workloads. It extends the core Spark API to handle real-time data from various sources like Kafka, Flume, and Amazon Kinesis. This processed data can be pushed to other systems like databases, Kafka, live dashboards.

What is Apache Kafka?

Kafka is a publish-subscribe messaging system designed for high throughput and fault tolerance. It’s highly scalable and can handle large volumes of data efficiently. Kafka clusters are widely used for real-time data streaming applications.
A Kafka cluster is a highly scalable and fault-tolerant system and it also has a much higher throughput compared to other message brokers such as ActiveMQ and RabbitMQ.

Steps to Integrate Spark with Kafka

Run Kafka Producer Shell

Use the Kafka Producer shell to produce JSON data to a Kafka topic. Read more kafka commands.

bin/kafka-topics.sh --create --topic kafka-topic --bootstrap-server localhost:9092

bin/kafka-console-producer.sh --topic kafka-topic --bootstrap-server localhost:9092

Spark Streaming With Kafka

In order to streaming data from Kafka topic, we need to use below Kafka client SBT dependencies.

libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % "3.4.2" % Test

Use readStream() on SparkSession to load a streaming Dataset from Kafka. Specify Kafka broker details, topic name, and starting offsets. Option startingOffsets earliest is used to read all data available in the Kafka at the start of the query and by default startingOffsets value is latest.

val df = spark.readStream
        .format("kafka")
        .option("kafka.bootstrap.servers", "192.168.1.100:9092")
        .option("subscribe", "kafka_topic")
        .option("startingOffsets", "earliest")
        .load()

Spark Streaming Write to Console

- Convert Binary Data: Since Kafka messages are in binary format, convert them to String using selectExpr() and CAST(value AS STRING).
- Define Schema: Define the schema for the JSON data to be processed.
- Parse JSON: Parse the JSON data using the defined schema.

val employeeDf = df.selectExpr("CAST(value AS STRING)")

val schema = new StructType() .add("id",IntegerType) 
             .add("firstname",StringType) 
             .add("lastname",StringType)  
             .add("dob_year",IntegerType) 
             .add("dob_month",IntegerType) 
             .add("gender",StringType) 
             .add("salary",IntegerType)

val finalDf = employeeDf
                .select(from_json(col("value"), schema).as("data")) 
                .select("data.*")

Write the processed data to the console for visualization or further processing.

finalDf.writeStream
      .format("console")
      .outputMode("append")
      .start()
      .awaitTermination()

Conclusions

Integrating Spark Streaming with Kafka enables real-time data processing and analysis. It allows you to build end-to-end streaming pipelines for various use cases such as real-time analytics, monitoring, and machine learning.

By following these steps, you can efficiently process and analyze streaming data from Kafka using Apache Spark, leveraging the strengths of both platforms for scalable and fault-tolerant data processing.

Manish Mishra

Manish Mishra is a Software Consultant with a focus on Scala, Apache Spark, and Databricks. My proficiency extends to using the Great Expectations tool for ensuring robust data quality. I am passionate about leveraging cutting-edge technologies to solve complex challenges in the dynamic field of data engineering.

Solutions

Industry

Our thinking

Introduction to Apache Spark Streaming with Apache Kafka

Manish Mishra

Table of Contents

Introduction

What is Apache Spark Streaming?

What is Apache Kafka?

Steps to Integrate Spark with Kafka

Run Kafka Producer Shell

Spark Streaming With Kafka

Spark Streaming Write to Console

Conclusions

Manish Mishra

Leave a Comment Cancel Reply

Suggested Article

NashTech

Solutions

Useful links

Connect with us

Our achievements