NashTech Insights

Apache Spark RDD -Resilient Distributed Dataset

Ayush Kumar Tiwari
Ayush Kumar Tiwari
Table of Contents
person in white shirt using computer

What is Apache Spark RDD

Apache Spark RDD (Resilient Distributed Dataset) is a fundamental data structure in Apache Spark, an open-source distributed computing framework. It is the immutable collection of objects stored in memory or in the disk of different cluster nodes.
Spark divides an RDD into multiple logical partitions to enable processing on multiple nodes of the cluster.

R – Resilient: Resilient means fault-tolerant. RDD keeps track of the lineage of transformations applied to an RDD to recover the lost partitions in case of any node failure by recomputing the transformations from its lineage.

D – Distributed: Distributed means data in RDD present on different nodes of the cluster.

D – Dataset: Datasets represent the collection of data objects that are partitioned and distributed across multiple nodes of the cluster.

Features of Spark RDD

Here are several features of spark RDD –

1- In-memory Computation
Spark RDD stores data in memory(RAM) instead of the disk. As the data stored in the disk takes more time to load and process. Hence the in-memory computation feature of spark increase the computation power.

2- Fault-tolerant
Spark RDD keeps the track of the data lineage information so in case of any failure it can recreate the data with the help of the information stored in the lineage. Each RDD has the information of all the transformations applied to create it.

3- Lazy Evaluation
All the transformation in Spark RDD are Lazy in nature. Lazy evaluation in computing delays the computation of results until the moment when they are actually needed. This means that the results are not computed immediately but rather generated when an action is triggered.

4- Immutability
RDD has an essential characteristic of immutability, implying that it cannot be altered once created. To modify an RDD, you need to create a new copy of the existing RDD and perform the desired operations on it. As a result, you can obtain the desired RDD whenever needed.

5- Partitioning
Partitioning is the fundamental unit of parallelism in Spark RDD. RDDs often contain large-sized data items. To enable distributed computing, RDDs partition and distribute this data across multiple nodes of the cluster.

Creation of RDD

In Apache Spark, you can create an RDD (Resilient Distributed Dataset) using different methods depending on the data source and the specific requirements of your application. Here are some common ways to create an RDD in Spark:

  • Parallelizing an existing collection
    You can create an RDD by parallelizing an existing collection in your driver program. This is done using the parallelize() method.

    testData = [5,6,9,8,7]
    rddOne = spark.sparkContext.parallelize(testData)

  • Loading data from a file
    Spark can create an RDD by reading data from various file systems, such as Hadoop Distributed File System (HDFS), local file systems, or the cloud storage systems. You can use the textFile() method to read text files and create an RDD. Here’s an example:

    rdd = spark.sparkContext.textFile("src/main/resources/testFile.txt")

  • Transforming an existing RDD
    You can create an RDD by applying transformations to an existing RDD. Transformations are operations that produce a new RDD from an existing RDD. Here’s an example:

    rddOne = spark.sparkContext.parallelize([5,6,8,9])
    rddTwo = rddOne.map(lambda x: x + 10)

Operations on RDDs

RDDs support two fundamental operations: transformations and actions.

Transformations

Transformations refer to the operations carried out on an RDD to generate a resulting RDD. These functions take the input as existing RDDs and produce one or more RDDs as output. Some of the transformation which we can apply on a RDD are –

  • filter(), union(), map(), flatMap(), distinct(), reduceByKey(), mapPartitions(), sortBy()
  • Examples – First create a RDD with parallelize method.
    val rddOne = sc.parallelize(List(99,54,24,85,99,36,62))
TransformationsResult of Transformations
rddOne.distinct.collectres0: Array[Int] = Array(24, 99, 36, 85, 54, 62)
rddOne.filter(x => x < 40).collectres1: Array[Int] = Array(24, 36)
rddOne.sortBy(x => x, true).collect
rddOne.sortBy(x => x, false).collect
res2: Array[Int] = Array(24, 36, 54, 62, 85, 99, 99)
res3: Array[Int] = Array(99, 99, 85, 62, 54, 36, 24)
rddOne.map(x => x * 2).collectres4: Array[Int] = Array(198, 108, 48, 170, 198, 72, 124)

Actions

Actions deliver results back to the driver program or store them in a storage system, thus initiating the execution of a computation. The RDD is loaded with data in a specific order using a lineage graph. Once all transformations have been applied, actions provide the final result back to the Spark Driver. Some of the actions which we can apply on a RDD are –

  • count(), first(), collect(), take(), countByKey(), collectAsMap(), and reduce().
  • Examples – Let’s use the same RDD we created above.
ActionsResults of the actions
rddOne.countres5: Long = 7
rddOne.collectres6: Array[Int] = Array(99, 54, 24, 85, 99, 36, 62)
rddOne.firstres7: Int = 99
rddOne.take(2)res8: Array[Int] = Array(99, 54)

Conclusion

In conclusion, Apache Spark RDDs (Resilient Distributed Datasets) serve as a powerful foundation for distributed data processing in the Spark ecosystem. RDDs offer a resilient, fault-tolerant, and distributed data abstraction, enabling efficient data manipulation and analysis. Throughout this blog, we explored the key features and operations of RDDs, including transformations and actions.

Ayush Kumar Tiwari

Ayush Kumar Tiwari

Leave a Comment

Your email address will not be published. Required fields are marked *

Suggested Article

%d bloggers like this: