NashTech Blog

PySpark Basic Operations: Filtering, Sorting, and Aggregating Data

Table of Contents

PySpark is the Python API for Apache Spark, an open-source distributed computing system that enables large-scale data processing. Whether you’re dealing with structured or unstructured data, PySpark provides a powerful set of tools to manage and analyze datasets in a distributed environment. In this blog, we’ll explore some fundamental operations in PySpark, specifically filtering, sorting, and aggregating data.

Let’s begin with creating a dataframe to apply all the basic operations on. 

Code:

from pyspark.sql import SparkSession

# Initialize a Spark session
spark = SparkSession.builder.appName("BasicOperations").getOrCreate()

data = [("Alice", 34, "HR"),
        ("Bob", 45, "Engineering"),
        ("Cathy", 29, "Finance"),
        ("David", 40, "Engineering")]

columns = ["Name", "Age", "Department"]

df = spark.createDataFrame(data, schema=columns)
df.show()

1. Filtering Data

Filtering data is one of the most basic operations when dealing with large datasets. It allows you to extract specific subsets of data that match certain conditions.

In PySpark, you use the filter() or where() function to apply conditions. Let’s filter the dataset to get people who are older than 30:

Code:

# Filtering using the filter() method
df.filter(df.Age > 30).show()

# Alternatively, using where()
df.where(df.Age > 30).show()

Output:

+-----+---+-----------+
| Name|Age| Department|
+-----+---+-----------+
|Alice| 34|         HR|
|  Bob| 45|Engineering|
|David| 40|Engineering|
+-----+---+-----------+

You can also apply multiple conditions using logical operators such as & (AND) and | (OR). For example, to filter people who are either in the Engineering department or are older than 35:

df.filter((df.Age > 35) | (df.Department == "Engineering")).show()

2. Sorting Data

Sorting data helps you organize and analyze it more effectively. PySpark provides the orderBy() function to sort DataFrames by one or more columns.

Let’s sort the DataFrame by the Age column in ascending order:

df.orderBy("Age").show()

To sort in descending order, use the desc() function:

from pyspark.sql.functions import desc

df.orderBy(desc("Age")).show()

You can also sort by multiple columns. For instance, sort by Department first and then by Age:

df.orderBy("Department", "Age").show()

3. Aggregating Data

Aggregation is useful when you need to calculate summary statistics like averages, sums, counts, and more. PySpark offers a range of aggregation functions such as groupBy(), agg(), and functions like sum(), count(), and avg().

Counting the number of records per department

The groupBy() function is commonly used for aggregation. Let’s count how many people are in each department:

df.groupBy("Department").count().show()

The result will look like this:

+-----------+-----+
| Department|count|
+-----------+-----+
|         HR|    1|
|Engineering|    2|
|    Finance|    1|
+-----------+-----+

Calculating average age per department

Similarly, you can calculate the average age of employees in each department:

from pyspark.sql.functions import avg

df.groupBy("Department").agg(avg("Age")).show()

Output:

+-----------+--------+
| Department|avg(Age)|
+-----------+--------+
|         HR|    34.0|
|Engineering|    42.5|
|    Finance|    29.0|
+-----------+--------+

Combining multiple aggregations

You can combine multiple aggregation functions in a single operation using agg(). For instance, let’s calculate both the count and the average age per department:

df.groupBy("Department").agg(count("Name"), avg("Age")).show()

Conclusion

In this blog, we covered some essential PySpark operations: filtering, sorting, and aggregating data. These operations form the backbone of data manipulation and analysis in PySpark. Whether you are filtering large datasets, sorting them for better visualization, or aggregating for insights, PySpark provides a flexible and scalable framework for big data processing.

PySpark is especially useful for working with massive datasets in a distributed environment, making it a go-to tool for data engineers and data scientists working with big data. Once you’re comfortable with the basics, you can dive into more advanced operations and optimizations that make PySpark even more powerful.

Picture of Vipul Kumar

Vipul Kumar

Leave a Comment

Your email address will not be published. Required fields are marked *

Suggested Article

Scroll to Top