PySpark is the Python API for Apache Spark, an open-source distributed computing system that enables large-scale data processing. Whether you’re dealing with structured or unstructured data, PySpark provides a powerful set of tools to manage and analyze datasets in a distributed environment. In this blog, we’ll explore some fundamental operations in PySpark, specifically filtering, sorting, and aggregating data.
Let’s begin with creating a dataframe to apply all the basic operations on.
Code:
from pyspark.sql import SparkSession
# Initialize a Spark session
spark = SparkSession.builder.appName("BasicOperations").getOrCreate()
data = [("Alice", 34, "HR"),
("Bob", 45, "Engineering"),
("Cathy", 29, "Finance"),
("David", 40, "Engineering")]
columns = ["Name", "Age", "Department"]
df = spark.createDataFrame(data, schema=columns)
df.show()
1. Filtering Data
Filtering data is one of the most basic operations when dealing with large datasets. It allows you to extract specific subsets of data that match certain conditions.
In PySpark, you use the filter() or where() function to apply conditions. Let’s filter the dataset to get people who are older than 30:
Code:
# Filtering using the filter() method
df.filter(df.Age > 30).show()
# Alternatively, using where()
df.where(df.Age > 30).show()
Output:
+-----+---+-----------+
| Name|Age| Department|
+-----+---+-----------+
|Alice| 34| HR|
| Bob| 45|Engineering|
|David| 40|Engineering|
+-----+---+-----------+
You can also apply multiple conditions using logical operators such as & (AND) and | (OR). For example, to filter people who are either in the Engineering department or are older than 35:
df.filter((df.Age > 35) | (df.Department == "Engineering")).show()
2. Sorting Data
Sorting data helps you organize and analyze it more effectively. PySpark provides the orderBy() function to sort DataFrames by one or more columns.
Let’s sort the DataFrame by the Age column in ascending order:
df.orderBy("Age").show()
To sort in descending order, use the desc() function:
from pyspark.sql.functions import desc
df.orderBy(desc("Age")).show()
You can also sort by multiple columns. For instance, sort by Department first and then by Age:
df.orderBy("Department", "Age").show()
3. Aggregating Data
Aggregation is useful when you need to calculate summary statistics like averages, sums, counts, and more. PySpark offers a range of aggregation functions such as groupBy(), agg(), and functions like sum(), count(), and avg().
Counting the number of records per department
The groupBy() function is commonly used for aggregation. Let’s count how many people are in each department:
df.groupBy("Department").count().show()
The result will look like this:
+-----------+-----+
| Department|count|
+-----------+-----+
| HR| 1|
|Engineering| 2|
| Finance| 1|
+-----------+-----+
Calculating average age per department
Similarly, you can calculate the average age of employees in each department:
from pyspark.sql.functions import avg
df.groupBy("Department").agg(avg("Age")).show()
Output:
+-----------+--------+
| Department|avg(Age)|
+-----------+--------+
| HR| 34.0|
|Engineering| 42.5|
| Finance| 29.0|
+-----------+--------+
Combining multiple aggregations
You can combine multiple aggregation functions in a single operation using agg(). For instance, let’s calculate both the count and the average age per department:
df.groupBy("Department").agg(count("Name"), avg("Age")).show()
Conclusion
In this blog, we covered some essential PySpark operations: filtering, sorting, and aggregating data. These operations form the backbone of data manipulation and analysis in PySpark. Whether you are filtering large datasets, sorting them for better visualization, or aggregating for insights, PySpark provides a flexible and scalable framework for big data processing.
PySpark is especially useful for working with massive datasets in a distributed environment, making it a go-to tool for data engineers and data scientists working with big data. Once you’re comfortable with the basics, you can dive into more advanced operations and optimizations that make PySpark even more powerful.