NashTech Blog

Revelation OF Use Cases of PySpark in Data Transformations

Table of Contents

What is PySpark?

PySpark is essential for big data processing, offering a scalable, fast solution for distributed data analysis. It handles large datasets across multiple nodes and supports tasks like ETL, real-time streaming, and analytics, making it ideal for complex, high-volume data workloads.

What is PySpark used for?

Most data scientists and analysts are familiar with Python and use it to implement machine learning workflows. PySpark allows them to work with a familiar language on large-scale distributed datasets. you can write Python and SQL-like commands to manipulate and analyze data in a distributed processing environment.

We explore some use cases where PySpark transformation functions play a vital role in solving data processing challenges.

1. Union of Two Dataframes with Different Column Structure.

To handle this scenario we can use an additional parameter, ‘allowMissingColumns=True’.

  1. Firstly Import library and necessary functions.
  2. Create First Dataframe with columns(“Name”, “Age”, “City”, “Department”, “Sub-Department”).
  3. Then there is resulting data in tabular form as an Output.
  1. Create Second Dataframe with columns(“Name”, “Age”, “City”, “Department”, “Sub-Department”, “Email” ).
  2. Then there is resulting data in tabular form as an Output.

If we doing a simple “UnionByName” it gives results in a exception as below:

But if we add ‘allowMissingColumns=True’ to it, it works. See below:

When we perform the union with df1.unionByName(df,allowMissingColumns=True,) the following happens.

  • Columns with the same name in both DataFrames are combined as expected.
  • Columns that exist in ‘df1’ but not in ‘df’ will be included in the resulting in ‘df_union’ with their corresponding values from ‘df1’, and the additional columns in ‘df’ will be included with their values.
  • The resulting DataFrame, ‘df_union‘ will have all columns from both ‘df1’ and ‘df’, and the values will be filled with null for any missing columns in either DataFrame.

In essence, the difference between the code ‘df_union = df.unionByName(df1,allowMissingColumns=True)’ and ‘df_union = df1.unionByName(df, allowMissingColumns=True)’ lies in the order of the DataFrames. The first DataFrame specified in the ‘UnionByName’ method is the one that has precedence in terms of column names, and any additional columns from the second DataFrame are appended.

2. Typecasting a String Column to Date Type that contains date values in various formats.

  1. Firstly Import library and necessary functions.
  2. Then Create the Spark Session to utilise it’s functionality.
  3. Create a dataframe and make an object with different formats of dates.
  4. Create a dataframe for column with “date_str”.
  5. Then there is resulting data in tabular form as an output as shown above.

Typecasting the Date Data with to_date function.

  • From the “pyspark.sql.functions” library import “to_date” function.
  • The function “to_date” is designed to convert a column (“col”) to a date.
  • With existing dataframe, Add new column “New_Date” and with “to_date” function typecast the “date_str” column in selected format which is “yyyy-MM-dd”.
  • Then the resulting output as shown above is like that the specified format values appeared correctly and other values are null.

Note : To overcome the above typecasting error we using “coalesce” with “to_date”.

  • We create “to_date_” function converts a column (col) into a date using multiple date formats.
  • It tries each format (“yyyy-MM-dd”, “dd/MM/yyyy”, “dd-MMM-yyyy”).
  • “coalesce” selects the first non-null (valid) date conversion.If the first format fails, it moves to the next until a match is found or all formats are exhausted.
  • Ensures flexibility by handling different date formats in the input data.
  • And then the resulting dataframe is fixed as shown above all date formats typecasted successfully.

3. Finding start of week from a date column

  1. Firstly Import library and necessary functions.
  2. Then Create the Spark Session to utilise it’s functionality.
  3. Create a dataframe of different dates.
  4. Create a dataframe of column “DateColumn”.
  • “date_trunc” to find the start of the week (default Monday).
  • Converted the “timestamp” result to date using “to_date”.
  • Applied “date_sub” to adjust for Sunday as the start of the week.
  • “date_trunc(“WEEK”, col(“DateColumn”))” : Finds the start of the week (Monday) for “DateColumn”. The result is in “timestamp” format.
  • “to_date(col(“StartOfWeek_whenMonday”))” : Converts the “timestamp” result to date type for easier handling.
  • “date_sub(col(“StartOfWeek_whenMonday”), lit(1))” : Subtracts 1 day from the Monday start, giving the start of the week as Sunday.
  • “df_with_start_of_week.show()” : Displays the DataFrame with both “StartOfWeek_whenMonday” and “StartOfWeek_whenSunday” columns.
  • Similarly if we want to use Tuesday as start of week, we can use date_add() instead of date_sub().

Conclusion

PySpark offers versatile solutions for data transformations. The allowMissingColumns=True parameter enables seamless union of dataframes with differing structures. Custom functions effectively handle typecasting when date values vary in format, ensuring accurate conversions with to_date function. Additionally, the date_trunc function simplifies finding the week’s start by specifying “WEEK” as the truncation level. These methods demonstrate PySpark’s adaptability and efficiency in managing complex data transformation tasks.

Picture of Anurag K

Anurag K

Leave a Comment

Your email address will not be published. Required fields are marked *

Suggested Article

Scroll to Top