Handling Dates and Timestamps in PySpark

Vipul Kumar

Working with dates and timestamps is a critical aspect of data processing, and in PySpark, it becomes even more essential due to the distributed nature of the framework. Handling temporal data efficiently ensures correct data aggregation, comparison, and transformation. In this blog, we’ll explore how PySpark facilitates date and timestamp manipulations through its functions, types, and conversion methods.

Introduction to Date and Timestamp Data Types

PySpark provides two primary data types for working with time-related data:

DateType: Used for representing dates (without time).
TimestampType: Represents both date and time with precision to the microsecond.

When dealing with date or time data in PySpark, these types are used to optimize processing across the cluster.

Code:

from pyspark.sql.types import DateType, TimestampType
# Example schema definition
df = spark.createDataFrame([(1, '2023-10-01')], ['id', 'date'])
df = df.withColumn("date", df["date"].cast(DateType()))
df.printSchema()

Date and Timestamp Functions in PySpark

PySpark’s pyspark.sql.functions module offers a comprehensive set of functions to manipulate date and time. Let’s take a look at some commonly used ones:

a. Current Date and Timestamp

current_date(): Returns the current date.
current_timestamp(): Returns the current timestamp.

Code:

from pyspark.sql.functions import current_date, current_timestamp 
df = df.withColumn("current_date", current_date()) 
df = df.withColumn("current_timestamp", current_timestamp()) 
df.show()

b. Date and Time Arithmetic

Performing arithmetic operations on dates and timestamps is straightforward in PySpark.

date_add(): Adds a specified number of days to a date.
date_sub(): Subtracts days from a date.

Code:

from pyspark.sql.functions import date_add, date_sub
df = df.withColumn("next_week", date_add(df["date"], 7))
df = df.withColumn("last_week", date_sub(df["date"], 7))
df.show()

datediff(): Calculates the difference between two dates in days.
months_between(): Returns the difference between two dates in months.

Code:

from pyspark.sql.functions import datediff, months_between

df = df.withColumn("days_diff", datediff(df["current_date"], df["date"]))
df = df.withColumn("months_diff", months_between(df["current_date"], df["date"]))
df.show()

c. Date Formatting

When you need to format dates into a specific string representation or parse a date string into a date, you can use:

date_format(): Formats a date according to a specified format.
to_date(): Converts a string to a DateType.

Code:

from pyspark.sql.functions import date_format, to_date

df = df.withColumn("formatted_date", date_format(df["date"], "yyyy/MM/dd"))
df = df.withColumn("parsed_date", to_date(df["formatted_date"], "yyyy/MM/dd"))
df.show()

d. Extracting Components from Dates and Timestamps

To extract components such as year, month, day, hour, etc., PySpark provides dedicated functions:

year(), month(), dayofmonth(), hour(), minute(), and second().

Code:

from pyspark.sql.functions import year, month, dayofmonth, hour, minute, second
df = df.withColumn("year", year(df["current_timestamp"]))
df = df.withColumn("month", month(df["current_timestamp"]))
df = df.withColumn("day", dayofmonth(df["current_timestamp"]))
df = df.withColumn("hour", hour(df["current_timestamp"]))
df.show()

3. Handling Time Zones

PySpark’s TimestampType stores timestamps in UTC by default, but you can adjust this behaviour using time zone settings.

from_utc_timestamp(): Converts UTC timestamp to a given time zone.
to_utc_timestamp(): Converts a timestamp from a time zone to UTC.

Code:

from pyspark.sql.functions import from_utc_timestamp, to_utc_timestamp

df = df.withColumn("local_timestamp", from_utc_timestamp(df["current_timestamp"], "Asia/Kolkata"))
df = df.withColumn("utc_timestamp", to_utc_timestamp(df["local_timestamp"], "Asia/Kolkata"))
df.show()

4. Converting Between Date, Timestamp, and String Types

It’s common to convert between different formats while working with external systems or preparing data for output.

to_date(): Converts a string to a DateType.
to_timestamp(): Converts a string to a TimestampType.
date_format(): Converts a date or timestamp into a formatted string.

Code:

from pyspark.sql.functions import to_date, to_timestamp, date_format

df = df.withColumn("date_from_string", to_date(df["formatted_date"], "yyyy/MM/dd"))
df = df.withColumn("timestamp_from_string", to_timestamp(df["formatted_date"], "yyyy/MM/dd HH:mm:ss"))
df.show()

Conclusion

Handling dates and timestamps in PySpark can seem daunting, but with the right tools and techniques, it becomes a powerful part of your data processing pipeline. By leveraging PySpark’s built-in functions for date and time manipulation, you can efficiently manage temporal data in a distributed environment.

Solutions

Industry

Our thinking

Handling Dates and Timestamps in PySpark

Vipul Kumar

Table of Contents

Introduction to Date and Timestamp Data Types

Date and Timestamp Functions in PySpark

a. Current Date and Timestamp

b. Date and Time Arithmetic

c. Date Formatting

d. Extracting Components from Dates and Timestamps

3. Handling Time Zones

4. Converting Between Date, Timestamp, and String Types

Conclusion

Vipul Kumar

Leave a Comment Cancel Reply

Suggested Article

NashTech

Solutions

Useful links

Connect with us

Our achievements