Basic Data Manipulation in PySpark

Most Data Scientists will tell you that Data Cleaning forms the bulk of their work. For me, someone who is used to data cleaning with for-loops when I’m feeling lazy or or np.where(), my first steps in learning to perform Data Cleaning in PySpark posed some challenges. In this blog post I’m going to discuss some learning obstacles I was able to overcome.

Converting DataTypes

Sometimes does not infer the correct datatype and you have to convert them. For this you would import the data types and make use of the .cast() function to convert columns to their correct format.

from pyspark.sql.functions import *
from pyspark.sql.types import *
df = videos.withColumn('views',videos['views'].cast(IntegerType()))\ .withColumn('likes',videos['likes'].cast(IntegerType()))\

Dealing with Dates

Often, PySpark can misread dates as strings. The built-in to_date() function can be handy, but if you’re not careful, it could turn a single digit month or day into an incorrect value. One way to avoid this if your date column is in a manageable order (‘yyyy-mm-dd’) is to concatenate a string split on a hyphen or back-slash, with an additional 0. PySpark’s to_date function will deal with extraneous zeroes better than absent zeroes.

from pyspark.sql.functions import *df = df.withColumn('dob',  to_date(concat_ws('-0',split('dob', '-')))).withColumn('visit1', to_date(concat_ws('-0',split('visit1', '-')))).withColumn('visit2', to_date(concat_ws('-0',split('visit2', '-'))))\.withColumn('visit3', to_date(concat_ws('-0',split('visit3', '-'))))

You might be wondering how to handle dates that are not in a easily handled order. For this solution, I split the date up into separate columns, then concatenated these columns using the above method, which I then turned into a date.

from pyspark.sql.functions import regexp_replace, regexp_extractdf = df.withColumn('Month', regexp_extract(df['Date'], '(\d*)/(\d*)/(\d*)', 1))
df = df.withColumn('Day', regexp_extract(df['Date'], '(\d*)/(\d*)/(\d*)', 2))
df = df.withColumn('Year', regexp_extract(df['Date'], '(\d*)/(\d*)/(\d*)', 3))
df = df.withColumn('Date', to_date(concat_ws('-0', df['Year'], df['Month'], df['Day'])))

These two methods used separately or in conjunction can help you convert numeric strings into DateTime objects, allowing you to perform functions like timediff on Spark dataframe.


You may have noticed that in the previous code snippet, I imported two functions that allow me to use RegEx. Regexp_replace is a lot like Python’s built in replace function, only it takes in a dataframe’s column as its first argument, followed by the regex pattern to be replaced, and lastly the replacement string. Regexp_extract is used to extract an item that matches a regex pattern. The function takes three arguments: the first is the column, the second is regex pattern which uses parenthesis to represent groupings, and the third is the index position of the capture group in the pattern. For example, in the above, the first capture group was the month, the second was the date, and the third was the year.

One problem I ran into with regexp_extract() is that no matter what, it seems to only extract the first match. For something like extracting hashtags, this is no good! People often use multiple hashtags in their tweets or Instagram captions. Fortunately, I was able to use a User Defined Function in conjunction with the regex library to extract hashtags. My solution is below:

from pyspark.sql.types import *
from pyspark.sql.functions import *
import re
def extract(s):
if s:
all_matches = re.findall(r'(#\w*)', s)
return all_matches
return None
extract_udf = udf(lambda z : extract(z), ArrayType(StringType()))
tweets = tweets.withColumn('Hashtags', extract_udf(tweets.Tweet))

Next week I’ll be covering more topics that I come across while I learn PySpark. Until then!

Data Scientist and Writer, passionate about language