Getting Started with PySpark

3 min readJan 4, 2021

To expand my Data Science toolkit, I’ve spent the last few weeks learning about PySpark through Layla AI’s PySpark course on Udemy. PySpark allows Python programmers to use Apache Spark. One of the reasons why you would want to learn PySpark is so that you could work with Big Data, data that might make the 8gbs of RAM on your 2013 Apple Macbook Pro explode. Spark has quickly become one of the most in demand skills on the market today for processing large datasets. PySpark allows you to work with this data in memory through Lazy Computation, meaning it only applies functions when it absolutely needs to.

Learning a new language can be daunting. Fortunately, for users of Pandas, SQL, and SciKitLearn, there are many ways in which your prior knowledge translates well to learning PySpark. In this blogpost, I will be discussing how I got started with PySpark.

Installing PySpark with homebrew was the easiest for me, but I had to also install Java 8 in addition.This blog post provides more information: https://medium.com/@yajieli/installing-spark-pyspark-on-mac-and-fix-of-some-common-errors-355a9050f735

Using PySpark involves working in a Spark context. When you first import PySpark, you’ll also need to import a SparkSession, something that tells your computer that you’re currently working within Apache Spark.

import pyspark
from pyspark.sql import SparkSession
#Here I instantiate the spark session
spark = SparkSession.builder.appName(‘Analyzing’).getOrCreate()
spark

You’ll quickly notice that reading in a file is fairly similar to reading in a file in Pandas.

fifa = spark.read.csv(‘fifa19.csv’, inferSchema = True, header=True)

Learning about Spark was the first time I encountered a parquet file. Parquet files can be massive and partitioned according to slices or by a categorical variable. Multiple parquet files can be read by using * or the folder directory. If you only want to import a specific number of parquet, you can comma separate them.

allusers = Spark.read.parquet(‘users*’, inferSchema = True, header=True)users1_2= spark.read.parquet(‘users1.parquet’,’users2.parquet’, inferSchema = True, header=True)

You definitely want to check on the first five rows to see if the import when correctly, much like panda’s df.head() function.

fifa.show(5)

For those accustom to pandas, you might want to see things formatted like a Pandas table, in which case you could run the following code

fifa.limit(5).toPandas()

You might also want to check that data types imported correctly. You can use PySpark’s print schema function.

fifa.printSchema()

Much of SQL’s functions are available in PySpark by importing pyspark.sql.functions.

From pyspark.sql.functions import *

From here the syntax follows SQL syntax pretty closely. To see a players name and nationality you would run the following code.

fifa.select(['Name','Nationality']).limit(5).toPandas()

If you wanted to see players older than 40 you could run the filter function

fifa.filter('Age >40').select(['Name', 'Nationality', 'Age']).limit(5).toPandas()

If you wanted to see the top ten oldest players you can use the orderBy function

Fifa.select(['Name', 'Nationality', 'Age']).orderBy(fifa['Age'].desc()).limit(10).toPandas()

You can also use a where function in PySpark’s SQL.

fifa.select(“Name”,”Club”).where(fifa.Club.like(“%Barcelona%”)).show(5, False)

You’ll see that the above line of code makes use of Regex’s wildcard character to highlight Clubs with Barcelona in its name. It also uses a False to prevent truncation when displaying the resolts.You can also use startswith and endswith functions to further filter.

fifa.select(“Name”,”Club”).where(fifa.Name.startswith(“L”)).where(fifa.Name.endswith(“i”)).limit(4).toPandas()

If you wanted to only select specific players you could use isin

fifa[fifa.Club.isin(“FC Barcelona”,”Juventus”)].limit(4).toPandas()

One other functionality I like is the substring function, which allows to isolate certain parts of string columns:

fifa.select(“Photo”,fifa.Photo.substr(-4,4)).show(5,False)

In the above example, the substring highlights the photo format of each link in the Photo column.

The above functions can help you get started with PySpark. Check back here next week for more information on what aspects of Data Science I’m currently learning. Until then!

Getting Started with PySpark

Written by Anton Haugen

No responses yet