Feature Selection with PySpark

Anton Haugen
3 min readMar 29, 2021

This past week I was doing my first machine learning project for Layla AI’s PySpark for Data Scientist course on Udemy. While in the first lessons for classification, the accuracy scores for the models had stellar results, when using the same data cleaning, normalization, and scaling, the accuracy for these models was absolutely dismal.

Whenever I did machine learning projects with scikit-learn in Python, I would do the feature selection and polynomial transformations in a more hands-on manner, that is whenever I wasn’t doing an NLP projects. In some cases for Big Data projects, you might be working with 7,000 features. Fortunately, Spark comes with built in feature selection tools.

In most pipelines, feature selection should occur just before the modeling stage, after ETL, handling imbalance, preprocessing, and importantly, the train-test split. I will be covering the three types of feature selection found in the Spark documentation.


Perhaps the most straightforward of the three methods discussed here, VectorSlicer takes your features array and creates a subarray from selected indices. Depending on the results of your initial examination of your data, you may find that only certain columns contain meaningful information. For example, for a classification problem of medieval art and artifacts involving the entirety of the Metropolitan Museum of art’s entire collection, the artist column may be empty since most artifacts had no attribution and would probably cause unnecessary labor for your model algorithm; in this case, you may want to use a VectorSlice to not include the index position of the art column.

from pyspark.ml.feature import VectorSlicervs= VectorSlicer(inputCol= “features”, outputCol=”sliced”, indices=[1,4])output= vs.transform(df)output.select(‘userFeatures’, ‘features’).show()

If your dataframe has an AttributeGroup, You can also use the names parameter to specify feature names.


For those who like to create feature interactions to make novel features, this would be the approach for you. As the name suggests, you use an R style formula to engineer new features

Since my knowledge of R is limited to using the formula whenever I import the OLS library, I had to learn what each symbol meant:

· ~ is what separates the target from the features. Think of it as your equals sign.

· + means to concatenate terms. If you use “+0”, this will remove the intercept.

· — removes a term, “- 1” will remove the intercept

· : interaction will allow you to create interactions between columns. i.e. y~ a+ b + a:b will correspond to y= w0+w1*a+w2*b +w3*a*b, where the w’s are coefficients

Because R formulas use feature names and outputs a feature array, you would do this before you creating your feature array. Here’s what the code would look like :

from pyspark.ml.feature import RFormulaformula=RFormula(formula= “clicked ~ country+ hour”, featuresCol= “features”, labelCol= “label”)output = formula.fit(dataset).transform(dataset)output.select(“features”, “label”).show()


This is the approach that I went with in my initial problem. It uses ChiSquare to yield the features with the most predictive power. The first of the five selection methods are numTopFeatures, which tells the algorithm the number of features you want. Second is Percentile, which yields top the features in a selected percent of the features. Third, fpr which chooses all features whose p-value are below a inputted threshold. Fourth, fdr uses the Benjamini-Hochberg procedure whose false discovery rate is below a threshold. And lastly, fwe chooses all p-values below threshold using a scale according to the number features.

from pyspark.ml.feature import ChiSqSelectorselector=ChiSqSelector(percentile=0.9, featuresCol=”features”, outputCol=’selectedFeatures’, labelCol= “label”)model=selector.fit(train)
result = model.transform(train)
train =result.select('label','selectedFeatures').withColumnRenamed('selectedFeatures', 'features')
test=new_test.select('label','selectedFeatures').withColumnRenamed('selectedFeatures', 'features')

Feature selection can be nearly impossible manually when handling dataframes with thousands of features. Mastering these techniques are vital to modeling with Big Data.