Feature Selection for PySpark Tree Classifiers

If you saw my blog post last week, you’ll know that I’ve been completing LaylaAI’s PySpark Essentials for Data Scientists course on Udemy and worked through the feature selection documentation on PySpark. This week I was finalizing my model for the project and reviewing my work when I needed to perform feature selection for my model.

Unlike LaylaAI, my best model for classifying music genres was a RandomForestClassifier and not a OneVsRest. Surprising to many Spark users, features selected by the ChiSqSelector are incompatible with Decision Tree classifiers including Random Forest Classifiers, unless you transform the sparse vectors to dense vectors. While I understand this approach can work, it wasn’t what I ultimately went with.

Once you’ve found out that your baseline model is Decision Tree or Random Forest, you will want to perform feature selection to try to improve your classifiers metric with the Vector Slicer. For this, you will want to generate a list of feature importance from your best model:

Next, you’ll want to import the VectorSlicer and loop over different feature amounts. You’ll see the feature importance list generated in the previous snippet is now being sliced depending on the value of n. I’ve adapted this code from LaylaAI’s PySpark course.

For my model the top 30 features showed better results than the top 70 results, though surprisingly, neither performed better than the baseline.

Feature selection is an essential part of the Machine Learning process, and integrating it is essential to improve your baseline model.



Data Scientist and Writer, passionate about language

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store