Feature Selection for PySpark Tree Classifiers

If you saw my blog post last week, you’ll know that I’ve been completing LaylaAI’s PySpark Essentials for Data Scientists course on Udemy and worked through the feature selection documentation on PySpark. This week I was finalizing my model for the project and reviewing my work when I needed to perform feature selection for my model.

Unlike LaylaAI, my best model for classifying music genres was a RandomForestClassifier and not a OneVsRest. Surprising to many Spark users, features selected by the ChiSqSelector are incompatible with Decision Tree classifiers including Random Forest Classifiers, unless you transform the sparse vectors to dense vectors. While I understand this approach can work, it wasn’t what I ultimately went with.

Once you’ve found out that your baseline model is Decision Tree or Random Forest, you will want to perform feature selection to try to improve your classifiers metric with the Vector Slicer. For this, you will want to generate a list of feature importance from your best model:

classifier = RandomForestClassifier()
paramGrid = (ParamGridBuilder()\
.addGrid(classifier.maxDepth, [2, 5, 10])
.addGrid(classifier.maxBins, [5, 10 , 20])
.addGrid(classifier.numTrees,[5, 20, 50])
.build())
crossval = CrossValidator(estimator=classifier,
estimatorParamMaps=paramGrid,
evaluator=MulticlassClassificationEvaluator(),
numFolds=2)
fitModel = crossval.fit(train)BestModel= fitModel.bestModel
featureImportances= BestModel.featureImportances.toArray()

Next, you’ll want to import the VectorSlicer and loop over different feature amounts. You’ll see the feature importance list generated in the previous snippet is now being sliced depending on the value of n. I’ve adapted this code from LaylaAI’s PySpark course.

classifier = RandomForestClassifier()
for n in range(10, maximum, 10):
print("Testing top n= ", n, " features")

best_n_features= featureImportances.argsort()[-n:][::-1]
best_n_features= best_n_features.tolist()
vs= VectorSlicer(inputCol='features', outputCol='best_features',indices=best_n_features)
bestFeaturesDf= vs.transform(final_data)
train,test= bestFeaturesDf.randomSplit([0.7, 0.3])

columns = ['Classifier', 'Result']
vals = [('Place Holder', 'N/A')]
results = spark.createDataFrame(vals, columns)

paramGrid = (ParamGridBuilder()\
.addGrid(classifier.maxDepth, [2, 5, 10])
.addGrid(classifier.maxBins, [5, 10 , 20])
.addGrid(classifier.numTrees,[5, 20, 50])
.build())
crossval = CrossValidator(estimator=classifier,
estimatorParamMaps=paramGrid,
evaluator=MulticlassClassificationEvaluator(),
numFolds=2)
fitModel = crossval.fit(train)BestModel= fitModel.bestModel
featureImportances= BestModel.featureImportances.toArray()
print("Feature Importances: ", featureImportances)
predictions = fitModel.transform(test)accuracy = (MC_evaluator.evaluate(predictions))*100print(" ")
print("Accuracy: ", accuracy)

For my model the top 30 features showed better results than the top 70 results, though surprisingly, neither performed better than the baseline.

Feature selection is an essential part of the Machine Learning process, and integrating it is essential to improve your baseline model.