Introducing MLLib’s One-vs-rest Classifier
While learning about classification algorithms in PySpark’s MLLib, I came across an algorithm I had not used in SciKit Learn, the One-vs-rest classifier.
The one-vs-rest classifier, or one-vs-all, splits a multi-class classification problem into several binary classification problems with a model for each class. For the popular orchid dataset, this would mean that each type of orchid would have a model. The classifier for class i is trained to predict if the label is label i or not and the final assignation output is given by the label of the most confident classifier. One popular instance of this type of classification is in email tagging, where an email can either be from work, your social networks, friends and family, or spam.
To use a one-vs-rest classifier in PySpark’s MLLib, you would first instantiate the base classifier, the binary classification algorithm you want your one-vs-rest classifier to use.
lr =LogisticRegression(maxIter=10, tol=1E-6, fitIntercept=True)
While you could use perceptron, it may cause longer run times since the model is more complex. You would then instantiate the one-vs-rest classifier using your base classifier.
ovr= OneVsRest(classifier=lr)ovrModel = ovr.fit(train)
Though I’ve mostly tuned my hyperparameters by tuning the logistic regression model, by referring to its parameter in my parameter grid (i.e. lr.regParam), you can also tune weights and adjust parallelism for your one-vs-rest classifier. If you also wanted to also tune your base classifier, be sure to instantiate it prior to training your model.
Because of this large number of models, a one vs rest classifier would not necessarily work well with classification problems with hundreds of classes or with slow models. In addition, you are prone to expose each model to class imbalance is the data set has an even distribution of classes.