How to Go with the MLFlow: Tracking Tutorial

How to Go with the MLFlow: Tracking Tutorial.

Anyone who has worked on a professional or personal machine learning project will know that keeping track of the performance and evaluations of each of your models can get particularly messy. In my case, I can somewhat fondly recall my bleary-eyed scrolling through a finalized jupyter notebook at the eleventh hour to ensure I created a model from each relevant algorithm and tuned each one’s hyperparameters properly.

MLFlow is an opensource framework released by databricks in 2018, the developers who created the Apache Spark project, to help users keep track of the Machine Learning lifecycle. Inspired in part by platforms created inhouse at tech companies like Google (TFX), Uber (Michelangelo), and Facebook (FBLearner), MLFlow allows tracking, planning, and training to work across libraries and algorithms through an open interface design. MLFlow Tracking allows users to keep track of parameters, metrics, source code, and version as well as any artifacts like data and models. This is not just critical for scaling a model, but also is beneficial in cases where data governance is an issue.

In about twenty lines of code, you can track your modeling process in MLFlow. Here’s what my MLflow tracker client looked like on my first attempt:

For each model for this classification problem, I have details on the parameters and evaluations as well as a handy reference to the algorithm used.

So once you “!pip install mlflow” you can get started with tracking the modeling process in MLFlow with the short tutorial below.

Tutorial

In the below code snippet, I import MLflow and the MLflow client and set up my evaluator objects for later.

import mlflow
from mlflow import spark
from mlflow.tracking import MlflowClient

LaylaAI in her udemy course PySpark Essentials for Data Scientists provided this handy function to create runs for each model:

experiments = client.list_experiments()

This code below is what you would want to have for each model.

# test the functionality here
run = create_run('Experiment-3')

What this would look like with a logistic regression model with cross validation would be below:

run = create_run(experiment_name)
classifier = LogisticRegression()

You could, of course, log more parameters depending on your needs.

To display the client, I ran the command “mlflow ui” on my command line within the project folder then access http://localhost:5000/ on my browser.

Further Reading

Corey Zumar, a software developer at Databricks, has an awesome demo on using mlflow on the popular number recognition Mnist dataset, covering more of the cloud functionality of MLFlow for productionizing your models and avoiding the ratnest mapping that can occur when an organization uses a variety of ML Frameworks.

Data Scientist and Writer, passionate about language