MLflow: Track Every Experiment

What is MLflow?

Simple explanation

MLflow is like a lab notebook that writes itself.

In a science lab, you record every experiment: what you mixed, how much, what temperature, and what happened. Without notes, you can’t repeat a success or understand a failure.

MLflow does this automatically for ML experiments. Every time you train a model, it records: which data you used (parameters), how well it performed (metrics), and the actual model file (artifact). Weeks later, you can look up “which run got 94% accuracy?” and trace back to the exact code and data.

MLflow concepts

Concept	What It Is	Example
Experiment	A named group of related runs	”churn-prediction-v2”
Run	A single execution of a training script	One training job with specific hyperparameters
Parameter	An input configuration value	learning_rate=0.01, n_estimators=100
Metric	A measured output value	accuracy=0.94, loss=0.12
Artifact	A file produced by the run	model.pkl, feature_importance.png, confusion_matrix.json
Tag	Metadata label	”team=nlp”, “sprint=q2”, “git_commit=abc123”

Logging with MLflow in Azure ML

When you run a training script in Azure ML, MLflow tracking is automatic. Here’s how to use it:

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score

# MLflow auto-connects to your Azure ML workspace
# No manual server configuration needed

# Start a run
with mlflow.start_run(run_name="rf-baseline"):
    # Log parameters (inputs)
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 10)
    mlflow.log_param("dataset_version", "v2")

    # Train the model
    model = RandomForestClassifier(n_estimators=100, max_depth=10)
    model.fit(X_train, y_train)

    # Log metrics (outputs)
    predictions = model.predict(X_test)
    mlflow.log_metric("accuracy", accuracy_score(y_test, predictions))
    mlflow.log_metric("f1_score", f1_score(y_test, predictions))

    # Log the model as an artifact
    mlflow.sklearn.log_model(model, "churn-model")

    # Log additional artifacts
    mlflow.log_artifact("feature_importance.png")

What’s happening:

Line 10: Opens a named run — everything logged inside this block is grouped together
Lines 12-14: Records the input configuration so you can reproduce this exact setup
Lines 22-23: Records how well the model performed
Line 26: Saves the model in MLflow’s standard format (can be deployed to any MLflow-compatible platform)
Line 29: Saves additional files (charts, reports) alongside the model

Scenario: Dr. Luca's reproducibility rescue

Dr. Luca Bianchi at GenomeVault ran 47 experiments over three weeks. His colleague asks: “Which run produced the best F1 score, and can we reproduce it?”

Without MLflow: “Um, I think it was the one on Tuesday… let me check my notebooks…”

With MLflow:

# Find the best run across all experiments
runs = mlflow.search_runs(
    experiment_names=["genomics-variant-calling"],
    order_by=["metrics.f1_score DESC"],
    max_results=1
)
print(runs[["run_id", "params.model_type", "metrics.f1_score"]])

Result: Run abc123, model_type=gradient_boost, F1=0.967. Every parameter, the exact code commit (via Git tag), and the trained model are all traceable.

Prof. Sarah Lin: “This is exactly the kind of rigour we need for our publications.”

Autologging

MLflow can automatically log parameters and metrics for popular frameworks — no manual log_param calls needed:

# Enable autologging for scikit-learn
mlflow.sklearn.autolog()

# Just train the model — MLflow captures everything
model = RandomForestClassifier(n_estimators=100, max_depth=10)
model.fit(X_train, y_train)

Supported frameworks for autologging:

Framework	What’s Auto-Logged
scikit-learn	All hyperparameters, metrics (accuracy, F1, etc.), model artifact
PyTorch / PyTorch Lightning	Loss per epoch, learning rate, model weights
TensorFlow / Keras	Epoch metrics, optimizer config, model architecture
XGBoost / LightGBM	Boosting params, feature importance, eval metrics
Spark ML	Pipeline stages, evaluator metrics

Exam tip: Autologging vs manual logging

Autologging is convenient but logs EVERYTHING. For production pipelines, manual logging gives you control over exactly what’s tracked.

The exam may ask when to use each:

Autologging: exploration, prototyping, when you want comprehensive tracking with no code changes
Manual logging: production pipelines, when you need specific metrics or custom artifacts

Comparing runs

One of MLflow’s most powerful features is comparing runs side by side:

# Search and compare runs
import mlflow

runs = mlflow.search_runs(
    experiment_names=["churn-prediction-v2"],
    filter_string="metrics.accuracy > 0.90",
    order_by=["metrics.f1_score DESC"]
)

# View top runs
print(runs[["run_id", "params.n_estimators", "params.max_depth",
            "metrics.accuracy", "metrics.f1_score"]].head(5))

What’s happening:

Line 6: Filters to only runs with accuracy above 90%
Line 7: Sorts by F1 score (descending) — best runs first
Lines 10-12: Shows the key parameters and metrics for comparison

In the Azure ML Studio UI, you can also visually compare runs — select multiple runs and view metrics in parallel charts, scatter plots, or tables.

Scenario: Kai compares 200 sweep runs

Kai just ran a hyperparameter sweep with 200 trials (covered in Module 7). Now he needs to find the best model.

# Find the top 5 runs from the sweep
best_runs = mlflow.search_runs(
    experiment_names=["churn-sweep-apr-2026"],
    order_by=["metrics.f1_score DESC"],
    max_results=5
)

# Log the winner for the team
winner = best_runs.iloc[0]
print(f"Best run: {winner.run_id}")
print(f"  F1: {winner['metrics.f1_score']:.4f}")
print(f"  Learning rate: {winner['params.learning_rate']}")
print(f"  Max depth: {winner['params.max_depth']}")

Priya (CTO): “Which model do we ship?” Kai: “Run 7f3a2b1 — F1 of 0.9612 with learning_rate=0.03 and max_depth=8.”

Key terms flashcards

Question

What are the three things MLflow tracks for every run?

Click or press Enter to reveal answer

Answer

Parameters (inputs like hyperparameters), Metrics (outputs like accuracy/loss), and Artifacts (files like model weights, charts, reports).

Click to flip back

Question

Do you need a separate MLflow server with Azure ML?

Click or press Enter to reveal answer

Answer

No — Azure ML workspaces include a built-in MLflow tracking server. Your workspace IS the tracking server. No additional setup needed.

Click to flip back

Question

What is MLflow autologging?

Click or press Enter to reveal answer

Answer

A feature that automatically logs parameters, metrics, and models for popular frameworks (sklearn, PyTorch, TensorFlow) — no manual log_param/log_metric calls needed.

Click to flip back

Question

How do you find the best run across many experiments?

Click or press Enter to reveal answer

Answer

mlflow.search_runs() with filter_string and order_by. Example: filter accuracy > 0.90, order by F1 score descending.

Click to flip back

Knowledge check

Knowledge Check

Dr. Luca ran 47 experiments over three weeks. His colleague asks which run produced the best F1 score. What tool should Luca use?

Knowledge Check

Kai wants comprehensive experiment tracking with minimal code changes during early prototyping. What should he enable?

Next up: AutoML & Hyperparameter Tuning — letting Azure find the best model for you.