MLflow: Track Every Experiment
If you can't track it, you can't reproduce it. Master MLflow experiment tracking — log metrics, parameters, and artifacts so every experiment is fully traceable.
What is MLflow?
MLflow is like a lab notebook that writes itself.
In a science lab, you record every experiment: what you mixed, how much, what temperature, and what happened. Without notes, you can’t repeat a success or understand a failure.
MLflow does this automatically for ML experiments. Every time you train a model, it records: which data you used (parameters), how well it performed (metrics), and the actual model file (artifact). Weeks later, you can look up “which run got 94% accuracy?” and trace back to the exact code and data.
MLflow concepts
| Concept | What It Is | Example |
|---|---|---|
| Experiment | A named group of related runs | ”churn-prediction-v2” |
| Run | A single execution of a training script | One training job with specific hyperparameters |
| Parameter | An input configuration value | learning_rate=0.01, n_estimators=100 |
| Metric | A measured output value | accuracy=0.94, loss=0.12 |
| Artifact | A file produced by the run | model.pkl, feature_importance.png, confusion_matrix.json |
| Tag | Metadata label | ”team=nlp”, “sprint=q2”, “git_commit=abc123” |
Logging with MLflow in Azure ML
When you run a training script in Azure ML, MLflow tracking is automatic. Here’s how to use it:
import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score
# MLflow auto-connects to your Azure ML workspace
# No manual server configuration needed
# Start a run
with mlflow.start_run(run_name="rf-baseline"):
# Log parameters (inputs)
mlflow.log_param("n_estimators", 100)
mlflow.log_param("max_depth", 10)
mlflow.log_param("dataset_version", "v2")
# Train the model
model = RandomForestClassifier(n_estimators=100, max_depth=10)
model.fit(X_train, y_train)
# Log metrics (outputs)
predictions = model.predict(X_test)
mlflow.log_metric("accuracy", accuracy_score(y_test, predictions))
mlflow.log_metric("f1_score", f1_score(y_test, predictions))
# Log the model as an artifact
mlflow.sklearn.log_model(model, "churn-model")
# Log additional artifacts
mlflow.log_artifact("feature_importance.png")
What’s happening:
- Line 10: Opens a named run — everything logged inside this block is grouped together
- Lines 12-14: Records the input configuration so you can reproduce this exact setup
- Lines 22-23: Records how well the model performed
- Line 26: Saves the model in MLflow’s standard format (can be deployed to any MLflow-compatible platform)
- Line 29: Saves additional files (charts, reports) alongside the model
Scenario: Dr. Luca's reproducibility rescue
Dr. Luca Bianchi at GenomeVault ran 47 experiments over three weeks. His colleague asks: “Which run produced the best F1 score, and can we reproduce it?”
Without MLflow: “Um, I think it was the one on Tuesday… let me check my notebooks…”
With MLflow:
# Find the best run across all experiments
runs = mlflow.search_runs(
experiment_names=["genomics-variant-calling"],
order_by=["metrics.f1_score DESC"],
max_results=1
)
print(runs[["run_id", "params.model_type", "metrics.f1_score"]])Result: Run abc123, model_type=gradient_boost, F1=0.967. Every parameter, the exact code commit (via Git tag), and the trained model are all traceable.
Prof. Sarah Lin: “This is exactly the kind of rigour we need for our publications.”
Autologging
MLflow can automatically log parameters and metrics for popular frameworks — no manual log_param calls needed:
# Enable autologging for scikit-learn
mlflow.sklearn.autolog()
# Just train the model — MLflow captures everything
model = RandomForestClassifier(n_estimators=100, max_depth=10)
model.fit(X_train, y_train)
Supported frameworks for autologging:
| Framework | What’s Auto-Logged |
|---|---|
| scikit-learn | All hyperparameters, metrics (accuracy, F1, etc.), model artifact |
| PyTorch / PyTorch Lightning | Loss per epoch, learning rate, model weights |
| TensorFlow / Keras | Epoch metrics, optimizer config, model architecture |
| XGBoost / LightGBM | Boosting params, feature importance, eval metrics |
| Spark ML | Pipeline stages, evaluator metrics |
Exam tip: Autologging vs manual logging
Autologging is convenient but logs EVERYTHING. For production pipelines, manual logging gives you control over exactly what’s tracked.
The exam may ask when to use each:
- Autologging: exploration, prototyping, when you want comprehensive tracking with no code changes
- Manual logging: production pipelines, when you need specific metrics or custom artifacts
Comparing runs
One of MLflow’s most powerful features is comparing runs side by side:
# Search and compare runs
import mlflow
runs = mlflow.search_runs(
experiment_names=["churn-prediction-v2"],
filter_string="metrics.accuracy > 0.90",
order_by=["metrics.f1_score DESC"]
)
# View top runs
print(runs[["run_id", "params.n_estimators", "params.max_depth",
"metrics.accuracy", "metrics.f1_score"]].head(5))
What’s happening:
- Line 6: Filters to only runs with accuracy above 90%
- Line 7: Sorts by F1 score (descending) — best runs first
- Lines 10-12: Shows the key parameters and metrics for comparison
In the Azure ML Studio UI, you can also visually compare runs — select multiple runs and view metrics in parallel charts, scatter plots, or tables.
Scenario: Kai compares 200 sweep runs
Kai just ran a hyperparameter sweep with 200 trials (covered in Module 7). Now he needs to find the best model.
# Find the top 5 runs from the sweep
best_runs = mlflow.search_runs(
experiment_names=["churn-sweep-apr-2026"],
order_by=["metrics.f1_score DESC"],
max_results=5
)
# Log the winner for the team
winner = best_runs.iloc[0]
print(f"Best run: {winner.run_id}")
print(f" F1: {winner['metrics.f1_score']:.4f}")
print(f" Learning rate: {winner['params.learning_rate']}")
print(f" Max depth: {winner['params.max_depth']}")Priya (CTO): “Which model do we ship?”
Kai: “Run 7f3a2b1 — F1 of 0.9612 with learning_rate=0.03 and max_depth=8.”
Key terms flashcards
Knowledge check
Dr. Luca ran 47 experiments over three weeks. His colleague asks which run produced the best F1 score. What tool should Luca use?
Kai wants comprehensive experiment tracking with minimal code changes during early prototyping. What should he enable?
🎬 Video coming soon
Next up: AutoML & Hyperparameter Tuning — letting Azure find the best model for you.