Data, Environments & Components
Reproducibility is the backbone of MLOps. Master datastores, data assets, environments, components, and registries — the building blocks that make every experiment repeatable.
Making ML reproducible
Imagine baking a cake but never writing down the recipe.
You used “some flour” from “that bag in the pantry” and baked it in “whichever oven was free.” Next week you try again — different flour, different oven — and the cake tastes completely different. Was it the flour? The oven? Both?
MLOps has the same problem. If you don’t lock down your data (the ingredients), your environment (the oven), and your components (the recipe steps), you can’t reproduce results or debug failures. Azure ML gives you tools to version and manage all three.
Datastores: where your data lives
A datastore is a reference to an existing Azure storage service. It doesn’t copy data — it stores connection information so your experiments can access it.
| Datastore Type | Backed By | Use Case |
|---|---|---|
| Azure Blob Storage | Blob containers | Unstructured data: images, text files, logs |
| Azure Data Lake Gen2 | ADLS Gen2 | Large-scale structured/semi-structured data, analytics |
| Azure File Share | Azure Files | Shared file systems, legacy file-based workflows |
| Azure SQL / PostgreSQL | Databases | Tabular data, feature stores |
from azure.ai.ml.entities import AzureBlobDatastore
# Register a blob datastore
blob_store = AzureBlobDatastore(
name="training_data",
account_name="neuralsparkstorage",
container_name="datasets",
description="Training datasets for all projects"
)
ml_client.datastores.create_or_update(blob_store)
What’s happening:
- Lines 1: Import the datastore entity class
- Lines 4-8: Define a reference to an existing blob container — no data is copied
- Line 10: Register it in the workspace so experiments can reference it by name
Exam tip: Credential-less datastores
Azure ML supports credential-less datastores using the workspace’s managed identity. This means the datastore doesn’t store any keys or connection strings — it relies on RBAC.
The exam favours this approach. If asked “what is the most secure way to connect a workspace to a storage account,” the answer is: managed identity + RBAC role assignment (e.g., “Storage Blob Data Reader” role).
Data assets: what your data is
A data asset is a versioned reference to specific data. Unlike a datastore (which points to a location), a data asset points to specific files or folders and tracks versions.
| Feature | Points To | Versioned | Best For |
|---|---|---|---|
| URI File | A single file | Yes | A specific CSV, parquet, or image file |
| URI Folder | A directory | Yes | A folder of images, a dataset partition |
| MLTable | Tabular data with schema | Yes | Structured data that needs column types, transforms |
from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes
# Create a versioned data asset
training_data = Data(
name="customer-churn-v2",
version="2",
path="azureml://datastores/training_data/paths/churn/2026-04/",
type=AssetTypes.URI_FOLDER,
description="April 2026 customer churn dataset (cleaned)"
)
ml_client.data.create_or_update(training_data)
What’s happening:
- Line 8: Points to a specific path in a registered datastore — versioning means you can always trace which data trained which model
- Line 9:
URI_FOLDERbecause it’s a directory of files, not a single file
Environments: your software stack
An environment defines the software dependencies for training and inference. It ensures that every run uses exactly the same Python packages, system libraries, and OS.
| Environment Type | Defined By | When To Use |
|---|---|---|
| Curated | Microsoft-managed | Quick start, common frameworks (PyTorch, TensorFlow, sklearn) |
| Custom (conda) | conda.yaml file | When you need specific package versions |
| Custom (Docker) | Dockerfile | When you need system-level dependencies or custom base images |
# conda.yaml — defines the software stack
name: churn-training-env
channels:
- conda-forge
dependencies:
- python=3.10
- scikit-learn=1.4
- pandas=2.2
- pip:
- azure-ai-ml==1.15.0
- mlflow==2.12.0
from azure.ai.ml.entities import Environment
env = Environment(
name="churn-training",
version="3",
conda_file="conda.yaml",
image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu22.04:latest",
description="Churn prediction training environment v3"
)
ml_client.environments.create_or_update(env)
What’s happening:
- The conda YAML pins every dependency to exact versions — no surprises between runs
- The Docker image provides the OS base, and conda installs Python packages on top
- Version “3” means you can roll back to v1 or v2 if something breaks
Scenario: Dr. Luca pins his genomics environment
Dr. Luca Bianchi at GenomeVault ran into a nightmare: a scikit-learn update changed how a model handled missing values, silently changing prediction accuracy by 2%. The model passed validation but produced different results in production.
His fix: pin every package version in conda.yaml and version the environment. Now every experiment references genomics-env:v7, and if Prof. Sarah Lin asks “can you reproduce last month’s results?” — the answer is always yes.
Lesson: Curated environments are great for prototyping, but production workloads need custom environments with pinned versions.
Components: reusable pipeline building blocks
A component is a self-contained piece of ML code with defined inputs, outputs, and an environment. Think of it as a function that can be plugged into different pipelines.
# component.yaml — a data preparation step
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: prepare_churn_data
version: "1"
display_name: Prepare Churn Data
type: command
inputs:
raw_data:
type: uri_folder
outputs:
cleaned_data:
type: uri_folder
code: ./src
environment: azureml:churn-training:3
command: >-
python prepare.py
--input $INPUTS.raw_data
--output $OUTPUTS.cleaned_data
What’s happening:
- Lines 7-12: Declares typed inputs and outputs — the pipeline knows what this component consumes and produces
- Line 13: Points to the source code directory
- Line 14: References a specific environment version
- Lines 15-18: The actual command, with input/output paths injected by Azure ML
Registries: share across workspaces
An Azure ML registry is a central catalog for sharing assets (models, environments, components, data) across multiple workspaces. This is how you promote a validated component from dev to production without rebuilding it.
# Create a registry
az ml registry create \
--name neuralspark-registry \
--resource-group rg-ml-shared \
--location eastus
# Share a component to the registry
az ml component create \
--file component.yaml \
--registry-name neuralspark-registry
Exam tip: Registry vs workspace
A common exam scenario: “How do you share a trained model between the dev and production workspaces?”
Answer: Register the model in an Azure ML registry, then reference it from the production workspace. Registries are workspace-independent — they provide cross-workspace asset sharing with RBAC control.
Don’t confuse this with the workspace model registry (local to one workspace) vs the Azure ML registry (shared across workspaces).
Key terms flashcards
Knowledge check
Dr. Luca needs to ensure that his genomics training pipeline uses exactly the same Python packages every time it runs, even months later. What should he use?
Kai's NeuralSpark team has a data preparation component that works perfectly in the dev workspace. He needs to use the same component in the production workspace without rebuilding it. What should he use?
Dr. Fatima at Meridian Financial needs to connect a workspace to a storage account without storing any credentials. What is the recommended approach?
🎬 Video coming soon
Next up: Compute Targets — choosing the right engine for training vs inference.