Data, Environments & Components

Making ML reproducible

Simple explanation

Imagine baking a cake but never writing down the recipe.

You used “some flour” from “that bag in the pantry” and baked it in “whichever oven was free.” Next week you try again — different flour, different oven — and the cake tastes completely different. Was it the flour? The oven? Both?

MLOps has the same problem. If you don’t lock down your data (the ingredients), your environment (the oven), and your components (the recipe steps), you can’t reproduce results or debug failures. Azure ML gives you tools to version and manage all three.

Datastores: where your data lives

A datastore is a reference to an existing Azure storage service. It doesn’t copy data — it stores connection information so your experiments can access it.

Datastore Type	Backed By	Use Case
Azure Blob Storage	Blob containers	Unstructured data: images, text files, logs
Azure Data Lake Gen2	ADLS Gen2	Large-scale structured/semi-structured data, analytics
Azure File Share	Azure Files	Shared file systems, legacy file-based workflows
Azure SQL / PostgreSQL	Databases	Tabular data, feature stores

from azure.ai.ml.entities import AzureBlobDatastore

# Register a blob datastore
blob_store = AzureBlobDatastore(
    name="training_data",
    account_name="neuralsparkstorage",
    container_name="datasets",
    description="Training datasets for all projects"
)
ml_client.datastores.create_or_update(blob_store)

What’s happening:

Lines 1: Import the datastore entity class
Lines 4-8: Define a reference to an existing blob container — no data is copied
Line 10: Register it in the workspace so experiments can reference it by name

Exam tip: Credential-less datastores

Azure ML supports credential-less datastores using the workspace’s managed identity. This means the datastore doesn’t store any keys or connection strings — it relies on RBAC.

The exam favours this approach. If asked “what is the most secure way to connect a workspace to a storage account,” the answer is: managed identity + RBAC role assignment (e.g., “Storage Blob Data Reader” role).

Data assets: what your data is

A data asset is a versioned reference to specific data. Unlike a datastore (which points to a location), a data asset points to specific files or folders and tracks versions.

Three types of data assets in Azure ML
Feature	Points To	Versioned	Best For
URI File	A single file	Yes	A specific CSV, parquet, or image file
URI Folder	A directory	Yes	A folder of images, a dataset partition
MLTable	Tabular data with schema	Yes	Structured data that needs column types, transforms

from azure.ai.ml.entities import Data
from azure.ai.ml.constants import AssetTypes

# Create a versioned data asset
training_data = Data(
    name="customer-churn-v2",
    version="2",
    path="azureml://datastores/training_data/paths/churn/2026-04/",
    type=AssetTypes.URI_FOLDER,
    description="April 2026 customer churn dataset (cleaned)"
)
ml_client.data.create_or_update(training_data)

What’s happening:

Line 8: Points to a specific path in a registered datastore — versioning means you can always trace which data trained which model
Line 9: URI_FOLDER because it’s a directory of files, not a single file

Environments: your software stack

An environment defines the software dependencies for training and inference. It ensures that every run uses exactly the same Python packages, system libraries, and OS.

Environment Type	Defined By	When To Use
Curated	Microsoft-managed	Quick start, common frameworks (PyTorch, TensorFlow, sklearn)
Custom (conda)	`conda.yaml` file	When you need specific package versions
Custom (Docker)	Dockerfile	When you need system-level dependencies or custom base images

# conda.yaml — defines the software stack
name: churn-training-env
channels:
  - conda-forge
dependencies:
  - python=3.10
  - scikit-learn=1.4
  - pandas=2.2
  - pip:
    - azure-ai-ml==1.15.0
    - mlflow==2.12.0

from azure.ai.ml.entities import Environment

env = Environment(
    name="churn-training",
    version="3",
    conda_file="conda.yaml",
    image="mcr.microsoft.com/azureml/openmpi4.1.0-ubuntu22.04:latest",
    description="Churn prediction training environment v3"
)
ml_client.environments.create_or_update(env)

What’s happening:

The conda YAML pins every dependency to exact versions — no surprises between runs
The Docker image provides the OS base, and conda installs Python packages on top
Version “3” means you can roll back to v1 or v2 if something breaks

Scenario: Dr. Luca pins his genomics environment

Dr. Luca Bianchi at GenomeVault ran into a nightmare: a scikit-learn update changed how a model handled missing values, silently changing prediction accuracy by 2%. The model passed validation but produced different results in production.

His fix: pin every package version in conda.yaml and version the environment. Now every experiment references genomics-env:v7, and if Prof. Sarah Lin asks “can you reproduce last month’s results?” — the answer is always yes.

Lesson: Curated environments are great for prototyping, but production workloads need custom environments with pinned versions.

Components: reusable pipeline building blocks

A component is a self-contained piece of ML code with defined inputs, outputs, and an environment. Think of it as a function that can be plugged into different pipelines.

# component.yaml — a data preparation step
$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: prepare_churn_data
version: "1"
display_name: Prepare Churn Data
type: command
inputs:
  raw_data:
    type: uri_folder
outputs:
  cleaned_data:
    type: uri_folder
code: ./src
environment: azureml:churn-training:3
command: >-
  python prepare.py
  --input $INPUTS.raw_data
  --output $OUTPUTS.cleaned_data

What’s happening:

Lines 7-12: Declares typed inputs and outputs — the pipeline knows what this component consumes and produces
Line 13: Points to the source code directory
Line 14: References a specific environment version
Lines 15-18: The actual command, with input/output paths injected by Azure ML

An Azure ML registry is a central catalog for sharing assets (models, environments, components, data) across multiple workspaces. This is how you promote a validated component from dev to production without rebuilding it.

# Create a registry
az ml registry create \
  --name neuralspark-registry \
  --resource-group rg-ml-shared \
  --location eastus

# Share a component to the registry
az ml component create \
  --file component.yaml \
  --registry-name neuralspark-registry

Exam tip: Registry vs workspace

A common exam scenario: “How do you share a trained model between the dev and production workspaces?”

Answer: Register the model in an Azure ML registry, then reference it from the production workspace. Registries are workspace-independent — they provide cross-workspace asset sharing with RBAC control.

Don’t confuse this with the workspace model registry (local to one workspace) vs the Azure ML registry (shared across workspaces).

Key terms flashcards

Question

Datastore vs data asset — what's the difference?

Click or press Enter to reveal answer

Answer

A datastore is a connection reference to a storage service (where). A data asset is a versioned pointer to specific data within a datastore (what). Datastores are reused across many data assets.

Click to flip back

Question

What are the three types of data assets?

Click or press Enter to reveal answer

Answer

URI File (single file), URI Folder (directory), and MLTable (tabular data with schema). All are versioned.

Click to flip back

Question

What is an Azure ML registry?

Click or press Enter to reveal answer

Answer

A central catalog for sharing models, environments, components, and data across multiple workspaces. Enables promotion from dev to prod without rebuilding assets.

Click to flip back

Question

Curated vs custom environment — when to use each?

Click or press Enter to reveal answer

Answer

Curated: quick prototyping with common frameworks. Custom (conda/Docker): production workloads where you need pinned versions for reproducibility.

Click to flip back

Knowledge check

Knowledge Check

Dr. Luca needs to ensure that his genomics training pipeline uses exactly the same Python packages every time it runs, even months later. What should he use?

Knowledge Check

Kai's NeuralSpark team has a data preparation component that works perfectly in the dev workspace. He needs to use the same component in the production workspace without rebuilding it. What should he use?

Knowledge Check

Dr. Fatima at Meridian Financial needs to connect a workspace to a storage account without storing any credentials. What is the recommended approach?

Next up: Compute Targets — choosing the right engine for training vs inference.

Making ML reproducible

Datastores: where your data lives

Data assets: what your data is

Environments: your software stack

Components: reusable pipeline building blocks

Registries: share across workspaces

Key terms flashcards

Knowledge check