🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided AI-300 Domain 3
Domain 3 — Module 5 of 5 100%
18 of 25 overall

AI-300 Study Guide

Domain 1: Design and Implement an MLOps Infrastructure

  • ML Workspace: Your AI Control Room Free
  • Data, Environments & Components
  • Compute Targets: Choosing the Right Engine
  • Infrastructure as Code: Provisioning at Scale
  • Git & CI/CD for ML Projects

Domain 2: Implement Machine Learning Model Lifecycle and Operations

  • MLflow: Track Every Experiment Free
  • AutoML & Hyperparameter Tuning
  • Training Pipelines: Automate Everything
  • Distributed Training: Scale to Big Data
  • Model Registration & Versioning
  • Model Approval & Responsible AI Gates
  • Deploying Models: Endpoints in Production
  • Drift, Monitoring & Retraining

Domain 3: Design and Implement a GenAIOps Infrastructure

  • Foundry: Hubs, Projects & Platform Setup Free
  • Network Security & IaC for Foundry
  • Deploying Foundation Models
  • Model Versioning & Production Strategies
  • PromptOps: Design, Compare, Version & Ship

Domain 4: Implement Generative AI Quality Assurance and Observability

  • Evaluation: Datasets, Metrics & Quality Gates Free
  • Safety Evaluations & Custom Metrics
  • Monitoring GenAI in Production
  • Cost Tracking, Logging & Debugging

Domain 5: Optimize Generative AI Systems and Model Performance

  • RAG Optimization: Better Retrieval, Better Answers Free
  • Embeddings & Hybrid Search
  • Fine-Tuning: Methods, Data & Production

AI-300 Study Guide

Domain 1: Design and Implement an MLOps Infrastructure

  • ML Workspace: Your AI Control Room Free
  • Data, Environments & Components
  • Compute Targets: Choosing the Right Engine
  • Infrastructure as Code: Provisioning at Scale
  • Git & CI/CD for ML Projects

Domain 2: Implement Machine Learning Model Lifecycle and Operations

  • MLflow: Track Every Experiment Free
  • AutoML & Hyperparameter Tuning
  • Training Pipelines: Automate Everything
  • Distributed Training: Scale to Big Data
  • Model Registration & Versioning
  • Model Approval & Responsible AI Gates
  • Deploying Models: Endpoints in Production
  • Drift, Monitoring & Retraining

Domain 3: Design and Implement a GenAIOps Infrastructure

  • Foundry: Hubs, Projects & Platform Setup Free
  • Network Security & IaC for Foundry
  • Deploying Foundation Models
  • Model Versioning & Production Strategies
  • PromptOps: Design, Compare, Version & Ship

Domain 4: Implement Generative AI Quality Assurance and Observability

  • Evaluation: Datasets, Metrics & Quality Gates Free
  • Safety Evaluations & Custom Metrics
  • Monitoring GenAI in Production
  • Cost Tracking, Logging & Debugging

Domain 5: Optimize Generative AI Systems and Model Performance

  • RAG Optimization: Better Retrieval, Better Answers Free
  • Embeddings & Hybrid Search
  • Fine-Tuning: Methods, Data & Production
Domain 3: Design and Implement a GenAIOps Infrastructure Premium ⏱ ~15 min read

PromptOps: Design, Compare, Version & Ship

Prompts are code — treat them like code. Learn to design effective prompts, create variants, compare performance, and manage versions with Git for production GenAI.

Why prompts need engineering discipline

☕ Simple explanation

Prompts are recipes. Treat them like recipes.

A great chef doesn’t just throw ingredients in a pot — they write down exact amounts, steps, and timing. When the dish tastes amazing, they save the recipe. When they want to try a variation (less salt, more garlic), they write a new version and taste-test both.

Version control is the recipe book with dates — so you know which version customers loved last month.

Prompt variants are recipe tweaks — same dish, slightly different approach. You compare them to find the best one.

CI/CD for prompts is like a restaurant chain ensuring every location uses the same tested recipe — not the chef’s improvisation.

In production GenAI systems, prompts are the primary interface between your application logic and the foundation model. A small change to a system prompt can dramatically alter output quality, safety, and cost. PromptOps applies software engineering practices to prompt management:

  • Version control — store prompts in Git, track every change, review through pull requests
  • Variants — create and test multiple prompt approaches systematically
  • Evaluation — compare variants using metrics (covered in depth in Domain 4)
  • CI/CD — automate testing and deployment of prompt changes through pipelines

This is PromptOps — treating prompts with the same rigour as application code.

Prompt engineering patterns

The exam tests three core patterns for designing effective prompts:

Core prompt engineering patterns
FeatureWhat It DoesWhen to UseToken Cost
System PromptSets the model's persona, rules, and output formatEvery production prompt — defines boundaries and behaviourLow — sent once per conversation
Few-Shot ExamplesProvides input-output examples to guide the modelWhen the task has a specific format or when zero-shot quality is poorMedium — each example adds tokens
Chain-of-ThoughtAsks the model to reason step-by-step before answeringComplex reasoning, math, multi-step analysisHigh — generates more output tokens

Designing a production system prompt

SYSTEM PROMPT — Loan Document Summariser v2.3

You are a financial document analyst at Meridian Financial.

TASK: Summarise the provided loan document into a structured summary.

RULES:
- Extract: borrower name, loan amount, interest rate, term, collateral
- Format output as JSON with exactly these fields
- If a field is not found in the document, use "NOT_FOUND" — never guess
- Do not include any information not present in the source document
- Respond only with the JSON — no explanations or commentary

OUTPUT FORMAT:
  "borrower": "...",
  "loan_amount": "...",
  "interest_rate": "...",
  "term_months": ...,
  "collateral": "..."

What’s happening:

  • Persona (line 3): Sets the context — the model acts as a financial analyst
  • Task (line 5): Clear, single instruction
  • Rules (lines 7-11): Explicit constraints — what to do, what not to do, how to handle missing data
  • Output format (lines 13-19): Exact structure expected — reduces parsing errors in downstream code
💡 Exam tip: Prompt design principles

The exam tests prompt design best practices:

  • Be specific — “Summarise this document” is worse than “Extract borrower name, loan amount, and interest rate as JSON”
  • Set boundaries — tell the model what NOT to do (no guessing, no extra commentary)
  • Define output format — JSON, markdown table, numbered list. Specify exactly
  • Handle edge cases — what should the model do when data is missing? Say so explicitly
  • Minimize ambiguity — if two people could interpret the prompt differently, it’s too vague

Creating prompt variants

Test different approaches to find the best one:

# Define prompt variants for A/B testing
VARIANTS = {
    "v1_basic": {
        "system": "Summarise the loan document. Return JSON.",
        "description": "Minimal instruction — tests if the model infers structure"
    },
    "v2_structured": {
        "system": """You are a financial document analyst.
Extract: borrower, loan_amount, interest_rate, term_months, collateral.
Return JSON only. Use 'NOT_FOUND' for missing fields.""",
        "description": "Structured with explicit field list and missing-data handling"
    },
    "v3_few_shot": {
        "system": """You are a financial document analyst.
Extract loan details as JSON.

Example input: 'John Smith borrows $500,000 at 4.5% for 30 years, secured by 123 Main St.'
Example output: {"borrower": "John Smith", "loan_amount": "$500,000", "interest_rate": "4.5%", "term_months": 360, "collateral": "123 Main St"}

Use 'NOT_FOUND' for missing fields. Return JSON only.""",
        "description": "Few-shot example showing exact expected format"
    },
}

What’s happening:

  • v1_basic (lines 3-6): Minimal prompt — tests whether the model can figure out the task with little guidance
  • v2_structured (lines 7-12): Explicit field list and instructions — more tokens but clearer expectations
  • v3_few_shot (lines 13-23): Includes a worked example — highest token cost but strongest format guidance

Comparing variant performance

After running variants through an evaluation dataset, compare results:

Metricv1_basicv2_structuredv3_few_shot
Field extraction accuracy72%91%96%
JSON parse success rate65%94%99%
Avg tokens per response180120110
Avg latency1.2s0.9s0.8s
Cost per 1000 docs$4.50$3.80$4.20

Analysis: v3_few_shot has the best quality (96% accuracy, 99% valid JSON) with slightly higher cost than v2. v1 is cheapest but unreliable — 35% of responses fail JSON parsing.

Decision: v3_few_shot for production — the 10% cost increase over v2 is worth the 5% accuracy gain and near-perfect JSON output.

Git version control for prompts

Store prompts in a structured Git repository:

prompts/
  loan-summariser/
    system-prompt.md          # Current production prompt
    variants/
      v1-basic.md
      v2-structured.md
      v3-few-shot.md          # Winner — promoted to system-prompt.md
    evaluations/
      eval-2025-04-15.json    # Evaluation results
      eval-2025-05-01.json
    CHANGELOG.md              # History of changes and why
  email-classifier/
    system-prompt.md
    variants/
      v1-keyword.md
      v2-semantic.md
    evaluations/
      eval-2025-03-20.json

What’s happening:

  • Each prompt task gets its own directory (loan-summariser, email-classifier)
  • system-prompt.md is the current production prompt — what gets deployed
  • variants/ holds all tested alternatives — useful for future comparison
  • evaluations/ stores test results — proves why the current version was chosen
  • CHANGELOG.md tracks history — “Changed from v2 to v3 on May 1 because v3 had 96% accuracy vs 91%“

Branching strategy for prompt changes

main (production prompts)
  |
  +-- feature/loan-summariser-v4
  |     |-- Updated system prompt with new field: "loan_type"
  |     |-- Evaluation results show 94% accuracy
  |     +-- Pull request: reviewed by team, approved
  |
  +-- feature/email-classifier-v2
        |-- Switched from keyword matching to semantic classification
        |-- Evaluation results show 15% improvement
        +-- Pull request: pending review

Pull request reviews for prompts should check:

  • Does the evaluation show improvement (or at least no regression)?
  • Are edge cases handled (missing data, unusual formats)?
  • Is the output format unchanged (or is downstream code updated)?
  • Are safety constraints maintained?

CI/CD for prompt deployment

# .github/workflows/prompt-deploy.yml
name: Prompt CI/CD
on:
  push:
    branches: [main]
    paths: ['prompts/**']
  pull_request:
    paths: ['prompts/**']

jobs:
  evaluate:
    runs-on: ubuntu-latest
    if: github.event_name == 'pull_request'
    steps:
      - uses: actions/checkout@v4

      - name: Azure Login
        uses: azure/login@v2
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

      - name: Run Prompt Evaluation
        run: |
          python scripts/evaluate_prompt.py \
            --prompt prompts/loan-summariser/system-prompt.md \
            --dataset evaluations/loan-test-set.jsonl \
            --output evaluations/results.json

      - name: Check Quality Gate
        run: |
          python scripts/check_quality.py \
            --results evaluations/results.json \
            --min-accuracy 0.90 \
            --min-json-parse-rate 0.95

  deploy:
    runs-on: ubuntu-latest
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4

      - name: Azure Login
        uses: azure/login@v2
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

      - name: Deploy Updated Prompts
        run: |
          python scripts/deploy_prompts.py \
            --environment production \
            --source prompts/

What’s happening:

  • Lines 5-8: Triggers on changes to any file in the prompts/ directory
  • Lines 11-13: On pull requests, run evaluation — don’t deploy
  • Lines 24-28: Runs the prompt against a test dataset and generates evaluation results
  • Lines 30-34: Quality gate — fails the PR if accuracy drops below 90% or JSON parse rate below 95%
  • Lines 36-50: On merge to main, deploys the updated prompts to production
Scenario: Zara versions client-specific prompts at Atlas

Atlas Consulting has different prompts for each client engagement. Zara’s challenge: 15 clients, each with customised system prompts for their industry.

Zara’s Git structure:

  • prompts/client-alpha/ — financial services prompts
  • prompts/client-beta/ — healthcare prompts
  • prompts/client-gamma/ — retail prompts

Each client directory has the same structure (system-prompt.md, variants/, evaluations/). When a consultant proposes a prompt change for Client Alpha:

  1. Create branch: feature/alpha-prompt-v3
  2. Edit prompts/client-alpha/variants/v3-improved.md
  3. CI pipeline evaluates the variant against Client Alpha’s test dataset
  4. Quality gate passes (accuracy went from 88% to 93%)
  5. Pull request reviewed by Marcus Webb
  6. Merge to main — deployed to Client Alpha’s Foundry project automatically

No consultant can accidentally push an untested prompt to production.

Scenario: Kai A/B tests prompt variants for NeuralSpark

Kai wants to improve NeuralSpark’s customer support bot response quality. Current prompt scores 3.8/5 on helpfulness.

Kai’s approach:

  1. Creates three variants:

    • v1 (current): Basic instruction with persona
    • v2: Adds chain-of-thought — “Think step-by-step about the customer’s issue before responding”
    • v3: Adds few-shot examples of ideal support responses
  2. A/B test setup: 60% v1, 20% v2, 20% v3

  3. After 1 week (5,000 conversations):

    • v1: 3.8/5 helpfulness, $0.02/conversation
    • v2: 4.3/5 helpfulness, $0.04/conversation (more tokens from reasoning)
    • v3: 4.1/5 helpfulness, $0.03/conversation
  4. Decision: v2 wins on quality. The 2x cost increase is acceptable for a support bot where better responses reduce escalations to human agents.

  5. Progressive rollout: 80/20, then 100% to v2. Old variants archived in Git.

Key terms flashcards

Question

What is PromptOps?

Click or press Enter to reveal answer

Answer

PromptOps applies software engineering practices to prompt management — version control (Git), testing (evaluation datasets), review (pull requests), and deployment (CI/CD pipelines). Prompts are treated as code artifacts, not ad-hoc text.

Click to flip back

Question

What are prompt variants?

Click or press Enter to reveal answer

Answer

Different versions of a prompt designed to accomplish the same task. Variants might differ in instruction style (basic vs structured), examples (zero-shot vs few-shot), or reasoning approach (direct vs chain-of-thought). They are compared using evaluation metrics to find the best performer.

Click to flip back

Question

Why store prompts in Git?

Click or press Enter to reveal answer

Answer

Git provides: change history (who changed what, when, and why), rollback (revert to any previous version instantly), code review (pull requests for prompt changes), branching (test changes without affecting production), and CI/CD integration (automated evaluation and deployment).

Click to flip back

Question

What is a quality gate for prompts in CI/CD?

Click or press Enter to reveal answer

Answer

An automated check that runs during a pull request. It evaluates the changed prompt against a test dataset and fails the PR if quality metrics drop below defined thresholds (e.g., accuracy below 90%, JSON parse rate below 95%). Prevents regressions from reaching production.

Click to flip back

Question

What is chain-of-thought prompting?

Click or press Enter to reveal answer

Answer

A technique where you instruct the model to reason step-by-step before giving a final answer. Improves accuracy on complex tasks (math, multi-step reasoning) but increases output tokens and cost. Example: 'Think step-by-step about the customer issue, then provide your recommendation.'

Click to flip back

Knowledge check

Knowledge Check

Zara's team at Atlas Consulting has a prompt that scores 88% accuracy for Client Alpha. A consultant proposes a change they believe will improve it. What is the correct PromptOps workflow?

Knowledge Check

Kai is comparing three prompt variants for NeuralSpark's support bot. Variant A scores 3.8/5 helpfulness at $0.02/conversation. Variant B scores 4.3/5 at $0.04/conversation. Variant C scores 4.1/5 at $0.03/conversation. Better support responses reduce escalations to human agents (which cost $15 each). Which variant should Kai choose?

🎬 Video coming soon


Next up: Evaluation — measuring whether your GenAI solution actually works.

← Previous

Model Versioning & Production Strategies

Next →

Evaluation: Datasets, Metrics & Quality Gates

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.