PromptOps: Design, Compare, Version & Ship

Why prompts need engineering discipline

Simple explanation

Prompts are recipes. Treat them like recipes.

A great chef doesn’t just throw ingredients in a pot — they write down exact amounts, steps, and timing. When the dish tastes amazing, they save the recipe. When they want to try a variation (less salt, more garlic), they write a new version and taste-test both.

Version control is the recipe book with dates — so you know which version customers loved last month.

Prompt variants are recipe tweaks — same dish, slightly different approach. You compare them to find the best one.

CI/CD for prompts is like a restaurant chain ensuring every location uses the same tested recipe — not the chef’s improvisation.

Prompt engineering patterns

The exam tests three core patterns for designing effective prompts:

Core prompt engineering patterns
Feature	What It Does	When to Use	Token Cost
System Prompt	Sets the model's persona, rules, and output format	Every production prompt — defines boundaries and behaviour	Low — sent once per conversation
Few-Shot Examples	Provides input-output examples to guide the model	When the task has a specific format or when zero-shot quality is poor	Medium — each example adds tokens
Chain-of-Thought	Asks the model to reason step-by-step before answering	Complex reasoning, math, multi-step analysis	High — generates more output tokens

Designing a production system prompt

SYSTEM PROMPT — Loan Document Summariser v2.3

You are a financial document analyst at Meridian Financial.

TASK: Summarise the provided loan document into a structured summary.

RULES:
- Extract: borrower name, loan amount, interest rate, term, collateral
- Format output as JSON with exactly these fields
- If a field is not found in the document, use "NOT_FOUND" — never guess
- Do not include any information not present in the source document
- Respond only with the JSON — no explanations or commentary

OUTPUT FORMAT:
  "borrower": "...",
  "loan_amount": "...",
  "interest_rate": "...",
  "term_months": ...,
  "collateral": "..."

What’s happening:

Persona (line 3): Sets the context — the model acts as a financial analyst
Task (line 5): Clear, single instruction
Rules (lines 7-11): Explicit constraints — what to do, what not to do, how to handle missing data
Output format (lines 13-19): Exact structure expected — reduces parsing errors in downstream code

Exam tip: Prompt design principles

The exam tests prompt design best practices:

Be specific — “Summarise this document” is worse than “Extract borrower name, loan amount, and interest rate as JSON”
Set boundaries — tell the model what NOT to do (no guessing, no extra commentary)
Define output format — JSON, markdown table, numbered list. Specify exactly
Handle edge cases — what should the model do when data is missing? Say so explicitly
Minimize ambiguity — if two people could interpret the prompt differently, it’s too vague

Creating prompt variants

Test different approaches to find the best one:

# Define prompt variants for A/B testing
VARIANTS = {
    "v1_basic": {
        "system": "Summarise the loan document. Return JSON.",
        "description": "Minimal instruction — tests if the model infers structure"
    },
    "v2_structured": {
        "system": """You are a financial document analyst.
Extract: borrower, loan_amount, interest_rate, term_months, collateral.
Return JSON only. Use 'NOT_FOUND' for missing fields.""",
        "description": "Structured with explicit field list and missing-data handling"
    },
    "v3_few_shot": {
        "system": """You are a financial document analyst.
Extract loan details as JSON.

Example input: 'John Smith borrows $500,000 at 4.5% for 30 years, secured by 123 Main St.'
Example output: {"borrower": "John Smith", "loan_amount": "$500,000", "interest_rate": "4.5%", "term_months": 360, "collateral": "123 Main St"}

Use 'NOT_FOUND' for missing fields. Return JSON only.""",
        "description": "Few-shot example showing exact expected format"
    },
}

What’s happening:

v1_basic (lines 3-6): Minimal prompt — tests whether the model can figure out the task with little guidance
v2_structured (lines 7-12): Explicit field list and instructions — more tokens but clearer expectations
v3_few_shot (lines 13-23): Includes a worked example — highest token cost but strongest format guidance

Comparing variant performance

After running variants through an evaluation dataset, compare results:

Metric	v1_basic	v2_structured	v3_few_shot
Field extraction accuracy	72%	91%	96%
JSON parse success rate	65%	94%	99%
Avg tokens per response	180	120	110
Avg latency	1.2s	0.9s	0.8s
Cost per 1000 docs	$4.50	$3.80	$4.20

Analysis: v3_few_shot has the best quality (96% accuracy, 99% valid JSON) with slightly higher cost than v2. v1 is cheapest but unreliable — 35% of responses fail JSON parsing.

Decision: v3_few_shot for production — the 10% cost increase over v2 is worth the 5% accuracy gain and near-perfect JSON output.

Git version control for prompts

Store prompts in a structured Git repository:

prompts/
  loan-summariser/
    system-prompt.md          # Current production prompt
    variants/
      v1-basic.md
      v2-structured.md
      v3-few-shot.md          # Winner — promoted to system-prompt.md
    evaluations/
      eval-2025-04-15.json    # Evaluation results
      eval-2025-05-01.json
    CHANGELOG.md              # History of changes and why
  email-classifier/
    system-prompt.md
    variants/
      v1-keyword.md
      v2-semantic.md
    evaluations/
      eval-2025-03-20.json

What’s happening:

Each prompt task gets its own directory (loan-summariser, email-classifier)
system-prompt.md is the current production prompt — what gets deployed
variants/ holds all tested alternatives — useful for future comparison
evaluations/ stores test results — proves why the current version was chosen
CHANGELOG.md tracks history — “Changed from v2 to v3 on May 1 because v3 had 96% accuracy vs 91%“

Branching strategy for prompt changes

main (production prompts)
  |
  +-- feature/loan-summariser-v4
  |     |-- Updated system prompt with new field: "loan_type"
  |     |-- Evaluation results show 94% accuracy
  |     +-- Pull request: reviewed by team, approved
  |
  +-- feature/email-classifier-v2
        |-- Switched from keyword matching to semantic classification
        |-- Evaluation results show 15% improvement
        +-- Pull request: pending review

Pull request reviews for prompts should check:

Does the evaluation show improvement (or at least no regression)?
Are edge cases handled (missing data, unusual formats)?
Is the output format unchanged (or is downstream code updated)?
Are safety constraints maintained?

CI/CD for prompt deployment

# .github/workflows/prompt-deploy.yml
name: Prompt CI/CD
on:
  push:
    branches: [main]
    paths: ['prompts/**']
  pull_request:
    paths: ['prompts/**']

jobs:
  evaluate:
    runs-on: ubuntu-latest
    if: github.event_name == 'pull_request'
    steps:
      - uses: actions/checkout@v4

      - name: Azure Login
        uses: azure/login@v2
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

      - name: Run Prompt Evaluation
        run: |
          python scripts/evaluate_prompt.py \
            --prompt prompts/loan-summariser/system-prompt.md \
            --dataset evaluations/loan-test-set.jsonl \
            --output evaluations/results.json

      - name: Check Quality Gate
        run: |
          python scripts/check_quality.py \
            --results evaluations/results.json \
            --min-accuracy 0.90 \
            --min-json-parse-rate 0.95

  deploy:
    runs-on: ubuntu-latest
    if: github.event_name == 'push' && github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4

      - name: Azure Login
        uses: azure/login@v2
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

      - name: Deploy Updated Prompts
        run: |
          python scripts/deploy_prompts.py \
            --environment production \
            --source prompts/

What’s happening:

Lines 5-8: Triggers on changes to any file in the prompts/ directory
Lines 11-13: On pull requests, run evaluation — don’t deploy
Lines 24-28: Runs the prompt against a test dataset and generates evaluation results
Lines 30-34: Quality gate — fails the PR if accuracy drops below 90% or JSON parse rate below 95%
Lines 36-50: On merge to main, deploys the updated prompts to production

Scenario: Zara versions client-specific prompts at Atlas

Atlas Consulting has different prompts for each client engagement. Zara’s challenge: 15 clients, each with customised system prompts for their industry.

Zara’s Git structure:

prompts/client-alpha/ — financial services prompts
prompts/client-beta/ — healthcare prompts
prompts/client-gamma/ — retail prompts

Each client directory has the same structure (system-prompt.md, variants/, evaluations/). When a consultant proposes a prompt change for Client Alpha:

Create branch: feature/alpha-prompt-v3
Edit prompts/client-alpha/variants/v3-improved.md
CI pipeline evaluates the variant against Client Alpha’s test dataset
Quality gate passes (accuracy went from 88% to 93%)
Pull request reviewed by Marcus Webb
Merge to main — deployed to Client Alpha’s Foundry project automatically

No consultant can accidentally push an untested prompt to production.

Scenario: Kai A/B tests prompt variants for NeuralSpark

Kai wants to improve NeuralSpark’s customer support bot response quality. Current prompt scores 3.8/5 on helpfulness.

Kai’s approach:

Creates three variants:
- v1 (current): Basic instruction with persona
- v2: Adds chain-of-thought — “Think step-by-step about the customer’s issue before responding”
- v3: Adds few-shot examples of ideal support responses
A/B test setup: 60% v1, 20% v2, 20% v3
After 1 week (5,000 conversations):
- v1: 3.8/5 helpfulness, $0.02/conversation
- v2: 4.3/5 helpfulness, $0.04/conversation (more tokens from reasoning)
- v3: 4.1/5 helpfulness, $0.03/conversation
Decision: v2 wins on quality. The 2x cost increase is acceptable for a support bot where better responses reduce escalations to human agents.
Progressive rollout: 80/20, then 100% to v2. Old variants archived in Git.