PromptOps: Design, Compare, Version & Ship
Prompts are code — treat them like code. Learn to design effective prompts, create variants, compare performance, and manage versions with Git for production GenAI.
Why prompts need engineering discipline
Prompts are recipes. Treat them like recipes.
A great chef doesn’t just throw ingredients in a pot — they write down exact amounts, steps, and timing. When the dish tastes amazing, they save the recipe. When they want to try a variation (less salt, more garlic), they write a new version and taste-test both.
Version control is the recipe book with dates — so you know which version customers loved last month.
Prompt variants are recipe tweaks — same dish, slightly different approach. You compare them to find the best one.
CI/CD for prompts is like a restaurant chain ensuring every location uses the same tested recipe — not the chef’s improvisation.
Prompt engineering patterns
The exam tests three core patterns for designing effective prompts:
| Feature | What It Does | When to Use | Token Cost |
|---|---|---|---|
| System Prompt | Sets the model's persona, rules, and output format | Every production prompt — defines boundaries and behaviour | Low — sent once per conversation |
| Few-Shot Examples | Provides input-output examples to guide the model | When the task has a specific format or when zero-shot quality is poor | Medium — each example adds tokens |
| Chain-of-Thought | Asks the model to reason step-by-step before answering | Complex reasoning, math, multi-step analysis | High — generates more output tokens |
Designing a production system prompt
SYSTEM PROMPT — Loan Document Summariser v2.3
You are a financial document analyst at Meridian Financial.
TASK: Summarise the provided loan document into a structured summary.
RULES:
- Extract: borrower name, loan amount, interest rate, term, collateral
- Format output as JSON with exactly these fields
- If a field is not found in the document, use "NOT_FOUND" — never guess
- Do not include any information not present in the source document
- Respond only with the JSON — no explanations or commentary
OUTPUT FORMAT:
"borrower": "...",
"loan_amount": "...",
"interest_rate": "...",
"term_months": ...,
"collateral": "..."
What’s happening:
- Persona (line 3): Sets the context — the model acts as a financial analyst
- Task (line 5): Clear, single instruction
- Rules (lines 7-11): Explicit constraints — what to do, what not to do, how to handle missing data
- Output format (lines 13-19): Exact structure expected — reduces parsing errors in downstream code
Exam tip: Prompt design principles
The exam tests prompt design best practices:
- Be specific — “Summarise this document” is worse than “Extract borrower name, loan amount, and interest rate as JSON”
- Set boundaries — tell the model what NOT to do (no guessing, no extra commentary)
- Define output format — JSON, markdown table, numbered list. Specify exactly
- Handle edge cases — what should the model do when data is missing? Say so explicitly
- Minimize ambiguity — if two people could interpret the prompt differently, it’s too vague
Creating prompt variants
Test different approaches to find the best one:
# Define prompt variants for A/B testing
VARIANTS = {
"v1_basic": {
"system": "Summarise the loan document. Return JSON.",
"description": "Minimal instruction — tests if the model infers structure"
},
"v2_structured": {
"system": """You are a financial document analyst.
Extract: borrower, loan_amount, interest_rate, term_months, collateral.
Return JSON only. Use 'NOT_FOUND' for missing fields.""",
"description": "Structured with explicit field list and missing-data handling"
},
"v3_few_shot": {
"system": """You are a financial document analyst.
Extract loan details as JSON.
Example input: 'John Smith borrows $500,000 at 4.5% for 30 years, secured by 123 Main St.'
Example output: {"borrower": "John Smith", "loan_amount": "$500,000", "interest_rate": "4.5%", "term_months": 360, "collateral": "123 Main St"}
Use 'NOT_FOUND' for missing fields. Return JSON only.""",
"description": "Few-shot example showing exact expected format"
},
}
What’s happening:
- v1_basic (lines 3-6): Minimal prompt — tests whether the model can figure out the task with little guidance
- v2_structured (lines 7-12): Explicit field list and instructions — more tokens but clearer expectations
- v3_few_shot (lines 13-23): Includes a worked example — highest token cost but strongest format guidance
Comparing variant performance
After running variants through an evaluation dataset, compare results:
| Metric | v1_basic | v2_structured | v3_few_shot |
|---|---|---|---|
| Field extraction accuracy | 72% | 91% | 96% |
| JSON parse success rate | 65% | 94% | 99% |
| Avg tokens per response | 180 | 120 | 110 |
| Avg latency | 1.2s | 0.9s | 0.8s |
| Cost per 1000 docs | $4.50 | $3.80 | $4.20 |
Analysis: v3_few_shot has the best quality (96% accuracy, 99% valid JSON) with slightly higher cost than v2. v1 is cheapest but unreliable — 35% of responses fail JSON parsing.
Decision: v3_few_shot for production — the 10% cost increase over v2 is worth the 5% accuracy gain and near-perfect JSON output.
Git version control for prompts
Store prompts in a structured Git repository:
prompts/
loan-summariser/
system-prompt.md # Current production prompt
variants/
v1-basic.md
v2-structured.md
v3-few-shot.md # Winner — promoted to system-prompt.md
evaluations/
eval-2025-04-15.json # Evaluation results
eval-2025-05-01.json
CHANGELOG.md # History of changes and why
email-classifier/
system-prompt.md
variants/
v1-keyword.md
v2-semantic.md
evaluations/
eval-2025-03-20.json
What’s happening:
- Each prompt task gets its own directory (loan-summariser, email-classifier)
system-prompt.mdis the current production prompt — what gets deployedvariants/holds all tested alternatives — useful for future comparisonevaluations/stores test results — proves why the current version was chosenCHANGELOG.mdtracks history — “Changed from v2 to v3 on May 1 because v3 had 96% accuracy vs 91%“
Branching strategy for prompt changes
main (production prompts)
|
+-- feature/loan-summariser-v4
| |-- Updated system prompt with new field: "loan_type"
| |-- Evaluation results show 94% accuracy
| +-- Pull request: reviewed by team, approved
|
+-- feature/email-classifier-v2
|-- Switched from keyword matching to semantic classification
|-- Evaluation results show 15% improvement
+-- Pull request: pending review
Pull request reviews for prompts should check:
- Does the evaluation show improvement (or at least no regression)?
- Are edge cases handled (missing data, unusual formats)?
- Is the output format unchanged (or is downstream code updated)?
- Are safety constraints maintained?
CI/CD for prompt deployment
# .github/workflows/prompt-deploy.yml
name: Prompt CI/CD
on:
push:
branches: [main]
paths: ['prompts/**']
pull_request:
paths: ['prompts/**']
jobs:
evaluate:
runs-on: ubuntu-latest
if: github.event_name == 'pull_request'
steps:
- uses: actions/checkout@v4
- name: Azure Login
uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- name: Run Prompt Evaluation
run: |
python scripts/evaluate_prompt.py \
--prompt prompts/loan-summariser/system-prompt.md \
--dataset evaluations/loan-test-set.jsonl \
--output evaluations/results.json
- name: Check Quality Gate
run: |
python scripts/check_quality.py \
--results evaluations/results.json \
--min-accuracy 0.90 \
--min-json-parse-rate 0.95
deploy:
runs-on: ubuntu-latest
if: github.event_name == 'push' && github.ref == 'refs/heads/main'
steps:
- uses: actions/checkout@v4
- name: Azure Login
uses: azure/login@v2
with:
client-id: ${{ secrets.AZURE_CLIENT_ID }}
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
- name: Deploy Updated Prompts
run: |
python scripts/deploy_prompts.py \
--environment production \
--source prompts/
What’s happening:
- Lines 5-8: Triggers on changes to any file in the
prompts/directory - Lines 11-13: On pull requests, run evaluation — don’t deploy
- Lines 24-28: Runs the prompt against a test dataset and generates evaluation results
- Lines 30-34: Quality gate — fails the PR if accuracy drops below 90% or JSON parse rate below 95%
- Lines 36-50: On merge to main, deploys the updated prompts to production
Scenario: Zara versions client-specific prompts at Atlas
Atlas Consulting has different prompts for each client engagement. Zara’s challenge: 15 clients, each with customised system prompts for their industry.
Zara’s Git structure:
prompts/client-alpha/— financial services promptsprompts/client-beta/— healthcare promptsprompts/client-gamma/— retail prompts
Each client directory has the same structure (system-prompt.md, variants/, evaluations/). When a consultant proposes a prompt change for Client Alpha:
- Create branch:
feature/alpha-prompt-v3 - Edit
prompts/client-alpha/variants/v3-improved.md - CI pipeline evaluates the variant against Client Alpha’s test dataset
- Quality gate passes (accuracy went from 88% to 93%)
- Pull request reviewed by Marcus Webb
- Merge to main — deployed to Client Alpha’s Foundry project automatically
No consultant can accidentally push an untested prompt to production.
Scenario: Kai A/B tests prompt variants for NeuralSpark
Kai wants to improve NeuralSpark’s customer support bot response quality. Current prompt scores 3.8/5 on helpfulness.
Kai’s approach:
-
Creates three variants:
- v1 (current): Basic instruction with persona
- v2: Adds chain-of-thought — “Think step-by-step about the customer’s issue before responding”
- v3: Adds few-shot examples of ideal support responses
-
A/B test setup: 60% v1, 20% v2, 20% v3
-
After 1 week (5,000 conversations):
- v1: 3.8/5 helpfulness, $0.02/conversation
- v2: 4.3/5 helpfulness, $0.04/conversation (more tokens from reasoning)
- v3: 4.1/5 helpfulness, $0.03/conversation
-
Decision: v2 wins on quality. The 2x cost increase is acceptable for a support bot where better responses reduce escalations to human agents.
-
Progressive rollout: 80/20, then 100% to v2. Old variants archived in Git.
Key terms flashcards
Knowledge check
Zara's team at Atlas Consulting has a prompt that scores 88% accuracy for Client Alpha. A consultant proposes a change they believe will improve it. What is the correct PromptOps workflow?
Kai is comparing three prompt variants for NeuralSpark's support bot. Variant A scores 3.8/5 helpfulness at $0.02/conversation. Variant B scores 4.3/5 at $0.04/conversation. Variant C scores 4.1/5 at $0.03/conversation. Better support responses reduce escalations to human agents (which cost $15 each). Which variant should Kai choose?
🎬 Video coming soon
Next up: Evaluation — measuring whether your GenAI solution actually works.