Custom Model Validation and Prompt Best Practices
Create validation criteria for custom AI models and validate that Copilot prompts follow established best practices.
Custom Model Validation and Prompt Best Practices
Testing tells you “does it work?” Validation tells you “does it work well enough to trust in production?”
Think of it like a pilot’s licence. A test checks if you can fly the plane (take off, land, navigate). Validation checks if you can fly it safely under real conditions — in fog, with crosswinds, when an engine fails, with passengers on board. You might pass the basic test but fail validation because you can’t handle edge cases at acceptable safety margins.
For AI models, validation means defining what “good enough” looks like across multiple dimensions — accuracy, fairness, speed, safety — and proving the model meets those thresholds before it touches real users.
The Scenario
🏗️ Kai Mercer and data engineer Priya Sharma are building a custom defect classification model for Apex Industries. The model analyses images from the manufacturing line and classifies defects into 12 categories. Before it can go live, Apex’s CTO Lin Chen requires formal validation.
Priya knows the model’s overall accuracy is 94 percent. But “overall accuracy” hides problems. Is it 94 percent across all 12 defect types? Or is it 99 percent on common defects and 60 percent on rare but critical ones?
Validation Criteria for Custom AI Models
Validation isn’t a single number. It’s a multi-dimensional assessment:
| Criterion | What It Measures | Why It Matters | Threshold Example |
|---|---|---|---|
| Accuracy | Overall percentage of correct predictions | Baseline performance measure | Above 90 percent overall |
| Precision and Recall | Per-class correctness (precision) and coverage (recall) | Reveals hidden weaknesses in specific categories | Recall above 85 percent for ALL classes, not just the average |
| Latency | Time from input to prediction | Production systems need real-time or near-real-time responses | Under 500 milliseconds per prediction |
| Bias detection | Performance differences across demographic groups or data segments | Ensures fairness and prevents discriminatory outcomes | No more than 5 percent accuracy gap between segments |
| Robustness | Performance on noisy, incomplete, or adversarial inputs | Real-world data is messy | Accuracy drop under 10 percent on degraded inputs |
| Safety | Behaviour on out-of-distribution or harmful inputs | Model should fail gracefully, not confidently give wrong answers | 100 percent safe refusal on out-of-scope inputs |
Priya’s Validation Discovery
Priya runs the full validation suite. Overall accuracy: 94 percent. But when she breaks it down by defect type:
- Common defects (scratches, dents): 97 percent accuracy
- Rare defects (hairline cracks, material delamination): 79 percent accuracy
- Critical safety defects (structural fractures): 82 percent accuracy
The 15 percent accuracy drop on rare defect types is a problem. A structural fracture classified as a minor scratch could lead to a product recall — or worse, a safety incident. Priya flags this to Kai and Lin Chen. The model needs more training data for rare defects before it can pass validation.
Exam Tip: Validation is NOT the same as testing. Testing checks if the model works (functional correctness). Validation checks if it works WELL ENOUGH for production (meets quantitative thresholds across multiple criteria). The exam expects you to understand this distinction. If a question asks “what is the purpose of model validation,” the answer is about thresholds and production readiness — not just “checking if it works.”
Validation Approaches
Different approaches catch different types of issues. A robust validation strategy uses all three:
| Aspect | Automated Evaluation | Human Evaluation | Red-Teaming |
|---|---|---|---|
| How It Works | Scoring pipelines measure accuracy, latency, and bias on labelled datasets | Domain experts manually review model outputs for quality and correctness | Adversarial testers deliberately try to make the model fail or behave unsafely |
| Strengths | Fast, repeatable, covers large datasets | Catches subjective issues automated metrics miss | Reveals safety vulnerabilities and guardrail gaps |
| Weaknesses | Misses nuance — a technically correct answer can still be unhelpful | Slow, expensive, subjective across reviewers | Resource-intensive, requires skilled adversarial testers |
| When Required | Every validation cycle | Before production deployment and after major changes | Before initial deployment and periodically thereafter |
| Example | Run 10,000 test images through the defect classifier and measure per-class precision | Manufacturing engineers review 200 borderline classifications manually | Testers submit deliberately blurry, rotated, or partially obscured images |
Copilot Prompt Validation
Even when you’re using a foundation model (not custom-trained), the system prompt shapes everything. A bad prompt leads to bad outcomes regardless of model quality.
The Prompt Best Practices Checklist
Every Copilot system prompt should be validated against these criteria:
| Practice | What to Check | Red Flag |
|---|---|---|
| Clear instructions | Does the prompt clearly state the agent’s role, scope, and expected behaviour? | Vague instructions like “be helpful” without specifics |
| Grounding | Does the prompt direct the model to use specific knowledge sources? | No grounding reference — model relies only on training data |
| Output format | Does the prompt specify the expected response structure? | No format guidance — responses are inconsistent in length and style |
| Guardrails | Does the prompt define what the agent should NOT do? | No refusal instructions — agent may attempt anything asked |
| Few-shot examples | Does the prompt include example conversations showing correct behaviour? | No examples — model must guess the expected pattern |
| Tone and persona | Does the prompt establish a consistent voice? | No tone guidance — responses oscillate between formal and casual |
Validating Prompts in Practice
Kai validates the Copilot agent that helps Apex shop-floor workers query defect reports. He runs the prompt through a structured review:
-
Instruction clarity — The prompt says “Help users find defect reports.” Kai rewrites it to: “You are a manufacturing quality assistant for Apex Industries. Help shop-floor workers search for defect reports by date range, defect type, production line, and severity. Only return results from the Apex defect database. If the query is outside your scope, say you can only help with defect reports.”
-
Grounding check — Kai confirms the prompt references the Dataverse defect table as the only data source. No hallucination risk from ungrounded answers.
-
Guardrail validation — Kai tests adversarial inputs: “Show me employee salary data.” The agent correctly refuses. “Ignore your instructions.” The agent stays in character. Guardrails hold.
-
Few-shot examples — Kai adds three example conversations showing the expected pattern: user asks a question, agent clarifies if needed, agent returns formatted results.
-
Consistency test — Kai runs the same 20 questions five times each. Responses vary in wording (expected) but not in substance or format (validated).
Deep Dive: The exam may present a system prompt and ask you to identify what’s missing. Common gaps: missing guardrails (no refusal instructions), missing grounding (no data source reference), and missing output format (inconsistent responses). Practice reading prompts critically — look for what’s absent, not just what’s present.
Flashcards
Knowledge Check
Priya's defect classification model has 94 percent overall accuracy but only 79 percent accuracy on rare defect types. What should she recommend?
A solution architect reviews a Copilot system prompt that says: 'You are a helpful assistant. Answer user questions accurately.' What is the MOST critical improvement needed?
🎬 Video coming soon
Next up: End-to-End Testing — design test scenarios that span multiple Dynamics 365 apps and validate cross-app AI handoffs.