Custom Model Validation and Prompt Best Practices

Simple explanation

Testing tells you “does it work?” Validation tells you “does it work well enough to trust in production?”

Think of it like a pilot’s licence. A test checks if you can fly the plane (take off, land, navigate). Validation checks if you can fly it safely under real conditions — in fog, with crosswinds, when an engine fails, with passengers on board. You might pass the basic test but fail validation because you can’t handle edge cases at acceptable safety margins.

For AI models, validation means defining what “good enough” looks like across multiple dimensions — accuracy, fairness, speed, safety — and proving the model meets those thresholds before it touches real users.

The Scenario

🏗️ Kai Mercer and data engineer Priya Sharma are building a custom defect classification model for Apex Industries. The model analyses images from the manufacturing line and classifies defects into 12 categories. Before it can go live, Apex’s CTO Lin Chen requires formal validation.

Priya knows the model’s overall accuracy is 94 percent. But “overall accuracy” hides problems. Is it 94 percent across all 12 defect types? Or is it 99 percent on common defects and 60 percent on rare but critical ones?

Validation Criteria for Custom AI Models

Validation isn’t a single number. It’s a multi-dimensional assessment:

Criterion	What It Measures	Why It Matters	Threshold Example
Accuracy	Overall percentage of correct predictions	Baseline performance measure	Above 90 percent overall
Precision and Recall	Per-class correctness (precision) and coverage (recall)	Reveals hidden weaknesses in specific categories	Recall above 85 percent for ALL classes, not just the average
Latency	Time from input to prediction	Production systems need real-time or near-real-time responses	Under 500 milliseconds per prediction
Bias detection	Performance differences across demographic groups or data segments	Ensures fairness and prevents discriminatory outcomes	No more than 5 percent accuracy gap between segments
Robustness	Performance on noisy, incomplete, or adversarial inputs	Real-world data is messy	Accuracy drop under 10 percent on degraded inputs
Safety	Behaviour on out-of-distribution or harmful inputs	Model should fail gracefully, not confidently give wrong answers	100 percent safe refusal on out-of-scope inputs

Priya’s Validation Discovery

Priya runs the full validation suite. Overall accuracy: 94 percent. But when she breaks it down by defect type:

Common defects (scratches, dents): 97 percent accuracy
Rare defects (hairline cracks, material delamination): 79 percent accuracy
Critical safety defects (structural fractures): 82 percent accuracy

The 15 percent accuracy drop on rare defect types is a problem. A structural fracture classified as a minor scratch could lead to a product recall — or worse, a safety incident. Priya flags this to Kai and Lin Chen. The model needs more training data for rare defects before it can pass validation.

Exam Tip: Validation is NOT the same as testing. Testing checks if the model works (functional correctness). Validation checks if it works WELL ENOUGH for production (meets quantitative thresholds across multiple criteria). The exam expects you to understand this distinction. If a question asks “what is the purpose of model validation,” the answer is about thresholds and production readiness — not just “checking if it works.”

Validation Approaches

Different approaches catch different types of issues. A robust validation strategy uses all three:

Aspect	Automated Evaluation	Human Evaluation	Red-Teaming
How It Works	Scoring pipelines measure accuracy, latency, and bias on labelled datasets	Domain experts manually review model outputs for quality and correctness	Adversarial testers deliberately try to make the model fail or behave unsafely
Strengths	Fast, repeatable, covers large datasets	Catches subjective issues automated metrics miss	Reveals safety vulnerabilities and guardrail gaps
Weaknesses	Misses nuance — a technically correct answer can still be unhelpful	Slow, expensive, subjective across reviewers	Resource-intensive, requires skilled adversarial testers
When Required	Every validation cycle	Before production deployment and after major changes	Before initial deployment and periodically thereafter
Example	Run 10,000 test images through the defect classifier and measure per-class precision	Manufacturing engineers review 200 borderline classifications manually	Testers submit deliberately blurry, rotated, or partially obscured images

Copilot Prompt Validation

Even when you’re using a foundation model (not custom-trained), the system prompt shapes everything. A bad prompt leads to bad outcomes regardless of model quality.

The Prompt Best Practices Checklist

Every Copilot system prompt should be validated against these criteria:

Practice	What to Check	Red Flag
Clear instructions	Does the prompt clearly state the agent’s role, scope, and expected behaviour?	Vague instructions like “be helpful” without specifics
Grounding	Does the prompt direct the model to use specific knowledge sources?	No grounding reference — model relies only on training data
Output format	Does the prompt specify the expected response structure?	No format guidance — responses are inconsistent in length and style
Guardrails	Does the prompt define what the agent should NOT do?	No refusal instructions — agent may attempt anything asked
Few-shot examples	Does the prompt include example conversations showing correct behaviour?	No examples — model must guess the expected pattern
Tone and persona	Does the prompt establish a consistent voice?	No tone guidance — responses oscillate between formal and casual

Validating Prompts in Practice

Kai validates the Copilot agent that helps Apex shop-floor workers query defect reports. He runs the prompt through a structured review:

Instruction clarity — The prompt says “Help users find defect reports.” Kai rewrites it to: “You are a manufacturing quality assistant for Apex Industries. Help shop-floor workers search for defect reports by date range, defect type, production line, and severity. Only return results from the Apex defect database. If the query is outside your scope, say you can only help with defect reports.”
Grounding check — Kai confirms the prompt references the Dataverse defect table as the only data source. No hallucination risk from ungrounded answers.
Guardrail validation — Kai tests adversarial inputs: “Show me employee salary data.” The agent correctly refuses. “Ignore your instructions.” The agent stays in character. Guardrails hold.
Few-shot examples — Kai adds three example conversations showing the expected pattern: user asks a question, agent clarifies if needed, agent returns formatted results.
Consistency test — Kai runs the same 20 questions five times each. Responses vary in wording (expected) but not in substance or format (validated).

Deep Dive: The exam may present a system prompt and ask you to identify what’s missing. Common gaps: missing guardrails (no refusal instructions), missing grounding (no data source reference), and missing output format (inconsistent responses). Practice reading prompts critically — look for what’s absent, not just what’s present.

Flashcards

Question

What is the key difference between testing and validation for AI models?

Click or press Enter to reveal answer

Answer

Testing checks functional correctness — does the model produce the right output? Validation checks production readiness — does the model meet quantitative thresholds for accuracy, bias, latency, robustness, and safety? A model can pass testing but fail validation if it doesn't meet the required thresholds.

Click to flip back

Question

Why is overall accuracy an insufficient validation metric?

Click or press Enter to reveal answer

Answer

Overall accuracy can hide per-class failures. A model with 94 percent overall accuracy might be 99 percent accurate on common cases and only 60 percent on rare but critical cases. Validation must include per-class precision and recall to expose these hidden weaknesses.

Click to flip back

Question

Name the six elements of the Copilot prompt best practices checklist.

Click or press Enter to reveal answer

Answer

1. Clear instructions — specific role and scope. 2. Grounding — reference to knowledge sources. 3. Output format — expected response structure. 4. Guardrails — what the agent must NOT do. 5. Few-shot examples — example conversations. 6. Tone and persona — consistent voice and style.

Click to flip back

Question

What are the three validation approaches for custom AI models?

Click or press Enter to reveal answer

Answer

1. Automated evaluation — scoring pipelines on labelled datasets (fast, repeatable). 2. Human evaluation — domain experts manually review outputs (catches nuance). 3. Red-teaming — adversarial testers try to break the model (reveals safety gaps). A robust strategy uses all three.

Click to flip back

Knowledge Check

Priya's defect classification model has 94 percent overall accuracy but only 79 percent accuracy on rare defect types. What should she recommend?

Knowledge Check

A solution architect reviews a Copilot system prompt that says: 'You are a helpful assistant. Answer user questions accurately.' What is the MOST critical improvement needed?

Next up: End-to-End Testing — design test scenarios that span multiple Dynamics 365 apps and validate cross-app AI handoffs.