Responsible AI: Filters, Auditing & Governance
Building AI is easy. Building AI responsibly is the hard part. Learn how to configure safety filters, implement evaluation, audit AI decisions, and govern agent behaviour with oversight controls.
Responsible AI is not optional
Responsible AI is like having safety features on a car β seatbelts, airbags, speed limiters, and dashcams.
Safety filters stop the AI from saying harmful things (seatbelt). Guardrails keep agents from going off-script (speed limiter). Evaluation tools check if the AI is trustworthy (MOT inspection). Audit logs record what the AI did and why (dashcam). And governance controls decide which tools the agent is allowed to use (keys to certain rooms).
The exam has 4 bullet points just on responsible AI β itβs heavily tested.
Safety filters and content moderation
Microsoft Foundry provides configurable content filters on every model deployment:
| Filter Category | What It Catches | Severity Levels |
|---|---|---|
| Hate and fairness | Discriminatory or prejudiced content | Safe (annotation only), Low, Medium, High |
| Sexual | Sexually explicit or suggestive content | Safe (annotation only), Low, Medium, High |
| Violence | Violent or graphic content | Safe (annotation only), Low, Medium, High |
| Self-harm | Content promoting self-harm | Safe (annotation only), Low, Medium, High |
| Prompt shields | Jailbreak attempts and prompt injection | Enabled/Disabled |
| Groundedness detection | Responses not grounded in provided data | Enabled/Disabled |
Exam tip: Custom vs default content filters
Every deployment has default content filters enabled. You can create custom content filter configurations to:
- Tighten filters for customer-facing apps (block medium severity, not just high)
- Relax filters for internal research tools (allow clinical/medical terminology)
- Add prompt shields to prevent injection attacks
The exam tests when to customise filters. Key rule: customer-facing = stricter, internal = can be looser, healthcare/legal = needs domain-specific tuning.
Evaluation instrumentation
Foundryβs evaluation framework lets you measure AI quality systematically:
| Evaluator | What It Measures | When to Use |
|---|---|---|
| Groundedness | Is the response based on retrieved data? | RAG applications |
| Relevance | Does the response answer the question? | All generative apps |
| Coherence | Is the response well-structured and logical? | Content generation |
| Fluency | Is the language natural and readable? | Customer-facing output |
| Safety | Does the response contain harmful content? | All applications |
| F1 score | Does the response match expected output? | Extraction and classification |
Running evaluations
| Method | When to Use |
|---|---|
| Manual evaluation | One-off quality check, debugging specific issues |
| Automated in CI/CD | Every code change, gate deployments on quality scores |
| Continuous monitoring | Production, detect drift over time |
| Red teaming | Pre-launch, adversarial testing to find safety gaps |
Real-world example: NeuralMed's safety evaluation
Before deploying their patient chatbot, NeuralMed runs three evaluation passes:
- Groundedness evaluation β 500 test questions, checking every response cites source medical articles
- Safety evaluation β adversarial prompts trying to extract diagnosis advice beyond the botβs scope
- Red teaming β security team attempts prompt injection, jailbreaks, and social engineering
The chatbot must score above 0.85 groundedness and pass all safety checks before going live. These evaluations run automatically in CI/CD on every model or prompt change.
Auditing: trace logging and provenance
| Auditing Component | What It Records | Why It Matters |
|---|---|---|
| Trace logging | Every model call, input, output, latency, tokens used | Debug issues, track costs, investigate incidents |
| Provenance metadata | Source documents used for each response | Prove responses are grounded, support citations |
| Approval workflows | Human review before high-stakes agent actions | Prevent autonomous mistakes in critical workflows |
Agent governance
Agents that act autonomously need boundaries. Governance controls include:
| Feature | Oversight Mode | What It Means |
|---|---|---|
| Full autonomy | Autonomous | Agent acts without human approval. Use for low-risk, well-tested workflows. |
| Human-in-the-loop | Semiautonomous | Agent proposes actions, human approves before execution. Use for high-stakes decisions. |
| Report only | Advisory | Agent recommends but never acts. Use for new or untrusted agents. |
Tool-access controls
| Control | What It Does | Example |
|---|---|---|
| Tool allowlist | Agent can only use approved tools | Compliance agent can search regulations but not modify records |
| Tool blocklist | Agent explicitly blocked from certain actions | Customer service bot can look up orders but canβt issue refunds over a threshold |
| Rate constraints | Limit how often an agent can call a tool | Agent can create max 10 support tickets per minute |
| Approval gates | Require human approval before specific tool calls | Agent must get approval before sending external emails |
Real-world example: Atlas Financial's agent governance
Atlas Financialβs compliance agent operates in semiautonomous mode:
- Autonomous: Search regulations, retrieve documents, generate compliance assessments
- Human approval required: Flag a loan application as non-compliant, escalate to regulatory team
- Blocked: Cannot modify loan applications, cannot communicate with external regulators directly
Every action is trace-logged. Provenance metadata links every compliance assessment to the specific regulations it cited. Monthly audit reports are generated automatically from trace logs.
Key terms
Knowledge check
NeuralMed's patient chatbot should NEVER provide specific medical diagnoses β only direct patients to consult their doctor. A safety evaluation reveals the chatbot occasionally says 'Based on your symptoms, you likely have...' Which control should they implement?
Atlas Financial's compliance agent can autonomously search regulations and generate assessments. However, the team wants any decision to flag a loan as 'non-compliant' to require manager approval. Which governance control should they configure?
Which of the following is an example of provenance metadata in an AI system?
π¬ Video coming soon