Responsible AI: Filters, Auditing & Governance

Responsible AI is not optional

Simple explanation

Responsible AI is like having safety features on a car — seatbelts, airbags, speed limiters, and dashcams.

Safety filters stop the AI from saying harmful things (seatbelt). Guardrails keep agents from going off-script (speed limiter). Evaluation tools check if the AI is trustworthy (MOT inspection). Audit logs record what the AI did and why (dashcam). And governance controls decide which tools the agent is allowed to use (keys to certain rooms).

The exam has 4 bullet points just on responsible AI — it’s heavily tested.

Safety filters and content moderation

Microsoft Foundry provides configurable content filters on every model deployment:

Filter Category	What It Catches	Severity Levels
Hate and fairness	Discriminatory or prejudiced content	Safe (annotation only), Low, Medium, High
Sexual	Sexually explicit or suggestive content	Safe (annotation only), Low, Medium, High
Violence	Violent or graphic content	Safe (annotation only), Low, Medium, High
Self-harm	Content promoting self-harm	Safe (annotation only), Low, Medium, High
Prompt shields	Jailbreak attempts and prompt injection	Enabled/Disabled
Groundedness detection	Responses not grounded in provided data	Enabled/Disabled

Exam tip: Custom vs default content filters

Every deployment has default content filters enabled. You can create custom content filter configurations to:

Tighten filters for customer-facing apps (block medium severity, not just high)
Relax filters for internal research tools (allow clinical/medical terminology)
Add prompt shields to prevent injection attacks

The exam tests when to customise filters. Key rule: customer-facing = stricter, internal = can be looser, healthcare/legal = needs domain-specific tuning.

Evaluation instrumentation

Foundry’s evaluation framework lets you measure AI quality systematically:

Evaluator	What It Measures	When to Use
Groundedness	Is the response based on retrieved data?	RAG applications
Relevance	Does the response answer the question?	All generative apps
Coherence	Is the response well-structured and logical?	Content generation
Fluency	Is the language natural and readable?	Customer-facing output
Safety	Does the response contain harmful content?	All applications
F1 score	Does the response match expected output?	Extraction and classification

Running evaluations

Method	When to Use
Manual evaluation	One-off quality check, debugging specific issues
Automated in CI/CD	Every code change, gate deployments on quality scores
Continuous monitoring	Production, detect drift over time
Red teaming	Pre-launch, adversarial testing to find safety gaps

Real-world example: NeuralMed's safety evaluation

Before deploying their patient chatbot, NeuralMed runs three evaluation passes:

Groundedness evaluation — 500 test questions, checking every response cites source medical articles
Safety evaluation — adversarial prompts trying to extract diagnosis advice beyond the bot’s scope
Red teaming — security team attempts prompt injection, jailbreaks, and social engineering

The chatbot must score above 0.85 groundedness and pass all safety checks before going live. These evaluations run automatically in CI/CD on every model or prompt change.

Auditing: trace logging and provenance

Auditing Component	What It Records	Why It Matters
Trace logging	Every model call, input, output, latency, tokens used	Debug issues, track costs, investigate incidents
Provenance metadata	Source documents used for each response	Prove responses are grounded, support citations
Approval workflows	Human review before high-stakes agent actions	Prevent autonomous mistakes in critical workflows

Agent governance

Agents that act autonomously need boundaries. Governance controls include:

Agent oversight modes
Feature	Oversight Mode	What It Means
Full autonomy	Autonomous	Agent acts without human approval. Use for low-risk, well-tested workflows.
Human-in-the-loop	Semiautonomous	Agent proposes actions, human approves before execution. Use for high-stakes decisions.
Report only	Advisory	Agent recommends but never acts. Use for new or untrusted agents.

Tool-access controls

Control	What It Does	Example
Tool allowlist	Agent can only use approved tools	Compliance agent can search regulations but not modify records
Tool blocklist	Agent explicitly blocked from certain actions	Customer service bot can look up orders but can’t issue refunds over a threshold
Rate constraints	Limit how often an agent can call a tool	Agent can create max 10 support tickets per minute
Approval gates	Require human approval before specific tool calls	Agent must get approval before sending external emails

Real-world example: Atlas Financial's agent governance

Atlas Financial’s compliance agent operates in semiautonomous mode:

Autonomous: Search regulations, retrieve documents, generate compliance assessments
Human approval required: Flag a loan application as non-compliant, escalate to regulatory team
Blocked: Cannot modify loan applications, cannot communicate with external regulators directly

Every action is trace-logged. Provenance metadata links every compliance assessment to the specific regulations it cited. Monthly audit reports are generated automatically from trace logs.