Trainable Classifiers: AI-Powered Detection

What are trainable classifiers?

Simple explanation

Think about training a new security guard.

You cannot give them a checklist for every single threat — threats are too varied. Instead, you show them 50 examples of suspicious behaviour: “This is what tailgating looks like. This is what a stolen badge scan looks like. This is what a social engineering attempt sounds like.”

After seeing enough examples, the guard learns to recognise the pattern — not by a fixed rule, but by understanding what these situations have in common.

Trainable classifiers work the same way. You feed Microsoft Purview dozens of example documents — contracts, resumes, financial statements — and the AI learns to recognise new documents that look similar. No regex needed.

Pre-trained vs custom trainable classifiers

Pre-trained classifiers are instant; custom classifiers are tailored
Feature	Pre-trained Classifiers	Custom Trainable Classifiers
Created by	Microsoft — ships with your tenant	Your admin team — trained on your examples
Examples	Resumes, source code, harassment, threats, profanity, financial statements, agreements	Whatever you train — clinical trial docs, board minutes, internal memos, R&D reports
Training needed?	No — ready to use immediately	Yes — you provide 50+ positive examples and test
Customisable?	No — you cannot modify their training	Yes — retrain if accuracy drops or content types evolve
Accuracy	Good for common types, may vary for niche content	Depends on training quality and example diversity
Use case	Quick classification of common document types	Organisation-specific content that no built-in classifier covers

Key pre-trained classifiers

Classifier	What It Detects
Agreements/Contracts	Legal agreements, NDAs, contracts
Resumes/CVs	Job applications and curriculum vitae
Source Code	Programming code in various languages
Financial Statements	Balance sheets, income statements, cash flow statements
Harassment	Offensive or harassing language
Threats	Threatening language toward people or property
Profanity	Vulgar or offensive language
Discrimination	Discriminatory language
Targeted Harassment	Offensive content directed at specific individuals
Customer Complaints	Content expressing dissatisfaction with products or services

Creating a custom trainable classifier

When no pre-trained classifier fits, you build your own. The process has four stages:

Stage 1: Seed content (positive examples)

Collect at least 50 documents (ideally 200+) that ARE the target type. These must be representative examples — diverse in content but consistent in type.

Requirement	Detail
Minimum count	50 positive examples (200+ recommended for better accuracy)
Format	Must be uploaded to a SharePoint Online site
Quality	Must genuinely represent the content type — not just any random documents
Diversity	Include variety within the type (different authors, dates, topics)
Language	Examples should reflect the languages used in your organisation

Stage 2: Processing

After you submit the seed content, the classifier processes the examples and builds a prediction model. This takes 24-72 hours — there is no way to speed it up.

Stage 3: Testing

Provide both positive examples (more of the target type) and negative examples (documents that are NOT the target type). The classifier evaluates each and you review the results.

Test Result	What It Means	Action
True positive	Correctly identified as the target type	Good — no action needed
True negative	Correctly identified as NOT the target type	Good — no action needed
False positive	Incorrectly flagged as the target type	Mark as “Not a match” to improve the model
False negative	Missed a real example of the target type	Mark as “Match” to improve the model

Stage 4: Publish

Once testing accuracy is acceptable, publish the classifier. It becomes available as a condition in DLP policies, auto-labeling rules, and retention labels — just like any SIT.

Scenario: Dr. Liam builds a clinical trial classifier

St. Harbour Health runs clinical trials. Trial documents vary widely — protocols, consent forms, adverse event reports, data collection forms — but they share characteristics: medical terminology, trial phase references, patient cohort language, regulatory citations.

No regex pattern can describe “a clinical trial document.” Dr. Liam collects 200 examples from the clinical research team, uploads them to SharePoint, creates a custom classifier, waits 48 hours for processing, then tests with 50 positive and 50 negative examples.

Results: 94% accuracy. He publishes the classifier and uses it in a DLP policy to prevent clinical trial documents from being shared externally without approval.

Retraining classifiers

Over time, document formats evolve. A classifier trained on 2024-era contracts may not recognise 2026-era contracts with AI-generated clauses. Microsoft Purview allows retraining:

Add new positive and negative examples
Submit for reprocessing (another 24-72 hours)
Re-test and validate accuracy
Republish

Exam tip: the 50-document minimum

The exam frequently tests the minimum requirements for trainable classifiers. Key numbers:

Seed content: minimum 50 positive examples (200+ recommended)
Testing: provide both positive AND negative examples
Processing time: 24-72 hours (not instant)
Location: seed content must be in SharePoint Online (not OneDrive, not local files)

If a question asks “what is the minimum number of positive examples for a custom trainable classifier?” the answer is 50.

Monitor classification: Data Explorer and Content Explorer

Once your SITs and classifiers are running, you need visibility into what they’re finding.

Content Explorer

Content Explorer lets you browse individual items that match a SIT or classifier. You can see exactly which documents contain sensitive data, where they live, and what was detected.

Capability	What You Can Do
Browse by SIT/label	See all items matching a specific SIT or sensitivity label
View content	Open and inspect the actual document (with appropriate permissions)
Filter by location	Narrow to Exchange, SharePoint, OneDrive, or endpoints
Verify accuracy	Confirm that SITs are detecting the right content

Who can access: Content Explorer Viewer role or Content Explorer List Viewer role (list viewer sees item count but cannot open content).

Data Explorer (Activity Explorer)

Data Explorer (also called Activity Explorer) shows what users are doing with classified data — a timeline of activities:

Activity Type	What It Shows
Label applied	A sensitivity label was added to a document
Label changed	A sensitivity label was changed or removed
DLP policy matched	Content triggered a DLP rule
File copied to USB	Endpoint DLP detected a file copy to removable media
File uploaded to cloud	Content was uploaded to a cloud service

Scenario: Zara audits Atlas Global's classification

Zara Okonkwo at Atlas Global just rolled out new SITs for employee data and project codes. After two weeks, she opens Content Explorer to check:

65,000 items match the employee data SIT across SharePoint and OneDrive
12,000 items match the project code SIT — but 2,000 are in a public SharePoint site
She drills into the public site items and finds project proposals that should be labelled Confidential

She then checks Activity Explorer and finds 45 instances where employees downloaded project documents to personal OneDrive — confirming she needs an Endpoint DLP policy next.

Question

What is the minimum number of positive examples required to create a custom trainable classifier?

Click or press Enter to reveal answer

Answer

50 positive examples minimum, though 200+ is recommended for better accuracy. Examples must be uploaded to a SharePoint Online site. Processing takes 24-72 hours.

Click to flip back

Question

What is the difference between Content Explorer and Activity Explorer?

Click or press Enter to reveal answer

Answer

Content Explorer lets you browse individual items that match a SIT or sensitivity label — you can view the actual documents. Activity Explorer shows what users are doing with classified data — a timeline of label changes, DLP matches, file copies, and other activities.

Click to flip back

Question

When should you use a trainable classifier instead of a custom SIT?

Click or press Enter to reveal answer

Answer

Use a trainable classifier when the content cannot be described by a regex pattern — contracts, resumes, financial statements, clinical documents. Custom SITs work for structured, pattern-based data (IDs, account numbers). Trainable classifiers work for unstructured content defined by shared characteristics.

Click to flip back

Question

What Content Explorer role allows you to see item counts but NOT view actual document content?

Click or press Enter to reveal answer

Answer

Content Explorer List Viewer. This role can see how many items match a SIT or label, and which locations contain them, but cannot open and read the actual documents. The full Content Explorer Viewer role is required to see content.

Click to flip back

Knowledge Check

Marcus at NovaTech needs to classify internal R&D documents. These documents vary widely — some are research papers, some are experiment logs, some are patent drafts — but they all share technical language patterns specific to NovaTech's AI products. No built-in classifier covers this. What should Marcus do?

Knowledge Check

Zara at Atlas Global created a custom trainable classifier for employee performance reviews six months ago. Recently, Atlas Global switched to a new review format with AI-generated summary sections. The classifier is now missing 20% of new reviews. What should Zara do?

Next up: Sensitivity Labels: Create & Protect — now that you can find sensitive data, learn how to protect it with labels.