Trainable Classifiers: AI-Powered Detection
When regex cannot describe the content and no database exists to match against, trainable classifiers learn from examples to recognise contracts, resumes, source code, and other unstructured content.
What are trainable classifiers?
Think about training a new security guard.
You cannot give them a checklist for every single threat β threats are too varied. Instead, you show them 50 examples of suspicious behaviour: βThis is what tailgating looks like. This is what a stolen badge scan looks like. This is what a social engineering attempt sounds like.β
After seeing enough examples, the guard learns to recognise the pattern β not by a fixed rule, but by understanding what these situations have in common.
Trainable classifiers work the same way. You feed Microsoft Purview dozens of example documents β contracts, resumes, financial statements β and the AI learns to recognise new documents that look similar. No regex needed.
Pre-trained vs custom trainable classifiers
| Feature | Pre-trained Classifiers | Custom Trainable Classifiers |
|---|---|---|
| Created by | Microsoft β ships with your tenant | Your admin team β trained on your examples |
| Examples | Resumes, source code, harassment, threats, profanity, financial statements, agreements | Whatever you train β clinical trial docs, board minutes, internal memos, R&D reports |
| Training needed? | No β ready to use immediately | Yes β you provide 50+ positive examples and test |
| Customisable? | No β you cannot modify their training | Yes β retrain if accuracy drops or content types evolve |
| Accuracy | Good for common types, may vary for niche content | Depends on training quality and example diversity |
| Use case | Quick classification of common document types | Organisation-specific content that no built-in classifier covers |
Key pre-trained classifiers
| Classifier | What It Detects |
|---|---|
| Agreements/Contracts | Legal agreements, NDAs, contracts |
| Resumes/CVs | Job applications and curriculum vitae |
| Source Code | Programming code in various languages |
| Financial Statements | Balance sheets, income statements, cash flow statements |
| Harassment | Offensive or harassing language |
| Threats | Threatening language toward people or property |
| Profanity | Vulgar or offensive language |
| Discrimination | Discriminatory language |
| Targeted Harassment | Offensive content directed at specific individuals |
| Customer Complaints | Content expressing dissatisfaction with products or services |
Creating a custom trainable classifier
When no pre-trained classifier fits, you build your own. The process has four stages:
Stage 1: Seed content (positive examples)
Collect at least 50 documents (ideally 200+) that ARE the target type. These must be representative examples β diverse in content but consistent in type.
| Requirement | Detail |
|---|---|
| Minimum count | 50 positive examples (200+ recommended for better accuracy) |
| Format | Must be uploaded to a SharePoint Online site |
| Quality | Must genuinely represent the content type β not just any random documents |
| Diversity | Include variety within the type (different authors, dates, topics) |
| Language | Examples should reflect the languages used in your organisation |
Stage 2: Processing
After you submit the seed content, the classifier processes the examples and builds a prediction model. This takes 24-72 hours β there is no way to speed it up.
Stage 3: Testing
Provide both positive examples (more of the target type) and negative examples (documents that are NOT the target type). The classifier evaluates each and you review the results.
| Test Result | What It Means | Action |
|---|---|---|
| True positive | Correctly identified as the target type | Good β no action needed |
| True negative | Correctly identified as NOT the target type | Good β no action needed |
| False positive | Incorrectly flagged as the target type | Mark as βNot a matchβ to improve the model |
| False negative | Missed a real example of the target type | Mark as βMatchβ to improve the model |
Stage 4: Publish
Once testing accuracy is acceptable, publish the classifier. It becomes available as a condition in DLP policies, auto-labeling rules, and retention labels β just like any SIT.
Scenario: Dr. Liam builds a clinical trial classifier
St. Harbour Health runs clinical trials. Trial documents vary widely β protocols, consent forms, adverse event reports, data collection forms β but they share characteristics: medical terminology, trial phase references, patient cohort language, regulatory citations.
No regex pattern can describe βa clinical trial document.β Dr. Liam collects 200 examples from the clinical research team, uploads them to SharePoint, creates a custom classifier, waits 48 hours for processing, then tests with 50 positive and 50 negative examples.
Results: 94% accuracy. He publishes the classifier and uses it in a DLP policy to prevent clinical trial documents from being shared externally without approval.
Retraining classifiers
Over time, document formats evolve. A classifier trained on 2024-era contracts may not recognise 2026-era contracts with AI-generated clauses. Microsoft Purview allows retraining:
- Add new positive and negative examples
- Submit for reprocessing (another 24-72 hours)
- Re-test and validate accuracy
- Republish
Exam tip: the 50-document minimum
The exam frequently tests the minimum requirements for trainable classifiers. Key numbers:
- Seed content: minimum 50 positive examples (200+ recommended)
- Testing: provide both positive AND negative examples
- Processing time: 24-72 hours (not instant)
- Location: seed content must be in SharePoint Online (not OneDrive, not local files)
If a question asks βwhat is the minimum number of positive examples for a custom trainable classifier?β the answer is 50.
Monitor classification: Data Explorer and Content Explorer
Once your SITs and classifiers are running, you need visibility into what theyβre finding.
Content Explorer
Content Explorer lets you browse individual items that match a SIT or classifier. You can see exactly which documents contain sensitive data, where they live, and what was detected.
| Capability | What You Can Do |
|---|---|
| Browse by SIT/label | See all items matching a specific SIT or sensitivity label |
| View content | Open and inspect the actual document (with appropriate permissions) |
| Filter by location | Narrow to Exchange, SharePoint, OneDrive, or endpoints |
| Verify accuracy | Confirm that SITs are detecting the right content |
Who can access: Content Explorer Viewer role or Content Explorer List Viewer role (list viewer sees item count but cannot open content).
Data Explorer (Activity Explorer)
Data Explorer (also called Activity Explorer) shows what users are doing with classified data β a timeline of activities:
| Activity Type | What It Shows |
|---|---|
| Label applied | A sensitivity label was added to a document |
| Label changed | A sensitivity label was changed or removed |
| DLP policy matched | Content triggered a DLP rule |
| File copied to USB | Endpoint DLP detected a file copy to removable media |
| File uploaded to cloud | Content was uploaded to a cloud service |
Scenario: Zara audits Atlas Global's classification
Zara Okonkwo at Atlas Global just rolled out new SITs for employee data and project codes. After two weeks, she opens Content Explorer to check:
- 65,000 items match the employee data SIT across SharePoint and OneDrive
- 12,000 items match the project code SIT β but 2,000 are in a public SharePoint site
- She drills into the public site items and finds project proposals that should be labelled Confidential
She then checks Activity Explorer and finds 45 instances where employees downloaded project documents to personal OneDrive β confirming she needs an Endpoint DLP policy next.
Marcus at NovaTech needs to classify internal R&D documents. These documents vary widely β some are research papers, some are experiment logs, some are patent drafts β but they all share technical language patterns specific to NovaTech's AI products. No built-in classifier covers this. What should Marcus do?
Zara at Atlas Global created a custom trainable classifier for employee performance reviews six months ago. Recently, Atlas Global switched to a new review format with AI-generated summary sections. The classifier is now missing 20% of new reviews. What should Zara do?
π¬ Video coming soon
Next up: Sensitivity Labels: Create & Protect β now that you can find sensitive data, learn how to protect it with labels.