Data Classification & Sensitivity Labels
How Microsoft Purview finds sensitive data, classifies it with patterns and AI, and protects it with sensitivity labels that travel with the document.
You can’t protect what you don’t know about
Think of a hospital filing room.
Imagine thousands of patient files scattered across desks, drawers, and shared folders — some confidential, some routine, all mixed together. You can’t lock up the sensitive ones if you don’t know which files contain patient records.
Data classification is the process of finding sensitive data and labelling it, so you know what needs protection. Microsoft Purview does this automatically — scanning documents for credit card numbers, medical records, passport numbers, and more.
How Purview classifies data
Microsoft Purview uses three methods to find and classify sensitive information:
1. Sensitive information types (SITs)
SITs use pattern matching to detect specific data formats. Think of them as smart templates that recognise data patterns.
| SIT Category | Examples | How It Detects |
|---|---|---|
| Financial | Credit card numbers, bank account numbers | Number patterns + checksum validation |
| Personal ID | Social Security numbers, passport numbers | Format patterns + context keywords nearby |
| Health | Medical record numbers, drug names | Patterns + proximity to health-related terms |
| Custom | Your organisation’s patient ID format, internal codes | You define the pattern and keywords |
Key exam concept: SITs use patterns and keywords, not AI. They look for specific formats (like 16 digits starting with 4 for Visa cards) plus supporting evidence (like the word “Visa” nearby). Microsoft includes 300+ built-in SITs and you can create custom ones.
2. Trainable classifiers
When patterns aren’t enough, trainable classifiers use machine learning to recognise types of documents based on their content — not just specific data formats.
Examples of built-in trainable classifiers:
- Legal documents — contracts, NDAs, settlement agreements
- Financial statements — balance sheets, income statements
- Resumes/CVs — candidate applications
- Source code — programming files
- Healthcare — clinical trial documentation, discharge summaries
Key exam concept: Trainable classifiers identify document types, not data patterns. A SIT finds a credit card number; a classifier recognises an entire document as a “financial statement.”
3. Sensitivity labels (classification + protection)
We’ll cover labels in detail below, but the key point: labels are how you classify AND protect data. SITs and classifiers find the data; labels tell the world what it is and enforce rules.
| Feature | Sensitive Information Types | Trainable Classifiers | Sensitivity Labels |
|---|---|---|---|
| What it does | Detects specific data patterns | Recognises document types using ML | Classifies AND protects data |
| How it works | Pattern matching + keywords | Machine learning models trained on examples | Metadata tags with enforced protection |
| Example | Finds a credit card number in a spreadsheet | Identifies a document as a legal contract | Marks a file as Confidential and encrypts it |
| Learns from examples? | No — uses fixed patterns | Yes — trained on sample documents | No — applied by users, policies, or auto-labelling |
Content Explorer and Activity Explorer
Once Purview classifies your data, you need visibility into what was found and what’s happening to it. That’s where the two Explorers come in.
Content Explorer
Content Explorer lets you browse the actual documents that contain sensitive data. Think of it as a search engine for classified content.
- See exactly which files contain credit card numbers, patient IDs, or other SIT matches
- Drill into a document to see the specific sensitive data detected
- Requires Content Explorer Content Viewer or Content Explorer List Viewer roles (not everyone should see the actual data)
Activity Explorer
Activity Explorer shows what actions are happening with classified and labelled content:
- A sensitivity label was applied or changed
- A file was shared externally
- A DLP policy was matched
- A labelled document was printed or copied to USB
| Feature | Content Explorer | Activity Explorer |
|---|---|---|
| What it answers | What sensitive data exists and where is it? | What are people doing with sensitive data? |
| Shows | Documents, the sensitive data found inside them, location | Actions: label applied, file shared, DLP match, download, print |
| Use case | Nadia needs to know how many files contain patient SSNs | Nadia needs to know if anyone shared labelled files externally this week |
| Permissions | Requires specific Content Explorer roles | Requires Activity Explorer role |
Scenario: Nadia investigates a data concern
MedGuard’s IT Director, Liam, reports that users have been sharing spreadsheets externally. Nadia investigates:
- Content Explorer — she searches for files containing “patient SSN” SIT matches. She finds 340 files across SharePoint and OneDrive
- Activity Explorer — she filters for “shared externally” actions on files with sensitivity labels. She spots 12 files shared with external partners in the last 30 days
Nadia now has the evidence to create a DLP policy targeting external sharing of patient data — and she knows exactly the scope of the problem.
Sensitivity labels
Sensitivity labels are the action part of data classification. They don’t just tag data — they enforce protection.
What can a sensitivity label do?
| Protection | How It Works |
|---|---|
| Encrypt | Only authorised users can open the document, even if it’s leaked |
| Restrict access | Block specific users or groups from accessing the content |
| Visual markings | Add headers, footers, or watermarks (e.g., “CONFIDENTIAL” watermark) |
| Protect containers | Control settings on Teams, SharePoint sites, and Microsoft 365 Groups |
Key exam concept: Sensitivity labels travel with the document. If you email a labelled file to someone outside your organisation, the encryption and restrictions still apply. The label is embedded in the file’s metadata.
Label policies
Creating a label is only half the job. Label policies control how labels are published and enforced:
| Policy Setting | What It Does |
|---|---|
| Publish to users | Choose which users and groups see which labels |
| Default label | Automatically apply a label to new documents (e.g., “General” by default) |
| Mandatory labelling | Users must choose a label before saving — no unlabelled documents allowed |
| Auto-labelling | Automatically apply labels based on SIT matches (e.g., if a document contains 5+ credit card numbers, label it “Confidential”) |
| Justification for downgrade | If a user tries to lower a label from “Highly Confidential” to “General,” they must explain why |
Label priority
Labels have a priority order (set by the admin). Higher-priority labels override lower ones.
Example priority:
- Public (lowest)
- General
- Confidential
- Highly Confidential (highest)
If auto-labelling detects patient data and wants to apply “Highly Confidential” but the user already applied “General,” the higher-priority label replaces it. However, if a user manually applied “General,” auto-labelling will not override it by default — an admin must explicitly enable this in the auto-labelling policy settings.
Scenario: Nadia sets up labelling for MedGuard
Nadia configures MedGuard’s labelling strategy:
- Labels created: Public, General, Confidential, Highly Confidential - Patient Data
- Default label: “General” applied to all new documents
- Mandatory labelling: Enabled — staff must label every document before saving
- Auto-labelling: If a document contains 3+ patient SSN matches, automatically apply “Highly Confidential - Patient Data” (which adds encryption + “PATIENT DATA” watermark)
- Justification required: Anyone downgrading from “Highly Confidential” must provide a reason
Now every document at MedGuard is classified, and patient data gets automatic protection without relying on staff to remember.
Exam tip: Labels vs SITs — the exam tests both
The exam often presents a scenario and asks whether to use a SIT, a classifier, or a label. Here’s the decision tree:
- “We need to detect credit card numbers in documents” → SIT (pattern detection)
- “We need to identify which documents are legal contracts” → Trainable classifier (ML-based)
- “We need to encrypt documents containing patient data” → Sensitivity label (protection)
- “We need to find credit card numbers AND encrypt the files” → SIT (to detect) + auto-labelling with a sensitivity label (to protect)
🎬 Video walkthrough
🎬 Video coming soon
Data Classification & Sensitivity Labels — SC-900 Domain 4.3
Data Classification & Sensitivity Labels — SC-900 Domain 4.3
~10 minFlashcards
Knowledge Check
MedGuard needs to automatically detect patient Social Security numbers stored in SharePoint documents. Which classification method should Nadia configure?
Nadia wants to see which labelled documents were shared with external users in the last 30 days. Which tool should she use?
Nadia configures a sensitivity label called 'Highly Confidential - Patient Data' that encrypts the document and adds a watermark. A doctor applies this label to a patient report and emails it to an external specialist. What happens to the protection?