Sensitive Information Types and Data Classification
Create and manage sensitive information types using keywords, keyword lists, and regular expressions to automatically identify and classify sensitive data.
The foundation of data protection
Before you can protect sensitive data, you need to find it. Sensitive information types (SITs) are the search patterns that tell Microsoft Purview what to look for.
Think of SITs like customs declarations at an airport. You train the scanner to recognise passport numbers, credit card numbers, and medical records. Once it knows what sensitive data looks like, it can flag it automatically — whether it’s in an email, a SharePoint document, or a Teams message.
Microsoft provides 300+ built-in SITs. For industry-specific data (patient IDs, internal codes), you create custom SITs.
Built-in vs custom SITs
Microsoft provides 300+ built-in SITs covering common data types globally:
| Category | Examples | Detection Method |
|---|---|---|
| Financial | Credit card numbers, bank account numbers, SWIFT codes | Pattern (Luhn algorithm) + keywords |
| Health | Medical record numbers, drug names, ICD codes | Pattern + medical keyword lists |
| Identity | SSN, passport numbers, driver’s licence | Country-specific patterns + keywords |
| IT | Azure storage keys, connection strings | Pattern matching |
| Regional | NZ IRD numbers, Australian TFN, UK NINO | Country-specific formats |
When you need custom SITs
Elena needs to detect MedGuard Health-specific data that no built-in SIT covers:
| Data Type | Format | Why Custom |
|---|---|---|
| Patient ID | MG- followed by 8 digits (e.g., MG-12345678) | Company-specific format |
| Internal drug codes | 3 letters + 4 digits (e.g., ASP1234) | Internal classification system |
| Referring doctor codes | DR- + 6 digits | Internal referral system |
Creating custom SITs
Custom SITs are created in Microsoft Purview portal > Information Protection > Classifiers > Sensitive info types (purview.microsoft.com).
Method 1: Keyword-based SIT
For simple text matching:
| Component | Example |
|---|---|
| Keyword list | ”patient record”, “medical history”, “diagnosis report”, “treatment plan” |
| Case sensitive | No (recommended for most scenarios) |
| Word match | Whole word (prevents false positives from partial matches) |
Method 2: Regular expression SIT
For structured data patterns:
| Component | Example |
|---|---|
| Primary pattern | MG-\d{8} (matches MG- followed by exactly 8 digits) |
| Supporting keywords | ”patient”, “record”, “MedGuard” (within 300 characters) |
| Confidence levels | High: pattern + 2 keywords. Medium: pattern + 1 keyword. Low: pattern only. |
Method 3: Keyword dictionary
For large keyword lists (up to 1 MB post-compression):
- Import from a file (one term per line)
- Useful for lists of drug names, medical terms, internal project codes
- More efficient than keyword lists for large volumes
Exam tip: Confidence levels and false positives
The exam tests your understanding of confidence levels and their impact on DLP:
- High confidence — primary pattern + multiple supporting evidence. Few false positives, may miss some real data.
- Medium confidence — primary pattern + some supporting evidence. Balanced.
- Low confidence — primary pattern alone. Catches more data but more false positives.
DLP policies can be configured to act on different confidence levels. For example: high confidence → block, medium confidence → warn, low confidence → log only. If the exam asks “Elena’s DLP policy is blocking too many legitimate emails,” the answer is likely to increase the required confidence level.
Exact Data Match (EDM)
For the highest accuracy, EDM-based SITs match against your actual data:
- Upload a hashed version of your sensitive data (e.g., actual patient IDs from your database)
- Purview matches content against the hashed data
- Zero false positives — it only flags data that exists in your database
Elena uses EDM for patient IDs — instead of matching the pattern MG-\d{8} (which might match test data or random numbers), EDM matches only actual patient IDs from MedGuard’s patient database.
Deep dive: Trainable classifiers
Beyond pattern-based SITs, Microsoft Purview also offers trainable classifiers — machine learning models trained to recognise content types:
- Pre-trained classifiers — resumes, source code, financial statements, legal documents
- Custom trainable classifiers — trained with your own sample documents
Trainable classifiers work on content understanding (not just patterns) and are useful for unstructured data. The exam may ask about the difference: SITs match patterns, trainable classifiers match content types.
Key concepts to remember
Knowledge check
Elena creates a custom SIT using a regex pattern to match MedGuard patient IDs (format: MG- followed by 8 digits). The DLP policy using this SIT generates many false positives from test documents containing similar patterns. What should Elena do to reduce false positives?
Dev needs to create a SIT that detects drug names for a pharmaceutical client. The client has a list of 15,000 drug names that changes quarterly. What is the most efficient approach?
🎬 Video coming soon
Next up: Retention Labels and Data Lifecycle — keeping data for as long as you need it, and disposing of it when you don’t.