Know Your Data: Sensitive Info Types
Before you can protect sensitive data, you need to find it. Learn how Microsoft Purview uses sensitive information types to detect credit card numbers, patient IDs, tax file numbers, and anything else that matters to your organisation.
What are sensitive information types?
Think of a sniffer dog at an airport.
The dog does not read every bag tag or scan every passport. It sniffs for specific chemical signatures β explosives, drugs, currency. It knows exactly what pattern to look for, and when it detects a match, it alerts the handler.
Sensitive information types (SITs) work the same way for your data. They scan emails, documents, chats, and files looking for specific patterns β credit card numbers, tax IDs, patient records, passport numbers. When a SIT finds a match, it triggers a policy action: block it, warn the user, or log the event.
SITs are the foundation of everything in SC-401. Without them, DLP policies, sensitivity labels, and auto-labeling have nothing to detect.
Why classification comes first
Every protection feature in Microsoft Purview follows the same sequence:
Know β Detect β Protect β Monitor
| Step | What Happens | Purview Feature |
|---|---|---|
| Know | Understand what sensitive data your org handles | Risk assessment, data inventory |
| Detect | Find that data wherever it lives | Sensitive information types, classifiers |
| Protect | Apply controls β labels, encryption, DLP | Sensitivity labels, DLP policies |
| Monitor | Track whatβs happening to sensitive data | Activity Explorer, Content Explorer, Audit |
SITs handle step 2. Without detection, protection is guesswork.
Scenario: Priya's classification challenge
Priya Kapoor is the CISO at Meridian Financial, a 3,000-person investment bank. A recent audit found that trading floor analysts were emailing spreadsheets containing client account numbers and tax IDs to personal email addresses.
Before she can create DLP policies to block this, Priya needs to answer: what exactly counts as sensitive data at Meridian?
Her list includes: client account numbers (custom 8-digit format), tax file numbers (country-specific), credit card numbers, SWIFT codes, and internal deal codes. Some are covered by Microsoftβs built-in SITs. Others need custom definitions.
Built-in vs custom sensitive info types
Microsoft ships over 300 built-in SITs that cover common patterns worldwide. But most organisations also have unique data formats.
| Feature | Built-in SITs | Custom SITs |
|---|---|---|
| Created by | Microsoft β shipped with every tenant | Your admin team β you define the pattern |
| Examples | Credit card number, SSN, passport number, IBAN, tax ID | Employee ID (EMP-XXXXX), internal project codes, custom account numbers |
| Detection method | Regex + keyword + checksum + proximity | Regex + keyword (you define the pattern) |
| Editable? | No β you cannot modify built-in definitions | Yes β full control over patterns, keywords, confidence |
| Country-specific? | Yes β many SITs are region-specific (e.g., Australia Tax File Number) | You decide β create for any region or format |
| Confidence levels | Pre-configured (low, medium, high) | You define confidence levels based on supporting evidence |
How a SIT detects sensitive data
Every SIT uses a combination of techniques to reduce false positives:
1. Primary pattern (regex)
The main pattern that identifies the data. For a credit card number, this is a 16-digit number with specific spacing rules.
2. Supporting evidence (keywords)
Keywords near the pattern that increase confidence. Finding β4532 0123 4567 8901β near the word βVisaβ or βcard numberβ is stronger evidence than the number alone.
3. Checksum validation
Mathematical checks that confirm the number is structurally valid. Credit card numbers use the Luhn algorithm β not every 16-digit number is a real card number.
4. Proximity rules
How close the supporting evidence must be to the primary pattern. Keywords within 300 characters of the number score higher than keywords 1,000 characters away.
5. Confidence levels
| Confidence | What It Means | Example |
|---|---|---|
| High (85-100%) | Strong match β multiple evidence elements found | 16-digit number + Luhn checksum + βVisaβ keyword within 300 chars |
| Medium (75-84%) | Moderate match β some evidence present | 16-digit number + Luhn checksum, but no keywords nearby |
| Low (65-74%) | Weak match β pattern found but minimal context | 16-digit number alone, no checksum validation |
Exam tip: confidence levels and DLP
Confidence levels matter for DLP policy configuration. A DLP rule can trigger on high confidence only (fewer false positives, may miss some real data) or on medium and above (catches more, but more false alerts).
The exam tests whether you understand this trade-off. If a question asks how to reduce false positives in a DLP policy, increasing the required confidence level is often the answer.
Identifying sensitive information requirements
Before you touch Purview, you need to map your organisationβs data landscape:
Step 1: Inventory your sensitive data
Work with legal, compliance, HR, and business units to identify:
- Regulatory requirements β GDPR personal data, HIPAA PHI, PCI-DSS cardholder data, SOX financial data
- Industry standards β banking account formats, medical record numbers, insurance claim IDs
- Internal policies β employee IDs, project codes, deal names, salary data
Step 2: Map to built-in SITs
For each data type, check if Microsoft already provides a built-in SIT:
- Go to Microsoft Purview portal β Data classification β Sensitive info types
- Search by name or country
- Review the pattern definition and test against sample data
Step 3: Identify gaps
Any data type not covered by built-in SITs needs one of:
- Custom SIT β for pattern-based data (Module 2)
- EDM classifier β for exact matches against a database (Module 3)
- Trainable classifier β for content thatβs hard to define by pattern, like contracts or resumes (Module 4)
Scenario: Dr. Liam's healthcare classification
Dr. Liam Chen is the IT Security Manager at St. Harbour Health, a 5,000-person healthcare network. His classification needs include:
- Patient Health Identifiers (PHI) β covered by built-in SITs for many countries
- Medicare numbers β built-in SIT available (country-specific)
- Internal Medical Record Numbers (MRN-XXXXXXX) β NOT covered. Needs a custom SIT.
- Clinical trial data β too varied for regex. Needs a trainable classifier.
- Prescription data β combination of drug names + patient info. Needs EDM matching.
Liam creates a classification plan that uses all three approaches: built-in SITs for standard patterns, custom SITs for internal formats, and trainable classifiers for unstructured clinical content.
Where SITs are used across Purview
SITs donβt work alone. Theyβre the shared detection engine across multiple features:
| Purview Feature | How It Uses SITs |
|---|---|
| DLP policies | Conditions that trigger block/warn/audit actions |
| Sensitivity labels (auto-labeling) | Automatically apply labels when SITs are detected |
| Retention labels (auto-apply) | Automatically retain or dispose content containing SITs |
| Insider Risk Management | Detect when users interact with SIT-matching content |
| Content Explorer | Browse and inspect documents that match SITs |
| DSPM for AI | Monitor what sensitive data AI services can access |
Priya at Meridian Financial discovers that trading analysts are emailing client account numbers (a custom 8-digit format: MF-XXXXXX) to personal addresses. She wants DLP to detect these. Which approach should she take?
Dr. Liam at St. Harbour Health needs to classify three types of data: standard Medicare numbers, internal Medical Record Numbers (MRN-XXXXXXX), and unstructured clinical trial documents. Which combination of classification methods should he use?
A DLP policy at Meridian Financial is generating too many false positive alerts for credit card numbers. The policy currently triggers on medium confidence matches. What should Priya do to reduce false positives?
π¬ Video coming soon
Next up: Custom Sensitive Info Types: Build Your Own β create your own detection patterns for organisation-specific data.