Know Your Data: Sensitive Info Types

What are sensitive information types?

Simple explanation

Think of a sniffer dog at an airport.

The dog does not read every bag tag or scan every passport. It sniffs for specific chemical signatures — explosives, drugs, currency. It knows exactly what pattern to look for, and when it detects a match, it alerts the handler.

Sensitive information types (SITs) work the same way for your data. They scan emails, documents, chats, and files looking for specific patterns — credit card numbers, tax IDs, patient records, passport numbers. When a SIT finds a match, it triggers a policy action: block it, warn the user, or log the event.

SITs are the foundation of everything in SC-401. Without them, DLP policies, sensitivity labels, and auto-labeling have nothing to detect.

Why classification comes first

Every protection feature in Microsoft Purview follows the same sequence:

Know → Detect → Protect → Monitor

Step	What Happens	Purview Feature
Know	Understand what sensitive data your org handles	Risk assessment, data inventory
Detect	Find that data wherever it lives	Sensitive information types, classifiers
Protect	Apply controls — labels, encryption, DLP	Sensitivity labels, DLP policies
Monitor	Track what’s happening to sensitive data	Activity Explorer, Content Explorer, Audit

SITs handle step 2. Without detection, protection is guesswork.

Scenario: Priya's classification challenge

Priya Kapoor is the CISO at Meridian Financial, a 3,000-person investment bank. A recent audit found that trading floor analysts were emailing spreadsheets containing client account numbers and tax IDs to personal email addresses.

Before she can create DLP policies to block this, Priya needs to answer: what exactly counts as sensitive data at Meridian?

Her list includes: client account numbers (custom 8-digit format), tax file numbers (country-specific), credit card numbers, SWIFT codes, and internal deal codes. Some are covered by Microsoft’s built-in SITs. Others need custom definitions.

Built-in vs custom sensitive info types

Microsoft ships over 300 built-in SITs that cover common patterns worldwide. But most organisations also have unique data formats.

Built-in SITs cover common patterns; custom SITs fill the gaps
Feature	Built-in SITs	Custom SITs
Created by	Microsoft — shipped with every tenant	Your admin team — you define the pattern
Examples	Credit card number, SSN, passport number, IBAN, tax ID	Employee ID (EMP-XXXXX), internal project codes, custom account numbers
Detection method	Regex + keyword + checksum + proximity	Regex + keyword (you define the pattern)
Editable?	No — you cannot modify built-in definitions	Yes — full control over patterns, keywords, confidence
Country-specific?	Yes — many SITs are region-specific (e.g., Australia Tax File Number)	You decide — create for any region or format
Confidence levels	Pre-configured (low, medium, high)	You define confidence levels based on supporting evidence

How a SIT detects sensitive data

Every SIT uses a combination of techniques to reduce false positives:

1. Primary pattern (regex)

The main pattern that identifies the data. For a credit card number, this is a 16-digit number with specific spacing rules.

2. Supporting evidence (keywords)

Keywords near the pattern that increase confidence. Finding “4532 0123 4567 8901” near the word “Visa” or “card number” is stronger evidence than the number alone.

3. Checksum validation

Mathematical checks that confirm the number is structurally valid. Credit card numbers use the Luhn algorithm — not every 16-digit number is a real card number.

4. Proximity rules

How close the supporting evidence must be to the primary pattern. Keywords within 300 characters of the number score higher than keywords 1,000 characters away.

5. Confidence levels

Confidence	What It Means	Example
High (85-100%)	Strong match — multiple evidence elements found	16-digit number + Luhn checksum + “Visa” keyword within 300 chars
Medium (75-84%)	Moderate match — some evidence present	16-digit number + Luhn checksum, but no keywords nearby
Low (65-74%)	Weak match — pattern found but minimal context	16-digit number alone, no checksum validation

Exam tip: confidence levels and DLP

Confidence levels matter for DLP policy configuration. A DLP rule can trigger on high confidence only (fewer false positives, may miss some real data) or on medium and above (catches more, but more false alerts).

The exam tests whether you understand this trade-off. If a question asks how to reduce false positives in a DLP policy, increasing the required confidence level is often the answer.

Identifying sensitive information requirements

Before you touch Purview, you need to map your organisation’s data landscape:

Step 1: Inventory your sensitive data

Work with legal, compliance, HR, and business units to identify:

Regulatory requirements — GDPR personal data, HIPAA PHI, PCI-DSS cardholder data, SOX financial data
Industry standards — banking account formats, medical record numbers, insurance claim IDs
Internal policies — employee IDs, project codes, deal names, salary data

Step 2: Map to built-in SITs

For each data type, check if Microsoft already provides a built-in SIT:

Go to Microsoft Purview portal → Data classification → Sensitive info types
Search by name or country
Review the pattern definition and test against sample data

Step 3: Identify gaps

Any data type not covered by built-in SITs needs one of:

Custom SIT — for pattern-based data (Module 2)
EDM classifier — for exact matches against a database (Module 3)
Trainable classifier — for content that’s hard to define by pattern, like contracts or resumes (Module 4)

Scenario: Dr. Liam's healthcare classification

Dr. Liam Chen is the IT Security Manager at St. Harbour Health, a 5,000-person healthcare network. His classification needs include:

Patient Health Identifiers (PHI) — covered by built-in SITs for many countries
Medicare numbers — built-in SIT available (country-specific)
Internal Medical Record Numbers (MRN-XXXXXXX) — NOT covered. Needs a custom SIT.
Clinical trial data — too varied for regex. Needs a trainable classifier.
Prescription data — combination of drug names + patient info. Needs EDM matching.

Liam creates a classification plan that uses all three approaches: built-in SITs for standard patterns, custom SITs for internal formats, and trainable classifiers for unstructured clinical content.

Where SITs are used across Purview

SITs don’t work alone. They’re the shared detection engine across multiple features:

Purview Feature	How It Uses SITs
DLP policies	Conditions that trigger block/warn/audit actions
Sensitivity labels (auto-labeling)	Automatically apply labels when SITs are detected
Retention labels (auto-apply)	Automatically retain or dispose content containing SITs
Insider Risk Management	Detect when users interact with SIT-matching content
Content Explorer	Browse and inspect documents that match SITs
DSPM for AI	Monitor what sensitive data AI services can access

Question

What are the three main techniques a SIT uses to detect sensitive data?

Click or press Enter to reveal answer

Answer

1. Primary pattern (regex) — matches the data format. 2. Supporting evidence (keywords) — context near the pattern. 3. Checksum validation — mathematical verification the data is structurally valid. Together with proximity rules and confidence levels, these reduce false positives.

Click to flip back

Question

What is the difference between a built-in SIT and a custom SIT?

Click or press Enter to reveal answer

Answer

Built-in SITs are pre-configured by Microsoft (300+ types) and cannot be modified. Custom SITs are created by your admin team with your own regex patterns, keywords, and confidence levels — used for organisation-specific data formats.

Click to flip back

Question

If a 16-digit number is found in a document but no keywords like 'Visa' or 'card number' appear nearby, what confidence level would a credit card SIT typically assign?

Click or press Enter to reveal answer

Answer

Medium confidence. The pattern matches and the Luhn checksum passes, but the absence of supporting keywords reduces confidence from high to medium.

Click to flip back

Knowledge Check

Priya at Meridian Financial discovers that trading analysts are emailing client account numbers (a custom 8-digit format: MF-XXXXXX) to personal addresses. She wants DLP to detect these. Which approach should she take?

Knowledge Check

Dr. Liam at St. Harbour Health needs to classify three types of data: standard Medicare numbers, internal Medical Record Numbers (MRN-XXXXXXX), and unstructured clinical trial documents. Which combination of classification methods should he use?

Knowledge Check

A DLP policy at Meridian Financial is generating too many false positive alerts for credit card numbers. The policy currently triggers on medium confidence matches. What should Priya do to reduce false positives?

Next up: Custom Sensitive Info Types: Build Your Own — create your own detection patterns for organisation-specific data.