Sensitive Information Types and Data Classification

The foundation of data protection

Simple explanation

Before you can protect sensitive data, you need to find it. Sensitive information types (SITs) are the search patterns that tell Microsoft Purview what to look for.

Think of SITs like customs declarations at an airport. You train the scanner to recognise passport numbers, credit card numbers, and medical records. Once it knows what sensitive data looks like, it can flag it automatically — whether it’s in an email, a SharePoint document, or a Teams message.

Microsoft provides 300+ built-in SITs. For industry-specific data (patient IDs, internal codes), you create custom SITs.

Built-in vs custom SITs

Microsoft provides 300+ built-in SITs covering common data types globally:

Category	Examples	Detection Method
Financial	Credit card numbers, bank account numbers, SWIFT codes	Pattern (Luhn algorithm) + keywords
Health	Medical record numbers, drug names, ICD codes	Pattern + medical keyword lists
Identity	SSN, passport numbers, driver’s licence	Country-specific patterns + keywords
IT	Azure storage keys, connection strings	Pattern matching
Regional	NZ IRD numbers, Australian TFN, UK NINO	Country-specific formats

When you need custom SITs

Elena needs to detect MedGuard Health-specific data that no built-in SIT covers:

Data Type	Format	Why Custom
Patient ID	`MG-` followed by 8 digits (e.g., MG-12345678)	Company-specific format
Internal drug codes	3 letters + 4 digits (e.g., ASP1234)	Internal classification system
Referring doctor codes	`DR-` + 6 digits	Internal referral system

Creating custom SITs

Custom SITs are created in Microsoft Purview portal > Information Protection > Classifiers > Sensitive info types (purview.microsoft.com).

Method 1: Keyword-based SIT

For simple text matching:

Component	Example
Keyword list	”patient record”, “medical history”, “diagnosis report”, “treatment plan”
Case sensitive	No (recommended for most scenarios)
Word match	Whole word (prevents false positives from partial matches)

Method 2: Regular expression SIT

For structured data patterns:

Component	Example
Primary pattern	`MG-\d{8}` (matches MG- followed by exactly 8 digits)
Supporting keywords	”patient”, “record”, “MedGuard” (within 300 characters)
Confidence levels	High: pattern + 2 keywords. Medium: pattern + 1 keyword. Low: pattern only.

Method 3: Keyword dictionary

For large keyword lists (up to 1 MB post-compression):

Import from a file (one term per line)
Useful for lists of drug names, medical terms, internal project codes
More efficient than keyword lists for large volumes

Exam tip: Confidence levels and false positives

The exam tests your understanding of confidence levels and their impact on DLP:

High confidence — primary pattern + multiple supporting evidence. Few false positives, may miss some real data.
Medium confidence — primary pattern + some supporting evidence. Balanced.
Low confidence — primary pattern alone. Catches more data but more false positives.

DLP policies can be configured to act on different confidence levels. For example: high confidence → block, medium confidence → warn, low confidence → log only. If the exam asks “Elena’s DLP policy is blocking too many legitimate emails,” the answer is likely to increase the required confidence level.

Exact Data Match (EDM)

For the highest accuracy, EDM-based SITs match against your actual data:

Upload a hashed version of your sensitive data (e.g., actual patient IDs from your database)
Purview matches content against the hashed data
Zero false positives — it only flags data that exists in your database

Elena uses EDM for patient IDs — instead of matching the pattern MG-\d{8} (which might match test data or random numbers), EDM matches only actual patient IDs from MedGuard’s patient database.

Deep dive: Trainable classifiers

Beyond pattern-based SITs, Microsoft Purview also offers trainable classifiers — machine learning models trained to recognise content types:

Pre-trained classifiers — resumes, source code, financial statements, legal documents
Custom trainable classifiers — trained with your own sample documents

Trainable classifiers work on content understanding (not just patterns) and are useful for unstructured data. The exam may ask about the difference: SITs match patterns, trainable classifiers match content types.

Key concepts to remember

Question

What three components make up a sensitive information type?

Click or press Enter to reveal answer

Answer

1. Primary pattern (regex, keyword list, or function). 2. Corroborative evidence (supporting keywords within proximity). 3. Confidence level (Low/Medium/High based on how many evidence elements match). Higher confidence = fewer false positives.

Click to flip back

Question

What is the difference between a keyword list and a keyword dictionary in Purview?

Click or press Enter to reveal answer

Answer

Keyword lists are small, inline collections of terms defined directly in the SIT. Keyword dictionaries are large, file-based collections (up to 1 MB post-compression) imported from a text file. Use dictionaries for drug names, medical terms, or other large reference lists.

Click to flip back

Question

What is Exact Data Match (EDM) and when should you use it?

Click or press Enter to reveal answer

Answer

EDM-based SITs match content against a hashed copy of your actual sensitive data (e.g., real patient IDs from your database). This eliminates false positives because it only matches data that exists in your records. Use for high-value data where false positives are unacceptable.

Click to flip back

Knowledge check

Knowledge Check

Elena creates a custom SIT using a regex pattern to match MedGuard patient IDs (format: MG- followed by 8 digits). The DLP policy using this SIT generates many false positives from test documents containing similar patterns. What should Elena do to reduce false positives?

Knowledge Check

Dev needs to create a SIT that detects drug names for a pharmaceutical client. The client has a list of 15,000 drug names that changes quarterly. What is the most efficient approach?

Next up: Retention Labels and Data Lifecycle — keeping data for as long as you need it, and disposing of it when you don’t.