Custom Sensitive Info Types: Build Your Own

Why build custom sensitive info types?

Simple explanation

Imagine a metal detector at a museum.

The standard detector catches knives, guns, and obvious weapons. But this museum also has a priceless collection of rare coins — and visitors have been smuggling them out in their pockets. The standard detector does not know what a rare coin looks like.

So the museum builds a custom detector specifically tuned to the weight, size, and metal composition of their coins. Now both standard threats AND museum-specific threats are caught.

That’s what custom SITs do. Microsoft’s built-in types catch universal patterns (credit cards, passports). Custom SITs catch your organisation’s unique patterns — employee IDs, account numbers, project codes, internal reference numbers.

Anatomy of a custom SIT

Every custom SIT has four components that work together:

1. Primary element — the pattern

The core regex pattern that identifies the data. This is what the SIT looks for first.

Data Example	Regex Pattern	What It Matches
MF-12345678	`MF-\d{8}`	Meridian Financial account number
EMP-AB-1234	`EMP-[A-Z]{2}-\d{4}`	Employee ID with department code
PRJ-2026-0042	`PRJ-\d{4}-\d{4}`	Project reference number

2. Supporting elements — keywords and keyword lists

Keywords near the pattern increase detection confidence. You can use:

Keyword lists — small sets of words defined inline (e.g., “account”, “client”, “portfolio”)
Keyword dictionaries — large sets imported from a file (e.g., 10,000 medical terms, product names)

Feature	Keyword List	Keyword Dictionary
Size	Up to ~2,000 terms	Up to 1 million terms, 100 KB file
Where defined	Inline in the SIT definition	Uploaded separately, referenced by SITs
Use case	Small supporting word sets	Large vocabularies — drug names, product codes, employee names
Editable	Edit the SIT directly	Update the dictionary file independently

3. Confidence levels

Each pattern + evidence combination maps to a confidence level:

Low confidence: Pattern match only, no supporting keywords
Medium confidence: Pattern match + at least one supporting keyword
High confidence: Pattern match + multiple keywords + additional evidence

You define what constitutes low, medium, and high for your custom SIT.

4. Character proximity

How close supporting keywords must be to the primary pattern. Default is 300 characters, but you can adjust from 1 to the entire document.

Scenario: Priya builds Meridian's account SIT

Priya needs to detect Meridian Financial’s client account numbers (format: MF-XXXXXXXX) across all M365 workloads.

Her custom SIT definition:

Primary pattern: MF-\d{8}
Keywords: “account”, “client”, “portfolio”, “fund”, “investment”
Proximity: 300 characters
Confidence: High = pattern + 2 keywords, Medium = pattern + 1 keyword, Low = pattern only

She tests it against 50 sample documents: 47 true positives, 2 false negatives (account numbers in image-only PDFs — needs OCR), 1 false positive (a reference number that happened to start with MF-). She adjusts the regex to require exactly 8 digits after MF- and retests.

Creating a custom SIT — two methods

Portal for simple SITs, PowerShell for complex or automated scenarios
Feature	Purview Portal (UI)	PowerShell
Best for	Simple SITs, visual editing, testing	Complex SITs, bulk creation, automation
Regex testing	Built-in test with sample data upload	Test separately, then deploy
Keyword dictionaries	Upload via portal	Import with New-DlpKeywordDictionary
Multiple patterns	Add via UI — one pattern per confidence level	Full XML control over multiple patterns
Versioning	No built-in version control	Export XML, store in source control
Cmdlet	N/A — web interface	New-DlpSensitiveInformationType

Portal method (recommended for most admins)

Microsoft Purview portal → Data classification → Classifiers → Sensitive info types
Click Create sensitive info type
Name and describe the SIT (this appears in policy selection dropdowns)
Add a pattern:
- Define the primary element (regex)
- Add supporting elements (keywords or keyword dictionaries)
- Set confidence level and proximity
Test the SIT — upload sample content to check matches
Publish — the SIT is immediately available for policies

PowerShell method

New-DlpSensitiveInformationType
  -Name "Meridian Account Number"
  -Description "Detects MF-XXXXXXXX format"
  ...

Exam tip: PowerShell cmdlets you should know

The exam expects you to recognise these PowerShell cmdlets:

New-DlpSensitiveInformationType — create a custom SIT
Set-DlpSensitiveInformationType — modify an existing custom SIT
Remove-DlpSensitiveInformationType — delete a custom SIT
Get-DlpSensitiveInformationType — list SITs (built-in and custom)
New-DlpKeywordDictionary — create a keyword dictionary
Set-DlpKeywordDictionary — update a keyword dictionary

You do NOT need to write PowerShell from scratch on the exam, but you should know which cmdlet does what.

Testing and tuning

Creating a custom SIT is only step one. Tuning it against real data is where the quality comes from.

Testing workflow

Step	Action	Goal
1. Upload samples	Provide documents that DO and DO NOT contain the target data	Establish baseline detection
2. Check true positives	Verify matches are genuine	Confirm pattern accuracy
3. Check false positives	Review flagged content that isn’t sensitive	Tighten regex or add more keywords
4. Check false negatives	Find documents with sensitive data that weren’t flagged	Loosen regex or reduce confidence threshold
5. Adjust and retest	Modify pattern, keywords, or confidence levels	Iterate until detection is accurate

Common tuning moves

Problem	Fix
Too many false positives	Add more supporting keywords (higher evidence requirement)
Too many false positives	Tighten regex pattern (more specific)
Too many false positives	Increase minimum confidence level in policy
Missing real data (false negatives)	Loosen regex (allow variations)
Missing real data (false negatives)	Add alternative keywords
Missing data in images	Enable OCR (Module 3)

Scenario: Dr. Liam tunes the MRN detector

Dr. Liam created a custom SIT for Medical Record Numbers (MRN-XXXXXXX) at St. Harbour Health. Initial testing showed:

True positives: 94% of MRNs detected correctly
False positives: 3% — some internal IT ticket numbers (INC-XXXXXXX) were matching because the regex [A-Z]{3}-\d{7} was too broad
False negatives: 3% — some older MRNs used lowercase formatting (mrn-1234567)

His fixes:

Changed regex from [A-Z]{3}-\d{7} to MRN-\d{7} — specific prefix eliminates IT ticket matches
Added case-insensitive flag to catch lowercase variants
Added keywords: “patient”, “medical record”, “admission”, “discharge”
Retested: 99% true positive rate, 0.2% false positive rate

Custom SITs vs other classification methods

When should you use a custom SIT versus EDM or trainable classifiers?

Choose your classification method based on the data type
Feature	Custom SIT	EDM Classifier	Trainable Classifier
Best for	Predictable patterns (IDs, codes)	Exact values from a database	Unstructured content (contracts, resumes)
Detection method	Regex + keywords	Hash match against uploaded data	Machine learning from examples
Accuracy	High for pattern-based data	Very high — matches exact values	Good for content types, less precise for specific values
Maintenance	Update regex when formats change	Refresh data table regularly	Retrain when content types evolve
Setup effort	Low — define pattern in portal	Medium — prepare and upload data table	High — provide 50+ positive examples + seed
Covered in	This module	Module 3	Module 4

Question

What are the four components of a custom sensitive information type?

Click or press Enter to reveal answer

Answer

1. Primary element (regex pattern). 2. Supporting elements (keywords or keyword dictionaries). 3. Confidence levels (how much evidence is required for low/medium/high confidence). 4. Character proximity (how close keywords must be to the pattern).

Click to flip back

Question

What is the difference between a keyword list and a keyword dictionary?

Click or press Enter to reveal answer

Answer

A keyword list is a small set of terms defined inline within the SIT (up to ~2,000 terms). A keyword dictionary is a large file-based vocabulary (up to 1 million terms, 100 KB) uploaded separately and referenced by multiple SITs. Use dictionaries for large vocabularies like drug names or product codes.

Click to flip back

Question

Which PowerShell cmdlet creates a new custom sensitive information type?

Click or press Enter to reveal answer

Answer

New-DlpSensitiveInformationType. Use Set-DlpSensitiveInformationType to modify existing SITs and Remove-DlpSensitiveInformationType to delete them.

Click to flip back

Question

A custom SIT is detecting too many false positives. Name two ways to reduce them.

Click or press Enter to reveal answer

Answer

1. Add more supporting keywords so the SIT requires more evidence (higher confidence). 2. Tighten the regex pattern to be more specific (e.g., require exact prefix instead of wildcard). Also consider increasing the required confidence level in the DLP policy that uses the SIT.

Click to flip back

Knowledge Check

Zara at Atlas Global needs to detect employee IDs that follow the format ATL-XX-NNNN (where XX is a department code and NNNN is a 4-digit number). The SIT should have high confidence when the ID appears near words like 'employee', 'staff', or 'HR'. What should she create?

Knowledge Check

Marcus at NovaTech created a custom SIT for internal project codes (NV-YYYY-NNN). After deploying it in a DLP policy, he receives 200 alerts in the first day — mostly false positives from marketing blog URLs that contain similar patterns. What should he do FIRST?

Next up: EDM & Fingerprinting: Detect Exact Data — when patterns aren’t enough, match against your exact data.