Custom Sensitive Info Types: Build Your Own
When Microsoft's 300+ built-in detectors aren't enough, create custom sensitive information types with your own regex patterns, keywords, and confidence levels to catch organisation-specific data.
Why build custom sensitive info types?
Imagine a metal detector at a museum.
The standard detector catches knives, guns, and obvious weapons. But this museum also has a priceless collection of rare coins — and visitors have been smuggling them out in their pockets. The standard detector does not know what a rare coin looks like.
So the museum builds a custom detector specifically tuned to the weight, size, and metal composition of their coins. Now both standard threats AND museum-specific threats are caught.
That’s what custom SITs do. Microsoft’s built-in types catch universal patterns (credit cards, passports). Custom SITs catch your organisation’s unique patterns — employee IDs, account numbers, project codes, internal reference numbers.
Anatomy of a custom SIT
Every custom SIT has four components that work together:
1. Primary element — the pattern
The core regex pattern that identifies the data. This is what the SIT looks for first.
| Data Example | Regex Pattern | What It Matches |
|---|---|---|
| MF-12345678 | MF-\d{8} | Meridian Financial account number |
| EMP-AB-1234 | EMP-[A-Z]{2}-\d{4} | Employee ID with department code |
| PRJ-2026-0042 | PRJ-\d{4}-\d{4} | Project reference number |
2. Supporting elements — keywords and keyword lists
Keywords near the pattern increase detection confidence. You can use:
- Keyword lists — small sets of words defined inline (e.g., “account”, “client”, “portfolio”)
- Keyword dictionaries — large sets imported from a file (e.g., 10,000 medical terms, product names)
| Feature | Keyword List | Keyword Dictionary |
|---|---|---|
| Size | Up to ~2,000 terms | Up to 1 million terms, 100 KB file |
| Where defined | Inline in the SIT definition | Uploaded separately, referenced by SITs |
| Use case | Small supporting word sets | Large vocabularies — drug names, product codes, employee names |
| Editable | Edit the SIT directly | Update the dictionary file independently |
3. Confidence levels
Each pattern + evidence combination maps to a confidence level:
- Low confidence: Pattern match only, no supporting keywords
- Medium confidence: Pattern match + at least one supporting keyword
- High confidence: Pattern match + multiple keywords + additional evidence
You define what constitutes low, medium, and high for your custom SIT.
4. Character proximity
How close supporting keywords must be to the primary pattern. Default is 300 characters, but you can adjust from 1 to the entire document.
Scenario: Priya builds Meridian's account SIT
Priya needs to detect Meridian Financial’s client account numbers (format: MF-XXXXXXXX) across all M365 workloads.
Her custom SIT definition:
- Primary pattern:
MF-\d{8} - Keywords: “account”, “client”, “portfolio”, “fund”, “investment”
- Proximity: 300 characters
- Confidence: High = pattern + 2 keywords, Medium = pattern + 1 keyword, Low = pattern only
She tests it against 50 sample documents: 47 true positives, 2 false negatives (account numbers in image-only PDFs — needs OCR), 1 false positive (a reference number that happened to start with MF-). She adjusts the regex to require exactly 8 digits after MF- and retests.
Creating a custom SIT — two methods
| Feature | Purview Portal (UI) | PowerShell |
|---|---|---|
| Best for | Simple SITs, visual editing, testing | Complex SITs, bulk creation, automation |
| Regex testing | Built-in test with sample data upload | Test separately, then deploy |
| Keyword dictionaries | Upload via portal | Import with New-DlpKeywordDictionary |
| Multiple patterns | Add via UI — one pattern per confidence level | Full XML control over multiple patterns |
| Versioning | No built-in version control | Export XML, store in source control |
| Cmdlet | N/A — web interface | New-DlpSensitiveInformationType |
Portal method (recommended for most admins)
- Microsoft Purview portal → Data classification → Classifiers → Sensitive info types
- Click Create sensitive info type
- Name and describe the SIT (this appears in policy selection dropdowns)
- Add a pattern:
- Define the primary element (regex)
- Add supporting elements (keywords or keyword dictionaries)
- Set confidence level and proximity
- Test the SIT — upload sample content to check matches
- Publish — the SIT is immediately available for policies
PowerShell method
New-DlpSensitiveInformationType
-Name "Meridian Account Number"
-Description "Detects MF-XXXXXXXX format"
...
Exam tip: PowerShell cmdlets you should know
The exam expects you to recognise these PowerShell cmdlets:
New-DlpSensitiveInformationType— create a custom SITSet-DlpSensitiveInformationType— modify an existing custom SITRemove-DlpSensitiveInformationType— delete a custom SITGet-DlpSensitiveInformationType— list SITs (built-in and custom)New-DlpKeywordDictionary— create a keyword dictionarySet-DlpKeywordDictionary— update a keyword dictionary
You do NOT need to write PowerShell from scratch on the exam, but you should know which cmdlet does what.
Testing and tuning
Creating a custom SIT is only step one. Tuning it against real data is where the quality comes from.
Testing workflow
| Step | Action | Goal |
|---|---|---|
| 1. Upload samples | Provide documents that DO and DO NOT contain the target data | Establish baseline detection |
| 2. Check true positives | Verify matches are genuine | Confirm pattern accuracy |
| 3. Check false positives | Review flagged content that isn’t sensitive | Tighten regex or add more keywords |
| 4. Check false negatives | Find documents with sensitive data that weren’t flagged | Loosen regex or reduce confidence threshold |
| 5. Adjust and retest | Modify pattern, keywords, or confidence levels | Iterate until detection is accurate |
Common tuning moves
| Problem | Fix |
|---|---|
| Too many false positives | Add more supporting keywords (higher evidence requirement) |
| Too many false positives | Tighten regex pattern (more specific) |
| Too many false positives | Increase minimum confidence level in policy |
| Missing real data (false negatives) | Loosen regex (allow variations) |
| Missing real data (false negatives) | Add alternative keywords |
| Missing data in images | Enable OCR (Module 3) |
Scenario: Dr. Liam tunes the MRN detector
Dr. Liam created a custom SIT for Medical Record Numbers (MRN-XXXXXXX) at St. Harbour Health. Initial testing showed:
- True positives: 94% of MRNs detected correctly
- False positives: 3% — some internal IT ticket numbers (INC-XXXXXXX) were matching because the regex
[A-Z]{3}-\d{7}was too broad - False negatives: 3% — some older MRNs used lowercase formatting (mrn-1234567)
His fixes:
- Changed regex from
[A-Z]{3}-\d{7}toMRN-\d{7}— specific prefix eliminates IT ticket matches - Added case-insensitive flag to catch lowercase variants
- Added keywords: “patient”, “medical record”, “admission”, “discharge”
- Retested: 99% true positive rate, 0.2% false positive rate
Custom SITs vs other classification methods
When should you use a custom SIT versus EDM or trainable classifiers?
| Feature | Custom SIT | EDM Classifier | Trainable Classifier |
|---|---|---|---|
| Best for | Predictable patterns (IDs, codes) | Exact values from a database | Unstructured content (contracts, resumes) |
| Detection method | Regex + keywords | Hash match against uploaded data | Machine learning from examples |
| Accuracy | High for pattern-based data | Very high — matches exact values | Good for content types, less precise for specific values |
| Maintenance | Update regex when formats change | Refresh data table regularly | Retrain when content types evolve |
| Setup effort | Low — define pattern in portal | Medium — prepare and upload data table | High — provide 50+ positive examples + seed |
| Covered in | This module | Module 3 | Module 4 |
Zara at Atlas Global needs to detect employee IDs that follow the format ATL-XX-NNNN (where XX is a department code and NNNN is a 4-digit number). The SIT should have high confidence when the ID appears near words like 'employee', 'staff', or 'HR'. What should she create?
Marcus at NovaTech created a custom SIT for internal project codes (NV-YYYY-NNN). After deploying it in a DLP policy, he receives 200 alerts in the first day — mostly false positives from marketing blog URLs that contain similar patterns. What should he do FIRST?
🎬 Video coming soon
Next up: EDM & Fingerprinting: Detect Exact Data — when patterns aren’t enough, match against your exact data.