🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided SC-401 Domain 1
Domain 1 — Module 2 of 8 25%
2 of 25 overall

SC-401 Study Guide

Domain 1: Implement Information Protection

  • Know Your Data: Sensitive Info Types Free
  • Custom Sensitive Info Types: Build Your Own Free
  • EDM & Fingerprinting: Detect Exact Data
  • Trainable Classifiers: AI-Powered Detection Free
  • Sensitivity Labels: Create & Protect Free
  • Sensitivity Labels: Publish & Auto-Apply
  • Email Encryption: Lock Down Messages
  • Purview IP Client: Classify Files at Scale

Domain 2: Implement DLP and Retention

  • DLP Foundations: Stop Data Leaks
  • DLP Policies: Build, Manage & Extend
  • DLP: Precedence & Adaptive Protection
  • Endpoint DLP: Setup & Configuration
  • Endpoint DLP: Advanced Rules & Monitoring
  • Retention: Plan Your Data Lifecycle
  • Retention Labels: Publish & Auto-Apply
  • Retention: Policies, Precedence & Recovery

Domain 3: Manage Risks, Alerts, and Activities

  • Insider Risk: Foundations & Setup
  • Insider Risk: Policies & Indicators
  • Insider Risk: Investigate & Close Cases
  • Adaptive Protection: Risk Levels Meet DLP
  • Purview Audit: Investigate & Retain
  • Activity Explorer & Content Search
  • Alert Response: Purview, XDR & Cloud Apps
  • DSPM for AI: Setup & Controls
  • DSPM for AI: Policies & Monitoring

SC-401 Study Guide

Domain 1: Implement Information Protection

  • Know Your Data: Sensitive Info Types Free
  • Custom Sensitive Info Types: Build Your Own Free
  • EDM & Fingerprinting: Detect Exact Data
  • Trainable Classifiers: AI-Powered Detection Free
  • Sensitivity Labels: Create & Protect Free
  • Sensitivity Labels: Publish & Auto-Apply
  • Email Encryption: Lock Down Messages
  • Purview IP Client: Classify Files at Scale

Domain 2: Implement DLP and Retention

  • DLP Foundations: Stop Data Leaks
  • DLP Policies: Build, Manage & Extend
  • DLP: Precedence & Adaptive Protection
  • Endpoint DLP: Setup & Configuration
  • Endpoint DLP: Advanced Rules & Monitoring
  • Retention: Plan Your Data Lifecycle
  • Retention Labels: Publish & Auto-Apply
  • Retention: Policies, Precedence & Recovery

Domain 3: Manage Risks, Alerts, and Activities

  • Insider Risk: Foundations & Setup
  • Insider Risk: Policies & Indicators
  • Insider Risk: Investigate & Close Cases
  • Adaptive Protection: Risk Levels Meet DLP
  • Purview Audit: Investigate & Retain
  • Activity Explorer & Content Search
  • Alert Response: Purview, XDR & Cloud Apps
  • DSPM for AI: Setup & Controls
  • DSPM for AI: Policies & Monitoring
Domain 1: Implement Information Protection Free ⏱ ~14 min read

Custom Sensitive Info Types: Build Your Own

When Microsoft's 300+ built-in detectors aren't enough, create custom sensitive information types with your own regex patterns, keywords, and confidence levels to catch organisation-specific data.

Why build custom sensitive info types?

☕ Simple explanation

Imagine a metal detector at a museum.

The standard detector catches knives, guns, and obvious weapons. But this museum also has a priceless collection of rare coins — and visitors have been smuggling them out in their pockets. The standard detector does not know what a rare coin looks like.

So the museum builds a custom detector specifically tuned to the weight, size, and metal composition of their coins. Now both standard threats AND museum-specific threats are caught.

That’s what custom SITs do. Microsoft’s built-in types catch universal patterns (credit cards, passports). Custom SITs catch your organisation’s unique patterns — employee IDs, account numbers, project codes, internal reference numbers.

Custom sensitive information types extend Microsoft Purview’s detection capabilities beyond the 300+ built-in types. You create them when your organisation has data formats that no built-in SIT covers — internal identifiers, proprietary codes, region-specific formats, or industry-specific patterns.

Custom SITs are defined using regular expressions (regex), keyword lists or dictionaries, confidence levels, and proximity rules. They can be created through the Microsoft Purview portal UI or via PowerShell (using New-DlpSensitiveInformationType). Once created, custom SITs appear alongside built-in types everywhere: DLP policies, sensitivity labels, auto-labeling, retention, and DSPM for AI.

Anatomy of a custom SIT

Every custom SIT has four components that work together:

1. Primary element — the pattern

The core regex pattern that identifies the data. This is what the SIT looks for first.

Data ExampleRegex PatternWhat It Matches
MF-12345678MF-\d{8}Meridian Financial account number
EMP-AB-1234EMP-[A-Z]{2}-\d{4}Employee ID with department code
PRJ-2026-0042PRJ-\d{4}-\d{4}Project reference number

2. Supporting elements — keywords and keyword lists

Keywords near the pattern increase detection confidence. You can use:

  • Keyword lists — small sets of words defined inline (e.g., “account”, “client”, “portfolio”)
  • Keyword dictionaries — large sets imported from a file (e.g., 10,000 medical terms, product names)
FeatureKeyword ListKeyword Dictionary
SizeUp to ~2,000 termsUp to 1 million terms, 100 KB file
Where definedInline in the SIT definitionUploaded separately, referenced by SITs
Use caseSmall supporting word setsLarge vocabularies — drug names, product codes, employee names
EditableEdit the SIT directlyUpdate the dictionary file independently

3. Confidence levels

Each pattern + evidence combination maps to a confidence level:

  • Low confidence: Pattern match only, no supporting keywords
  • Medium confidence: Pattern match + at least one supporting keyword
  • High confidence: Pattern match + multiple keywords + additional evidence

You define what constitutes low, medium, and high for your custom SIT.

4. Character proximity

How close supporting keywords must be to the primary pattern. Default is 300 characters, but you can adjust from 1 to the entire document.

💡 Scenario: Priya builds Meridian's account SIT

Priya needs to detect Meridian Financial’s client account numbers (format: MF-XXXXXXXX) across all M365 workloads.

Her custom SIT definition:

  • Primary pattern: MF-\d{8}
  • Keywords: “account”, “client”, “portfolio”, “fund”, “investment”
  • Proximity: 300 characters
  • Confidence: High = pattern + 2 keywords, Medium = pattern + 1 keyword, Low = pattern only

She tests it against 50 sample documents: 47 true positives, 2 false negatives (account numbers in image-only PDFs — needs OCR), 1 false positive (a reference number that happened to start with MF-). She adjusts the regex to require exactly 8 digits after MF- and retests.

Creating a custom SIT — two methods

Portal for simple SITs, PowerShell for complex or automated scenarios
FeaturePurview Portal (UI)PowerShell
Best forSimple SITs, visual editing, testingComplex SITs, bulk creation, automation
Regex testingBuilt-in test with sample data uploadTest separately, then deploy
Keyword dictionariesUpload via portalImport with New-DlpKeywordDictionary
Multiple patternsAdd via UI — one pattern per confidence levelFull XML control over multiple patterns
VersioningNo built-in version controlExport XML, store in source control
CmdletN/A — web interfaceNew-DlpSensitiveInformationType

Portal method (recommended for most admins)

  1. Microsoft Purview portal → Data classification → Classifiers → Sensitive info types
  2. Click Create sensitive info type
  3. Name and describe the SIT (this appears in policy selection dropdowns)
  4. Add a pattern:
    • Define the primary element (regex)
    • Add supporting elements (keywords or keyword dictionaries)
    • Set confidence level and proximity
  5. Test the SIT — upload sample content to check matches
  6. Publish — the SIT is immediately available for policies

PowerShell method

New-DlpSensitiveInformationType
  -Name "Meridian Account Number"
  -Description "Detects MF-XXXXXXXX format"
  ...
💡 Exam tip: PowerShell cmdlets you should know

The exam expects you to recognise these PowerShell cmdlets:

  • New-DlpSensitiveInformationType — create a custom SIT
  • Set-DlpSensitiveInformationType — modify an existing custom SIT
  • Remove-DlpSensitiveInformationType — delete a custom SIT
  • Get-DlpSensitiveInformationType — list SITs (built-in and custom)
  • New-DlpKeywordDictionary — create a keyword dictionary
  • Set-DlpKeywordDictionary — update a keyword dictionary

You do NOT need to write PowerShell from scratch on the exam, but you should know which cmdlet does what.

Testing and tuning

Creating a custom SIT is only step one. Tuning it against real data is where the quality comes from.

Testing workflow

StepActionGoal
1. Upload samplesProvide documents that DO and DO NOT contain the target dataEstablish baseline detection
2. Check true positivesVerify matches are genuineConfirm pattern accuracy
3. Check false positivesReview flagged content that isn’t sensitiveTighten regex or add more keywords
4. Check false negativesFind documents with sensitive data that weren’t flaggedLoosen regex or reduce confidence threshold
5. Adjust and retestModify pattern, keywords, or confidence levelsIterate until detection is accurate

Common tuning moves

ProblemFix
Too many false positivesAdd more supporting keywords (higher evidence requirement)
Too many false positivesTighten regex pattern (more specific)
Too many false positivesIncrease minimum confidence level in policy
Missing real data (false negatives)Loosen regex (allow variations)
Missing real data (false negatives)Add alternative keywords
Missing data in imagesEnable OCR (Module 3)
💡 Scenario: Dr. Liam tunes the MRN detector

Dr. Liam created a custom SIT for Medical Record Numbers (MRN-XXXXXXX) at St. Harbour Health. Initial testing showed:

  • True positives: 94% of MRNs detected correctly
  • False positives: 3% — some internal IT ticket numbers (INC-XXXXXXX) were matching because the regex [A-Z]{3}-\d{7} was too broad
  • False negatives: 3% — some older MRNs used lowercase formatting (mrn-1234567)

His fixes:

  1. Changed regex from [A-Z]{3}-\d{7} to MRN-\d{7} — specific prefix eliminates IT ticket matches
  2. Added case-insensitive flag to catch lowercase variants
  3. Added keywords: “patient”, “medical record”, “admission”, “discharge”
  4. Retested: 99% true positive rate, 0.2% false positive rate

Custom SITs vs other classification methods

When should you use a custom SIT versus EDM or trainable classifiers?

Choose your classification method based on the data type
FeatureCustom SITEDM ClassifierTrainable Classifier
Best forPredictable patterns (IDs, codes)Exact values from a databaseUnstructured content (contracts, resumes)
Detection methodRegex + keywordsHash match against uploaded dataMachine learning from examples
AccuracyHigh for pattern-based dataVery high — matches exact valuesGood for content types, less precise for specific values
MaintenanceUpdate regex when formats changeRefresh data table regularlyRetrain when content types evolve
Setup effortLow — define pattern in portalMedium — prepare and upload data tableHigh — provide 50+ positive examples + seed
Covered inThis moduleModule 3Module 4
Question

What are the four components of a custom sensitive information type?

Click or press Enter to reveal answer

Answer

1. Primary element (regex pattern). 2. Supporting elements (keywords or keyword dictionaries). 3. Confidence levels (how much evidence is required for low/medium/high confidence). 4. Character proximity (how close keywords must be to the pattern).

Click to flip back

Question

What is the difference between a keyword list and a keyword dictionary?

Click or press Enter to reveal answer

Answer

A keyword list is a small set of terms defined inline within the SIT (up to ~2,000 terms). A keyword dictionary is a large file-based vocabulary (up to 1 million terms, 100 KB) uploaded separately and referenced by multiple SITs. Use dictionaries for large vocabularies like drug names or product codes.

Click to flip back

Question

Which PowerShell cmdlet creates a new custom sensitive information type?

Click or press Enter to reveal answer

Answer

New-DlpSensitiveInformationType. Use Set-DlpSensitiveInformationType to modify existing SITs and Remove-DlpSensitiveInformationType to delete them.

Click to flip back

Question

A custom SIT is detecting too many false positives. Name two ways to reduce them.

Click or press Enter to reveal answer

Answer

1. Add more supporting keywords so the SIT requires more evidence (higher confidence). 2. Tighten the regex pattern to be more specific (e.g., require exact prefix instead of wildcard). Also consider increasing the required confidence level in the DLP policy that uses the SIT.

Click to flip back

Knowledge Check

Zara at Atlas Global needs to detect employee IDs that follow the format ATL-XX-NNNN (where XX is a department code and NNNN is a 4-digit number). The SIT should have high confidence when the ID appears near words like 'employee', 'staff', or 'HR'. What should she create?

Knowledge Check

Marcus at NovaTech created a custom SIT for internal project codes (NV-YYYY-NNN). After deploying it in a DLP policy, he receives 200 alerts in the first day — mostly false positives from marketing blog URLs that contain similar patterns. What should he do FIRST?

🎬 Video coming soon


Next up: EDM & Fingerprinting: Detect Exact Data — when patterns aren’t enough, match against your exact data.

← Previous

Know Your Data: Sensitive Info Types

Next →

EDM & Fingerprinting: Detect Exact Data

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.