EDM & Fingerprinting: Detect Exact Data

Beyond patterns — matching real data

Simple explanation

Think about the difference between spotting a passport and spotting YOUR passport.

A regex-based SIT is like a border guard who knows what a passport looks like — the right size, the right colour, a photo, machine-readable text. It can spot any passport. But it cannot tell if a specific passport belongs to a specific person on a watchlist.

Exact data match (EDM) is like a guard with a database of everyone on the watchlist. When they scan a passport, they compare it directly against the database. If the name and number match an entry, they know exactly who it is — not just “this is a passport” but “this is John Smith’s passport.”

Document fingerprinting is different again — it learns the template of a specific form (like a tax return or patent application) and detects any document that matches that form’s structure, even if the content is different.

Exact data match (EDM) — detect your actual data

EDM is the highest-accuracy classification method in Purview. Instead of “find anything that looks like a credit card number,” EDM says “find these specific credit card numbers from our customer database.”

How EDM works

Step	What Happens
1. Define the schema	Create an EDM schema that describes your data table columns (e.g., Name, SSN, Account Number)
2. Prepare the data	Export your sensitive data to a CSV/TSV file (the “sensitive information source table”)
3. Hash and upload	The EDM Upload Agent hashes the data locally (it never sends plaintext to the cloud) and uploads the hashes
4. Create the EDM SIT	Define which columns are primary (must match) and which are corroborative (supporting evidence)
5. Detection	When content is scanned, Purview hashes values in documents and compares against your uploaded hashes

Primary vs corroborative elements

Element Type	Role	Example
Primary	Must match for detection to trigger	Patient ID, Account Number, SSN
Corroborative	Supporting evidence — increases confidence	Patient Name, Date of Birth, Address

A match requires at least one primary element. Corroborative elements boost confidence and reduce false positives.

Scenario: Priya deploys EDM for client accounts

Meridian Financial has 45,000 client accounts. Priya wants DLP to detect when any actual client’s data appears in an email or document — not just any 8-digit number, but specifically numbers that belong to real clients.

She configures EDM:

Schema: ClientAccountNumber (primary), ClientName (corroborative), TaxID (primary)
Data source: Nightly export from the client management system (45,000 rows)
Hash schedule: Daily refresh via the EDM Upload Agent
Result: DLP now catches “John Smith, account MF-12345678” with near-zero false positives — because it matches against the real client database, not just the pattern.

EDM requirements and limitations

Requirement	Detail
Maximum rows	Up to 100 million rows per data table
Maximum columns	Up to 32 columns per schema
Maximum table size	Uncompressed data up to 32 GB
Refresh frequency	Can refresh up to twice per day
Hash algorithm	SHA-256 — data is hashed locally before upload
Upload tool	EDM Upload Agent installed on a Windows server with access to the data source
Licensing	Requires Microsoft 365 E5, E5 Compliance, or E5 Information Protection

Exam tip: EDM hashing is done locally

A common exam question tests whether you understand where hashing happens. The EDM Upload Agent hashes your sensitive data on-premises (or on your designated server) before uploading. Plaintext sensitive data is never sent to the Microsoft cloud. Only SHA-256 hashes are uploaded.

This is a critical data residency and privacy feature. If a question asks about the security of EDM, remember: hashing is local, only hashes are stored in the cloud.

Document fingerprinting — detect forms by structure

Document fingerprinting converts a blank form or template into a SIT based on its text structure (the “word pattern”). Any document that matches the template’s structure triggers detection.

How it works

Upload a blank template — e.g., a blank patent application, tax form, or new hire form
Purview analyses the word pattern — it identifies the unique combination of text elements that define the form’s structure
Creates a SIT — the fingerprint becomes a SIT you can use in DLP policies
Detection — when any document matches the word pattern, it’s flagged

Good candidates for fingerprinting

Document Type	Why It Works
Tax forms	Standardised structure with consistent field labels
Patent applications	Specific sections, headers, and legal language
Insurance claim forms	Predictable layout and terminology
New hire paperwork	Standard fields across all employees
Regulatory filings	Mandated structure and section headings

Not good candidates

Document Type	Why It Doesn’t Work
Free-form emails	No consistent structure to fingerprint
Meeting notes	Highly variable format
Source code	Structure varies too much between projects
Presentations	Slides have inconsistent layouts

Scenario: Dr. Liam fingerprints patient intake forms

St. Harbour Health uses a standardised patient intake form across all clinics. Dr. Liam wants DLP to detect when completed intake forms are emailed externally — because they contain patient identifiers, medical history, and insurance details.

He uploads the blank intake form template as a document fingerprint. Now, whenever a completed version of the form is attached to an external email, DLP blocks it with a policy tip: “This appears to be a patient intake form. External sharing of patient data requires approval.”

OCR — see text in images

Optical character recognition extracts text from images and scanned PDFs so SITs can evaluate them.

What OCR enables

Without OCR, a credit card number photographed on a desk is invisible to DLP. With OCR enabled, Purview extracts the text from the image and runs SIT evaluation on it.

Where OCR works

Location	OCR Available?
Exchange Online	Yes — images in email bodies and attachments
SharePoint Online	Yes — images and scanned PDFs in document libraries
OneDrive for Business	Yes — personal file storage
Teams	Yes — images shared in chats and channels
Endpoints	Yes — scanned documents on devices

How to enable OCR

OCR is configured at the Microsoft Purview portal → Settings → Optical character recognition (OCR).

Key configuration points:

OCR must be explicitly enabled — it is not on by default
Specify which locations should use OCR scanning
OCR consumes additional processing — enable only where needed
Supports 149+ languages for text extraction
Requires Azure AI Services (billed separately for high volumes)

Exam tip: OCR prerequisites

The exam may ask about OCR prerequisites. Key facts:

OCR requires an Azure subscription with Azure AI Services
There is a monthly free tier (5,000 images) before billing starts
OCR must be enabled in Purview settings — it is NOT on by default
OCR applies to DLP, auto-labeling, and SIT evaluation
If OCR is not enabled and sensitive data is only in images, DLP will NOT detect it

Three advanced detection methods that fill gaps left by regex-based SITs
Feature	Exact Data Match (EDM)	Document Fingerprinting	OCR
What it detects	Specific values from your database	Documents matching a template structure	Text within images and scanned PDFs
Best for	Known data (customer lists, employee records)	Standardised forms (tax, HR, insurance)	Sensitive data captured as images or scans
Detection method	Hash comparison against uploaded data	Word pattern matching against template	Text extraction then SIT evaluation
False positive rate	Very low — matches exact values	Low — form structure is unique	Depends on image quality and SIT accuracy
Setup effort	Medium — schema + data + upload agent	Low — upload a blank template	Low — enable in Purview settings
Maintenance	Regular data refresh (daily/weekly)	Update template if form changes	Minimal — runs automatically once enabled

Question

In exact data match, what is the difference between a primary element and a corroborative element?

Click or press Enter to reveal answer

Answer

A primary element must match for detection (e.g., Account Number, SSN). A corroborative element is supporting evidence that increases confidence (e.g., Name, Date of Birth). At least one primary element must match for an EDM detection to trigger.

Click to flip back

Question

Where does EDM hashing happen — in the cloud or on-premises?

Click or press Enter to reveal answer

Answer

On-premises (or on your designated server). The EDM Upload Agent hashes sensitive data locally using SHA-256 before uploading. Plaintext sensitive data is never sent to the Microsoft cloud.

Click to flip back

Question

What type of document is a good candidate for document fingerprinting?

Click or press Enter to reveal answer

Answer

Standardised forms with a consistent structure — tax forms, patent applications, insurance claims, new hire paperwork, regulatory filings. Documents with highly variable formats (free-form emails, meeting notes, source code) are NOT good candidates.

Click to flip back

Question

What prerequisite must be met before OCR can scan images for sensitive information types?

Click or press Enter to reveal answer

Answer

OCR must be explicitly enabled in Microsoft Purview settings (it is NOT on by default). It also requires an Azure subscription with Azure AI Services for processing. A free tier covers 5,000 images per month before billing starts.

Click to flip back

Knowledge Check

Priya at Meridian Financial wants to ensure that DLP detects when actual client names and account numbers appear in documents — not just any 8-digit number that matches a pattern. She has a client database with 45,000 records. What classification method should she use?

Knowledge Check

Dr. Liam at St. Harbour Health discovers that nurses are photographing whiteboards containing patient information and sharing the photos via Teams. Current DLP policies do not detect this. What should Dr. Liam configure?

Next up: Trainable Classifiers: AI-Powered Detection — teach Microsoft Purview to recognise content by example.