EDM & Fingerprinting: Detect Exact Data
Exact data match classifiers compare content against your actual database values. Document fingerprinting creates templates from forms. Together with OCR, they catch what regex cannot.
Beyond patterns — matching real data
Think about the difference between spotting a passport and spotting YOUR passport.
A regex-based SIT is like a border guard who knows what a passport looks like — the right size, the right colour, a photo, machine-readable text. It can spot any passport. But it cannot tell if a specific passport belongs to a specific person on a watchlist.
Exact data match (EDM) is like a guard with a database of everyone on the watchlist. When they scan a passport, they compare it directly against the database. If the name and number match an entry, they know exactly who it is — not just “this is a passport” but “this is John Smith’s passport.”
Document fingerprinting is different again — it learns the template of a specific form (like a tax return or patent application) and detects any document that matches that form’s structure, even if the content is different.
Exact data match (EDM) — detect your actual data
EDM is the highest-accuracy classification method in Purview. Instead of “find anything that looks like a credit card number,” EDM says “find these specific credit card numbers from our customer database.”
How EDM works
| Step | What Happens |
|---|---|
| 1. Define the schema | Create an EDM schema that describes your data table columns (e.g., Name, SSN, Account Number) |
| 2. Prepare the data | Export your sensitive data to a CSV/TSV file (the “sensitive information source table”) |
| 3. Hash and upload | The EDM Upload Agent hashes the data locally (it never sends plaintext to the cloud) and uploads the hashes |
| 4. Create the EDM SIT | Define which columns are primary (must match) and which are corroborative (supporting evidence) |
| 5. Detection | When content is scanned, Purview hashes values in documents and compares against your uploaded hashes |
Primary vs corroborative elements
| Element Type | Role | Example |
|---|---|---|
| Primary | Must match for detection to trigger | Patient ID, Account Number, SSN |
| Corroborative | Supporting evidence — increases confidence | Patient Name, Date of Birth, Address |
A match requires at least one primary element. Corroborative elements boost confidence and reduce false positives.
Scenario: Priya deploys EDM for client accounts
Meridian Financial has 45,000 client accounts. Priya wants DLP to detect when any actual client’s data appears in an email or document — not just any 8-digit number, but specifically numbers that belong to real clients.
She configures EDM:
- Schema: ClientAccountNumber (primary), ClientName (corroborative), TaxID (primary)
- Data source: Nightly export from the client management system (45,000 rows)
- Hash schedule: Daily refresh via the EDM Upload Agent
- Result: DLP now catches “John Smith, account MF-12345678” with near-zero false positives — because it matches against the real client database, not just the pattern.
EDM requirements and limitations
| Requirement | Detail |
|---|---|
| Maximum rows | Up to 100 million rows per data table |
| Maximum columns | Up to 32 columns per schema |
| Maximum table size | Uncompressed data up to 32 GB |
| Refresh frequency | Can refresh up to twice per day |
| Hash algorithm | SHA-256 — data is hashed locally before upload |
| Upload tool | EDM Upload Agent installed on a Windows server with access to the data source |
| Licensing | Requires Microsoft 365 E5, E5 Compliance, or E5 Information Protection |
Exam tip: EDM hashing is done locally
A common exam question tests whether you understand where hashing happens. The EDM Upload Agent hashes your sensitive data on-premises (or on your designated server) before uploading. Plaintext sensitive data is never sent to the Microsoft cloud. Only SHA-256 hashes are uploaded.
This is a critical data residency and privacy feature. If a question asks about the security of EDM, remember: hashing is local, only hashes are stored in the cloud.
Document fingerprinting — detect forms by structure
Document fingerprinting converts a blank form or template into a SIT based on its text structure (the “word pattern”). Any document that matches the template’s structure triggers detection.
How it works
- Upload a blank template — e.g., a blank patent application, tax form, or new hire form
- Purview analyses the word pattern — it identifies the unique combination of text elements that define the form’s structure
- Creates a SIT — the fingerprint becomes a SIT you can use in DLP policies
- Detection — when any document matches the word pattern, it’s flagged
Good candidates for fingerprinting
| Document Type | Why It Works |
|---|---|
| Tax forms | Standardised structure with consistent field labels |
| Patent applications | Specific sections, headers, and legal language |
| Insurance claim forms | Predictable layout and terminology |
| New hire paperwork | Standard fields across all employees |
| Regulatory filings | Mandated structure and section headings |
Not good candidates
| Document Type | Why It Doesn’t Work |
|---|---|
| Free-form emails | No consistent structure to fingerprint |
| Meeting notes | Highly variable format |
| Source code | Structure varies too much between projects |
| Presentations | Slides have inconsistent layouts |
Scenario: Dr. Liam fingerprints patient intake forms
St. Harbour Health uses a standardised patient intake form across all clinics. Dr. Liam wants DLP to detect when completed intake forms are emailed externally — because they contain patient identifiers, medical history, and insurance details.
He uploads the blank intake form template as a document fingerprint. Now, whenever a completed version of the form is attached to an external email, DLP blocks it with a policy tip: “This appears to be a patient intake form. External sharing of patient data requires approval.”
OCR — see text in images
Optical character recognition extracts text from images and scanned PDFs so SITs can evaluate them.
What OCR enables
Without OCR, a credit card number photographed on a desk is invisible to DLP. With OCR enabled, Purview extracts the text from the image and runs SIT evaluation on it.
Where OCR works
| Location | OCR Available? |
|---|---|
| Exchange Online | Yes — images in email bodies and attachments |
| SharePoint Online | Yes — images and scanned PDFs in document libraries |
| OneDrive for Business | Yes — personal file storage |
| Teams | Yes — images shared in chats and channels |
| Endpoints | Yes — scanned documents on devices |
How to enable OCR
OCR is configured at the Microsoft Purview portal → Settings → Optical character recognition (OCR).
Key configuration points:
- OCR must be explicitly enabled — it is not on by default
- Specify which locations should use OCR scanning
- OCR consumes additional processing — enable only where needed
- Supports 149+ languages for text extraction
- Requires Azure AI Services (billed separately for high volumes)
Exam tip: OCR prerequisites
The exam may ask about OCR prerequisites. Key facts:
- OCR requires an Azure subscription with Azure AI Services
- There is a monthly free tier (5,000 images) before billing starts
- OCR must be enabled in Purview settings — it is NOT on by default
- OCR applies to DLP, auto-labeling, and SIT evaluation
- If OCR is not enabled and sensitive data is only in images, DLP will NOT detect it
| Feature | Exact Data Match (EDM) | Document Fingerprinting | OCR |
|---|---|---|---|
| What it detects | Specific values from your database | Documents matching a template structure | Text within images and scanned PDFs |
| Best for | Known data (customer lists, employee records) | Standardised forms (tax, HR, insurance) | Sensitive data captured as images or scans |
| Detection method | Hash comparison against uploaded data | Word pattern matching against template | Text extraction then SIT evaluation |
| False positive rate | Very low — matches exact values | Low — form structure is unique | Depends on image quality and SIT accuracy |
| Setup effort | Medium — schema + data + upload agent | Low — upload a blank template | Low — enable in Purview settings |
| Maintenance | Regular data refresh (daily/weekly) | Update template if form changes | Minimal — runs automatically once enabled |
Priya at Meridian Financial wants to ensure that DLP detects when actual client names and account numbers appear in documents — not just any 8-digit number that matches a pattern. She has a client database with 45,000 records. What classification method should she use?
Dr. Liam at St. Harbour Health discovers that nurses are photographing whiteboards containing patient information and sharing the photos via Teams. Current DLP policies do not detect this. What should Dr. Liam configure?
🎬 Video coming soon
Next up: Trainable Classifiers: AI-Powered Detection — teach Microsoft Purview to recognise content by example.