Data Quality: The Make-or-Break Factor for AI

Why does data quality matter more with AI?

Simple explanation

”Garbage in, garbage out” has been true in computing for decades. With AI, it’s “garbage in, confidently wrong garbage out — at scale.”

Traditional software crashes or throws errors when data is bad. AI doesn’t. It takes your messy, incomplete, outdated data and produces polished, professional-looking output that seems correct — but isn’t. And it does it fast, across your entire organisation.

That’s why data quality isn’t a technical detail for your IT team. It’s a strategic priority for every leader deploying AI.

Data types: Structured, unstructured, and semi-structured

AI systems work with three types of data, each with different quality challenges:

Data types and their quality challenges
Feature	What it looks like	Examples	Quality challenge
Structured data	Organised in rows and columns with defined formats	Databases, spreadsheets, CRM records, financial transactions	Missing values, duplicate records, inconsistent formats (dates, currencies)
Unstructured data	No predefined format — free-form content	Emails, documents, Teams chats, images, videos, meeting transcripts	Outdated content, contradictory versions, poor organisation, no metadata
Semi-structured data	Has some organisation but not rigid rows and columns	JSON files, XML data, tagged emails, SharePoint metadata	Inconsistent tagging, missing fields, schema variations across sources

Exam tip: Why unstructured data matters most for gen AI

Most enterprise data is unstructured — documents, emails, chats, presentations. This is exactly the data that generative AI (especially Copilot) grounds on.

The exam may test whether you understand that:

80% of enterprise data is unstructured — and it’s the hardest to quality-check
Copilot primarily grounds on unstructured data via Microsoft Graph (emails, documents, chats)
Poor unstructured data quality directly leads to poor AI responses

Five dimensions of data quality

Leaders should evaluate data across five key dimensions before deploying AI:

Dimension	What it means	AI impact if poor	Check
Accuracy	Data reflects reality correctly	AI provides factually wrong answers with high confidence	Are product specs, prices, and policies current and verified?
Completeness	No critical gaps or missing fields	AI can’t answer questions about missing topics — or fills in gaps with fabrications	Are all departments, products, and regions represented in the data?
Timeliness	Data is current and regularly updated	AI gives outdated answers — last year’s pricing, old policies, former employees	When was each document last reviewed? Is there a refresh schedule?
Consistency	Same information is recorded the same way across sources	AI gets contradictory inputs and produces unpredictable responses	Does the HR policy in SharePoint match the version in the employee handbook?
Relevance	Data is appropriate for the AI use case	AI retrieves noise instead of signal — irrelevant content dilutes good answers	Is the indexed content actually useful for the questions users will ask?

Representative datasets: Why they matter for fairness

A representative dataset reflects the full diversity of the population or scenarios the AI will encounter. If the training data or grounding data is skewed, the AI’s outputs will be biased.

Problem	What happens	Real-world example
Underrepresentation	AI performs poorly for groups missing from the data	A hiring AI trained mostly on male resumes ranks female candidates lower
Historical bias	Data reflects past discrimination — AI perpetuates it	A lending model trained on historical approvals denies loans to demographics that were historically discriminated against
Geographic skew	Data overrepresents certain regions or cultures	A customer support AI trained on US data gives incorrect answers about EU regulations
Temporal bias	Training data is outdated, reflecting old patterns	A market analysis AI recommends strategies based on pre-pandemic consumer behaviour

Why leaders — not just data scientists — need to care about representation

Representative datasets aren’t just a technical concern. They’re a governance and reputational risk:

Regulatory: The EU AI Act and similar regulations require AI systems to be tested for bias
Reputational: A biased AI in customer-facing applications can generate headlines
Legal: Discriminatory AI outputs can create liability

The board and C-suite need to ask: “Does our data represent all the people and scenarios this AI will encounter?” If the answer is no, the AI isn’t ready for deployment.

Real-world scenario: Dr. Patel audits data quality before AI deployment

📊 Dr. Anisha Patel, Board Advisor, insists that her client’s organisation completes a data quality audit before rolling out Copilot to 3,000 employees. Here’s what the audit finds:

SharePoint:

40% of documents haven’t been updated in over 2 years
Three versions of the employee handbook exist — with conflicting information
The old intranet site was migrated but never cleaned up — 10,000 outdated pages are still indexed

CRM data:

15% of customer records have no industry classification
Duplicate contact records across regions mean AI pulls conflicting account information

Email and Teams:

Teams channels created for past projects still contain outdated decisions and superseded plans
No archival policy means Copilot surfaces 4-year-old email threads as current context

Dr. Patel’s recommendation: Do not deploy Copilot organisation-wide until critical data hygiene is addressed. Start with a pilot in one department with clean data, and use the findings to build a data cleanup roadmap.

Dr. Patel's data preparation checklist for leaders

Before any AI deployment, ensure:

Archive or delete outdated content — if it’s not current, it shouldn’t be in the AI’s reach
Consolidate duplicate and conflicting documents into single sources of truth
Review permissions — AI will surface anything users can access, so fix oversharing first
Establish ownership — every key document should have an owner responsible for accuracy
Create a refresh schedule — data that’s never updated becomes a liability, not an asset
Test with real queries — ask the AI questions you know the answers to and verify it responds correctly

Key flashcards

Question

What are the five dimensions of data quality?

Click or press Enter to reveal answer

Answer

Accuracy (reflects reality), Completeness (no critical gaps), Timeliness (data is current), Consistency (same info recorded the same way), and Relevance (appropriate for the AI use case).

Click to flip back

Question

Why is data quality MORE critical with AI than with traditional software?

Click or press Enter to reveal answer

Answer

Traditional software crashes on bad data. AI produces polished, confident output regardless of data quality — making errors harder to detect. And it delivers wrong answers at scale across the organisation.

Click to flip back

Question

What is a representative dataset and why does it matter?

Click or press Enter to reveal answer

Answer

A representative dataset reflects the full diversity of people and scenarios the AI will encounter. Non-representative data leads to biased AI outputs — a governance, reputational, and legal risk.

Click to flip back

Knowledge check

Knowledge Check

Dr. Patel's audit finds three conflicting versions of the employee handbook in SharePoint. If Copilot is deployed now, what is the most likely outcome?

Knowledge Check

Dr. Patel is reviewing a company's hiring AI as part of a governance audit. She notices it consistently ranks candidates from certain universities higher than equally qualified candidates from other institutions. What data quality issue is this most likely caused by?

🎬 Video coming soon

Next up: When Traditional Machine Learning Adds Value — understanding when old-school ML outperforms generative AI.