Data Quality: The Make-or-Break Factor for AI
Every AI system is only as good as the data behind it. Learn the data quality dimensions that determine whether AI helps or harms — and how to assess your organisation's readiness.
Why does data quality matter more with AI?
”Garbage in, garbage out” has been true in computing for decades. With AI, it’s “garbage in, confidently wrong garbage out — at scale.”
Traditional software crashes or throws errors when data is bad. AI doesn’t. It takes your messy, incomplete, outdated data and produces polished, professional-looking output that seems correct — but isn’t. And it does it fast, across your entire organisation.
That’s why data quality isn’t a technical detail for your IT team. It’s a strategic priority for every leader deploying AI.
Data types: Structured, unstructured, and semi-structured
AI systems work with three types of data, each with different quality challenges:
| Feature | What it looks like | Examples | Quality challenge |
|---|---|---|---|
| Structured data | Organised in rows and columns with defined formats | Databases, spreadsheets, CRM records, financial transactions | Missing values, duplicate records, inconsistent formats (dates, currencies) |
| Unstructured data | No predefined format — free-form content | Emails, documents, Teams chats, images, videos, meeting transcripts | Outdated content, contradictory versions, poor organisation, no metadata |
| Semi-structured data | Has some organisation but not rigid rows and columns | JSON files, XML data, tagged emails, SharePoint metadata | Inconsistent tagging, missing fields, schema variations across sources |
Exam tip: Why unstructured data matters most for gen AI
Most enterprise data is unstructured — documents, emails, chats, presentations. This is exactly the data that generative AI (especially Copilot) grounds on.
The exam may test whether you understand that:
- 80% of enterprise data is unstructured — and it’s the hardest to quality-check
- Copilot primarily grounds on unstructured data via Microsoft Graph (emails, documents, chats)
- Poor unstructured data quality directly leads to poor AI responses
Five dimensions of data quality
Leaders should evaluate data across five key dimensions before deploying AI:
| Dimension | What it means | AI impact if poor | Check |
|---|---|---|---|
| Accuracy | Data reflects reality correctly | AI provides factually wrong answers with high confidence | Are product specs, prices, and policies current and verified? |
| Completeness | No critical gaps or missing fields | AI can’t answer questions about missing topics — or fills in gaps with fabrications | Are all departments, products, and regions represented in the data? |
| Timeliness | Data is current and regularly updated | AI gives outdated answers — last year’s pricing, old policies, former employees | When was each document last reviewed? Is there a refresh schedule? |
| Consistency | Same information is recorded the same way across sources | AI gets contradictory inputs and produces unpredictable responses | Does the HR policy in SharePoint match the version in the employee handbook? |
| Relevance | Data is appropriate for the AI use case | AI retrieves noise instead of signal — irrelevant content dilutes good answers | Is the indexed content actually useful for the questions users will ask? |
Representative datasets: Why they matter for fairness
A representative dataset reflects the full diversity of the population or scenarios the AI will encounter. If the training data or grounding data is skewed, the AI’s outputs will be biased.
| Problem | What happens | Real-world example |
|---|---|---|
| Underrepresentation | AI performs poorly for groups missing from the data | A hiring AI trained mostly on male resumes ranks female candidates lower |
| Historical bias | Data reflects past discrimination — AI perpetuates it | A lending model trained on historical approvals denies loans to demographics that were historically discriminated against |
| Geographic skew | Data overrepresents certain regions or cultures | A customer support AI trained on US data gives incorrect answers about EU regulations |
| Temporal bias | Training data is outdated, reflecting old patterns | A market analysis AI recommends strategies based on pre-pandemic consumer behaviour |
Why leaders — not just data scientists — need to care about representation
Representative datasets aren’t just a technical concern. They’re a governance and reputational risk:
- Regulatory: The EU AI Act and similar regulations require AI systems to be tested for bias
- Reputational: A biased AI in customer-facing applications can generate headlines
- Legal: Discriminatory AI outputs can create liability
The board and C-suite need to ask: “Does our data represent all the people and scenarios this AI will encounter?” If the answer is no, the AI isn’t ready for deployment.
Real-world scenario: Dr. Patel audits data quality before AI deployment
📊 Dr. Anisha Patel, Board Advisor, insists that her client’s organisation completes a data quality audit before rolling out Copilot to 3,000 employees. Here’s what the audit finds:
SharePoint:
- 40% of documents haven’t been updated in over 2 years
- Three versions of the employee handbook exist — with conflicting information
- The old intranet site was migrated but never cleaned up — 10,000 outdated pages are still indexed
CRM data:
- 15% of customer records have no industry classification
- Duplicate contact records across regions mean AI pulls conflicting account information
Email and Teams:
- Teams channels created for past projects still contain outdated decisions and superseded plans
- No archival policy means Copilot surfaces 4-year-old email threads as current context
Dr. Patel’s recommendation: Do not deploy Copilot organisation-wide until critical data hygiene is addressed. Start with a pilot in one department with clean data, and use the findings to build a data cleanup roadmap.
Dr. Patel's data preparation checklist for leaders
Before any AI deployment, ensure:
- Archive or delete outdated content — if it’s not current, it shouldn’t be in the AI’s reach
- Consolidate duplicate and conflicting documents into single sources of truth
- Review permissions — AI will surface anything users can access, so fix oversharing first
- Establish ownership — every key document should have an owner responsible for accuracy
- Create a refresh schedule — data that’s never updated becomes a liability, not an asset
- Test with real queries — ask the AI questions you know the answers to and verify it responds correctly
Key flashcards
Knowledge check
Dr. Patel's audit finds three conflicting versions of the employee handbook in SharePoint. If Copilot is deployed now, what is the most likely outcome?
Dr. Patel is reviewing a company's hiring AI as part of a governance audit. She notices it consistently ranks candidates from certain universities higher than equally qualified candidates from other institutions. What data quality issue is this most likely caused by?
🎬 Video coming soon
Next up: When Traditional Machine Learning Adds Value — understanding when old-school ML outperforms generative AI.