Normalization: Why Duplicate Data is Bad
Normalization is the art of organising data to eliminate redundancy. It's fundamental to good database design — and a guaranteed exam topic.
What is normalization?
Normalization means “don’t write the same thing twice.”
Imagine Tom writes every driver’s full address inside every delivery record. If Sarah Park moves house, Tom has to update hundreds of delivery records — miss one and the data is wrong.
Normalization says: store Sarah’s address ONCE in the Drivers table, and just reference her DriverID in the Deliveries table. Now you update one place and everything stays consistent.
The problem normalization solves
Before normalization (bad design)
Jake’s early CloudPulse database stored everything in one table:
| OrderID | CustomerName | CustomerEmail | CustomerCity | Product | Price |
|---|---|---|---|---|---|
| 5001 | Meridian Health | info@meridian.co.nz | Auckland | Pro Plan | 49.00 |
| 5002 | Meridian Health | info@meridian.co.nz | Auckland | API Add-on | 19.00 |
| 5003 | Kiwi Fitness | hello@kiwi.fit | Wellington | Pro Plan | 49.00 |
Problems:
- Redundancy: “Meridian Health” and their email/city are stored in every order row
- Update anomaly: If Meridian Health changes their email, Jake must update EVERY order row. Miss one? Inconsistent data.
- Delete anomaly: If Jake deletes all orders for Kiwi Fitness, he loses the customer’s contact info entirely
- Insert anomaly: Jake can’t add a new customer until they place an order
After normalization (good design)
Split into separate tables:
Customers table:
| CustomerID | Name | City | |
|---|---|---|---|
| C001 | Meridian Health | info@meridian.co.nz | Auckland |
| C002 | Kiwi Fitness | hello@kiwi.fit | Wellington |
Orders table:
| OrderID | CustomerID (FK) | Product | Price |
|---|---|---|---|
| 5001 | C001 | Pro Plan | 49.00 |
| 5002 | C001 | API Add-on | 19.00 |
| 5003 | C002 | Pro Plan | 49.00 |
Now each fact lives in ONE place. Update Meridian’s email once, and it’s correct everywhere.
Normal forms (simplified)
Normalization is done in stages called normal forms. For DP-900, you need to understand the concept, not memorise every rule:
| Normal Form | Key Rule | Plain English |
|---|---|---|
| 1NF | No repeating groups; each cell has one value | Every column stores a single value, not a list |
| 2NF | Every non-key column depends on the WHOLE primary key | No partial dependencies in tables with composite keys |
| 3NF | Every non-key column depends ONLY on the primary key | No column depends on another non-key column |
Normal forms by example
1NF violation — a “Products” column storing “Pro Plan, API Add-on” in one cell. Fix: one row per product.
2NF violation — a table with composite key (OrderID + ProductID) where CustomerName depends only on OrderID, not on the full key. Fix: move CustomerName to a separate table.
3NF violation — a table where City depends on PostalCode, which depends on CustomerID. City doesn’t depend directly on the primary key. Fix: move City and PostalCode to a separate location table.
In practice, most well-designed databases achieve 3NF. That’s the sweet spot between data integrity and query simplicity.
Normalization vs denormalization
Normalization minimises duplication — great for transactional databases. But analytical databases (data warehouses) often denormalize on purpose — combining tables to speed up read-heavy queries.
| Feature | Normalized | Denormalized |
|---|---|---|
| Duplication | Minimal — each fact stored once | Intentional — data repeated for speed |
| Updates | Fast — update one place | Slow — update many places |
| Read queries | Require joins (slower for big queries) | Pre-joined data (faster reads) |
| Best for | Transactional (OLTP) databases | Analytical (OLAP) data warehouses |
| Integrity risk | Low — constraints enforced | Higher — duplicates can get out of sync |
Exam tip: normalization questions
The exam asks about normalization at a conceptual level:
- “Reduce data redundancy” → Normalization
- “Prevent update anomalies” → Normalization
- “Split one large table into multiple related tables” → Normalization
- “Combine tables for faster reporting” → Denormalization
- “Optimise for analytical queries” → Denormalization
You don’t need to identify specific normal forms (1NF, 2NF, 3NF) on the exam — just understand WHY normalization exists and when denormalization is appropriate.
Flashcards
Knowledge check
Jake notices that when he updates a customer's email address, he has to change it in 47 different order records. What database design problem is this?
Priya's data warehouse combines customer, product, and sales data into a single wide table for faster reporting. This is an example of:
🎬 Video coming soon
Next up: SQL Basics: SELECT, INSERT, UPDATE, DELETE — learn the language that relational databases speak.