Normalization: Why Duplicate Data is Bad

What is normalization?

Simple explanation

Normalization means “don’t write the same thing twice.”

Imagine Tom writes every driver’s full address inside every delivery record. If Sarah Park moves house, Tom has to update hundreds of delivery records — miss one and the data is wrong.

Normalization says: store Sarah’s address ONCE in the Drivers table, and just reference her DriverID in the Deliveries table. Now you update one place and everything stays consistent.

The problem normalization solves

Before normalization (bad design)

Jake’s early CloudPulse database stored everything in one table:

OrderID	CustomerName	CustomerEmail	CustomerCity	Product	Price
5001	Meridian Health	info@meridian.co.nz	Auckland	Pro Plan	49.00
5002	Meridian Health	info@meridian.co.nz	Auckland	API Add-on	19.00
5003	Kiwi Fitness	hello@kiwi.fit	Wellington	Pro Plan	49.00

Problems:

Redundancy: “Meridian Health” and their email/city are stored in every order row
Update anomaly: If Meridian Health changes their email, Jake must update EVERY order row. Miss one? Inconsistent data.
Delete anomaly: If Jake deletes all orders for Kiwi Fitness, he loses the customer’s contact info entirely
Insert anomaly: Jake can’t add a new customer until they place an order

After normalization (good design)

Split into separate tables:

Customers table:

CustomerID	Name	Email	City
C001	Meridian Health	info@meridian.co.nz	Auckland
C002	Kiwi Fitness	hello@kiwi.fit	Wellington

Orders table:

OrderID	CustomerID (FK)	Product	Price
5001	C001	Pro Plan	49.00
5002	C001	API Add-on	19.00
5003	C002	Pro Plan	49.00

Now each fact lives in ONE place. Update Meridian’s email once, and it’s correct everywhere.

Normal forms (simplified)

Normalization is done in stages called normal forms. For DP-900, you need to understand the concept, not memorise every rule:

Normal Form	Key Rule	Plain English
1NF	No repeating groups; each cell has one value	Every column stores a single value, not a list
2NF	Every non-key column depends on the WHOLE primary key	No partial dependencies in tables with composite keys
3NF	Every non-key column depends ONLY on the primary key	No column depends on another non-key column

Normal forms by example

1NF violation — a “Products” column storing “Pro Plan, API Add-on” in one cell. Fix: one row per product.

2NF violation — a table with composite key (OrderID + ProductID) where CustomerName depends only on OrderID, not on the full key. Fix: move CustomerName to a separate table.

3NF violation — a table where City depends on PostalCode, which depends on CustomerID. City doesn’t depend directly on the primary key. Fix: move City and PostalCode to a separate location table.

In practice, most well-designed databases achieve 3NF. That’s the sweet spot between data integrity and query simplicity.

Normalization vs denormalization

Normalization minimises duplication — great for transactional databases. But analytical databases (data warehouses) often denormalize on purpose — combining tables to speed up read-heavy queries.

Normalized vs denormalized data
Feature	Normalized	Denormalized
Duplication	Minimal — each fact stored once	Intentional — data repeated for speed
Updates	Fast — update one place	Slow — update many places
Read queries	Require joins (slower for big queries)	Pre-joined data (faster reads)
Best for	Transactional (OLTP) databases	Analytical (OLAP) data warehouses
Integrity risk	Low — constraints enforced	Higher — duplicates can get out of sync

Exam tip: normalization questions

The exam asks about normalization at a conceptual level:

“Reduce data redundancy” → Normalization
“Prevent update anomalies” → Normalization
“Split one large table into multiple related tables” → Normalization
“Combine tables for faster reporting” → Denormalization
“Optimise for analytical queries” → Denormalization

You don’t need to identify specific normal forms (1NF, 2NF, 3NF) on the exam — just understand WHY normalization exists and when denormalization is appropriate.

Flashcards

Question

What is normalization in database design?

Click or press Enter to reveal answer

Answer

The process of organising data to reduce redundancy by splitting data into multiple related tables. Each fact is stored in one place, connected through key relationships.

Click to flip back

Question

What is an update anomaly?

Click or press Enter to reveal answer

Answer

An inconsistency that occurs when the same data is stored in multiple places and one copy is updated while others are not. Normalization prevents this by storing each fact once.

Click to flip back

Question

When would you denormalize data on purpose?

Click or press Enter to reveal answer

Answer

In analytical databases (OLAP/data warehouses) where read performance is more important than write efficiency. Pre-joining data into fewer tables speeds up aggregation queries.

Click to flip back

Knowledge check

Knowledge Check

Jake notices that when he updates a customer's email address, he has to change it in 47 different order records. What database design problem is this?

Knowledge Check

Priya's data warehouse combines customer, product, and sales data into a single wide table for faster reporting. This is an example of:

Next up: SQL Basics: SELECT, INSERT, UPDATE, DELETE — learn the language that relational databases speak.