Data Ingestion & Processing

Getting data from “there” to “here”

Simple explanation

Data doesn’t magically appear in dashboards. Someone has to collect it, clean it, and deliver it.

Think of Priya’s FreshMart. Sales data lives in 50 different store systems. Customer feedback is in emails. Inventory is in a separate app. Before Priya can build a dashboard showing “sales by region,” all of that data needs to be pulled together, cleaned up, and put in one place.

Ingestion is the “pulling together” part — collecting data from all those sources. Processing is the “cleaning up” part — fixing errors, combining formats, and making it ready for analysis. Together, they form a data pipeline.

ETL vs ELT

Two approaches to moving and transforming data:

ETL vs ELT
Feature	ETL	ELT
Full name	Extract, Transform, Load	Extract, Load, Transform
When transform happens	Before loading into the destination	After loading raw data into the destination
Where transform runs	Separate processing engine	Inside the destination (data lake/warehouse)
Raw data preserved?	No — only the transformed result is stored	Yes — raw data stays in the lake
Best for	Traditional data warehouses	Modern data lakes and lakehouses (Fabric, Databricks)
Flexibility	Must re-extract if requirements change	Re-transform from preserved raw data

The modern trend is ELT — load raw data first (cheap storage in a data lake), then transform it using powerful cloud compute. This preserves the original data and allows re-processing when business requirements change.

Azure Data Factory

Azure Data Factory is Microsoft’s cloud data integration service. It orchestrates and automates data movement and transformation.

Key concepts:

Pipelines: A logical grouping of activities that perform a data integration task
Activities: Individual steps in a pipeline (copy data, run a script, call an API)
Datasets: References to the data you want to use (a SQL table, a blob file)
Linked services: Connection strings to source and destination systems
Triggers: Schedules or events that start a pipeline (run nightly, on file arrival)

Priya’s pipeline example:

Extract: Copy sales data from 50 store databases (every night at 2 AM)
Load: Land the raw data in a Fabric lakehouse as Parquet files
Transform: Run Spark notebooks to clean, validate, and aggregate the data
Serve: Write the transformed data to a warehouse table for Power BI

Data Factory vs Fabric pipelines

Microsoft Fabric includes its own pipeline capability (built on Data Factory technology). The key difference:

Azure Data Factory (standalone) — a dedicated service for data integration, works with any Azure storage or external system
Fabric pipelines — built into the Fabric platform, tightly integrated with lakehouses, warehouses, and notebooks

For new analytics projects using Fabric, use Fabric pipelines. For complex multi-cloud or hybrid scenarios, standalone Data Factory may be more appropriate.

Data processing considerations

When designing data pipelines, consider:

Consideration	What to Think About
Latency	How quickly must new data be available? Real-time? Daily batch?
Volume	How much data per day? Gigabytes or petabytes?
Format	What format is the source data? CSV, JSON, database tables?
Quality	How clean is the source data? Do you need validation, deduplication?
Frequency	One-time migration or ongoing scheduled pipeline?
Security	Does the data contain PII? Encryption requirements?

Exam tip: ingestion patterns

The exam tests your understanding of ingestion concepts:

“Load raw data first, then transform” → ELT
“Transform data before loading” → ETL
“Orchestrate data movement on a schedule” → Azure Data Factory / Fabric pipelines
“Connect to 50 different data sources” → Data Factory with linked services
“Preserve raw data for re-processing” → ELT with a data lake

Flashcards

Question

What is the difference between ETL and ELT?

Click or press Enter to reveal answer

Answer

ETL transforms data before loading (in a separate engine). ELT loads raw data first, then transforms inside the destination (data lake/warehouse). ELT is the modern approach — it preserves raw data and leverages cloud compute.

Click to flip back

Question

What is Azure Data Factory?

Click or press Enter to reveal answer

Answer

A cloud data integration service that orchestrates data movement and transformation through pipelines. It connects to 90+ data sources, schedules workflows, and automates ETL/ELT processes.

Click to flip back

Question

What is a data pipeline?

Click or press Enter to reveal answer

Answer

A series of automated steps that extract data from sources, transform it (clean, validate, restructure), and load it into a destination for analysis. Pipelines can run on a schedule or be triggered by events.

Click to flip back

Knowledge check

Knowledge Check

FreshMart wants to keep the original raw sales data in their data lake, so they can re-process it if reporting requirements change. Which approach should the data engineering team use?

Knowledge Check

Pacific Freight needs to copy delivery data from their on-premises SQL Server to a Fabric lakehouse every night at midnight. Which Azure service orchestrates this?

Next up: Analytical Data Stores: Data Lakes, Warehouses & Lakehouses — where does all that data go after ingestion?