Data Ingestion & Processing
Before data can be analysed, it needs to be collected, cleaned, and loaded. Learn about ETL, ELT, and the pipelines that make analytics possible.
Getting data from “there” to “here”
Data doesn’t magically appear in dashboards. Someone has to collect it, clean it, and deliver it.
Think of Priya’s FreshMart. Sales data lives in 50 different store systems. Customer feedback is in emails. Inventory is in a separate app. Before Priya can build a dashboard showing “sales by region,” all of that data needs to be pulled together, cleaned up, and put in one place.
Ingestion is the “pulling together” part — collecting data from all those sources. Processing is the “cleaning up” part — fixing errors, combining formats, and making it ready for analysis. Together, they form a data pipeline.
ETL vs ELT
Two approaches to moving and transforming data:
| Feature | ETL | ELT |
|---|---|---|
| Full name | Extract, Transform, Load | Extract, Load, Transform |
| When transform happens | Before loading into the destination | After loading raw data into the destination |
| Where transform runs | Separate processing engine | Inside the destination (data lake/warehouse) |
| Raw data preserved? | No — only the transformed result is stored | Yes — raw data stays in the lake |
| Best for | Traditional data warehouses | Modern data lakes and lakehouses (Fabric, Databricks) |
| Flexibility | Must re-extract if requirements change | Re-transform from preserved raw data |
The modern trend is ELT — load raw data first (cheap storage in a data lake), then transform it using powerful cloud compute. This preserves the original data and allows re-processing when business requirements change.
Azure Data Factory
Azure Data Factory is Microsoft’s cloud data integration service. It orchestrates and automates data movement and transformation.
Key concepts:
- Pipelines: A logical grouping of activities that perform a data integration task
- Activities: Individual steps in a pipeline (copy data, run a script, call an API)
- Datasets: References to the data you want to use (a SQL table, a blob file)
- Linked services: Connection strings to source and destination systems
- Triggers: Schedules or events that start a pipeline (run nightly, on file arrival)
Priya’s pipeline example:
- Extract: Copy sales data from 50 store databases (every night at 2 AM)
- Load: Land the raw data in a Fabric lakehouse as Parquet files
- Transform: Run Spark notebooks to clean, validate, and aggregate the data
- Serve: Write the transformed data to a warehouse table for Power BI
Data Factory vs Fabric pipelines
Microsoft Fabric includes its own pipeline capability (built on Data Factory technology). The key difference:
- Azure Data Factory (standalone) — a dedicated service for data integration, works with any Azure storage or external system
- Fabric pipelines — built into the Fabric platform, tightly integrated with lakehouses, warehouses, and notebooks
For new analytics projects using Fabric, use Fabric pipelines. For complex multi-cloud or hybrid scenarios, standalone Data Factory may be more appropriate.
Data processing considerations
When designing data pipelines, consider:
| Consideration | What to Think About |
|---|---|
| Latency | How quickly must new data be available? Real-time? Daily batch? |
| Volume | How much data per day? Gigabytes or petabytes? |
| Format | What format is the source data? CSV, JSON, database tables? |
| Quality | How clean is the source data? Do you need validation, deduplication? |
| Frequency | One-time migration or ongoing scheduled pipeline? |
| Security | Does the data contain PII? Encryption requirements? |
Exam tip: ingestion patterns
The exam tests your understanding of ingestion concepts:
- “Load raw data first, then transform” → ELT
- “Transform data before loading” → ETL
- “Orchestrate data movement on a schedule” → Azure Data Factory / Fabric pipelines
- “Connect to 50 different data sources” → Data Factory with linked services
- “Preserve raw data for re-processing” → ELT with a data lake
Flashcards
Knowledge check
FreshMart wants to keep the original raw sales data in their data lake, so they can re-process it if reporting requirements change. Which approach should the data engineering team use?
Pacific Freight needs to copy delivery data from their on-premises SQL Server to a Fabric lakehouse every night at midnight. Which Azure service orchestrates this?
🎬 Video coming soon
Next up: Analytical Data Stores: Data Lakes, Warehouses & Lakehouses — where does all that data go after ingestion?