🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided DP-900 Domain 4
Domain 4 — Module 1 of 8 13%
20 of 27 overall

DP-900 Study Guide

Domain 1: Core Data Concepts

  • Your First Look at Data Free
  • Data File Formats: CSV, JSON, Parquet & More Free
  • Databases: Relational vs Non-Relational Free
  • Transactional Workloads: Keeping Data Consistent Free
  • Analytical Workloads: Finding the Insights Free
  • Data Roles: DBA, Engineer & Analyst Free
  • The Azure Data Landscape Free

Domain 2: Relational Data on Azure

  • Relational Data: Tables, Keys & Relationships
  • Normalization: Why Duplicate Data is Bad
  • SQL Basics: SELECT, INSERT, UPDATE, DELETE
  • Database Objects: Views, Indexes & More
  • Azure SQL: Your Database in the Cloud
  • Open-Source Databases on Azure
  • Choosing the Right Azure Database

Domain 3: Non-Relational Data on Azure

  • Azure Blob Storage: Files in the Cloud
  • Azure Files & Table Storage
  • Azure Cosmos DB: The Global Database
  • Cosmos DB APIs: SQL, MongoDB & More
  • Choosing Non-Relational Storage

Domain 4: Analytics on Azure

  • Data Ingestion & Processing
  • Analytical Data Stores: Data Lakes, Warehouses & Lakehouses
  • Microsoft Fabric & Azure Databricks
  • Batch vs Streaming: Two Speeds of Data
  • Real-Time Analytics on Azure
  • Power BI: See Your Data
  • Data Models in Power BI
  • Choosing the Right Visualization

DP-900 Study Guide

Domain 1: Core Data Concepts

  • Your First Look at Data Free
  • Data File Formats: CSV, JSON, Parquet & More Free
  • Databases: Relational vs Non-Relational Free
  • Transactional Workloads: Keeping Data Consistent Free
  • Analytical Workloads: Finding the Insights Free
  • Data Roles: DBA, Engineer & Analyst Free
  • The Azure Data Landscape Free

Domain 2: Relational Data on Azure

  • Relational Data: Tables, Keys & Relationships
  • Normalization: Why Duplicate Data is Bad
  • SQL Basics: SELECT, INSERT, UPDATE, DELETE
  • Database Objects: Views, Indexes & More
  • Azure SQL: Your Database in the Cloud
  • Open-Source Databases on Azure
  • Choosing the Right Azure Database

Domain 3: Non-Relational Data on Azure

  • Azure Blob Storage: Files in the Cloud
  • Azure Files & Table Storage
  • Azure Cosmos DB: The Global Database
  • Cosmos DB APIs: SQL, MongoDB & More
  • Choosing Non-Relational Storage

Domain 4: Analytics on Azure

  • Data Ingestion & Processing
  • Analytical Data Stores: Data Lakes, Warehouses & Lakehouses
  • Microsoft Fabric & Azure Databricks
  • Batch vs Streaming: Two Speeds of Data
  • Real-Time Analytics on Azure
  • Power BI: See Your Data
  • Data Models in Power BI
  • Choosing the Right Visualization
Domain 4: Analytics on Azure Premium ⏱ ~12 min read

Data Ingestion & Processing

Before data can be analysed, it needs to be collected, cleaned, and loaded. Learn about ETL, ELT, and the pipelines that make analytics possible.

Getting data from “there” to “here”

☕ Simple explanation

Data doesn’t magically appear in dashboards. Someone has to collect it, clean it, and deliver it.

Think of Priya’s FreshMart. Sales data lives in 50 different store systems. Customer feedback is in emails. Inventory is in a separate app. Before Priya can build a dashboard showing “sales by region,” all of that data needs to be pulled together, cleaned up, and put in one place.

Ingestion is the “pulling together” part — collecting data from all those sources. Processing is the “cleaning up” part — fixing errors, combining formats, and making it ready for analysis. Together, they form a data pipeline.

Data ingestion is the process of collecting data from source systems and loading it into an analytical store. Data processing (transformation) cleans, restructures, and enriches the raw data to make it suitable for analysis. Together, these form the data pipeline — the plumbing that connects operational systems to analytics platforms.

ETL vs ELT

Two approaches to moving and transforming data:

ETL vs ELT
FeatureETLELT
Full nameExtract, Transform, LoadExtract, Load, Transform
When transform happensBefore loading into the destinationAfter loading raw data into the destination
Where transform runsSeparate processing engineInside the destination (data lake/warehouse)
Raw data preserved?No — only the transformed result is storedYes — raw data stays in the lake
Best forTraditional data warehousesModern data lakes and lakehouses (Fabric, Databricks)
FlexibilityMust re-extract if requirements changeRe-transform from preserved raw data

The modern trend is ELT — load raw data first (cheap storage in a data lake), then transform it using powerful cloud compute. This preserves the original data and allows re-processing when business requirements change.

Azure Data Factory

Azure Data Factory is Microsoft’s cloud data integration service. It orchestrates and automates data movement and transformation.

Key concepts:

  • Pipelines: A logical grouping of activities that perform a data integration task
  • Activities: Individual steps in a pipeline (copy data, run a script, call an API)
  • Datasets: References to the data you want to use (a SQL table, a blob file)
  • Linked services: Connection strings to source and destination systems
  • Triggers: Schedules or events that start a pipeline (run nightly, on file arrival)

Priya’s pipeline example:

  1. Extract: Copy sales data from 50 store databases (every night at 2 AM)
  2. Load: Land the raw data in a Fabric lakehouse as Parquet files
  3. Transform: Run Spark notebooks to clean, validate, and aggregate the data
  4. Serve: Write the transformed data to a warehouse table for Power BI
ℹ️ Data Factory vs Fabric pipelines

Microsoft Fabric includes its own pipeline capability (built on Data Factory technology). The key difference:

  • Azure Data Factory (standalone) — a dedicated service for data integration, works with any Azure storage or external system
  • Fabric pipelines — built into the Fabric platform, tightly integrated with lakehouses, warehouses, and notebooks

For new analytics projects using Fabric, use Fabric pipelines. For complex multi-cloud or hybrid scenarios, standalone Data Factory may be more appropriate.

Data processing considerations

When designing data pipelines, consider:

ConsiderationWhat to Think About
LatencyHow quickly must new data be available? Real-time? Daily batch?
VolumeHow much data per day? Gigabytes or petabytes?
FormatWhat format is the source data? CSV, JSON, database tables?
QualityHow clean is the source data? Do you need validation, deduplication?
FrequencyOne-time migration or ongoing scheduled pipeline?
SecurityDoes the data contain PII? Encryption requirements?
💡 Exam tip: ingestion patterns

The exam tests your understanding of ingestion concepts:

  • “Load raw data first, then transform” → ELT
  • “Transform data before loading” → ETL
  • “Orchestrate data movement on a schedule” → Azure Data Factory / Fabric pipelines
  • “Connect to 50 different data sources” → Data Factory with linked services
  • “Preserve raw data for re-processing” → ELT with a data lake

Flashcards

Question

What is the difference between ETL and ELT?

Click or press Enter to reveal answer

Answer

ETL transforms data before loading (in a separate engine). ELT loads raw data first, then transforms inside the destination (data lake/warehouse). ELT is the modern approach — it preserves raw data and leverages cloud compute.

Click to flip back

Question

What is Azure Data Factory?

Click or press Enter to reveal answer

Answer

A cloud data integration service that orchestrates data movement and transformation through pipelines. It connects to 90+ data sources, schedules workflows, and automates ETL/ELT processes.

Click to flip back

Question

What is a data pipeline?

Click or press Enter to reveal answer

Answer

A series of automated steps that extract data from sources, transform it (clean, validate, restructure), and load it into a destination for analysis. Pipelines can run on a schedule or be triggered by events.

Click to flip back

Knowledge check

Knowledge Check

FreshMart wants to keep the original raw sales data in their data lake, so they can re-process it if reporting requirements change. Which approach should the data engineering team use?

Knowledge Check

Pacific Freight needs to copy delivery data from their on-premises SQL Server to a Fabric lakehouse every night at midnight. Which Azure service orchestrates this?

🎬 Video coming soon

Next up: Analytical Data Stores: Data Lakes, Warehouses & Lakehouses — where does all that data go after ingestion?

← Previous

Choosing Non-Relational Storage

Next →

Analytical Data Stores: Data Lakes, Warehouses & Lakehouses

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.