🔒 Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided DP-750 Domain 3
Domain 3 — Module 1 of 10 10%
11 of 28 overall

DP-750 Study Guide

Domain 1: Set Up and Configure an Azure Databricks Environment

  • Azure Databricks: Your Lakehouse Platform Free
  • Choosing the Right Compute Free
  • Configuring Compute for Performance Free
  • Unity Catalog: The Three-Level Namespace Free
  • Tables, Views & External Catalogs Free

Domain 2: Secure and Govern Unity Catalog Objects

  • Securing Unity Catalog: Who Gets What
  • Secrets & Authentication
  • Data Discovery & Attribute-Based Access
  • Row Filters, Column Masks & Retention
  • Lineage, Audit Logs & Delta Sharing

Domain 3: Prepare and Process Data

  • Data Modeling: Ingestion Design Free
  • SCD, Granularity & Temporal Tables
  • Partitioning, Clustering & Table Optimization
  • Ingesting Data: Lakeflow Connect & Notebooks
  • Ingesting Data: SQL Methods & CDC
  • Streaming Ingestion: Structured Streaming & Event Hubs
  • Auto Loader & Declarative Pipelines
  • Cleansing & Profiling Data Free
  • Transforming & Loading Data
  • Data Quality & Schema Enforcement

Domain 4: Deploy and Maintain Data Pipelines and Workloads

  • Building Data Pipelines Free
  • Lakeflow Jobs: Create & Configure
  • Lakeflow Jobs: Schedule, Alerts & Recovery
  • Git & Version Control
  • Testing & Databricks Asset Bundles
  • Monitoring Clusters & Troubleshooting
  • Spark Performance: DAG & Query Profile
  • Optimizing Delta Tables & Azure Monitor

DP-750 Study Guide

Domain 1: Set Up and Configure an Azure Databricks Environment

  • Azure Databricks: Your Lakehouse Platform Free
  • Choosing the Right Compute Free
  • Configuring Compute for Performance Free
  • Unity Catalog: The Three-Level Namespace Free
  • Tables, Views & External Catalogs Free

Domain 2: Secure and Govern Unity Catalog Objects

  • Securing Unity Catalog: Who Gets What
  • Secrets & Authentication
  • Data Discovery & Attribute-Based Access
  • Row Filters, Column Masks & Retention
  • Lineage, Audit Logs & Delta Sharing

Domain 3: Prepare and Process Data

  • Data Modeling: Ingestion Design Free
  • SCD, Granularity & Temporal Tables
  • Partitioning, Clustering & Table Optimization
  • Ingesting Data: Lakeflow Connect & Notebooks
  • Ingesting Data: SQL Methods & CDC
  • Streaming Ingestion: Structured Streaming & Event Hubs
  • Auto Loader & Declarative Pipelines
  • Cleansing & Profiling Data Free
  • Transforming & Loading Data
  • Data Quality & Schema Enforcement

Domain 4: Deploy and Maintain Data Pipelines and Workloads

  • Building Data Pipelines Free
  • Lakeflow Jobs: Create & Configure
  • Lakeflow Jobs: Schedule, Alerts & Recovery
  • Git & Version Control
  • Testing & Databricks Asset Bundles
  • Monitoring Clusters & Troubleshooting
  • Spark Performance: DAG & Query Profile
  • Optimizing Delta Tables & Azure Monitor
Domain 3: Prepare and Process Data Free ⏱ ~15 min read

Data Modeling: Ingestion Design

Choose the right ingestion tools, loading methods, table formats, and managed vs external tables — the architectural decisions that shape your entire lakehouse.

Designing your ingestion architecture

☕ Simple explanation

Ingestion design is like planning a kitchen’s supply chain before opening a restaurant.

Where do ingredients come from (sources)? How often do deliveries arrive — daily truck (batch) or continuous conveyor belt (streaming)? What containers do they come in (file formats)? Which dock do they arrive at (ingestion tool)? Do you own the storage room or rent it (managed vs external)?

Get these decisions wrong and you spend months refactoring. Get them right and your data pipeline practically builds itself.

Data ingestion design involves selecting the extraction pattern (full vs incremental), source format (CSV, JSON, Parquet), loading method (batch vs streaming), ingestion tool (Lakeflow Connect, notebooks, ADF), target format (Delta, Iceberg), and table ownership (managed vs external). These decisions cascade through your entire architecture.

Extraction types

Extraction TypeHow It WorksBest For
Full extractionCopy ALL data from source every timeSmall reference tables, initial loads
Incremental extractionCopy only NEW or CHANGED records since last runLarge transactional tables, frequent updates
CDC (Change Data Capture)Capture individual row-level changes (insert, update, delete)Real-time sync, audit trail

Mei Lin uses incremental extraction for Freshmart’s daily POS data (millions of transactions). For the product catalogue (5,000 items), she uses full extraction because it’s small and simpler.

Choosing an ingestion tool

FeatureLakeflow ConnectNotebooksAzure Data Factory
Best forSaaS/database connectorsCustom logic, complex transformsEnterprise orchestration
Code required?Low-code configurationPython/SQL/ScalaLow-code + custom activities
Streaming supportYesYes (Structured Streaming)Limited
Built into DatabricksYes (native)Yes (native)No (Azure service)
Exam scenarioIngest from Salesforce, SAPCustom ETL with business logicOrchestrate multi-service pipeline

Exam decision tree:

  1. Ingesting from a SaaS app or standard database? → Lakeflow Connect
  2. Need custom transformation logic during ingestion? → Notebooks
  3. Orchestrating across Azure services (not just Databricks)? → Azure Data Factory
ℹ️ Azure Data Factory integration pattern

ADF doesn’t replace Databricks — it orchestrates it:

  1. ADF pipeline triggers on schedule
  2. Copy Activity moves data from source to ADLS bronze layer
  3. Databricks Notebook Activity transforms bronze to silver to gold
  4. ADF monitors the entire flow

Exam tip: If the question mentions orchestrating across multiple Azure services, ADF is the answer. If ingestion is purely within Databricks, use Lakeflow Connect or notebooks.

Batch vs streaming

AspectBatchStreaming
Data arrivesIn chunks (hourly, daily)Continuously
LatencyMinutes to hoursSeconds to minutes
CostLower (compute runs only during batch)Higher (continuous compute)
ComplexitySimplerMore complex (state, checkpoints)
Use caseNightly ETL, reportingFraud detection, IoT, real-time dashboards

Tomás at NovaPay uses streaming for fraud detection. Ravi at DataPulse uses batch for nightly client reporting.

Table formats

FormatACIDSchema EnforcementTime TravelDefault Choice?
Delta LakeYesYesYesYes — always use this
IcebergYesYesYesMulti-engine interoperability
ParquetNoNoNoRead-only analytics, archival
CSVNoNoNoData exchange
JSONNoNoNoSemi-structured, API responses
💡 Delta vs Iceberg
  • Delta Lake = native Databricks format. Best performance, all features.
  • Iceberg = multi-engine format. Choose when data must be read by Trino, Flink, or other engines outside Databricks.

Exam tip: If “interoperability with non-Databricks engines” is mentioned → Iceberg. Otherwise → Delta.

Managed vs external tables (design decision)

When to ChooseManagedExternal
Standard lakehouse tablesYes
DROP should delete dataYes
Data shared across systemsYes
Pre-existing data in ADLSYes
DROP should keep data filesYes
Question

What are the three extraction types for data ingestion?

Click or press Enter to reveal answer

Answer

Full extraction (all data every time — small tables), Incremental extraction (new/changed records only — large tables), CDC (individual row-level changes — real-time sync).

Click to flip back

Question

When should you use Lakeflow Connect vs notebooks vs Azure Data Factory?

Click or press Enter to reveal answer

Answer

Lakeflow Connect: SaaS/database connectors. Notebooks: custom transformation logic. ADF: orchestrating pipelines across multiple Azure services beyond Databricks.

Click to flip back

Question

Why is Delta Lake the default table format?

Click or press Enter to reveal answer

Answer

Delta Lake provides ACID transactions, schema enforcement, time travel, and OPTIMIZE/VACUUM. It's the native lakehouse format with the best Databricks performance.

Click to flip back

🎬 Video coming soon

Knowledge check

Knowledge Check

Ravi needs to ingest customer data from Salesforce into DataPulse's lakehouse. The data should arrive daily with minimal custom code. Which tool is most appropriate?

Knowledge Check

Tomás needs fraud detection within 5 seconds of a transaction. Historical reporting refreshes every 6 hours. Which loading methods should he use?


Next up: SCD, Granularity & Temporal Tables — slowly changing dimensions, granularity decisions, and history tables.

← Previous

Lineage, Audit Logs & Delta Sharing

Next →

SCD, Granularity & Temporal Tables

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.