πŸ”’ Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided DP-750 Domain 3
Domain 3 β€” Module 5 of 10 50%
15 of 28 overall

DP-750 Study Guide

Domain 1: Set Up and Configure an Azure Databricks Environment

  • Azure Databricks: Your Lakehouse Platform Free
  • Choosing the Right Compute Free
  • Configuring Compute for Performance Free
  • Unity Catalog: The Three-Level Namespace Free
  • Tables, Views & External Catalogs Free

Domain 2: Secure and Govern Unity Catalog Objects

  • Securing Unity Catalog: Who Gets What
  • Secrets & Authentication
  • Data Discovery & Attribute-Based Access
  • Row Filters, Column Masks & Retention
  • Lineage, Audit Logs & Delta Sharing

Domain 3: Prepare and Process Data

  • Data Modeling: Ingestion Design Free
  • SCD, Granularity & Temporal Tables
  • Partitioning, Clustering & Table Optimization
  • Ingesting Data: Lakeflow Connect & Notebooks
  • Ingesting Data: SQL Methods & CDC
  • Streaming Ingestion: Structured Streaming & Event Hubs
  • Auto Loader & Declarative Pipelines
  • Cleansing & Profiling Data Free
  • Transforming & Loading Data
  • Data Quality & Schema Enforcement

Domain 4: Deploy and Maintain Data Pipelines and Workloads

  • Building Data Pipelines Free
  • Lakeflow Jobs: Create & Configure
  • Lakeflow Jobs: Schedule, Alerts & Recovery
  • Git & Version Control
  • Testing & Databricks Asset Bundles
  • Monitoring Clusters & Troubleshooting
  • Spark Performance: DAG & Query Profile
  • Optimizing Delta Tables & Azure Monitor

DP-750 Study Guide

Domain 1: Set Up and Configure an Azure Databricks Environment

  • Azure Databricks: Your Lakehouse Platform Free
  • Choosing the Right Compute Free
  • Configuring Compute for Performance Free
  • Unity Catalog: The Three-Level Namespace Free
  • Tables, Views & External Catalogs Free

Domain 2: Secure and Govern Unity Catalog Objects

  • Securing Unity Catalog: Who Gets What
  • Secrets & Authentication
  • Data Discovery & Attribute-Based Access
  • Row Filters, Column Masks & Retention
  • Lineage, Audit Logs & Delta Sharing

Domain 3: Prepare and Process Data

  • Data Modeling: Ingestion Design Free
  • SCD, Granularity & Temporal Tables
  • Partitioning, Clustering & Table Optimization
  • Ingesting Data: Lakeflow Connect & Notebooks
  • Ingesting Data: SQL Methods & CDC
  • Streaming Ingestion: Structured Streaming & Event Hubs
  • Auto Loader & Declarative Pipelines
  • Cleansing & Profiling Data Free
  • Transforming & Loading Data
  • Data Quality & Schema Enforcement

Domain 4: Deploy and Maintain Data Pipelines and Workloads

  • Building Data Pipelines Free
  • Lakeflow Jobs: Create & Configure
  • Lakeflow Jobs: Schedule, Alerts & Recovery
  • Git & Version Control
  • Testing & Databricks Asset Bundles
  • Monitoring Clusters & Troubleshooting
  • Spark Performance: DAG & Query Profile
  • Optimizing Delta Tables & Azure Monitor
Domain 3: Prepare and Process Data Premium ⏱ ~13 min read

Ingesting Data: SQL Methods & CDC

CTAS, CREATE OR REPLACE TABLE, COPY INTO, and change data capture feeds β€” the SQL-based ingestion methods the exam tests explicitly.

SQL-based ingestion methods

β˜• Simple explanation

SQL gives you three ways to move data into a table β€” each with different superpowers.

  • CTAS (CREATE TABLE AS SELECT) β€” creates a brand-new table from a query. Like photocopying a document into a new folder.
  • CREATE OR REPLACE TABLE β€” creates the table or completely overwrites it if it exists. Like printing a fresh copy every time.
  • COPY INTO β€” appends data from files into an existing table, skipping files already loaded. Like a mailroom that stamps each letter β€œprocessed.”

Databricks SQL provides three primary ingestion statements: CTAS for creating tables from queries, CREATE OR REPLACE TABLE for idempotent full-refresh loads, and COPY INTO for incremental file-based loading with exactly-once semantics. Each serves different ingestion patterns.

CTAS (CREATE TABLE AS SELECT)

-- Create a new table from a query
CREATE TABLE silver.clean_orders AS
SELECT order_id, customer_id, amount, order_date
FROM bronze.raw_orders
WHERE amount > 0 AND order_date IS NOT NULL;

Key facts:

  • Creates a new Delta table
  • Fails if the table already exists (use IF NOT EXISTS or CREATE OR REPLACE)
  • Schema is inferred from the SELECT query
  • Data is copied β€” the new table is independent of the source

CREATE OR REPLACE TABLE

-- Full refresh: replace the table contents entirely
CREATE OR REPLACE TABLE gold.daily_summary AS
SELECT
  order_date,
  COUNT(*) as order_count,
  SUM(amount) as total_revenue
FROM silver.clean_orders
GROUP BY order_date;

Key facts:

  • Idempotent β€” running it twice produces the same result
  • Replaces both schema AND data
  • Preserves table history (you can time travel to previous versions)
  • Perfect for full-refresh aggregation tables

COPY INTO

-- Incrementally load new CSV files into an existing table
COPY INTO bronze.raw_sales
FROM 'abfss://landing@storage.dfs.core.windows.net/sales/'
FILEFORMAT = CSV
FORMAT_OPTIONS ('header' = 'true', 'inferSchema' = 'true')
COPY_OPTIONS ('mergeSchema' = 'true');

Key facts:

  • Only processes files not already loaded (tracks processed files internally)
  • Idempotent β€” safe to run multiple times without duplicates
  • Appends to an existing table (doesn’t overwrite)
  • Supports CSV, JSON, Parquet, Avro, ORC, and text files
FeatureCTASCREATE OR REPLACECOPY INTO
Creates table?Yes (new)Yes (replaces)No (existing table)
OperationOne-time creationFull refreshIncremental append
Idempotent?No (fails if exists)YesYes
Tracks loaded files?NoNoYes
Best forInitial table creationFull-refresh aggregatesIncremental file loading
πŸ’‘ COPY INTO vs Auto Loader

Both handle incremental file loading, but differently:

FeatureCOPY INTOAuto Loader
File trackingInternal state per tableCheckpoint directory
StreamingNo (batch only)Yes (continuous or triggered)
File discoveryLists directory each runEvent-based notifications
ScaleGood for thousands of filesBetter for millions of files
Schema evolutionmergeSchema optionBuilt-in schema inference

Exam pattern: Small-to-medium file volumes β†’ COPY INTO. Large-scale, continuous file ingestion β†’ Auto Loader.

Change Data Capture (CDC)

CDC captures row-level changes (inserts, updates, deletes) from a source system:

Enabling the Delta change data feed

-- Enable change data feed on a table
ALTER TABLE silver.customers
SET TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true');

-- Read changes since version 5
SELECT * FROM table_changes('silver.customers', 5);

-- Read changes between timestamps
SELECT * FROM table_changes('silver.customers', '2026-04-01', '2026-04-21');

The change data feed adds metadata columns to each change row:

ColumnValuesMeaning
_change_typeinsert, update_preimage, update_postimage, deleteWhat happened
_commit_versionVersion numberWhich transaction
_commit_timestampTimestampWhen it happened

Processing CDC feeds

TomΓ‘s processes NovaPay’s transaction CDC feed to keep the fraud detection table in sync:

# Read CDC changes as a stream
changes = (spark.readStream
    .format("delta")
    .option("readChangeFeed", "true")
    .option("startingVersion", 0)
    .table("bronze.transactions"))

# Apply changes to the target table
def apply_changes(batch_df, batch_id):
    batch_df.createOrReplaceTempView("changes")
    spark.sql("""
        MERGE INTO silver.fraud_transactions AS target
        USING changes AS source
        ON target.txn_id = source.txn_id
        WHEN MATCHED AND source._change_type = 'update_postimage'
          THEN UPDATE SET *
        WHEN NOT MATCHED AND source._change_type = 'insert'
          THEN INSERT *
    """)

changes.writeStream.foreachBatch(apply_changes).start()
Question

What is the difference between CTAS, CREATE OR REPLACE TABLE, and COPY INTO?

Click or press Enter to reveal answer

Answer

CTAS: creates a new table from a query (one-time). CREATE OR REPLACE: replaces table contents entirely (idempotent full refresh). COPY INTO: appends new files to existing table, skipping already-loaded files (idempotent incremental).

Click to flip back

Question

How does COPY INTO avoid loading duplicate files?

Click or press Enter to reveal answer

Answer

COPY INTO internally tracks which files have already been loaded. Running it multiple times is safe β€” it only processes new files that haven't been seen before.

Click to flip back

Question

What metadata columns does the Delta change data feed add?

Click or press Enter to reveal answer

Answer

_change_type (insert, update_preimage, update_postimage, delete), _commit_version (transaction version), _commit_timestamp (when the change occurred).

Click to flip back

🎬 Video coming soon

Knowledge check

Knowledge Check

Ravi rebuilds DataPulse's daily revenue summary every morning from scratch. If he runs the job twice by accident, the result should be identical. Which SQL method should he use?

Knowledge Check

Freshmart suppliers upload new CSV files to ADLS daily. Mei Lin needs to load these files into a bronze table. Files should only be loaded once, even if the job runs multiple times. Which method should she use?


Next up: Streaming Ingestion: Structured Streaming & Event Hubs β€” real-time data processing with Spark and Azure Event Hubs.

← Previous

Ingesting Data: Lakeflow Connect & Notebooks

Next β†’

Streaming Ingestion: Structured Streaming & Event Hubs

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.