Azure Databricks: Your Lakehouse Platform

What is Azure Databricks?

Simple explanation

Think of Azure Databricks as a massive data kitchen.

Raw ingredients (your data) arrive from different suppliers (databases, files, streams). The kitchen has different stations — one for prep (cleaning data), one for cooking (transforming), and one for plating (serving to dashboards and reports).

The kitchen runs on Apache Spark — an engine that can process huge amounts of data by splitting the work across many cooks (machines) at once. Azure Databricks wraps Spark in a managed service so you don’t have to build and maintain the kitchen yourself.

Everything is stored in a lakehouse — a single storage layer that combines the flexibility of a data lake (store anything) with the structure of a data warehouse (query it like a database).

The lakehouse architecture

The lakehouse merges two previously separate worlds:

Traditional Approach	Problem	Lakehouse Solution
Data lake (store everything as files)	No ACID transactions, no schema enforcement, “data swamp” risk	Delta Lake format adds transactions, schema, time travel
Data warehouse (structured, query-optimised)	Expensive, rigid schema, can’t handle unstructured data	Lakehouse queries files directly with warehouse-grade performance
Two copies of data (lake + warehouse)	Data duplication, sync issues, higher cost	Single copy of data serves both workloads

Key exam concept: The lakehouse is not just a buzzword. It’s the architectural foundation of everything in DP-750. Every question assumes you’re working in a lakehouse built on Delta Lake, governed by Unity Catalog.

Workspace anatomy

When Dr. Sarah Okafor opens her Azure Databricks workspace at Athena Group, she sees:

Workspace — the top-level container. One workspace per team or environment (dev, staging, prod). Created as an Azure resource in a resource group.
Notebooks — interactive documents mixing code (Python, SQL, Scala, R) with markdown. Where most data engineering work happens.
Clusters / Compute — the machines that run your code. You’ll learn to choose and configure these in the next two modules.
Jobs — scheduled or triggered runs of notebooks or pipelines. The automation layer.
Pipelines — Lakeflow Spark Declarative Pipelines for building production data flows.
SQL Warehouses — serverless or classic endpoints for running SQL queries against your lakehouse tables.
Catalog — Unity Catalog’s three-level namespace (catalog > schema > table) for organising and governing data.

How Azure Databricks connects to Azure

Azure Databricks doesn’t exist in isolation. It integrates with:

Azure Service	Integration
Azure Data Lake Storage Gen2	Primary storage for lakehouse data (Delta tables live here)
Azure Key Vault	Securely store and retrieve secrets (connection strings, API keys)
Microsoft Entra ID	Identity and access management — SSO, service principals, managed identities
Azure Data Factory	Orchestrate data pipelines that include Databricks notebooks
Azure Monitor	Stream logs and metrics for cluster/job monitoring
Azure Event Hubs	Ingest real-time streaming data

Exam tip: Questions often test whether you know which Azure service to use for a given task. Databricks handles compute and transformation — storage, identity, and monitoring are Azure’s job.

The control plane vs. data plane

Azure Databricks has two planes:

Control plane (managed by Databricks) — the workspace UI, job scheduler, notebook service, cluster manager. Runs in the Databricks cloud account.
Data plane (runs in your Azure subscription) — the actual compute clusters, DBFS, and network resources. Your data stays in your subscription.

This means Ravi’s data at DataPulse Analytics never leaves their Azure environment — only the control signals (start cluster, run job) go through the Databricks control plane.

Exam tip: Security boundaries

The exam may ask about where data resides. Key facts:

Your data stays in your Azure subscription (data plane)
Cluster VMs run in your VNet (or a Databricks-managed VNet)
The workspace UI and job orchestration run in the Databricks control plane
Unity Catalog metastore can be in your account or Databricks-managed
Serverless compute runs in the Databricks account (not your subscription) — this is a key architectural difference

Delta Lake: the storage format

Every managed table in your lakehouse uses Delta Lake format by default. External tables can use Delta or other supported formats (Parquet, CSV, JSON, Iceberg). Delta Lake is:

Parquet files underneath (columnar, compressed, efficient)
Plus a transaction log (_delta_log/) that tracks every change
Enables ACID transactions — reads never see partial writes
Supports time travel — query data as it was at any previous version
Enforces schema — you can’t accidentally write wrong column types

-- Time travel: see what the table looked like 3 versions ago
SELECT * FROM sales_data VERSION AS OF 3;

-- Time travel: see what it looked like at a specific timestamp
SELECT * FROM sales_data TIMESTAMP AS OF '2026-03-01';

Question

What storage format does Azure Databricks use by default for lakehouse tables?

Click or press Enter to reveal answer

Answer

Delta Lake — built on Parquet files with a transaction log that provides ACID transactions, time travel, and schema enforcement.

Click to flip back

Question

What is the difference between the control plane and data plane in Azure Databricks?

Click or press Enter to reveal answer

Answer

Control plane (Databricks-managed): workspace UI, job scheduler, notebook service. Data plane (your Azure subscription): compute clusters, storage, network. Your data stays in your subscription.

Click to flip back

Question

What is the lakehouse architecture?

Click or press Enter to reveal answer

Answer

A single storage layer combining data lake flexibility (store any format cheaply) with data warehouse capabilities (ACID transactions, schema enforcement, SQL queries) — eliminating the need for separate lake and warehouse copies.

Click to flip back

The medallion architecture

Most Databricks lakehouses follow a medallion pattern for organising data:

Layer	Purpose	Data Quality
Bronze	Raw ingestion — data as-is from sources	Low (duplicates, nulls, messy formats)
Silver	Cleaned and conformed — validated, deduplicated, typed	Medium (usable for analysis)
Gold	Business-level aggregates — KPIs, dimensions, facts	High (ready for dashboards and reports)

When Mei Lin builds Freshmart’s inventory pipeline, raw point-of-sale data lands in bronze, gets cleaned and joined with product master data in silver, and becomes daily stock reports in gold.

Question

What are the three layers of the medallion architecture?

Click or press Enter to reveal answer

Answer

Bronze (raw ingestion, as-is), Silver (cleaned, validated, deduplicated), Gold (business aggregates, KPIs, dashboard-ready). Each layer improves data quality.

Click to flip back

Knowledge check

Knowledge Check

Tomás needs to process real-time fraud alerts at NovaPay. The data arrives from Azure Event Hubs. Where does the compute that processes this data physically run?

Knowledge Check

Dr. Sarah Okafor is designing Athena Group's lakehouse. She wants ACID transactions, schema enforcement, and the ability to query historical versions of data. Which storage format should she use?

Knowledge Check

Ravi is explaining Azure Databricks to a new team member at DataPulse Analytics. Which statement about the lakehouse architecture is correct?

Next up: Choosing the Right Compute — job compute, serverless, warehouses, and when to use each.