Azure Databricks: Your Lakehouse Platform
Azure Databricks is where data engineering meets the lakehouse. Understand the architecture, workspace components, and how everything connects — from clusters to Unity Catalog.
What is Azure Databricks?
Think of Azure Databricks as a massive data kitchen.
Raw ingredients (your data) arrive from different suppliers (databases, files, streams). The kitchen has different stations — one for prep (cleaning data), one for cooking (transforming), and one for plating (serving to dashboards and reports).
The kitchen runs on Apache Spark — an engine that can process huge amounts of data by splitting the work across many cooks (machines) at once. Azure Databricks wraps Spark in a managed service so you don’t have to build and maintain the kitchen yourself.
Everything is stored in a lakehouse — a single storage layer that combines the flexibility of a data lake (store anything) with the structure of a data warehouse (query it like a database).
The lakehouse architecture
The lakehouse merges two previously separate worlds:
| Traditional Approach | Problem | Lakehouse Solution |
|---|---|---|
| Data lake (store everything as files) | No ACID transactions, no schema enforcement, “data swamp” risk | Delta Lake format adds transactions, schema, time travel |
| Data warehouse (structured, query-optimised) | Expensive, rigid schema, can’t handle unstructured data | Lakehouse queries files directly with warehouse-grade performance |
| Two copies of data (lake + warehouse) | Data duplication, sync issues, higher cost | Single copy of data serves both workloads |
Key exam concept: The lakehouse is not just a buzzword. It’s the architectural foundation of everything in DP-750. Every question assumes you’re working in a lakehouse built on Delta Lake, governed by Unity Catalog.
Workspace anatomy
When Dr. Sarah Okafor opens her Azure Databricks workspace at Athena Group, she sees:
- Workspace — the top-level container. One workspace per team or environment (dev, staging, prod). Created as an Azure resource in a resource group.
- Notebooks — interactive documents mixing code (Python, SQL, Scala, R) with markdown. Where most data engineering work happens.
- Clusters / Compute — the machines that run your code. You’ll learn to choose and configure these in the next two modules.
- Jobs — scheduled or triggered runs of notebooks or pipelines. The automation layer.
- Pipelines — Lakeflow Spark Declarative Pipelines for building production data flows.
- SQL Warehouses — serverless or classic endpoints for running SQL queries against your lakehouse tables.
- Catalog — Unity Catalog’s three-level namespace (catalog > schema > table) for organising and governing data.
How Azure Databricks connects to Azure
Azure Databricks doesn’t exist in isolation. It integrates with:
| Azure Service | Integration |
|---|---|
| Azure Data Lake Storage Gen2 | Primary storage for lakehouse data (Delta tables live here) |
| Azure Key Vault | Securely store and retrieve secrets (connection strings, API keys) |
| Microsoft Entra ID | Identity and access management — SSO, service principals, managed identities |
| Azure Data Factory | Orchestrate data pipelines that include Databricks notebooks |
| Azure Monitor | Stream logs and metrics for cluster/job monitoring |
| Azure Event Hubs | Ingest real-time streaming data |
Exam tip: Questions often test whether you know which Azure service to use for a given task. Databricks handles compute and transformation — storage, identity, and monitoring are Azure’s job.
The control plane vs. data plane
Azure Databricks has two planes:
- Control plane (managed by Databricks) — the workspace UI, job scheduler, notebook service, cluster manager. Runs in the Databricks cloud account.
- Data plane (runs in your Azure subscription) — the actual compute clusters, DBFS, and network resources. Your data stays in your subscription.
This means Ravi’s data at DataPulse Analytics never leaves their Azure environment — only the control signals (start cluster, run job) go through the Databricks control plane.
Exam tip: Security boundaries
The exam may ask about where data resides. Key facts:
- Your data stays in your Azure subscription (data plane)
- Cluster VMs run in your VNet (or a Databricks-managed VNet)
- The workspace UI and job orchestration run in the Databricks control plane
- Unity Catalog metastore can be in your account or Databricks-managed
- Serverless compute runs in the Databricks account (not your subscription) — this is a key architectural difference
Delta Lake: the storage format
Every table in your lakehouse uses Delta Lake format by default. Delta Lake is:
- Parquet files underneath (columnar, compressed, efficient)
- Plus a transaction log (
_delta_log/) that tracks every change - Enables ACID transactions — reads never see partial writes
- Supports time travel — query data as it was at any previous version
- Enforces schema — you can’t accidentally write wrong column types
-- Time travel: see what the table looked like 3 versions ago
SELECT * FROM sales_data VERSION AS OF 3;
-- Time travel: see what it looked like at a specific timestamp
SELECT * FROM sales_data TIMESTAMP AS OF '2026-03-01';
The medallion architecture
Most Databricks lakehouses follow a medallion pattern for organising data:
| Layer | Purpose | Data Quality |
|---|---|---|
| Bronze | Raw ingestion — data as-is from sources | Low (duplicates, nulls, messy formats) |
| Silver | Cleaned and conformed — validated, deduplicated, typed | Medium (usable for analysis) |
| Gold | Business-level aggregates — KPIs, dimensions, facts | High (ready for dashboards and reports) |
When Mei Lin builds Freshmart’s inventory pipeline, raw point-of-sale data lands in bronze, gets cleaned and joined with product master data in silver, and becomes daily stock reports in gold.
🎬 Video coming soon
Knowledge check
Tomás needs to process real-time fraud alerts at NovaPay. The data arrives from Azure Event Hubs. Where does the compute that processes this data physically run?
Dr. Sarah Okafor is designing Athena Group's lakehouse. She wants ACID transactions, schema enforcement, and the ability to query historical versions of data. Which storage format should she use?
Ravi is explaining Azure Databricks to a new team member at DataPulse Analytics. Which statement about the lakehouse architecture is correct?
Next up: Choosing the Right Compute — job compute, serverless, warehouses, and when to use each.