Data Stores & Tools: Make the Right Choice

How do you choose?

Simple explanation

Think of three types of restaurant kitchens.

A lakehouse is a large open kitchen — you can cook anything (structured, semi-structured, unstructured), use any tools (PySpark, SQL, Python), and store ingredients in any container. Flexible but requires a skilled chef.

A warehouse is a precision kitchen — everything is structured, SQL-based, optimised for fast queries and BI reporting. Less flexible but blazingly fast for analytics.

A KQL database (Eventhouse) is a sushi bar — specialised for fast, real-time queries on streaming data. It excels at time-series analytics and log exploration.

The exam tests whether you pick the right kitchen for the meal you’re cooking.

Choosing a data store

Lakehouse for engineering, warehouse for BI, KQL database for real-time
Feature	Lakehouse	Warehouse	KQL Database (Eventhouse)
Data format	Delta Lake tables + files (CSV, Parquet, JSON, images)	Relational tables (T-SQL DDL)	KQL tables (optimised columnar)
Query language	Spark SQL, PySpark, T-SQL (via SQL endpoint — read-only)	T-SQL (full DML — INSERT, UPDATE, DELETE)	KQL (Kusto Query Language)
Write method	Spark notebooks, pipelines, Dataflows Gen2	T-SQL statements, pipelines, Dataflows Gen2	Eventstreams, KQL ingestion, pipelines
Best for	Data engineering — large-scale ETL, ML, semi-structured data	BI analytics — structured data, complex SQL queries, reporting	Real-time analytics — streaming, time-series, log exploration
Unstructured data?	Yes — Files/ section stores any file type	No — structured tables only	No — structured time-series data
SQL endpoint?	Yes (read-only — SELECT, views)	Full T-SQL (read + write)	No — KQL only
Power BI integration	Via SQL endpoint or Direct Lake	Direct SQL connection	Via KQL query sets or dashboards

Decision patterns (exam-tested)

Scenario	Best Store	Why
Store 500M rows of sales data for PySpark transformations and ML training	Lakehouse	Need Spark for ML, large-scale transforms
Serve a Power BI dashboard with complex SQL aggregations and stored procedures	Warehouse	T-SQL DML, stored procs, optimised for BI
Ingest and query 1M events/second from IoT sensors	KQL Database	Real-time ingestion, time-series KQL
Store raw JSON files from an API alongside structured Delta tables	Lakehouse	Files/ section handles unstructured; Tables/ handles structured
Build a dimensional model with fact and dimension tables for analysts	Warehouse (or Lakehouse with SQL endpoint)	Both work; warehouse preferred when analysts need full T-SQL
Real-time clickstream analysis with 10-second query latency	KQL Database	Sub-second KQL queries on streaming data

Scenario: Anika's dual-store design

ShopStream uses both a lakehouse and a warehouse:

Lakehouse: Raw order JSON files land in Files/. PySpark notebooks parse, validate, and transform them into Delta tables in Tables/. ML models train on these tables.
Warehouse: The cleaned, modeled data (star schema) is loaded into the warehouse for the BI team. They write stored procedures for complex revenue calculations and build Power BI reports via DirectQuery or Import mode.

The lakehouse is the engineering workbench. The warehouse is the analytics serving layer.

Choosing a transformation tool

Match the tool to the team's skills and the workload's needs
Factor	Dataflows Gen2	Notebooks	KQL	T-SQL
Interface	Visual (Power Query)	Code (PySpark, SQL, R)	Code (KQL)	Code (SQL)
Skill required	Low — drag-and-drop	High — PySpark/Python	Medium — KQL syntax	Medium — SQL
Scale	Small to medium	Large (distributed Spark)	Large (columnar engine)	Large (MPP engine)
Best for	Simple cleaning from 150+ connectors	Complex transforms, ML, custom logic	Time-series transforms, aggregations	Relational transforms, stored procs
Output to	Lakehouse, warehouse, KQL DB	Lakehouse (Delta tables)	KQL Database	Warehouse tables
Scheduling	Built-in refresh	Via pipeline notebook activity	Via pipeline or Eventhouse	Via pipeline stored proc activity

Decision patterns

Scenario	Best Tool	Why
Extract from Salesforce, clean, load to lakehouse (no code)	Dataflows Gen2	150+ connectors, visual, no code
Join 3 Delta tables (1B rows), calculate running averages, write to warehouse	Notebook (PySpark)	Scale + complex windowed calculations
Calculate hourly aggregates on streaming sensor data	KQL	Time-series optimised, bin() function
Run a stored procedure to update dimension tables in a warehouse	T-SQL	Direct SQL on warehouse, existing proc logic
Load CSVs from SharePoint, standardise dates, filter blanks	Dataflows Gen2	Simple, visual, perfect for SharePoint connector

Exam tip: 'Choose between' questions

When the exam asks “which tool should you use for data transformation,” look for these clues:

“No coding” or “visual” → Dataflows Gen2
“Large scale” or “ML” or “PySpark” → Notebook
“Time-series” or “streaming” or “KQL” → KQL
“Stored procedure” or “T-SQL” or “warehouse” → T-SQL
“150+ connectors” or “Power Query” → Dataflows Gen2

If multiple tools could work, choose the simplest one that meets the requirement.

When to combine stores and tools

Most real-world solutions use multiple stores and tools together:

Pattern	Flow
Lakehouse + Warehouse	Raw data → Lakehouse (PySpark transforms) → Warehouse (star schema for BI)
Lakehouse + KQL Database	Batch data → Lakehouse; Streaming data → KQL Database; both feed dashboards
Dataflow + Notebook	Dataflow Gen2 (extract/clean from source) → Notebook (complex transforms at scale)

Question

What is the key difference between a lakehouse and a warehouse in Fabric?

Click or press Enter to reveal answer

Answer

Lakehouse: Delta Lake tables + unstructured files, accessed via Spark or SQL endpoint (read-only SQL). Warehouse: T-SQL relational store with full DML (INSERT, UPDATE, DELETE, stored procedures). Lakehouse for engineering; warehouse for BI analytics.

Click to flip back

Question

When should you choose a KQL Database (Eventhouse) over a lakehouse?

Click or press Enter to reveal answer

Answer

When you need real-time ingestion (millions of events/second), sub-second query latency on time-series data, and KQL's time-series functions (bin, summarize, render). For IoT, clickstream, logs, and telemetry.

Click to flip back

Question

Can you write to a lakehouse using T-SQL?

Click or press Enter to reveal answer

Answer

No. The lakehouse's SQL analytics endpoint is read-only (SELECT, views). To write data to a lakehouse, use Spark notebooks, pipelines (Copy activity), or Dataflows Gen2. For full T-SQL DML (INSERT, UPDATE, DELETE), use a warehouse.

Click to flip back

Knowledge Check

ShopStream needs to store raw JSON order files from payment gateways alongside structured Delta tables for transformed data. Which Fabric data store should Anika use?

Knowledge Check

A team of business analysts (no Python skills) need to pull data from Salesforce, rename columns, filter out blank emails, and load into a lakehouse. Which transformation tool is most appropriate?

Knowledge Check

Zoe needs to calculate the average video watch time per hour from a stream of 500,000 playback events per minute. The results must be queryable within 10 seconds of the events occurring. Which data store and query language should she use?

Next up: OneLake Shortcuts: Data Without Duplication — access external data as if it were local, without copying a single byte.