Data Stores & Tools: Make the Right Choice
Choose between lakehouses, warehouses, and KQL databases. Pick the right transformation tool — Dataflows Gen2, notebooks, KQL, or T-SQL — for every scenario.
How do you choose?
Think of three types of restaurant kitchens.
A lakehouse is a large open kitchen — you can cook anything (structured, semi-structured, unstructured), use any tools (PySpark, SQL, Python), and store ingredients in any container. Flexible but requires a skilled chef.
A warehouse is a precision kitchen — everything is structured, SQL-based, optimised for fast queries and BI reporting. Less flexible but blazingly fast for analytics.
A KQL database (Eventhouse) is a sushi bar — specialised for fast, real-time queries on streaming data. It excels at time-series analytics and log exploration.
The exam tests whether you pick the right kitchen for the meal you’re cooking.
Choosing a data store
| Feature | Lakehouse | Warehouse | KQL Database (Eventhouse) |
|---|---|---|---|
| Data format | Delta Lake tables + files (CSV, Parquet, JSON, images) | Relational tables (T-SQL DDL) | KQL tables (optimised columnar) |
| Query language | Spark SQL, PySpark, T-SQL (via SQL endpoint — read-only) | T-SQL (full DML — INSERT, UPDATE, DELETE) | KQL (Kusto Query Language) |
| Write method | Spark notebooks, pipelines, Dataflows Gen2 | T-SQL statements, pipelines, Dataflows Gen2 | Eventstreams, KQL ingestion, pipelines |
| Best for | Data engineering — large-scale ETL, ML, semi-structured data | BI analytics — structured data, complex SQL queries, reporting | Real-time analytics — streaming, time-series, log exploration |
| Unstructured data? | Yes — Files/ section stores any file type | No — structured tables only | No — structured time-series data |
| SQL endpoint? | Yes (read-only — SELECT, views) | Full T-SQL (read + write) | No — KQL only |
| Power BI integration | Via SQL endpoint or Direct Lake | Direct SQL connection | Via KQL query sets or dashboards |
Decision patterns (exam-tested)
| Scenario | Best Store | Why |
|---|---|---|
| Store 500M rows of sales data for PySpark transformations and ML training | Lakehouse | Need Spark for ML, large-scale transforms |
| Serve a Power BI dashboard with complex SQL aggregations and stored procedures | Warehouse | T-SQL DML, stored procs, optimised for BI |
| Ingest and query 1M events/second from IoT sensors | KQL Database | Real-time ingestion, time-series KQL |
| Store raw JSON files from an API alongside structured Delta tables | Lakehouse | Files/ section handles unstructured; Tables/ handles structured |
| Build a dimensional model with fact and dimension tables for analysts | Warehouse (or Lakehouse with SQL endpoint) | Both work; warehouse preferred when analysts need full T-SQL |
| Real-time clickstream analysis with 10-second query latency | KQL Database | Sub-second KQL queries on streaming data |
Scenario: Anika's dual-store design
ShopStream uses both a lakehouse and a warehouse:
- Lakehouse: Raw order JSON files land in Files/. PySpark notebooks parse, validate, and transform them into Delta tables in Tables/. ML models train on these tables.
- Warehouse: The cleaned, modeled data (star schema) is loaded into the warehouse for the BI team. They write stored procedures for complex revenue calculations and build Power BI reports via DirectQuery or Import mode.
The lakehouse is the engineering workbench. The warehouse is the analytics serving layer.
Choosing a transformation tool
| Factor | Dataflows Gen2 | Notebooks | KQL | T-SQL |
|---|---|---|---|---|
| Interface | Visual (Power Query) | Code (PySpark, SQL, R) | Code (KQL) | Code (SQL) |
| Skill required | Low — drag-and-drop | High — PySpark/Python | Medium — KQL syntax | Medium — SQL |
| Scale | Small to medium | Large (distributed Spark) | Large (columnar engine) | Large (MPP engine) |
| Best for | Simple cleaning from 150+ connectors | Complex transforms, ML, custom logic | Time-series transforms, aggregations | Relational transforms, stored procs |
| Output to | Lakehouse, warehouse, KQL DB | Lakehouse (Delta tables) | KQL Database | Warehouse tables |
| Scheduling | Built-in refresh | Via pipeline notebook activity | Via pipeline or Eventhouse | Via pipeline stored proc activity |
Decision patterns
| Scenario | Best Tool | Why |
|---|---|---|
| Extract from Salesforce, clean, load to lakehouse (no code) | Dataflows Gen2 | 150+ connectors, visual, no code |
| Join 3 Delta tables (1B rows), calculate running averages, write to warehouse | Notebook (PySpark) | Scale + complex windowed calculations |
| Calculate hourly aggregates on streaming sensor data | KQL | Time-series optimised, bin() function |
| Run a stored procedure to update dimension tables in a warehouse | T-SQL | Direct SQL on warehouse, existing proc logic |
| Load CSVs from SharePoint, standardise dates, filter blanks | Dataflows Gen2 | Simple, visual, perfect for SharePoint connector |
Exam tip: 'Choose between' questions
When the exam asks “which tool should you use for data transformation,” look for these clues:
- “No coding” or “visual” → Dataflows Gen2
- “Large scale” or “ML” or “PySpark” → Notebook
- “Time-series” or “streaming” or “KQL” → KQL
- “Stored procedure” or “T-SQL” or “warehouse” → T-SQL
- “150+ connectors” or “Power Query” → Dataflows Gen2
If multiple tools could work, choose the simplest one that meets the requirement.
When to combine stores and tools
Most real-world solutions use multiple stores and tools together:
| Pattern | Flow |
|---|---|
| Lakehouse + Warehouse | Raw data → Lakehouse (PySpark transforms) → Warehouse (star schema for BI) |
| Lakehouse + KQL Database | Batch data → Lakehouse; Streaming data → KQL Database; both feed dashboards |
| Dataflow + Notebook | Dataflow Gen2 (extract/clean from source) → Notebook (complex transforms at scale) |
ShopStream needs to store raw JSON order files from payment gateways alongside structured Delta tables for transformed data. Which Fabric data store should Anika use?
A team of business analysts (no Python skills) need to pull data from Salesforce, rename columns, filter out blank emails, and load into a lakehouse. Which transformation tool is most appropriate?
Zoe needs to calculate the average video watch time per hour from a stream of 500,000 playback events per minute. The results must be queryable within 10 seconds of the events occurring. Which data store and query language should she use?
🎬 Video coming soon
Next up: OneLake Shortcuts: Data Without Duplication — access external data as if it were local, without copying a single byte.