Optimize Pipelines & Warehouses

Pipeline optimization

Simple explanation

Think of an assembly line in a factory.

A slow assembly line usually has one of three problems: bottlenecks (one station takes 10x longer than others), sequencing (stations wait for each other when they could work in parallel), or wasted effort (doing work that isn’t needed).

Pipeline optimization follows the same principles: identify the slowest activity, run independent activities in parallel, and skip unnecessary work.

Pipeline performance techniques

The fastest pipeline is one that does less work
Technique	How	Impact
Parallel ForEach	Increase batch count (up to 50) in ForEach activity	Process multiple items simultaneously instead of sequentially
Copy activity DIU	Increase Data Integration Units (DIU) for large data transfers	More parallelism in the copy operation itself
Incremental over full	Use watermark or CDC to load only changed data	90%+ reduction in data volume for large tables
Source-side filtering	Push filters into the source query (not post-copy)	Reduce data transferred over the network
Avoid unnecessary activities	Remove logging, temp table creation, or intermediate copies that aren't needed	Fewer activities = faster pipeline
Activity dependencies	Only chain activities that truly depend on each other; let independent ones run in parallel	Reduces total pipeline duration

Scenario: Carlos optimizes his nightly pipeline

Carlos’s pipeline takes 3 hours. He profiles each activity:

Activity	Duration	Depends On
Copy SAP data	45 min	—
Copy IoT data	30 min	—
Transform notebook	90 min	Both copies
Semantic model refresh	15 min	Transform

Before: All activities run sequentially (3 hours total). After: Copy SAP and Copy IoT run in parallel (45 min), then Transform (90 min), then Refresh (15 min) = 2.5 hours. He also switches SAP to incremental load: Copy SAP drops to 8 min. Total: 1.8 hours — 40% faster.

Warehouse optimization

Statistics

Fabric warehouses use statistics to create optimal query execution plans. Statistics describe data distribution (row counts, value histograms, distinct counts per column).

-- Update statistics on a table
UPDATE STATISTICS dbo.FactOrders;

-- Create statistics on specific columns
CREATE STATISTICS stat_order_date ON dbo.FactOrders (order_date);

When to update: After large data loads, after significant DELETE operations, or when query plans look suboptimal.

Result-set caching

Fabric warehouses have a result-set caching feature, though availability may vary — check the latest Fabric documentation for current status. When available, cached results return instantly for repeated identical queries when underlying data hasn’t changed.

Current best practice: Use statistics updates and efficient query patterns (filter early, select only needed columns) to optimise warehouse query performance.

Query optimization patterns

Pattern	Poor	Optimized
Filter early	`SELECT * FROM FactOrders JOIN DimCustomer ON ... WHERE year = 2026`	`SELECT * FROM (SELECT * FROM FactOrders WHERE year = 2026) f JOIN DimCustomer ON ...`
Select only needed columns	`SELECT * FROM FactOrders`	`SELECT order_id, revenue, order_date FROM FactOrders`
Avoid SELECT DISTINCT on large tables	`SELECT DISTINCT customer_id FROM FactOrders`	Use `GROUP BY customer_id` or get distinct from the dimension table
Use appropriate data types	`VARCHAR(MAX)` for a 10-character code	`VARCHAR(10)` — saves memory and improves performance

Exam tip: Warehouse vs lakehouse query performance

The exam may compare query performance between the lakehouse SQL endpoint and the warehouse:

Lakehouse SQL endpoint: Read-only, no statistics management, best for ad-hoc exploration
Warehouse: Full T-SQL, statistics, result-set caching, stored procedures — best for production BI workloads

If a scenario describes slow BI dashboard queries, look for warehouse optimizations (statistics, caching, materialized views) — not lakehouse tuning.

Question

What is the most impactful pipeline optimization for large data loads?

Click or press Enter to reveal answer

Answer

Switch from full load to incremental load (using watermark or CDC). This reduces data volume by 90%+ for large tables, dramatically cutting Copy activity duration and network transfer.

Click to flip back

Question

What are statistics in a Fabric warehouse?

Click or press Enter to reveal answer

Answer

Metadata about data distribution (row counts, value histograms, distinct counts) used by the query optimizer to create efficient execution plans. Update statistics after large data loads.

Click to flip back

Question

How can you speed up repeated warehouse queries?

Click or press Enter to reveal answer

Answer

Use materialized views to pre-compute common aggregation patterns. Also ensure statistics are updated after large loads, use efficient query patterns (filter early, select only needed columns), and consider pre-aggregated summary tables for dashboard queries.

Click to flip back

Knowledge Check

Carlos's pipeline has 3 Copy activities that take 30, 25, and 20 minutes respectively. They currently run sequentially (75 min total). None depends on the others. What is the best optimization?

Knowledge Check

A Power BI dashboard runs the same warehouse query 200 times per day. The query takes 8 seconds. What warehouse feature would most improve the user experience?

Next up: Optimize Streaming: Real-Time Performance — tune Eventstreams, Eventhouses, and streaming queries for maximum throughput.