Optimize Pipelines & Warehouses
Tune pipeline performance with parallelism and resource management. Optimize warehouse queries with statistics, result caching, and efficient T-SQL patterns.
Pipeline optimization
Think of an assembly line in a factory.
A slow assembly line usually has one of three problems: bottlenecks (one station takes 10x longer than others), sequencing (stations wait for each other when they could work in parallel), or wasted effort (doing work that isnβt needed).
Pipeline optimization follows the same principles: identify the slowest activity, run independent activities in parallel, and skip unnecessary work.
Pipeline performance techniques
| Technique | How | Impact |
|---|---|---|
| Parallel ForEach | Increase batch count (up to 50) in ForEach activity | Process multiple items simultaneously instead of sequentially |
| Copy activity DIU | Increase Data Integration Units (DIU) for large data transfers | More parallelism in the copy operation itself |
| Incremental over full | Use watermark or CDC to load only changed data | 90%+ reduction in data volume for large tables |
| Source-side filtering | Push filters into the source query (not post-copy) | Reduce data transferred over the network |
| Avoid unnecessary activities | Remove logging, temp table creation, or intermediate copies that aren't needed | Fewer activities = faster pipeline |
| Activity dependencies | Only chain activities that truly depend on each other; let independent ones run in parallel | Reduces total pipeline duration |
Scenario: Carlos optimizes his nightly pipeline
Carlosβs pipeline takes 3 hours. He profiles each activity:
| Activity | Duration | Depends On |
|---|---|---|
| Copy SAP data | 45 min | β |
| Copy IoT data | 30 min | β |
| Transform notebook | 90 min | Both copies |
| Semantic model refresh | 15 min | Transform |
Before: All activities run sequentially (3 hours total). After: Copy SAP and Copy IoT run in parallel (45 min), then Transform (90 min), then Refresh (15 min) = 2.5 hours. He also switches SAP to incremental load: Copy SAP drops to 8 min. Total: 1.8 hours β 40% faster.
Warehouse optimization
Statistics
Fabric warehouses use statistics to create optimal query execution plans. Statistics describe data distribution (row counts, value histograms, distinct counts per column).
-- Update statistics on a table
UPDATE STATISTICS dbo.FactOrders;
-- Create statistics on specific columns
CREATE STATISTICS stat_order_date ON dbo.FactOrders (order_date);
When to update: After large data loads, after significant DELETE operations, or when query plans look suboptimal.
Result-set caching
Fabric warehouses have a result-set caching feature, though availability may vary β check the latest Fabric documentation for current status. When available, cached results return instantly for repeated identical queries when underlying data hasnβt changed.
Current best practice: Use statistics updates and efficient query patterns (filter early, select only needed columns) to optimise warehouse query performance.
Query optimization patterns
| Pattern | Poor | Optimized |
|---|---|---|
| Filter early | SELECT * FROM FactOrders JOIN DimCustomer ON ... WHERE year = 2026 | SELECT * FROM (SELECT * FROM FactOrders WHERE year = 2026) f JOIN DimCustomer ON ... |
| Select only needed columns | SELECT * FROM FactOrders | SELECT order_id, revenue, order_date FROM FactOrders |
| Avoid SELECT DISTINCT on large tables | SELECT DISTINCT customer_id FROM FactOrders | Use GROUP BY customer_id or get distinct from the dimension table |
| Use appropriate data types | VARCHAR(MAX) for a 10-character code | VARCHAR(10) β saves memory and improves performance |
Exam tip: Warehouse vs lakehouse query performance
The exam may compare query performance between the lakehouse SQL endpoint and the warehouse:
- Lakehouse SQL endpoint: Read-only, no statistics management, best for ad-hoc exploration
- Warehouse: Full T-SQL, statistics, result-set caching, stored procedures β best for production BI workloads
If a scenario describes slow BI dashboard queries, look for warehouse optimizations (statistics, caching, materialized views) β not lakehouse tuning.
Carlos's pipeline has 3 Copy activities that take 30, 25, and 20 minutes respectively. They currently run sequentially (75 min total). None depends on the others. What is the best optimization?
A Power BI dashboard runs the same warehouse query 200 times per day. The query takes 8 seconds. What warehouse feature would most improve the user experience?
π¬ Video coming soon
Next up: Optimize Streaming: Real-Time Performance β tune Eventstreams, Eventhouses, and streaming queries for maximum throughput.