Orchestration: Pick the Right Tool
Choose between Dataflows Gen2, pipelines, and notebooks for data orchestration. Design schedules and event-based triggers to automate your workflows.
Three tools, three use cases
Think of three ways to get to work.
Walking (Dataflows Gen2) β simple, visual, no special skills needed. Perfect for short trips. You see every step clearly.
Driving (Pipelines) β more powerful, handles complex routes, can carry passengers (other activities). But you need to know the roads.
Flying (Notebooks) β maximum power and flexibility. Go anywhere, do anything. But you need a pilotβs licence (coding skills).
The exam tests whether you know which one to pick for a given scenario. The answer is usually the simplest tool that gets the job done.
The decision framework
| Factor | Dataflows Gen2 | Pipelines | Notebooks |
|---|---|---|---|
| Interface | Visual (Power Query drag-and-drop) | Visual (canvas with activities) + JSON | Code (PySpark, SQL, Scala, R) |
| Best for | Simple data cleaning and shaping from 150+ connectors | Orchestrating multiple activities (copy, dataflow, notebook, stored proc) | Complex transformations, ML, custom logic, large-scale processing |
| Coding required? | No β M language generated automatically | Minimal β expressions and parameters | Yes β PySpark, SQL, or Scala |
| Scale | Small to medium datasets (Power Query engine) | Orchestrates at any scale (delegates to other engines) | Large datasets (distributed Spark processing) |
| Error handling | Basic retry on refresh failure | Rich β retry, conditional paths, failure activities, alerts | Custom β try/except in code, widget notifications |
| Scheduling | Built-in refresh schedule | Triggers: schedule, tumbling window, event-based | Built-in schedule, via pipeline (notebook activity), or manual run |
| Output destinations | Lakehouse, warehouse, KQL database | No output itself β orchestrates other tools that produce output | Lakehouse (Delta tables), warehouse, files |
When to use what β exam decision patterns
| Scenario | Best Tool | Why |
|---|---|---|
| Load CSV from blob storage, clean column names, filter rows, write to lakehouse | Dataflows Gen2 | Simple ETL, no code needed, Power Query handles it |
| Run a dataflow, then a notebook, then refresh a semantic model β with retry on failure | Pipeline | Multi-step orchestration with error handling |
| Join 500M rows across three Delta tables, calculate running averages, write to warehouse | Notebook | Scale + complex logic needs distributed Spark |
| Copy data from Azure SQL to lakehouse (no transformation) | Pipeline (Copy activity) | Pure data movement β no transformation needed |
| Transform data using stored procedures in a warehouse | Pipeline (Stored Procedure activity) | Calls existing SQL logic without a notebook |
| Apply machine learning model to incoming data | Notebook | ML libraries (scikit-learn, MLflow) only available in code |
Scenario: Carlos's orchestration design
Carlos at Precision Manufacturing needs to load daily production data:
- Copy raw CSV files from an SFTP server to the lakehouse β Pipeline Copy activity
- Clean column names, filter invalid records, standardise date formats β Dataflows Gen2 (visual, quick)
- Transform β join with dimension tables, calculate defect rates, build fact table β Notebook (500M rows, complex joins)
- Refresh the Power BI semantic model β Pipeline (semantic model refresh activity)
Carlos wraps steps 1-4 in a single Pipeline that orchestrates all the activities in sequence, with retry logic on the copy activity and an email alert if the notebook fails.
Schedules and triggers
Once youβve built your orchestration, you need to make it run automatically.
Trigger types
| Trigger Type | How It Works | Best For |
|---|---|---|
| Schedule | Runs at fixed intervals (every 6 hours, daily at 3 AM, every Monday) | Regular batch processing on predictable cadence |
| Tumbling window | Like schedule, but windows donβt overlap and catch up on missed runs | Time-partitioned data loads (process yesterdayβs data) |
| Event-based | Fires when something happens β new file in storage, message in Event Hub | Real-time or near-real-time ingestion |
| On-demand | Manual trigger or API call | Testing, ad-hoc runs, CI/CD-triggered deployments |
| Feature | Schedule Trigger | Tumbling Window | Event-Based |
|---|---|---|---|
| Runs on | Fixed clock times | Fixed intervals, catches up on missed | External event (file arrival, message) |
| Overlap possible? | Yes β if previous run hasn't finished | No β windows don't overlap | N/A β each event triggers one run |
| Backfill? | No β missed runs are skipped | Yes β runs for each missed window | No β only fires on new events |
| Typical use | Daily refresh at 3 AM | Process data for each hour, catching up after downtime | New file in ADLS triggers ingestion immediately |
Exam tip: Tumbling window vs schedule
The exam often presents a scenario where a pipeline missed runs during a capacity outage. The question: βHow do you ensure all missed time windows are processed?β
Answer: Tumbling window trigger. Unlike a schedule trigger (which skips missed runs), a tumbling window trigger keeps track of each window and catches up on any that were missed.
Pattern: βGuaranteed processing of every time windowβ β tumbling window.
Scenario: Anika's event-driven pipeline
ShopStream receives order data as JSON files dropped into Azure Blob Storage by the payment gateway. Anika configures an event-based trigger:
- Event: New blob created in
orders/incoming/container - Action: Pipeline starts β Copy activity moves the file to the lakehouse β Notebook parses JSON, validates, and appends to the orders Delta table
Orders appear in the analytics dashboard within 5 minutes of payment. No scheduled polling β the pipeline runs only when thereβs work to do.
A data engineer needs to load 800 million rows from three Delta tables, calculate rolling 7-day averages, and write results to a warehouse. Which tool should they use?
Carlos's pipeline runs on a daily schedule at 3 AM. Over the weekend, the Fabric capacity was paused for maintenance, and Saturday and Sunday runs were missed. On Monday, the pipeline runs once. How many days of data were processed?
π¬ Video coming soon
Next up: Pipeline Patterns: Parameters & Expressions β make your orchestration reusable with dynamic expressions and parameterised pipelines.