Ingesting Data: Dataflows Gen2 & Pipelines
Get data into Fabric β no-code Dataflows Gen2, orchestration pipelines, COPY INTO, and notebooks. Match the right ingestion tool to the right scenario.
How does data get into Fabric?
Think of a postal service with different delivery options.
Need to send a postcard? Drop it in the letterbox (COPY INTO β fast, simple, direct). Need to send 50 different parcels to 50 addresses on a schedule? Use a courier service (Pipeline β orchestrates multiple steps). Need to sort, repackage, and relabel items before delivery? Use a fulfilment centre (Dataflow Gen2 β transforms along the way). Need a custom, one-off delivery with complex routing? Hire a specialist (Notebook β full code control).
Fabric gives you all four options. The exam tests your ability to pick the right one for each scenario.
Comparing ingestion tools
| Tool | Code Required | Best For | Target |
|---|---|---|---|
| Dataflows Gen2 | No code (Power Query GUI) | Connect β clean β load with visual transformations | Lakehouse, Warehouse, or other Fabric items |
| Pipeline | Low code (drag-and-drop activities) | Orchestrate multiple steps: copy, transform, load, schedule | Any Fabric item (coordinates other tools) |
| Notebook | Full code (PySpark/Scala/R) | Complex transformations, API calls, custom logic, ML prep | Lakehouse (Delta tables) |
| COPY INTO | T-SQL command | Bulk-load Parquet/CSV files into warehouse tables | Warehouse only |
Dataflows Gen2
Dataflows Gen2 is the no-code ingestion tool in Fabric. Built on Power Query Online (the same engine as Power BI Desktopβs Get Data experience).
Key capabilities
- 350+ connectors β databases, SaaS apps, files, APIs
- Visual transformations β filter, merge, pivot, unpivot, split columns, add custom columns
- Scheduling β run on a schedule or trigger from a pipeline
- Staging β data is staged in OneLake before loading to the destination
- Destinations β load directly into lakehouses, warehouses, or KQL databases
When to choose Dataflows Gen2
- Your team prefers no-code/low-code tools
- The transformation is simple to moderate (cleaning, type conversion, merge/join)
- You need to connect to a SaaS source (Salesforce, Dynamics, Google Sheets) that does not have a native Fabric connector
- You want a repeatable, scheduled data load without writing pipelines
Scenario: Dr. Sarah's patient data feed
Dr. Sarah at Pacific Health receives daily patient survey results from a third-party SaaS tool. She creates a Dataflow Gen2 that:
- Connects to the SaaS API (using a web connector)
- Flattens the nested JSON responses into a tabular format
- Renames columns to match her lakehouse schema
- Filters out test records
- Loads the clean data into the
silver_patient_surveystable in her lakehouse
Total setup time: 30 minutes. No code written. Runs daily at 6 AM.
Pipelines
Pipelines are the orchestration backbone of Fabric β based on Azure Data Factory. They coordinate multiple activities into a single workflow.
Common pipeline activities
| Activity | What It Does |
|---|---|
| Copy activity | Moves data from source to destination (the most common activity) |
| Dataflow activity | Runs a Dataflow Gen2 as a step in the pipeline |
| Notebook activity | Runs a Spark notebook |
| Stored procedure | Executes a warehouse stored procedure |
| For Each / If Condition | Control flow β loops and branching |
| Web activity | Calls a REST API |
| Wait | Pauses for a specified duration |
When to choose Pipelines
- You need to orchestrate multiple steps (copy β transform β load β notify)
- You need scheduling with retry logic and error handling
- You need to parameterise workflows (same pipeline, different source/target per run)
- You need to coordinate notebooks, dataflows, and stored procedures in sequence
Scenario: Anita's nightly ingestion pipeline
Anita at FreshCart builds a pipeline that runs every night at midnight:
- Copy activity: Copy CSV files from Azure Blob Storage (2,000 stores) into lakehouse Bronze tables
- Notebook activity: Run a PySpark notebook that deduplicates, validates, and loads Silver tables
- Stored procedure activity: Call a warehouse stored procedure to rebuild Gold-layer aggregates
- Web activity: Send a Slack notification when the pipeline completes
If Step 2 fails, the pipeline retries 3 times. If all retries fail, it sends an alert to the on-call team.
Notebooks
Spark notebooks give you full code control over data ingestion and transformation.
When to choose Notebooks
- Complex transformations β data quality rules, custom parsing, API pagination
- Semi-structured data β JSON, XML, nested structures that need flattening
- Machine learning prep β feature engineering, data sampling
- Exploratory work β ad-hoc investigation before building production pipelines
Common ingestion patterns in notebooks
# Read CSV files from OneLake into a Spark DataFrame
df = spark.read.format("csv") \
.option("header", "true") \
.option("inferSchema", "true") \
.load("Files/raw/pos_transactions/*.csv")
# Basic transformations
from pyspark.sql.functions import col, to_date, when
df_clean = df \
.filter(col("amount") > 0) \
.withColumn("transaction_date", to_date(col("date_string"), "yyyy-MM-dd")) \
.withColumn("category", when(col("dept_code") == "GR", "Grocery")
.when(col("dept_code") == "HW", "Hardware")
.otherwise("Other"))
# Write to a Delta table in the lakehouse
df_clean.write.format("delta") \
.mode("append") \
.saveAsTable("silver_transactions")
COPY INTO
The fastest way to bulk-load data into a Fabric Warehouse:
COPY INTO dbo.daily_positions
FROM 'https://onelake.dfs.fabric.microsoft.com/workspace/lakehouse.Lakehouse/Files/exports/positions.parquet'
WITH (
FILE_TYPE = 'PARQUET',
CREDENTIAL = (IDENTITY = 'Shared Access Signature', SECRET = '<sas-token>')
)
When to choose COPY INTO
- Loading Parquet or CSV files directly into warehouse tables
- You need high throughput for large batch loads
- Your workflow is SQL-centric (no Spark, no Power Query)
Dr. Sarah at Pacific Health needs to ingest daily patient survey data from a third-party SaaS tool. The data needs light cleaning (rename columns, filter test records, convert types). Her team has no coding skills. Which tool should she use?
Anita at FreshCart needs to orchestrate a nightly workflow: (1) copy CSV files from Azure Blob Storage, (2) run a PySpark notebook for transformations, (3) call a warehouse stored procedure for aggregations, (4) send a notification. Which tool coordinates these four steps?
π¬ Video coming soon
Next up: Star Schema Design β the data modelling pattern that underpins every high-performance lakehouse and warehouse.