Ingesting Data: Dataflows Gen2 & Pipelines

How does data get into Fabric?

Simple explanation

Think of a postal service with different delivery options.

Need to send a postcard? Drop it in the letterbox (COPY INTO — fast, simple, direct). Need to send 50 different parcels to 50 addresses on a schedule? Use a courier service (Pipeline — orchestrates multiple steps). Need to sort, repackage, and relabel items before delivery? Use a fulfilment centre (Dataflow Gen2 — transforms along the way). Need a custom, one-off delivery with complex routing? Hire a specialist (Notebook — full code control).

Fabric gives you all four options. The exam tests your ability to pick the right one for each scenario.

Comparing ingestion tools

Different tools for different teams and scenarios
Tool	Code Required	Best For	Target
Dataflows Gen2	No code (Power Query GUI)	Connect → clean → load with visual transformations	Lakehouse, Warehouse, or other Fabric items
Pipeline	Low code (drag-and-drop activities)	Orchestrate multiple steps: copy, transform, load, schedule	Any Fabric item (coordinates other tools)
Notebook	Full code (PySpark/Scala/R)	Complex transformations, API calls, custom logic, ML prep	Lakehouse (Delta tables)
COPY INTO	T-SQL command	Bulk-load Parquet/CSV files into warehouse tables	Warehouse only

Dataflows Gen2

Dataflows Gen2 is the no-code ingestion tool in Fabric. Built on Power Query Online (the same engine as Power BI Desktop’s Get Data experience).

Key capabilities

350+ connectors — databases, SaaS apps, files, APIs
Visual transformations — filter, merge, pivot, unpivot, split columns, add custom columns
Scheduling — run on a schedule or trigger from a pipeline
Staging — data is staged in OneLake before loading to the destination
Destinations — load directly into lakehouses, warehouses, or KQL databases

When to choose Dataflows Gen2

Your team prefers no-code/low-code tools
The transformation is simple to moderate (cleaning, type conversion, merge/join)
You need to connect to a SaaS source (Salesforce, Dynamics, Google Sheets) that does not have a native Fabric connector
You want a repeatable, scheduled data load without writing pipelines

Scenario: Dr. Sarah's patient data feed

Dr. Sarah at Pacific Health receives daily patient survey results from a third-party SaaS tool. She creates a Dataflow Gen2 that:

Connects to the SaaS API (using a web connector)
Flattens the nested JSON responses into a tabular format
Renames columns to match her lakehouse schema
Filters out test records
Loads the clean data into the silver_patient_surveys table in her lakehouse

Total setup time: 30 minutes. No code written. Runs daily at 6 AM.

Pipelines

Pipelines are the orchestration backbone of Fabric — based on Azure Data Factory. They coordinate multiple activities into a single workflow.

Common pipeline activities

Activity	What It Does
Copy activity	Moves data from source to destination (the most common activity)
Dataflow activity	Runs a Dataflow Gen2 as a step in the pipeline
Notebook activity	Runs a Spark notebook
Stored procedure	Executes a warehouse stored procedure
For Each / If Condition	Control flow — loops and branching
Web activity	Calls a REST API
Wait	Pauses for a specified duration

When to choose Pipelines

You need to orchestrate multiple steps (copy → transform → load → notify)
You need scheduling with retry logic and error handling
You need to parameterise workflows (same pipeline, different source/target per run)
You need to coordinate notebooks, dataflows, and stored procedures in sequence

Scenario: Anita's nightly ingestion pipeline

Anita at FreshCart builds a pipeline that runs every night at midnight:

Copy activity: Copy CSV files from Azure Blob Storage (2,000 stores) into lakehouse Bronze tables
Notebook activity: Run a PySpark notebook that deduplicates, validates, and loads Silver tables
Stored procedure activity: Call a warehouse stored procedure to rebuild Gold-layer aggregates
Web activity: Send a Slack notification when the pipeline completes

If Step 2 fails, the pipeline retries 3 times. If all retries fail, it sends an alert to the on-call team.

Notebooks

Spark notebooks give you full code control over data ingestion and transformation.

When to choose Notebooks

Complex transformations — data quality rules, custom parsing, API pagination
Semi-structured data — JSON, XML, nested structures that need flattening
Machine learning prep — feature engineering, data sampling
Exploratory work — ad-hoc investigation before building production pipelines

Common ingestion patterns in notebooks

# Read CSV files from OneLake into a Spark DataFrame
df = spark.read.format("csv") \
    .option("header", "true") \
    .option("inferSchema", "true") \
    .load("Files/raw/pos_transactions/*.csv")

# Basic transformations
from pyspark.sql.functions import col, to_date, when

df_clean = df \
    .filter(col("amount") > 0) \
    .withColumn("transaction_date", to_date(col("date_string"), "yyyy-MM-dd")) \
    .withColumn("category", when(col("dept_code") == "GR", "Grocery")
                            .when(col("dept_code") == "HW", "Hardware")
                            .otherwise("Other"))

# Write to a Delta table in the lakehouse
df_clean.write.format("delta") \
    .mode("append") \
    .saveAsTable("silver_transactions")

COPY INTO

The fastest way to bulk-load data into a Fabric Warehouse:

COPY INTO dbo.daily_positions
FROM 'https://onelake.dfs.fabric.microsoft.com/workspace/lakehouse.Lakehouse/Files/exports/positions.parquet'
WITH (
    FILE_TYPE = 'PARQUET',
    CREDENTIAL = (IDENTITY = 'Shared Access Signature', SECRET = '<sas-token>')
)

When to choose COPY INTO

Loading Parquet or CSV files directly into warehouse tables
You need high throughput for large batch loads
Your workflow is SQL-centric (no Spark, no Power Query)

Question

What is a Dataflow Gen2 in Fabric?

Click or press Enter to reveal answer

Answer

A no-code/low-code data transformation and ingestion tool built on Power Query Online. It connects to 350+ sources, applies visual transformations (filter, merge, pivot), and loads data into lakehouses, warehouses, or KQL databases. Best for business users who prefer a GUI over code.

Click to flip back

Question

What is the primary purpose of a Fabric Pipeline?

Click or press Enter to reveal answer

Answer

Orchestration — coordinating multiple data activities (copy, notebook, dataflow, stored procedure) into a single, scheduled workflow with retry logic and error handling. Pipelines are based on Azure Data Factory.

Click to flip back

Question

What does the COPY INTO command do?

Click or press Enter to reveal answer

Answer

COPY INTO is a T-SQL command that bulk-loads data from files (Parquet, CSV) in OneLake or external storage into a Fabric Warehouse table. It is the fastest method for high-volume batch loading into a warehouse.

Click to flip back

Knowledge Check

Dr. Sarah at Pacific Health needs to ingest daily patient survey data from a third-party SaaS tool. The data needs light cleaning (rename columns, filter test records, convert types). Her team has no coding skills. Which tool should she use?

Knowledge Check

Anita at FreshCart needs to orchestrate a nightly workflow: (1) copy CSV files from Azure Blob Storage, (2) run a PySpark notebook for transformations, (3) call a warehouse stored procedure for aggregations, (4) send a notification. Which tool coordinates these four steps?

Next up: Star Schema Design — the data modelling pattern that underpins every high-performance lakehouse and warehouse.