πŸ”’ Guided

Pre-launch preview. Authorised access only.

Incorrect code

Guided by A Guide to Cloud
Explore AB-900 AI-901
Guided DP-700 Domain 2
Domain 2 β€” Module 8 of 10 80%
16 of 26 overall

DP-700 Study Guide

Domain 1: Implement and Manage an Analytics Solution

  • Workspace Settings: Your Fabric Foundation
  • Version Control: Git in Fabric
  • Deployment Pipelines: Dev to Production
  • Access Controls: Who Gets In
  • Data Security: Control Who Sees What
  • Governance: Labels, Endorsement & Audit
  • Orchestration: Pick the Right Tool
  • Pipeline Patterns: Parameters & Expressions

Domain 2: Ingest and Transform Data

  • Delta Lake: The Heart of Fabric Free
  • Loading Patterns: Full, Incremental & Streaming Free
  • Dimensional Modeling: Prep for Analytics Free
  • Data Stores & Tools: Make the Right Choice Free
  • OneLake Shortcuts: Data Without Duplication
  • Mirroring: Real-Time Database Replication
  • PySpark Transformations: Code Your Pipeline
  • Transform Data with SQL & KQL
  • Eventstreams & Spark Streaming: Real-Time Ingestion
  • Real-Time Intelligence: KQL & Windowing

Domain 3: Monitor and Optimize an Analytics Solution

  • Monitoring & Alerts: Catch Problems Early
  • Troubleshoot Pipelines & Dataflows
  • Troubleshoot Notebooks & SQL
  • Troubleshoot Streaming & Shortcuts
  • Optimize Lakehouse Tables: Delta Tuning
  • Optimize Spark: Speed Up Your Code
  • Optimize Pipelines & Warehouses
  • Optimize Streaming: Real-Time Performance

DP-700 Study Guide

Domain 1: Implement and Manage an Analytics Solution

  • Workspace Settings: Your Fabric Foundation
  • Version Control: Git in Fabric
  • Deployment Pipelines: Dev to Production
  • Access Controls: Who Gets In
  • Data Security: Control Who Sees What
  • Governance: Labels, Endorsement & Audit
  • Orchestration: Pick the Right Tool
  • Pipeline Patterns: Parameters & Expressions

Domain 2: Ingest and Transform Data

  • Delta Lake: The Heart of Fabric Free
  • Loading Patterns: Full, Incremental & Streaming Free
  • Dimensional Modeling: Prep for Analytics Free
  • Data Stores & Tools: Make the Right Choice Free
  • OneLake Shortcuts: Data Without Duplication
  • Mirroring: Real-Time Database Replication
  • PySpark Transformations: Code Your Pipeline
  • Transform Data with SQL & KQL
  • Eventstreams & Spark Streaming: Real-Time Ingestion
  • Real-Time Intelligence: KQL & Windowing

Domain 3: Monitor and Optimize an Analytics Solution

  • Monitoring & Alerts: Catch Problems Early
  • Troubleshoot Pipelines & Dataflows
  • Troubleshoot Notebooks & SQL
  • Troubleshoot Streaming & Shortcuts
  • Optimize Lakehouse Tables: Delta Tuning
  • Optimize Spark: Speed Up Your Code
  • Optimize Pipelines & Warehouses
  • Optimize Streaming: Real-Time Performance
Domain 2: Ingest and Transform Data Premium ⏱ ~14 min read

Transform Data with SQL & KQL

Use T-SQL in Fabric warehouses and KQL in Eventhouses to transform, aggregate, and shape data β€” stored procedures, views, materialized queries, and time-series patterns.

Two query languages, two data stores

β˜• Simple explanation

Think of two different kitchens with different tools.

The warehouse kitchen uses T-SQL β€” the same language database admins have used for decades. It excels at joins, stored procedures, and creating views for BI reports. It’s precise, powerful, and familiar.

The Eventhouse kitchen uses KQL (Kusto Query Language) β€” a language built for speed on time-series data. It excels at slicing millions of events by time windows, calculating percentiles, and rendering real-time dashboards. It reads left-to-right like a pipeline.

Both transform data. The difference is what data they work on and how they think about it.

T-SQL in Fabric warehouses supports full DML (INSERT, UPDATE, DELETE, MERGE) and DDL (CREATE TABLE, ALTER, DROP), plus views, stored procedures, and functions. It’s the primary tool for relational transformations and dimensional model maintenance.

KQL in Fabric Eventhouses is a read-optimised query language designed for log analytics and time-series data. It uses a pipe-based syntax where each operator transforms the output of the previous one. KQL supports powerful time-series functions (bin, summarize, make-series), rendering operators, and materialized views for pre-aggregation.

T-SQL transformations in warehouses

Views for reusable transformations

-- Create a view that denormalizes orders with customer info
CREATE VIEW vw_OrdersWithCustomers AS
SELECT
    o.order_id,
    o.order_date,
    c.customer_name,
    c.region,
    o.quantity,
    o.revenue
FROM FactOrders o
INNER JOIN DimCustomer c ON o.customer_key = c.customer_key
WHERE c.is_current = 1;   -- SCD Type 2: only current dimension rows

Stored procedures for repeatable ETL

-- Stored procedure: load and deduplicate daily orders
CREATE PROCEDURE usp_LoadDailyOrders
    @load_date DATE
AS
BEGIN
    -- Step 1: Stage new data
    INSERT INTO staging.DailyOrders
    SELECT * FROM external_source.Orders
    WHERE order_date = @load_date;

    -- Step 2: Deduplicate (keep latest per order_id)
    WITH RankedOrders AS (
        SELECT *,
            ROW_NUMBER() OVER (
                PARTITION BY order_id
                ORDER BY modified_date DESC
            ) AS rn
        FROM staging.DailyOrders
    )
    DELETE FROM RankedOrders WHERE rn > 1;

    -- Step 3: MERGE into fact table
    MERGE INTO FactOrders AS target
    USING staging.DailyOrders AS source
    ON target.order_id = source.order_id
    WHEN MATCHED THEN
        UPDATE SET target.quantity = source.quantity,
                   target.revenue = source.revenue,
                   target.status = source.status
    WHEN NOT MATCHED THEN
        INSERT (order_id, order_date, customer_key, quantity, revenue, status)
        VALUES (source.order_id, source.order_date, source.customer_key,
                source.quantity, source.revenue, source.status);

    -- Step 4: Clean staging
    TRUNCATE TABLE staging.DailyOrders;
END;
ℹ️ What's happening: The stored procedure explained

This stored procedure runs four steps in order:

  1. Stage β€” load new data from the external source into a staging table
  2. Deduplicate β€” use ROW_NUMBER to rank duplicates and delete all but the latest
  3. MERGE β€” upsert into the fact table (update matches, insert new)
  4. Clean β€” truncate the staging table for the next run

The pipeline calls this procedure with a date parameter: EXEC usp_LoadDailyOrders @load_date = '2026-04-21'

Handling duplicates with T-SQL

-- Find duplicates
SELECT order_id, COUNT(*) as dupe_count
FROM FactOrders
GROUP BY order_id
HAVING COUNT(*) > 1;

-- Remove duplicates (keep the row with latest modified_date)
WITH Dupes AS (
    SELECT *,
        ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY modified_date DESC) AS rn
    FROM FactOrders
)
DELETE FROM Dupes WHERE rn > 1;

KQL transformations in Eventhouses

KQL uses a pipe syntax β€” each line transforms the output of the previous line.

Basic KQL patterns

// Count events by hour for the last 24 hours
OrderEvents
| where Timestamp > ago(24h)                    // Filter: last 24 hours
| summarize EventCount = count() by bin(Timestamp, 1h)  // Group by hour
| order by Timestamp asc                         // Sort chronologically
| render timechart                               // Visualize as line chart

Aggregation patterns

// Revenue by region, with percentiles
OrderEvents
| where EventType == "purchase"
| summarize
    TotalRevenue = sum(Amount),
    AvgRevenue = avg(Amount),
    P95Revenue = percentile(Amount, 95),
    OrderCount = count()
    by Region
| order by TotalRevenue desc

Time-series functions

// Create a time series of hourly order counts (fills gaps with 0)
OrderEvents
| make-series OrderCount = count() on Timestamp step 1h
| render timechart
πŸ’‘ Scenario: Zoe's clickstream analysis

Zoe at WaveMedia needs to find the top 10 most-watched videos in the last hour, with average watch time and drop-off rate:

PlaybackEvents
| where Timestamp > ago(1h)
| where EventType == "video_play"
| summarize
    AvgWatchSeconds = avg(WatchDuration),
    TotalViews = count(),
    DropOffRate = countif(WatchDuration < 30) * 100.0 / count()
    by VideoId, VideoTitle
| top 10 by TotalViews desc

This runs in under 2 seconds on 50M events. The same query in T-SQL on a warehouse would take minutes.

T-SQL vs KQL

T-SQL for relational work; KQL for time-series speed
FactorT-SQL (Warehouse)KQL (Eventhouse)
Syntax styleSELECT ... FROM ... WHERE ... (set-based)Table | where | summarize | render (pipe-based)
Write supportFull DML (INSERT, UPDATE, DELETE, MERGE)Limited (ingestion only β€” no UPDATE/DELETE on rows)
Best forRelational transforms, dimensional modeling, stored procsTime-series analysis, log exploration, streaming aggregations
Time functionsDATEADD, DATEDIFF, FORMATago(), bin(), datetime_diff(), make-series()
VisualizationNone built-in (use Power BI)render operator (timechart, barchart, piechart)
Stored procedures?YesStored functions only (no side effects)
Views?Yes (standard + materialized views planned)Materialized views (pre-computed aggregations)

Handling data quality issues

Missing data

StrategyT-SQLPySpark
Replace with defaultISNULL(column, default_value)fillna({"column": default})
Remove rowsWHERE column IS NOT NULLdropna(subset=["column"])
Flag for reviewCASE WHEN column IS NULL THEN 'missing' ENDwhen(col("column").isNull(), "missing")

Duplicate data

StrategyT-SQLPySpark
DetectGROUP BY key HAVING COUNT(*) > 1groupBy("key").count().filter("count > 1")
RemoveROW_NUMBER() OVER (PARTITION BY key ORDER BY date DESC) then DELETE WHERE rn > 1dropDuplicates(["key"]) or window + filter

Question

What is the key difference between T-SQL and KQL syntax?

Click or press Enter to reveal answer

Answer

T-SQL is set-based (SELECT ... FROM ... WHERE ...). KQL is pipe-based (Table | where ... | summarize ... | render ...). Each KQL operator transforms the output of the previous one, reading left-to-right.

Click to flip back

Question

How do you remove duplicate rows in T-SQL?

Click or press Enter to reveal answer

Answer

Use ROW_NUMBER() OVER (PARTITION BY key ORDER BY date DESC) to rank duplicates, then DELETE WHERE rn > 1. This keeps the most recent version of each duplicated key.

Click to flip back

Question

What is KQL's bin() function?

Click or press Enter to reveal answer

Answer

bin() rounds timestamp values to a time bucket β€” bin(Timestamp, 1h) groups events into hourly buckets. Essential for time-series aggregations like 'count events per hour.'

Click to flip back


Knowledge Check

A warehouse table has duplicate order records due to a pipeline bug. Each order_id appears 2-3 times with different modified_date values. Carlos needs to keep only the most recent version of each order. Which T-SQL pattern is correct?

Knowledge Check

Zoe needs to calculate the number of video plays per 15-minute interval for the last 6 hours. Which KQL query is correct?

🎬 Video coming soon

Next up: Eventstreams & Spark Streaming: Real-Time Ingestion β€” process streaming data as it arrives using Eventstreams and Spark Structured Streaming.

← Previous

PySpark Transformations: Code Your Pipeline

Next β†’

Eventstreams & Spark Streaming: Real-Time Ingestion

Guided

I learn, I simplify, I share.

A Guide to Cloud YouTube Feedback

© 2026 Sutheesh. All rights reserved.

Guided is an independent study resource and is not affiliated with, endorsed by, or officially connected to Microsoft. Microsoft, Azure, and related trademarks are property of Microsoft Corporation. Always verify information against Microsoft Learn.