Transform Data with SQL & KQL
Use T-SQL in Fabric warehouses and KQL in Eventhouses to transform, aggregate, and shape data β stored procedures, views, materialized queries, and time-series patterns.
Two query languages, two data stores
Think of two different kitchens with different tools.
The warehouse kitchen uses T-SQL β the same language database admins have used for decades. It excels at joins, stored procedures, and creating views for BI reports. Itβs precise, powerful, and familiar.
The Eventhouse kitchen uses KQL (Kusto Query Language) β a language built for speed on time-series data. It excels at slicing millions of events by time windows, calculating percentiles, and rendering real-time dashboards. It reads left-to-right like a pipeline.
Both transform data. The difference is what data they work on and how they think about it.
T-SQL transformations in warehouses
Views for reusable transformations
-- Create a view that denormalizes orders with customer info
CREATE VIEW vw_OrdersWithCustomers AS
SELECT
o.order_id,
o.order_date,
c.customer_name,
c.region,
o.quantity,
o.revenue
FROM FactOrders o
INNER JOIN DimCustomer c ON o.customer_key = c.customer_key
WHERE c.is_current = 1; -- SCD Type 2: only current dimension rows
Stored procedures for repeatable ETL
-- Stored procedure: load and deduplicate daily orders
CREATE PROCEDURE usp_LoadDailyOrders
@load_date DATE
AS
BEGIN
-- Step 1: Stage new data
INSERT INTO staging.DailyOrders
SELECT * FROM external_source.Orders
WHERE order_date = @load_date;
-- Step 2: Deduplicate (keep latest per order_id)
WITH RankedOrders AS (
SELECT *,
ROW_NUMBER() OVER (
PARTITION BY order_id
ORDER BY modified_date DESC
) AS rn
FROM staging.DailyOrders
)
DELETE FROM RankedOrders WHERE rn > 1;
-- Step 3: MERGE into fact table
MERGE INTO FactOrders AS target
USING staging.DailyOrders AS source
ON target.order_id = source.order_id
WHEN MATCHED THEN
UPDATE SET target.quantity = source.quantity,
target.revenue = source.revenue,
target.status = source.status
WHEN NOT MATCHED THEN
INSERT (order_id, order_date, customer_key, quantity, revenue, status)
VALUES (source.order_id, source.order_date, source.customer_key,
source.quantity, source.revenue, source.status);
-- Step 4: Clean staging
TRUNCATE TABLE staging.DailyOrders;
END;
What's happening: The stored procedure explained
This stored procedure runs four steps in order:
- Stage β load new data from the external source into a staging table
- Deduplicate β use ROW_NUMBER to rank duplicates and delete all but the latest
- MERGE β upsert into the fact table (update matches, insert new)
- Clean β truncate the staging table for the next run
The pipeline calls this procedure with a date parameter: EXEC usp_LoadDailyOrders @load_date = '2026-04-21'
Handling duplicates with T-SQL
-- Find duplicates
SELECT order_id, COUNT(*) as dupe_count
FROM FactOrders
GROUP BY order_id
HAVING COUNT(*) > 1;
-- Remove duplicates (keep the row with latest modified_date)
WITH Dupes AS (
SELECT *,
ROW_NUMBER() OVER (PARTITION BY order_id ORDER BY modified_date DESC) AS rn
FROM FactOrders
)
DELETE FROM Dupes WHERE rn > 1;
KQL transformations in Eventhouses
KQL uses a pipe syntax β each line transforms the output of the previous line.
Basic KQL patterns
// Count events by hour for the last 24 hours
OrderEvents
| where Timestamp > ago(24h) // Filter: last 24 hours
| summarize EventCount = count() by bin(Timestamp, 1h) // Group by hour
| order by Timestamp asc // Sort chronologically
| render timechart // Visualize as line chart
Aggregation patterns
// Revenue by region, with percentiles
OrderEvents
| where EventType == "purchase"
| summarize
TotalRevenue = sum(Amount),
AvgRevenue = avg(Amount),
P95Revenue = percentile(Amount, 95),
OrderCount = count()
by Region
| order by TotalRevenue desc
Time-series functions
// Create a time series of hourly order counts (fills gaps with 0)
OrderEvents
| make-series OrderCount = count() on Timestamp step 1h
| render timechart
Scenario: Zoe's clickstream analysis
Zoe at WaveMedia needs to find the top 10 most-watched videos in the last hour, with average watch time and drop-off rate:
PlaybackEvents
| where Timestamp > ago(1h)
| where EventType == "video_play"
| summarize
AvgWatchSeconds = avg(WatchDuration),
TotalViews = count(),
DropOffRate = countif(WatchDuration < 30) * 100.0 / count()
by VideoId, VideoTitle
| top 10 by TotalViews descThis runs in under 2 seconds on 50M events. The same query in T-SQL on a warehouse would take minutes.
T-SQL vs KQL
| Factor | T-SQL (Warehouse) | KQL (Eventhouse) |
|---|---|---|
| Syntax style | SELECT ... FROM ... WHERE ... (set-based) | Table | where | summarize | render (pipe-based) |
| Write support | Full DML (INSERT, UPDATE, DELETE, MERGE) | Limited (ingestion only β no UPDATE/DELETE on rows) |
| Best for | Relational transforms, dimensional modeling, stored procs | Time-series analysis, log exploration, streaming aggregations |
| Time functions | DATEADD, DATEDIFF, FORMAT | ago(), bin(), datetime_diff(), make-series() |
| Visualization | None built-in (use Power BI) | render operator (timechart, barchart, piechart) |
| Stored procedures? | Yes | Stored functions only (no side effects) |
| Views? | Yes (standard + materialized views planned) | Materialized views (pre-computed aggregations) |
Handling data quality issues
Missing data
| Strategy | T-SQL | PySpark |
|---|---|---|
| Replace with default | ISNULL(column, default_value) | fillna({"column": default}) |
| Remove rows | WHERE column IS NOT NULL | dropna(subset=["column"]) |
| Flag for review | CASE WHEN column IS NULL THEN 'missing' END | when(col("column").isNull(), "missing") |
Duplicate data
| Strategy | T-SQL | PySpark |
|---|---|---|
| Detect | GROUP BY key HAVING COUNT(*) > 1 | groupBy("key").count().filter("count > 1") |
| Remove | ROW_NUMBER() OVER (PARTITION BY key ORDER BY date DESC) then DELETE WHERE rn > 1 | dropDuplicates(["key"]) or window + filter |
A warehouse table has duplicate order records due to a pipeline bug. Each order_id appears 2-3 times with different modified_date values. Carlos needs to keep only the most recent version of each order. Which T-SQL pattern is correct?
Zoe needs to calculate the number of video plays per 15-minute interval for the last 6 hours. Which KQL query is correct?
π¬ Video coming soon
Next up: Eventstreams & Spark Streaming: Real-Time Ingestion β process streaming data as it arrives using Eventstreams and Spark Structured Streaming.