Data Integration & Analytics
Azure Data Factory, Synapse Analytics, Microsoft Fabric, and Stream Analytics β design data pipelines that move, transform, and analyse data across your Azure estate.
Why data integration design matters
Data is useless if itβs stuck in silos. Integration means connecting data from different sources (databases, APIs, files, streams), transforming it into a useful shape, and delivering it to where analysis happens.
Two patterns: Batch integration (move yesterdayβs data overnight β Data Factory) and Real-time streaming (process events as they happen β Stream Analytics, Event Hubs).
For analysis, think Synapse Analytics for big data warehousing, Microsoft Fabric as the all-in-one analytics platform, and Power BI for business reporting.
Data integration services
| Service | Pattern | Data Volume | Latency | Best For |
|---|---|---|---|---|
| Azure Data Factory | Batch ETL/ELT | Large (TB+) | Minutes to hours | Scheduled data movement and transformation pipelines |
| Synapse Pipelines | Batch ETL/ELT (same engine as ADF) | Large (TB+) | Minutes to hours | When analytics and integration are in the same workspace |
| Azure Stream Analytics | Real-time stream processing | Continuous | Sub-second to seconds | IoT telemetry, real-time dashboards, event-driven alerts |
| Azure Logic Apps | Workflow automation | Small to medium | Seconds to minutes | API integration, business workflows, event-triggered actions |
| Azure Event Hubs | Event ingestion (not processing) | Millions of events/sec | Milliseconds (ingestion) | High-throughput event ingestion before processing |
ETL vs ELT
| Pattern | How It Works | Best For |
|---|---|---|
| ETL (Extract, Transform, Load) | Transform data BEFORE loading into destination | When destination has limited compute (smaller databases) |
| ELT (Extract, Load, Transform) | Load raw data first, transform IN the destination | When destination has powerful compute (Synapse, Databricks, Fabric) |
ποΈ Priyaβs integration design: GlobalTech uses ELT pattern:
- Extract: Data Factory copies raw data from 15 source systems (SQL, SAP, files)
- Load: Raw data lands in ADLS Gen2 (bronze/raw layer)
- Transform: Synapse Spark transforms raw β curated (silver) β aggregated (gold)
- Serve: Power BI connects to the gold layer for executive dashboards
Data analytics services
| Service | Type | Compute Model | Best For |
|---|---|---|---|
| Synapse Dedicated SQL Pool | Data warehouse | Provisioned (DWU β always-on) | Large-scale, predictable warehouse workloads |
| Synapse Serverless SQL Pool | Query-on-demand | Serverless (pay per query) | Ad-hoc queries on data lake files without loading |
| Synapse Spark | Big data processing | Spark clusters (auto-scale) | Data engineering, ML, complex transformations |
| Microsoft Fabric | Unified analytics SaaS | Capacity-based | All-in-one: ingestion, transformation, warehouse, reporting |
| Azure Databricks | Spark-based analytics | Clusters (auto-scale) | Advanced ML, data science, delta lake architecture |
| Power BI | Business intelligence | Capacity or Pro licenses | Dashboards, reports, self-service analytics |
π Marcusβs analytics choice: NovaSaaS adopted Microsoft Fabric for their analytics stack:
- OneLake as the unified data lake (replacing separate ADLS accounts)
- Data Factory in Fabric for ingestion pipelines
- Fabric Data Warehouse for SQL-based analytics (serverless, no cluster management)
- Power BI embedded in their SaaS product for customer-facing dashboards
Design decision: Synapse vs Fabric vs Databricks
Choose Synapse Analytics when:
- You need dedicated SQL pool for predictable data warehouse workloads
- Youβre already invested in the Synapse ecosystem
- You need Spark AND SQL in one workspace
Choose Microsoft Fabric when:
- You want a unified SaaS platform (no infrastructure management)
- Your team uses Power BI heavily (Fabric integrates natively)
- You want OneLake as a single data lake across all workloads
Choose Azure Databricks when:
- Advanced ML/data science is the primary workload
- You use Delta Lake architecture
- You need Spark ecosystem tools (MLflow, Delta Live Tables)
- Multi-cloud portability matters (Databricks runs on AWS/GCP too)
Real-time streaming architecture
For scenarios needing sub-second processing:
Event Sources β Event Hubs (ingestion) β Stream Analytics (processing) β Outputs
| Component | Role | Scale |
|---|---|---|
| Event Hubs | Ingestion buffer β receives millions of events/sec | Partition-based, auto-inflate |
| Stream Analytics | SQL-like queries on streams β windowing, aggregation, joins | Streaming Units (auto-scale) |
| Outputs | Cosmos DB, SQL, Blob, Power BI, Functions | Multiple simultaneous outputs |
π¦ Elenaβs real-time fraud detection:
- Card transactions β Event Hubs (millions/second ingestion)
- Stream Analytics applies fraud rules (unusual amounts, foreign locations, velocity checks)
- Suspicious transactions β Cosmos DB for investigation dashboard
- Alerts β Azure Functions β notify fraud team via Teams
Knowledge check
ποΈ GlobalTech needs to consolidate data from 15 source systems (SQL Server, SAP, CSV files) into a data lake for analytics. Data volumes are 2 TB daily. The analytics team wants to run Spark transformations and Power BI reports. Which architecture should Priya recommend?
π¦ FinSecure Bank processes 10 million card transactions per hour. They need real-time fraud detection with sub-second alerting. Suspicious transactions must be stored for investigation. Which architecture should Elena recommend?
π¬ Video coming soon
Domain 2 complete! Youβve designed relational databases, NoSQL with Cosmos DB, unstructured storage, and data integration pipelines.
Next up: Now letβs design for when things go wrong β Recovery Objectives: RPO, RTO & SLA.