Data Integration & Analytics

Why data integration design matters

Simple explanation

Data is useless if it’s stuck in silos. Integration means connecting data from different sources (databases, APIs, files, streams), transforming it into a useful shape, and delivering it to where analysis happens.

Two patterns: Batch integration (move yesterday’s data overnight — Data Factory) and Real-time streaming (process events as they happen — Stream Analytics, Event Hubs).

For analysis, think Synapse Analytics for big data warehousing, Microsoft Fabric as the all-in-one analytics platform, and Power BI for business reporting.

Data integration services

Data Integration Service Comparison
Service	Pattern	Data Volume	Latency	Best For
Azure Data Factory	Batch ETL/ELT	Large (TB+)	Minutes to hours	Scheduled data movement and transformation pipelines
Synapse Pipelines	Batch ETL/ELT (same engine as ADF)	Large (TB+)	Minutes to hours	When analytics and integration are in the same workspace
Azure Stream Analytics	Real-time stream processing	Continuous	Sub-second to seconds	IoT telemetry, real-time dashboards, event-driven alerts
Azure Logic Apps	Workflow automation	Small to medium	Seconds to minutes	API integration, business workflows, event-triggered actions
Azure Event Hubs	Event ingestion (not processing)	Millions of events/sec	Milliseconds (ingestion)	High-throughput event ingestion before processing

ETL vs ELT

Pattern	How It Works	Best For
ETL (Extract, Transform, Load)	Transform data BEFORE loading into destination	When destination has limited compute (smaller databases)
ELT (Extract, Load, Transform)	Load raw data first, transform IN the destination	When destination has powerful compute (Synapse, Databricks, Fabric)

🏗️ Priya’s integration design: GlobalTech uses ELT pattern:

Extract: Data Factory copies raw data from 15 source systems (SQL, SAP, files)
Load: Raw data lands in ADLS Gen2 (bronze/raw layer)
Transform: Synapse Spark transforms raw → curated (silver) → aggregated (gold)
Serve: Power BI connects to the gold layer for executive dashboards

Data analytics services

Analytics Platform Comparison
Service	Type	Compute Model	Best For
Synapse Dedicated SQL Pool	Data warehouse	Provisioned (DWU — always-on)	Large-scale, predictable warehouse workloads
Synapse Serverless SQL Pool	Query-on-demand	Serverless (pay per query)	Ad-hoc queries on data lake files without loading
Synapse Spark	Big data processing	Spark clusters (auto-scale)	Data engineering, ML, complex transformations
Microsoft Fabric	Unified analytics SaaS	Capacity-based	All-in-one: ingestion, transformation, warehouse, reporting
Azure Databricks	Spark-based analytics	Clusters (auto-scale)	Advanced ML, data science, delta lake architecture
Power BI	Business intelligence	Capacity or Pro licenses	Dashboards, reports, self-service analytics

🚀 Marcus’s analytics choice: NovaSaaS adopted Microsoft Fabric for their analytics stack:

OneLake as the unified data lake (replacing separate ADLS accounts)
Data Factory in Fabric for ingestion pipelines
Fabric Data Warehouse for SQL-based analytics (serverless, no cluster management)
Power BI embedded in their SaaS product for customer-facing dashboards

Design decision: Synapse vs Fabric vs Databricks

Choose Synapse Analytics when:

You need dedicated SQL pool for predictable data warehouse workloads
You’re already invested in the Synapse ecosystem
You need Spark AND SQL in one workspace

Choose Microsoft Fabric when:

You want a unified SaaS platform (no infrastructure management)
Your team uses Power BI heavily (Fabric integrates natively)
You want OneLake as a single data lake across all workloads

Choose Azure Databricks when:

Advanced ML/data science is the primary workload
You use Delta Lake architecture
You need Spark ecosystem tools (MLflow, Delta Live Tables)
Multi-cloud portability matters (Databricks runs on AWS/GCP too)

Real-time streaming architecture

For scenarios needing sub-second processing:

Event Sources → Event Hubs (ingestion) → Stream Analytics (processing) → Outputs

Component	Role	Scale
Event Hubs	Ingestion buffer — receives millions of events/sec	Partition-based, auto-inflate
Stream Analytics	SQL-like queries on streams — windowing, aggregation, joins	Streaming Units (auto-scale)
Outputs	Cosmos DB, SQL, Blob, Power BI, Functions	Multiple simultaneous outputs

🏦 Elena’s real-time fraud detection:

Card transactions → Event Hubs (millions/second ingestion)
Stream Analytics applies fraud rules (unusual amounts, foreign locations, velocity checks)
Suspicious transactions → Cosmos DB for investigation dashboard
Alerts → Azure Functions → notify fraud team via Teams

Knowledge check

Question

What's the difference between ETL and ELT?

Click or press Enter to reveal answer

Answer

ETL transforms data before loading (transform happens in transit). ELT loads raw data first, then transforms in the destination. ELT is preferred when the destination has powerful compute (Synapse, Fabric, Databricks) — it's more scalable and keeps raw data available for re-processing.

Click to flip back

Question

When should you recommend Microsoft Fabric over Azure Synapse?

Click or press Enter to reveal answer

Answer

Fabric is the newer, unified SaaS platform — choose it when you want: no infrastructure management, OneLake as a unified data lake, native Power BI integration, and an all-in-one experience. Synapse is better when you need dedicated SQL pools for heavy warehouse workloads or are already invested in the Synapse ecosystem.

Click to flip back

Question

What Azure service processes millions of events per second for real-time analytics?

Click or press Enter to reveal answer

Answer

Azure Event Hubs for ingestion (receives events) + Azure Stream Analytics for processing (SQL-like queries on streams). Event Hubs is the buffer, Stream Analytics is the processor. Outputs go to Cosmos DB, SQL, Power BI, or other services.

Click to flip back

Knowledge Check

🏗️ GlobalTech needs to consolidate data from 15 source systems (SQL Server, SAP, CSV files) into a data lake for analytics. Data volumes are 2 TB daily. The analytics team wants to run Spark transformations and Power BI reports. Which architecture should Priya recommend?

Knowledge Check

🏦 FinSecure Bank processes 10 million card transactions per hour. They need real-time fraud detection with sub-second alerting. Suspicious transactions must be stored for investigation. Which architecture should Elena recommend?

Domain 2 complete! You’ve designed relational databases, NoSQL with Cosmos DB, unstructured storage, and data integration pipelines.

Next up: Now let’s design for when things go wrong — Recovery Objectives: RPO, RTO & SLA.