Monitoring & Logging Design
A well-architected Azure solution needs eyes everywhere. Learn how to design a monitoring and logging strategy that gives you the right signals — without drowning in noise.
Why monitoring design matters
Monitoring is like the instrument panel in a cockpit. Every pilot needs altitude, speed, and fuel gauges — but too many gauges with no hierarchy means you’ll miss the one alarm that matters.
As an Azure architect, you need to design WHERE logs go, HOW they’re routed, and WHAT gets alerted on. Bad monitoring design means you either drown in data or miss critical failures.
The three design decisions: What to collect (platform logs, application telemetry, security events), where to send it (Log Analytics, Storage, Event Hubs), and who gets alerted (Azure Monitor, action groups, ITSM integration).
The Azure Monitor ecosystem
Azure Monitor is the umbrella — everything else feeds into or out of it.
| Component | What It Does | Design Decision |
|---|---|---|
| Azure Monitor | Unified platform for metrics, logs, alerts | Central orchestrator — always the starting point |
| Log Analytics | Store and query logs using KQL | Workspace topology: how many, where, who owns them |
| Application Insights | Application performance monitoring (APM) | Workspace-based (recommended) vs classic deployment |
| Azure Monitor Agent | Collects data from VMs, VMSS | Replaces legacy agents (Log Analytics agent, Diagnostics extension) |
| Diagnostic Settings | Routes platform logs to destinations | Every resource needs explicit diagnostic config |
| Action Groups | Defines notification and automation actions | Email, SMS, webhook, ITSM, Logic App, Azure Function |
Exam pattern: Questions often present a scenario and ask “which monitoring component should the architect recommend?” The answer depends on what’s being monitored (platform vs app vs infrastructure) and what action is needed (alert vs analyse vs archive).
Designing your logging solution
What gets logged?
Azure generates three categories of log data:
| Log Type | Source | Examples | Default Destination |
|---|---|---|---|
| Activity logs | Azure control plane | Resource creation, RBAC changes, deployments | Azure Monitor (auto, 90-day retention) |
| Resource logs | Individual Azure resources | SQL query stats, Storage access, Key Vault operations | Nowhere — you MUST configure diagnostic settings |
| Entra ID logs | Identity platform | Sign-ins, audit events, provisioning | Entra portal (30-day default) |
Exam tip: Resource logs aren't collected by default
This is a critical design point. Activity logs are automatically available, but resource logs require you to create diagnostic settings on each resource. If a scenario asks “logs aren’t appearing for a storage account” — the answer is almost always missing diagnostic settings. As an architect, you need to design a policy (Azure Policy) to automatically deploy diagnostic settings at scale.
Log Analytics workspace topology
This is one of the biggest monitoring design decisions. There’s no single right answer — it depends on your organisation’s structure, compliance needs, and cost tolerance.
| Factor | Centralised (1 workspace) | Distributed (per-team/app) | Hybrid (regional + central) |
|---|---|---|---|
| Management overhead | Low — one workspace to manage | High — many workspaces, many configs | Medium — clear ownership model |
| Cross-resource queries | Easy — everything in one place | Hard — requires cross-workspace queries | Medium — regional queries easy, global needs cross-workspace |
| Access control | Harder — need resource-context or table-level RBAC | Easy — workspace-level RBAC per team | Good — regional teams own their workspace |
| Data sovereignty | Risk — all data in one region | Good — data stays where team is | Good — regional workspaces respect boundaries |
| Cost optimisation | Good — easier to hit commitment tiers | Poor — each workspace has own cost baseline | Good — regional volumes help hit tiers |
| Best for | Small-medium orgs, single region | Large orgs with strict data boundaries | Global enterprises, regulated industries |
🏛️ David’s design: CloudPath Advisory recommends the hybrid model for government clients. “Each agency keeps data in their region’s workspace for sovereignty. A central workspace gets a copy of security events for the SOC team. Azure Lighthouse lets the central team query across without moving data.”
Designing log routing
Once you know what to collect and where to store it, you need to route logs efficiently.
Diagnostic settings destinations
Every Azure resource’s diagnostic settings can send logs to up to three destinations simultaneously:
| Destination | Use Case | Retention |
|---|---|---|
| Log Analytics workspace | Query, alert, analyse with KQL | Configurable (30 days to 2 years, archive to 12 years) |
| Storage account | Long-term archive, compliance, audit | Unlimited (lifecycle management) |
| Event Hubs | Stream to external SIEM (Splunk, Datadog) or custom consumers | Real-time (consumer controls retention) |
🏦 Elena’s scenario: FinSecure Bank must retain all Key Vault access logs for 7 years (PCI DSS). Elena routes logs to Log Analytics (90-day interactive query) AND Storage (7-year archive with immutable blobs). Security events go to Log Analytics with Microsoft Sentinel for SIEM correlation and threat detection.
Azure Monitor Agent vs legacy agents
| Feature | Azure Monitor Agent (AMA) | Log Analytics Agent (MMA) | Diagnostics Extension |
|---|---|---|---|
| Status | Current — recommended | Deprecated Aug 2024 | Legacy — limited use |
| Multi-homing | Yes — data collection rules (DCRs) | Yes — manual config | No |
| Configuration | Centralised DCRs in Azure | Per-agent workspace config | Per-VM extension config |
| Filtering at source | Yes — DCR transformations | No — all or nothing | Limited |
| Best for | All new deployments | Legacy only — migrate away | Guest OS metrics only |
Design decision: Data Collection Rules (DCRs)
DCRs are the architect’s tool for log routing at scale. A single DCR can:
- Filter which events are collected (reducing cost)
- Transform data before ingestion (KQL transformations)
- Route different log types to different workspaces
- Apply to thousands of VMs via Azure Policy
Well-Architected connection (Cost Optimisation): DCR transformations can filter out noisy, low-value logs before they hit Log Analytics — directly reducing your monitoring bill. A common pattern: collect verbose logs in dev, filter to errors-only in production.
Designing your monitoring solution
Metrics vs logs: when to use which
| Factor | Metrics | Logs |
|---|---|---|
| Data type | Numeric time-series | Structured/semi-structured text |
| Query speed | Milliseconds | Seconds to minutes |
| Retention | 93 days (auto) | Configurable (up to 12 years) |
| Cost | Free (platform metrics) | Pay per GB ingested |
| Best for | Real-time alerts, dashboards, autoscale triggers | Root cause analysis, audit, compliance, complex queries |
| Alert latency | ~1 minute | ~5-15 minutes |
Design principle: Use metrics for real-time detection and logs for investigation. Alerting on metrics is faster and cheaper. Alerting on logs is more flexible but slower and costlier.
Alert design patterns
🏗️ Priya’s challenge: GlobalTech’s migration created 400+ Azure resources across 12 subscriptions. The operations team was getting 200+ alert emails per day — and ignoring all of them.
Priya’s redesign:
- Severity tiers: Sev 0 (critical — pages on-call), Sev 1 (important — Teams channel), Sev 2 (informational — dashboard only)
- Action group hierarchy: Different action groups per severity, per team
- Alert processing rules: Suppress alerts during maintenance windows, route by resource group tag
- Smart detection: Application Insights anomaly detection instead of static thresholds
Well-Architected Framework connection
Reliability: Monitoring is the first line of defence against outages. Design alerts for SLO breaches, not just resource failures.
Operational Excellence: Alert fatigue is a real risk. If your monitoring design creates noise, operators will ignore it — which is worse than no monitoring at all.
Cost Optimisation: Log Analytics costs scale with data volume. Architect your log collection to capture what you need, not everything possible.
Application Insights for application monitoring
For custom applications, Application Insights provides:
- Distributed tracing across microservices
- Live metrics for real-time debugging
- Availability tests (URL ping, multi-step)
- Application Map visualising dependencies
- Smart detection for performance anomalies
Design decision: Always use workspace-based Application Insights (not classic). This sends telemetry to a Log Analytics workspace, enabling cross-resource queries and unified alerting.
🚀 Marcus’s approach: NovaSaaS runs 30+ microservices. Marcus uses a single workspace-based Application Insights instance with sampling at 20% in production (cost control) and 100% in staging (full visibility). Custom KQL alerts watch for error rate spikes across the distributed trace.
Knowledge check
🏦 Elena needs to retain all Azure Key Vault access logs for 7 years to meet PCI DSS compliance, while also enabling her SOC team to run real-time queries on the last 90 days. Which combination should she recommend?
🏗️ Priya is designing monitoring for GlobalTech's 12-subscription Azure environment. Teams in Europe and Asia need to query their own logs independently, but the central security team needs visibility across all regions. Which Log Analytics topology should she recommend?
🎬 Video coming soon
Next up: Now that you can see your environment, let’s design who gets in — Choosing Authentication Methods.