Monitoring & Logging Design

Why monitoring design matters

Simple explanation

Monitoring is like the instrument panel in a cockpit. Every pilot needs altitude, speed, and fuel gauges — but too many gauges with no hierarchy means you’ll miss the one alarm that matters.

As an Azure architect, you need to design WHERE logs go, HOW they’re routed, and WHAT gets alerted on. Bad monitoring design means you either drown in data or miss critical failures.

The three design decisions: What to collect (platform logs, application telemetry, security events), where to send it (Log Analytics, Storage, Event Hubs), and who gets alerted (Azure Monitor, action groups, ITSM integration).

The Azure Monitor ecosystem

Azure Monitor is the umbrella — everything else feeds into or out of it.

Component	What It Does	Design Decision
Azure Monitor	Unified platform for metrics, logs, alerts	Central orchestrator — always the starting point
Log Analytics	Store and query logs using KQL	Workspace topology: how many, where, who owns them
Application Insights	Application performance monitoring (APM)	Workspace-based (recommended) vs classic deployment
Azure Monitor Agent	Collects data from VMs, VMSS	Replaces legacy agents (Log Analytics agent, Diagnostics extension)
Diagnostic Settings	Routes platform logs to destinations	Every resource needs explicit diagnostic config
Action Groups	Defines notification and automation actions	Email, SMS, webhook, ITSM, Logic App, Azure Function

Exam pattern: Questions often present a scenario and ask “which monitoring component should the architect recommend?” The answer depends on what’s being monitored (platform vs app vs infrastructure) and what action is needed (alert vs analyse vs archive).

Designing your logging solution

What gets logged?

Azure generates three categories of log data:

Log Type	Source	Examples	Default Destination
Activity logs	Azure control plane	Resource creation, RBAC changes, deployments	Azure Monitor (auto, 90-day retention)
Resource logs	Individual Azure resources	SQL query stats, Storage access, Key Vault operations	Nowhere — you MUST configure diagnostic settings
Entra ID logs	Identity platform	Sign-ins, audit events, provisioning	Entra portal (30-day default)

Exam tip: Resource logs aren't collected by default

This is a critical design point. Activity logs are automatically available, but resource logs require you to create diagnostic settings on each resource. If a scenario asks “logs aren’t appearing for a storage account” — the answer is almost always missing diagnostic settings. As an architect, you need to design a policy (Azure Policy) to automatically deploy diagnostic settings at scale.

Log Analytics workspace topology

This is one of the biggest monitoring design decisions. There’s no single right answer — it depends on your organisation’s structure, compliance needs, and cost tolerance.

Log Analytics Workspace Topologies
Factor	Centralised (1 workspace)	Distributed (per-team/app)	Hybrid (regional + central)
Management overhead	Low — one workspace to manage	High — many workspaces, many configs	Medium — clear ownership model
Cross-resource queries	Easy — everything in one place	Hard — requires cross-workspace queries	Medium — regional queries easy, global needs cross-workspace
Access control	Harder — need resource-context or table-level RBAC	Easy — workspace-level RBAC per team	Good — regional teams own their workspace
Data sovereignty	Risk — all data in one region	Good — data stays where team is	Good — regional workspaces respect boundaries
Cost optimisation	Good — easier to hit commitment tiers	Poor — each workspace has own cost baseline	Good — regional volumes help hit tiers
Best for	Small-medium orgs, single region	Large orgs with strict data boundaries	Global enterprises, regulated industries

🏛️ David’s design: CloudPath Advisory recommends the hybrid model for government clients. “Each agency keeps data in their region’s workspace for sovereignty. A central workspace gets a copy of security events for the SOC team. Azure Lighthouse lets the central team query across without moving data.”

Designing log routing

Once you know what to collect and where to store it, you need to route logs efficiently.

Diagnostic settings destinations

Every Azure resource’s diagnostic settings can send logs to up to three destinations simultaneously:

Destination	Use Case	Retention
Log Analytics workspace	Query, alert, analyse with KQL	Configurable (30 days to 2 years, archive to 12 years)
Storage account	Long-term archive, compliance, audit	Unlimited (lifecycle management)
Event Hubs	Stream to external SIEM (Splunk, Datadog) or custom consumers	Real-time (consumer controls retention)

🏦 Elena’s scenario: FinSecure Bank must retain all Key Vault access logs for 7 years (PCI DSS). Elena routes logs to Log Analytics (90-day interactive query) AND Storage (7-year archive with immutable blobs). Security events go to Log Analytics with Microsoft Sentinel for SIEM correlation and threat detection.

Azure Monitor Agent vs legacy agents

Azure Monitor Agent vs Legacy Agents
Feature	Azure Monitor Agent (AMA)	Log Analytics Agent (MMA)	Diagnostics Extension
Status	Current — recommended	Deprecated Aug 2024	Legacy — limited use
Multi-homing	Yes — data collection rules (DCRs)	Yes — manual config	No
Configuration	Centralised DCRs in Azure	Per-agent workspace config	Per-VM extension config
Filtering at source	Yes — DCR transformations	No — all or nothing	Limited
Best for	All new deployments	Legacy only — migrate away	Guest OS metrics only

Design decision: Data Collection Rules (DCRs)

DCRs are the architect’s tool for log routing at scale. A single DCR can:

Filter which events are collected (reducing cost)
Transform data before ingestion (KQL transformations)
Route different log types to different workspaces
Apply to thousands of VMs via Azure Policy

Well-Architected connection (Cost Optimisation): DCR transformations can filter out noisy, low-value logs before they hit Log Analytics — directly reducing your monitoring bill. A common pattern: collect verbose logs in dev, filter to errors-only in production.

Designing your monitoring solution

Metrics vs logs: when to use which

Factor	Metrics	Logs
Data type	Numeric time-series	Structured/semi-structured text
Query speed	Milliseconds	Seconds to minutes
Retention	93 days (auto)	Configurable (up to 12 years)
Cost	Free (platform metrics)	Pay per GB ingested
Best for	Real-time alerts, dashboards, autoscale triggers	Root cause analysis, audit, compliance, complex queries
Alert latency	~1 minute	~5-15 minutes

Design principle: Use metrics for real-time detection and logs for investigation. Alerting on metrics is faster and cheaper. Alerting on logs is more flexible but slower and costlier.

Alert design patterns

🏗️ Priya’s challenge: GlobalTech’s migration created 400+ Azure resources across 12 subscriptions. The operations team was getting 200+ alert emails per day — and ignoring all of them.

Priya’s redesign:

Severity tiers: Sev 0 (critical — pages on-call), Sev 1 (important — Teams channel), Sev 2 (informational — dashboard only)
Action group hierarchy: Different action groups per severity, per team
Alert processing rules: Suppress alerts during maintenance windows, route by resource group tag
Smart detection: Application Insights anomaly detection instead of static thresholds

Well-Architected Framework connection

Reliability: Monitoring is the first line of defence against outages. Design alerts for SLO breaches, not just resource failures.

Operational Excellence: Alert fatigue is a real risk. If your monitoring design creates noise, operators will ignore it — which is worse than no monitoring at all.

Cost Optimisation: Log Analytics costs scale with data volume. Architect your log collection to capture what you need, not everything possible.

Application Insights for application monitoring

For custom applications, Application Insights provides:

Distributed tracing across microservices
Live metrics for real-time debugging
Availability tests (URL ping, multi-step)
Application Map visualising dependencies
Smart detection for performance anomalies

Design decision: Always use workspace-based Application Insights (not classic). This sends telemetry to a Log Analytics workspace, enabling cross-resource queries and unified alerting.

🚀 Marcus’s approach: NovaSaaS runs 30+ microservices. Marcus uses a single workspace-based Application Insights instance with sampling at 20% in production (cost control) and 100% in staging (full visibility). Custom KQL alerts watch for error rate spikes across the distributed trace.

Knowledge check

Question

What are the three destinations for Azure diagnostic settings?

Click or press Enter to reveal answer

Answer

1. Log Analytics workspace (query and alert), 2. Storage account (long-term archive), 3. Event Hubs (stream to external SIEM). You can send to all three simultaneously.

Click to flip back

Question

What replaced the Log Analytics Agent (MMA)?

Click or press Enter to reveal answer

Answer

Azure Monitor Agent (AMA) with Data Collection Rules (DCRs). MMA was deprecated in August 2024. AMA provides centralised configuration, filtering at source, and multi-workspace routing.

Click to flip back

Question

Why are metrics preferred over logs for real-time alerting?

Click or press Enter to reveal answer

Answer

Metrics alert in ~1 minute vs ~5-15 minutes for logs. Metrics are numeric time-series with millisecond query speed and are free for platform metrics. Logs are better for root cause analysis and compliance.

Click to flip back

Knowledge Check

🏦 Elena needs to retain all Azure Key Vault access logs for 7 years to meet PCI DSS compliance, while also enabling her SOC team to run real-time queries on the last 90 days. Which combination should she recommend?

Knowledge Check

🏗️ Priya is designing monitoring for GlobalTech's 12-subscription Azure environment. Teams in Europe and Asia need to query their own logs independently, but the central security team needs visibility across all regions. Which Log Analytics topology should she recommend?

Next up: Now that you can see your environment, let’s design who gets in — Choosing Authentication Methods.