Log Shipping and HA/DR Operations

Simple explanation

Log shipping is like mailing carbon copies of a diary. Every few minutes, you photocopy the latest pages and send them to a backup location. The backup is always a few pages behind, but if the original is lost, you have almost everything.

It’s the simplest DR technology — no clustering, no special editions. Just backup, copy, restore. Repeat.

Log shipping architecture

Component	Role
Primary server	Production database; automated log backups
Secondary server	Standby database; automated log restores
Monitor server	(Optional) Tracks backup/restore status, alerts on failures
Backup job	SQL Agent job on primary — backs up the log to a shared location
Copy job	SQL Agent job on secondary — copies log files from the share
Restore job	SQL Agent job on secondary — restores the copied log files

Setting up log shipping

Primary database must be in FULL recovery mode
Take an initial full backup and restore it on the secondary with NORECOVERY
Configure log shipping via SSMS wizard or T-SQL
Three SQL Agent jobs are created automatically

-- On the primary: verify recovery model
ALTER DATABASE NorthStarDR SET RECOVERY FULL;

-- Take initial full backup
BACKUP DATABASE NorthStarDR TO DISK = 'C:\Backups\NorthStarDR_Full.bak' WITH INIT;

-- On the secondary: restore with NORECOVERY (standby)
RESTORE DATABASE NorthStarDR FROM DISK = '\\share\NorthStarDR_Full.bak'
WITH NORECOVERY;  -- or STANDBY = 'C:\StandbyFile.ldf' for read-only access

NORECOVERY vs STANDBY

Mode	Database State	Readable?	Use Case
NORECOVERY	Restoring…	No	Pure standby — fastest restore, no user access
STANDBY	Standby / Read-Only	Yes (read-only)	Warm standby with reporting capability

Log shipping on Azure

Log shipping is primarily for SQL Server on Azure VMs:

Primary and secondary can be in different Azure regions (for DR)
Log files are typically stored on Azure Blob Storage or Azure File Share
SQL Agent runs the backup/copy/restore jobs
Azure SQL MI does NOT support user-configured log shipping (Azure manages replication internally)
Azure SQL Database does NOT support log shipping

Monitoring HA/DR solutions

What to monitor by solution

Solution	Key Metrics	Tools
Failover groups	Replication lag, failover status, grace period	Azure Portal, `sys.dm_geo_replication_link_status`
Geo-replication	Replication state, lag in seconds	`sys.dm_geo_replication_link_status`
Always On AGs	Synchronization state, data loss, latency	AG Dashboard (SSMS), `sys.dm_hadr_database_replica_states`
FCIs	Cluster node status, quorum health	Failover Cluster Manager, `sys.dm_os_cluster_nodes`
Log shipping	Backup/copy/restore job status, last restored file	`msdb.dbo.log_shipping_monitor_*` tables
Backups (general)	Last backup time, backup size, duration	`msdb.dbo.backupset`, Azure Monitor

Key DMVs for HA/DR monitoring

-- AG replication status
SELECT
    ag.name AS ag_name,
    ar.replica_server_name,
    drs.synchronization_state_desc,
    drs.synchronization_health_desc,
    drs.log_send_queue_size,
    drs.redo_queue_size
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas ar ON drs.replica_id = ar.replica_id
JOIN sys.availability_groups ag ON ar.group_id = ag.group_id;

-- Geo-replication status (Azure SQL DB)
SELECT partner_server, replication_state_desc, replication_lag_sec
FROM sys.dm_geo_replication_link_status;

-- Log shipping status
SELECT primary_server, secondary_server, last_backup_date, last_restore_date,
       DATEDIFF(MINUTE, last_restore_date, GETDATE()) AS minutes_behind
FROM msdb.dbo.log_shipping_monitor_secondary;

Troubleshooting HA/DR

Problem	Likely Cause	Investigation
AG secondary not synchronizing	Network latency, redo queue backlog, log send queue full	Check `log_send_queue_size` and `redo_queue_size` in DMVs
Failover group failover failed	Grace period not elapsed, secondary not in sync	Check replication lag in Azure Portal; wait for grace period
Log shipping restore lag increasing	Copy job failing, restore job slow, network bottleneck	Check Agent job history; verify file share connectivity
FCI failover took too long	Shared storage latency, cluster quorum issues	Check Failover Cluster Manager events; verify storage health
Geo-secondary not readable	Replication paused or broken	Check `sys.dm_geo_replication_link_status` for state
After failover: logins missing	Logins not synced between replicas	Sync logins using `sp_help_revlogin` or contained database users

Post-failover checklist

Verify all databases are online on the new primary
Check application connectivity — connection strings point to listener/failover group
Verify logins and permissions — sync any missing logins
Check SQL Agent jobs — jobs may need to be enabled on the new primary
Monitor performance — the new primary may have different resource characteristics
Plan failback — when the original primary recovers, plan the reverse failover

Question

What are the three SQL Agent jobs that log shipping creates?

Click or press Enter to reveal answer

Answer

1) Backup job (on primary — backs up the transaction log), 2) Copy job (on secondary — copies log files from the share), 3) Restore job (on secondary — restores the copied log files).

Click to flip back

Question

What DMV shows AG replication status including log send and redo queue sizes?

Click or press Enter to reveal answer

Answer

sys.dm_hadr_database_replica_states — shows synchronization_state, synchronization_health, log_send_queue_size, and redo_queue_size for each database in the AG.

Click to flip back

Question

What is the difference between NORECOVERY and STANDBY mode for log shipping?

Click or press Enter to reveal answer

Answer

NORECOVERY: database is in 'Restoring' state, not readable. STANDBY: database is read-only between restores — users can query it (but active connections are kicked during restore).

Click to flip back

Knowledge Check

Kenji's AG secondary shows a growing redo_queue_size and synchronization_state = 'NOT SYNCHRONIZING.' The log_send_queue_size is 0. What is the likely issue?

Knowledge Check

After an automatic failover group failover, Tomas discovers that SQL Server Agent jobs are running on the old primary but not on the new primary. What should he do?

🎬 Video coming soon

Congratulations — you’ve completed all 5 domains of DP-300! You’re now equipped with the knowledge to administer Azure SQL solutions: plan, secure, monitor, automate, and protect.

Good luck on your exam! 🎓