Log Shipping and HA/DR Operations
Configure log shipping for disaster recovery. Monitor HA/DR solutions and troubleshoot common failover and replication issues.
Log shipping and HA/DR operations
Log shipping is like mailing carbon copies of a diary. Every few minutes, you photocopy the latest pages and send them to a backup location. The backup is always a few pages behind, but if the original is lost, you have almost everything.
Itβs the simplest DR technology β no clustering, no special editions. Just backup, copy, restore. Repeat.
Log shipping architecture
| Component | Role |
|---|---|
| Primary server | Production database; automated log backups |
| Secondary server | Standby database; automated log restores |
| Monitor server | (Optional) Tracks backup/restore status, alerts on failures |
| Backup job | SQL Agent job on primary β backs up the log to a shared location |
| Copy job | SQL Agent job on secondary β copies log files from the share |
| Restore job | SQL Agent job on secondary β restores the copied log files |
Setting up log shipping
- Primary database must be in FULL recovery mode
- Take an initial full backup and restore it on the secondary with NORECOVERY
- Configure log shipping via SSMS wizard or T-SQL
- Three SQL Agent jobs are created automatically
-- On the primary: verify recovery model
ALTER DATABASE NorthStarDR SET RECOVERY FULL;
-- Take initial full backup
BACKUP DATABASE NorthStarDR TO DISK = 'C:\Backups\NorthStarDR_Full.bak' WITH INIT;
-- On the secondary: restore with NORECOVERY (standby)
RESTORE DATABASE NorthStarDR FROM DISK = '\\share\NorthStarDR_Full.bak'
WITH NORECOVERY; -- or STANDBY = 'C:\StandbyFile.ldf' for read-only access
NORECOVERY vs STANDBY
| Mode | Database State | Readable? | Use Case |
|---|---|---|---|
| NORECOVERY | Restoringβ¦ | No | Pure standby β fastest restore, no user access |
| STANDBY | Standby / Read-Only | Yes (read-only) | Warm standby with reporting capability |
Log shipping on Azure
Log shipping is primarily for SQL Server on Azure VMs:
- Primary and secondary can be in different Azure regions (for DR)
- Log files are typically stored on Azure Blob Storage or Azure File Share
- SQL Agent runs the backup/copy/restore jobs
- Azure SQL MI does NOT support user-configured log shipping (Azure manages replication internally)
- Azure SQL Database does NOT support log shipping
Monitoring HA/DR solutions
What to monitor by solution
| Solution | Key Metrics | Tools |
|---|---|---|
| Failover groups | Replication lag, failover status, grace period | Azure Portal, sys.dm_geo_replication_link_status |
| Geo-replication | Replication state, lag in seconds | sys.dm_geo_replication_link_status |
| Always On AGs | Synchronization state, data loss, latency | AG Dashboard (SSMS), sys.dm_hadr_database_replica_states |
| FCIs | Cluster node status, quorum health | Failover Cluster Manager, sys.dm_os_cluster_nodes |
| Log shipping | Backup/copy/restore job status, last restored file | msdb.dbo.log_shipping_monitor_* tables |
| Backups (general) | Last backup time, backup size, duration | msdb.dbo.backupset, Azure Monitor |
Key DMVs for HA/DR monitoring
-- AG replication status
SELECT
ag.name AS ag_name,
ar.replica_server_name,
drs.synchronization_state_desc,
drs.synchronization_health_desc,
drs.log_send_queue_size,
drs.redo_queue_size
FROM sys.dm_hadr_database_replica_states drs
JOIN sys.availability_replicas ar ON drs.replica_id = ar.replica_id
JOIN sys.availability_groups ag ON ar.group_id = ag.group_id;
-- Geo-replication status (Azure SQL DB)
SELECT partner_server, replication_state_desc, replication_lag_sec
FROM sys.dm_geo_replication_link_status;
-- Log shipping status
SELECT primary_server, secondary_server, last_backup_date, last_restore_date,
DATEDIFF(MINUTE, last_restore_date, GETDATE()) AS minutes_behind
FROM msdb.dbo.log_shipping_monitor_secondary;
Troubleshooting HA/DR
| Problem | Likely Cause | Investigation |
|---|---|---|
| AG secondary not synchronizing | Network latency, redo queue backlog, log send queue full | Check log_send_queue_size and redo_queue_size in DMVs |
| Failover group failover failed | Grace period not elapsed, secondary not in sync | Check replication lag in Azure Portal; wait for grace period |
| Log shipping restore lag increasing | Copy job failing, restore job slow, network bottleneck | Check Agent job history; verify file share connectivity |
| FCI failover took too long | Shared storage latency, cluster quorum issues | Check Failover Cluster Manager events; verify storage health |
| Geo-secondary not readable | Replication paused or broken | Check sys.dm_geo_replication_link_status for state |
| After failover: logins missing | Logins not synced between replicas | Sync logins using sp_help_revlogin or contained database users |
Post-failover checklist
- Verify all databases are online on the new primary
- Check application connectivity β connection strings point to listener/failover group
- Verify logins and permissions β sync any missing logins
- Check SQL Agent jobs β jobs may need to be enabled on the new primary
- Monitor performance β the new primary may have different resource characteristics
- Plan failback β when the original primary recovers, plan the reverse failover
Kenji's AG secondary shows a growing redo_queue_size and synchronization_state = 'NOT SYNCHRONIZING.' The log_send_queue_size is 0. What is the likely issue?
After an automatic failover group failover, Tomas discovers that SQL Server Agent jobs are running on the old primary but not on the new primary. What should he do?
π¬ Video coming soon
Congratulations β youβve completed all 5 domains of DP-300! Youβre now equipped with the knowledge to administer Azure SQL solutions: plan, secure, monitor, automate, and protect.
Good luck on your exam! π