Azure Backup Vault, Azure Recovery Services Vault, and Azure Site Recovery make up Microsoft’s core suite of data protection and disaster recovery services. Azure’s vaults enable customers to store backups of entire Azure VMs, on-premise workloads, and workloads from Azure services such as Azure SQL Database, Azure Blob, and Azure Database for PostgreSQL. Azure Site Recovery integrates with Azure Recovery Services Vault to extend its backup services to support disaster recovery. In the case of an unexpected outage, Azure Site Recovery will reference the stored data in Azure Recovery Services Vault to replicate and fail over on-premise or cross-region workloads. Ensuring that these services stay healthy is critical to minimizing service downtime and data loss when outages occur.
Datadog now integrates with Azure Backup Vault, Azure Recovery Services Vault, and Azure Site Recovery to help you visualize your Azure data protection metrics and alert your engineers when backup jobs fail to complete. In this blog post, we’ll discuss how to track the status of your vaults’ backup jobs using our preconfigured dashboard and monitors, as well as how to measure your organization’s data loss following disaster recovery using recovery point objectives (RPO).
Track and alert on the status of your vaults’ backup jobs
Datadog’s preconfigured Azure Backup dashboard gives you insights into your backup vaults from both Azure Backup Vault and Azure Recovery Services Vault, as well as health status updates from recent backup jobs. Tracking and consolidating the vaults you operate can be important for both cost management and efficient disaster recovery. Vaults don’t share storage space, and by operating an excessive number of vaults, you risk duplicating backup data and fragmenting backups of protected instances across multiple vaults. This can slow down your time to recovery, especially if your developers are not familiar with where backup data for which protected instances are stored.
Typically, your backup jobs for Azure data protection services are scheduled to run at regular intervals in order to meet service level objectives (SLOs) on backup availability for specific workloads. For instance, you may back up your VMs and file shares nightly, but take hourly snapshots of high-transaction workloads such as SQL databases. You can use the backup job health graph to monitor your recently run backup jobs and gain quick insights into their health status or whether they failed.
Vault jobs may encounter errors because they’ve exceeded your storage quota or run into issues with disk space, network connectivity, and more. When these errors occur, your backup may fail to complete or load incomplete snapshots of your Azure VMs. The longer these errors go unresolved, the greater the time period since your last successful backup, and the more data you risk losing in the event of an outage. Using Datadog’s monitor template for Azure Backup Vault job errors, you can quickly set up alerting for these errors so that your engineers can immediately investigate and restore the health of your backup jobs.
Measure data loss following disaster recovery using recovery point objectives
When outages occur, it is expected that your services and workloads will experience some degree of data loss. Recovery point objectives (RPO) define the duration of acceptable data loss in case of an outage, measured by the time elapsed since your most recent successful backup.
The Azure Site Recovery section within our dashboard enables you to monitor your RPOs in real time, so you can adjust your replication frequency in order to meet your SLOs. For example, if the dashboard consistently shows an RPO of several days for your service while your SLO for that service is under one day of data loss, you’re exceeding your downtime target and need to use higher-performance storage or back up replication data at daily intervals, or even multiple times each day. Ensuring that your RPOs fall within your services’ SLOS can be critical to maintaining compliance—especially in highly regulated industries such as healthcare and financial services—and helps you create and meet standards for operational resiliency.
Start monitoring your Azure environment with Datadog
Datadog’s integrations with Azure Backup Vault, Azure Service Vault Recovery, and Azure Site Recovery help you ensure that your Azure environment’s backups stay healthy and that you’re prepared when disasters occur. You can learn more about these integrations along with our other Microsoft Azure integrations in our documentation.
If you don’t already have a Datadog account, sign up for a free 14-day trial today.