Enrich Your On-Call Experience With Observability Data at Your Fingertips by Using Datadog On-Call | Datadog

Enrich your on-call experience with observability data at your fingertips by using Datadog On-Call

Author Brianne Bujnowski
Author Daljeet Sandu

Published: June 26, 2024

The stress, sudden disruptions, and high stakes of resolving issues while on call is one of the most challenging aspects of an engineer’s job. Many organizations, from startups to large enterprises, still struggle with their on-call experience, which leads to longer resolution times and lower employee retention rates. Constant context switching, managing multiple tools, and racing against time to resolve issues can cause frustration, burnout, and inefficiency.

Having a single tool to observe your tech stack, detect issues quickly, and page the right people at the right time is crucial. That’s why we’ve introduced Datadog On-Call, now in private beta. Datadog On-Call integrates monitoring, paging, and incident response into one platform, helping responders quickly triage alerts by presenting pages alongside relevant observability data—together with important service and team ownership details. This not only enhances efficiency and reduces stress but also empowers your team to respond to incidents faster and more effectively, ultimately maintaining the reliability and performance of your systems.

In this post, we’ll walk through how Datadog On-Call enables your teams to:

Minimize context switching by consolidating monitoring, paging, and resolution into a single platform

One of the major frustrations during an on-call shift is the need to juggle multiple tools and platforms to gather all the necessary information. Engineers often have to switch between Datadog and paging systems, which not only wastes precious time but also increases the risk of missing critical details.

With Datadog On-Call, the seamless integration of monitoring and paging ensures that you receive real-time notifications directly from the same platform where you can analyze the issue and collaborate throughout remediation. This workflow eliminates the inefficiencies caused by context switching, enabling your team to detect and respond to incidents faster. By having all the tools you need in one place, Datadog On-Call enhances productivity, reduces the stress of managing multiple systems, and improves overall incident response times.

Easily manage and take action on pages from one view.

For example, let’s say you’re an on-call backend engineer using Datadog and receive a page at three in the morning. With Datadog On-Call, when the alert is triggered, you are paged via push notification to access the Datadog Mobile App. From there, you can review the impact of the alert alongside relevant observability data and effectively triage the alert from your phone. If the impact is severe enough, you can go further by declaring an incident from the mobile app and triggering workflow automations that quickly implement potential resolutions. This entire process, from page to incident resolution, can be done on-the-go from one platform.

Ensure clear service and team ownership to break down knowledge silos and minimize confusion

By separating monitoring and paging across multiple tools, organizations often end up with redundant service configurations. This fragmentation can lead to confusion about service ownership and responsibility, making it difficult to determine who should be paged for specific issues. The lack of clear team ownership results in delays and prolonged incidents as engineers scramble to identify the right contact.

Datadog On-Call addresses these challenges with a team-centric design that shows clear service and team ownership. With Datadog On-Call, you can associate a team with any service to reduce redundant configurations and ensure services are mapped to the appropriate owners for a single, at-a-glance source of truth. Additionally, after a page, on-call engineers can immediately see the upstream and downstream impact of issues with the Datadog Service Catalog and bring further details to the attention of the right owners.

Implement intuitive scheduling and escalation policies to ensure timely responses

Effective scheduling and escalation policies are essential for managing on-call duties without overburdening your team. Traditional scheduling methods can be cumbersome, leading to an uneven distribution of on-call shifts and an increased risk of burnout.

Schedule list of all on-call engineers on your team.

Datadog On-Call simplifies this process with intuitive scheduling tools that make it easy to create and manage on-call rotations. The On-Call page supports quality-of-life improvements such as drag-and-drop or live schedule previews, allowing you to set up schedules that ensure fair distribution of duties, prevent fatigue, and maintain a balanced workload.

Creating and keeping an on-call schedule is seamless.

Beyond its scheduling benefits, Datadog On-Call’s robust escalation policies ensure that pages are promptly addressed. If the primary on-call engineer is unavailable or does not acknowledge a page, the next available team member is automatically notified. By implementing these intuitive scheduling and escalation features, Datadog On-Call helps maintain high levels of responsiveness and reliability in your incident management process.

Gain actionable insights from pages with detailed analytics

Reviewing past pages is critical for teams to understand the root causes of future incidents and identify opportunities for improvement. These reviews help teams answer core questions such as: What triggered the alert? How effective was the response? Were there any delays in detection or acknowledgment? What can be done to prevent similar incidents in the future? By conducting thorough page reviews, teams can analyze their incident response processes and make data-driven decisions to enhance their workflows.

Detailed analytics of all your pages.

Datadog On-Call provides detailed analytics that make page reviews more insightful and productive. Teams can easily access metrics such as the number of pages received, the time taken to respond to alerts, and the duration of the incident. These metrics enable teams to pinpoint inefficiencies and areas for improvement. For example, if recurring issues are identified, teams can adjust monitoring thresholds or update runbooks to ensure quicker resolutions in the future.

Improve your on-call experience today

Datadog On-Call brings monitoring, paging, and incident resolution into one unified platform, helping on-call engineers see which team members are most active in resolving issues and which services cause the most operational load. This reduces on-call burden and continuously improves team processes, making work more efficient and effective.

To try out Datadog On-Call for your team, sign up for our private beta. If you’re not already using Datadog, get started today with a 14-day .