How to Create an Effective Paging Strategy

How to create an effective paging strategy

Empowered engineers and effective tools are the foundation of incident management, and having a solid on-call process can help facilitate both. In practice, however, many paging approaches have the opposite effect, often overwhelming responders and increasing burnout.

To create an effective paging strategy, organizations should focus responder attention on the most important issues and help facilitate a sense of ownership over them. When done successfully, this not only helps critical problems get resolved faster, it enables engineers to better understand how on-call work relates to both business goals and user satisfaction.

In this post, we’ll explore strategies for:

Designing symptom-based pages and monitors
Establishing on-call roles and responsibilites
Uniting paging, remediation, and analysis across platforms

Design symptom-based pages and monitors

Figuring out what to alert on can be a difficult balance. Monitors that are too high-level risk missing critical problems until they’ve already spiraled out of control, while noisy alerts can quickly lead to alert fatigue. You need to alert on the right issues, at the right sensitivity, and to the right people.

To do so, you should first aim to page on monitors that detect user impact. Because teams are often pushed to catch problems before they reach users and conduct fast root cause analysis, it can be tempting to expedite this process by alerting on system health metrics that seem to point to what’s going wrong. These metrics may include CPU or memory usage reaching a certain threshold, for example. However, given the wealth of downstream and upstream impacts in distributed systems, these health metrics give a poor sense of severity as they might not point to the true root cause in the first place. Additionally, they tend to generate noise that can distract from critical issues—or even miss them altogether.

By contrast, paging on user-facing symptoms keeps your response grounded in issues that are actively causing pain. These issues can be anything from noticeably slower loading times that increase user frustration to errors that completely prevent users from interacting with your app. In addition to more quickly surfacing critical incidents, this approach helps minimize the burden of on-call work on your team by cutting down on the amount of time they spend responding to hypothetical or unimportant problems. When your team members have a better understanding of user impact, they’re more likely to have a better sense of the importance of the issues that they’re working on and feel more invested in on-call work. Finally, these metrics often correlate more directly with SLOs, such as service availability, making it easier to prioritize issues based on business impact.

User-facing metrics you’ll want to focus on in your pages include:

Performance metrics, which can include both overall service latency as well as more granular metrics like Core Web Vitals.
Error metrics, like the percentage of 5xx server errors returned over time. You may need to pay careful attention to the severity you use for these metrics. For example, 4xx client errors can also be useful to alert on, but you’ll want to use a higher threshold as smaller spikes may simply be a result of user error.
Throughput metrics, usually consisting of the number of queries or requests being processed by your app. With throughput, you’ll often want to add downtimes to account for predictable dips in user activity, such as during app maintenance or certain holidays.

Using user-facing metrics to determine high-priority monitors can help you decide what route to notify responders through. Alerts that are sent directly to a mobile phone or dedicated paging device are the most disruptive for on-call team members, especially when they occur late at night or otherwise outside normal work hours. Ideally, on-call responders shouldn’t receive more than two of these pages per shift.

Instead, lower-priority health alerts that are more useful for long-term maintenance can be sent to ticketing or task management systems, such as Datadog Case Management. This enables them to be tracked without immediately disrupting other work. Datadog helps you achieve this by giving you the ability to fine-tune your notification settings directly within the monitor itself.

The notification configuration and automation options for a monitor.

Establish on-call roles and responsibilities

Customizing your monitors helps you determine when to page, but how do you know who should be paged and what is required of them? A key part of the incident response process is determining who owns which issues, as well as when issues need to be delegated or escalated. Doing so helps you ensure that on-call work is fairly distributed across teams, recurring problems are prevented, and quality of response is prioritized over the number of pages responded to—all of which can help prevent confusion among on-call team members. Additionally, when your team members know who to contact when issues rise in severity, they’re able to resolve the problem faster, prevent future ones, and build trust with users.

To set clear lines of responsibility, you should establish the idea that being paged means owning the incident response process end to end, from initial triage to resolution and postmortem. With smaller issues, this may mean that the on-call team member resolves the issue themselves. For example, let’s say they’re paged about a sudden spike in latency on one of their services. Investigating the issue reveals that it stems from high CPU usage, which they can resolve by scaling up the problematic cluster. From here, they can determine whether this case fits their team’s criteria for autoscaling to prevent future issues.

For pages that escalate to incidents, however, the on-call team member usually becomes the incident commander. Once they acknowledge the page, this team member will then declare the incident and start performing initial triage activities. While not necessarily responsible for carrying out all remediation activities, they are responsible for making critical decisions about the incident, including the severity level, other necessary responders, and a mitigation plan.

Many incidents require collaboration from owners of other services, subject matter experts, and customer communication specialists. The idea that pages require an immediate response should be an important organization-wide standard; to uphold it, you should only page other responders for high-urgency matters. When paging, key details about an incident should include:

The current impact
Why you think their service is involved in the incident
What you need them to do

For issues that aren’t as urgent, you should go through normal communication channels, such as your organization’s messaging platform.

Additionally, for complex incidents, engineers with more experience coordinating large-scale incidents may take over the incident commander role. Depending on your organization’s tooling and processes, these responders should be paged automatically when the severity of an incident is established. In these situations, the original on-call commander often remains part of the incident response process to help provide context and troubleshoot issues related to their service.

It may not always be the case that an on-call team member can immediately respond to pages, as anything from personal emergencies to mobile network problems can present unexpected challenges. When the designated on-call team member can’t be reached, escalation policies should help you determine who will be paged next. Many paging systems will automatically escalate issues based on the policies you define, contacting the next team member in line if a response is not received in time. Teams should come up with a time limit for page acknowledgment that ensures issues are addressed without causing unnecessary stress—usually this is between five and 15 minutes.

Unite paging, remediation, and analysis across platforms

To help on-call team members keep track of the work that they’re responsible for, whatever paging methodology you use should go hand in hand with the tools you use to manage issue remediation and other on-call activities. Otherwise, on-call team members can quickly become overwhelmed and lose track of work, especially when it comes to larger incidents with lots of follow-up tasks or long post-resolution analyses.

One way to ensure that your work is effectively organized is to use cross-platform features and integrations. By automatically sharing data from your paging systems with communication tools like Slack and work management tools like Jira, you can easily triage alerts, document key details, and loop in other responders. Correlating and relaying findings between different platforms can be a major source of both confusion and time loss during incident response. By having your troubleshooting data in the same place where your incident management tools live, you can easily pivot between the two in the same app. Then, you can send insights directly to your communication tools and ensure that no context is lost between responders.

Datadog Incident Response helps you easily alert, track, and investigate issues within a single platform. First, Datadog On-Call enables you to manage your paging strategy on a granular level. For example, you can see an overview of your longest-running pages and most-paged monitors—areas where you can potentially optimize your alerts to focus on more relevant metrics or to be less sensitive.

The Analytics page in Datadog On-Call, with statistics such as frequently page monitors, users, and services displayed.

When an on-call team member is paged, On-Call enables them to easily acknowledge the page, evaluate the severity, and begin investigating. Teams can create dynamic runbooks within Datadog Notebooks, helping on-call responders quickly determine the first steps they should take to resolve the issue. Responders can also easily pivot from the page details to relevant monitors and service metrics to view critical context around the problem.

For larger issues, on-call team members can declare incidents directly from a page. Within an incident, Datadog automatically collects all context related to an incident, including conversations or meetings that may be happening on other platforms. You can easily choose key aspects of these discussions to highlight within your incident timeline alongside other critical actions.

The timeline for an incident in Incident Management, showing a page being received and escalated to an incident.

To make handling high-urgency issues even easier, you can access incident response features directly within the Datadog mobile app. With the app, you can pivot from receiving a page to triaging the issue and declaring an incident without ever leaving your phone. The app also includes access to Datadog Notebooks, so you can begin consulting your runbooks and strategizing a response right away. By the time you pivot to your computer for in-depth investigation, you’ve already alerted the relevant responders and kicked off communication.

When you do need to pivot between platforms, Datadog provides over 850 integrations to make this process seamless. These include Slack, Teams, Zoom, and Jira. And once an issue has been resolved, you can access Datadog Case Management to organize any follow-up tasks and delegate them as needed.

Optimize your paging strategy for better incident response

On-call shifts have the potential to burn out your team members, lowering productivity and potentially leading to more issues going unresolved or overlooked. Additionally, recurring issues due to inefficient on-call and incident management processes often mean higher costs in the long run.

By refining your alerts to highlight problems that have the most customer impact and integrating features that help your team members track the entire lifecycle of an incident, you can empower engineers and keep their focus on high-impact work. Effective on-call strategy and tools require continuous fine-tuning, but having clear processes in place can help your team members feel confident and in control even in the face of ever-changing situations.

Datadog Incident Response helps you better manage your on-call strategy, from the initial page through remediation and post-incident analysis. You can get started with Datadog On-Call using our documentation. Or, if you’re not yet a Datadog user, you can sign up for a 14-day free trial.

Want to work with us? We're hiring!

How to create an effective paging strategy

Start Your Free Trial

Design symptom-based pages and monitors

Establish on-call roles and responsibilities

Unite paging, remediation, and analysis across platforms

Optimize your paging strategy for better incident response

Related jobs at Datadog

How we structure on-call rotations at Datadog

1日2兆円超の株式取引を処理するシステムをDatadogで統合監視して安定運用と業務効率化を実現

This Month in Datadog - January 2025

Aha! selects Datadog to streamline observability and service management workflows

Start monitoring your metrics in minutes