How We Structure On-Call Rotations at Datadog

How we structure on-call rotations at Datadog

Structuring successful on-call rotations

Datadog teams that operate critical services—those whose availability must be maintained to prevent user-facing issues and meet SLA obligations—create on-call rotations to ensure continuous uptime. These rotations designate the team members who will respond to pages when their service experiences issues.

The size of the team largely determines the structure of a rotation, which is the number of responders and the days, times, and frequency of their on-call shifts. Each team designs its rotation to balance service coverage with a sustainable workload for responders, ensuring availability without overwhelming them. Our scheduling process takes into account the preferences of the responders, including their desired on-call times, inconvenient times, and scheduled leave dates. Gathering this input is key to building a successful rotation that accommodates individual needs and minimizes the impact of on-call work on their personal lives.

We’re careful to not put a responder on call too often, which can raise the risk of burnout and prevent engineers from focusing on their usual priorities. But we also aim to ensure that each responder has on-call duty often enough that they can stay up to date and invested in their team’s on-call procedures. They may be less motivated to improve the on-call process if it represents only a small fraction of their time.

In this section, we’ll look at the guidelines we use to shape on-call rotations and the process we use to create and manage them.

Screenshot of the Datadog On-Call schedule page, displaying a monthly calendar with color-coded on-call assignments for team members. The left sidebar shows rotation details, including start time, shift length, and assigned personnel.

Accounting for team size

Team size shapes how the rotation works and can affect the experience of everyone in the rotation. We generally aim to create on-call rotations of six to eight engineers so that each of them serves on-call duty no more than once per month—even with scheduling variations like attrition, leave time, and sick time.

If a team is large and dispersed across multiple time zones, we sometimes create a rotation of several engineers from each location. With this arrangement, shifts can be scheduled to follow the sun and distribute on-call responsibilities across time zones in a way that minimizes nighttime work for everyone.

But many of our teams aren’t big enough to run a rotation of six to eight, so we encourage them to share responders amongst themselves in a cross-team rotation. But even still, we often build rotations of only three or four responders. Our small teams are more likely to operate from a single location and usually can’t create a follow-the-sun rotation based on multiple time zones. Instead they use 24/7 rotations in which responders take turns covering all hours, such as in 12- or 24-hour shifts. Each on-call engineer is typically scheduled for only a few days at a time, since the after-hours work has a greater impact on engineers’ work-life balance. With brief turns in the on-call role and fewer teammates available to alternate, small teams’ rotations are often accelerated so each member is on duty more frequently.

Optimizing shift length

Each rotation’s shift schedule aims to maximize the effectiveness of the on-call response along with the well-being and productivity of the engineers in the rotation. Shift lengths vary quite a bit across Datadog—generally between eight and 12 hours—due to team characteristics such as size, geographical distribution, and the scheduling constraints and preferences of the responders. Smaller teams generally use shorter shifts to offset the fact that responders work on-call duty more frequently. Short shifts also tend to work better for teams whose on-call experience is particularly intense. For example, rotations that see heavy pager loads or that require fast response might schedule shifts of only eight hours to help minimize fatigue and prevent burnout.

But short shifts can increase the risk of miscommunication. Each time an outgoing responder hands off to an incoming one, they need to share information about unresolved issues and temporary fixes to ensure continuity. Because frequent handoffs increase the risk of disruption, we use 12 hours as a guideline for shift lengths, which is long enough to minimize handoffs but still short enough to manage the risk of fatigue.

Structuring responsibilities

During their on-call workdays, engineers should do only on-call work as much as possible. This means responding to pages, troubleshooting, and coordinating incidents, but it also includes maintaining resources such as dashboards, monitors, runbooks and other documentation that guides our on-call practice. These responsibilities take the place of the work they would normally do to contribute to their team’s feature velocity. We believe that separating feature work from on-call responsibilities improves service reliability, leads to more accurate development timelines, and ultimately increases the team’s velocity by preventing urgent and unplanned work.

Providing support for our on-call staff

We give responders comprehensive support before, during, and after their on-call duties to position them for success. In this section, we’ll show you how we train responders, equip them with essential resources, and support them with the redundancies and hands-on leadership they need to effectively maintain the reliability of our services.

Training

Before engineers join a rotation, we train them in the responsibilities and expectations that come with on-call duty. It’s important for all responders to understand that our on-call practice is critical to our commitment to our customers and to gain the expertise necessary to effectively triage and mitigate issues as they arise. They learn key requirements such as when and how they need to be available for on-call work and how quickly they need to be able to acknowledge an issue. If an issue escalates into an incident, the on-call engineer takes ownership, seeing it through to resolution and coordinating any necessary follow-up to ensure that the problem never arises again.

Because these activities often span multiple shifts, we also provide training on how to execute effective handoffs. This includes documenting the state of the relevant systems to note any temporary fixes or abnormalities that the next on-call engineer should know about. Responders are also trained to evaluate and refine any alerts that triggered during their shift. This is an opportunity for each responder to contribute to maintaining an effective paging strategy, such as by revising the severity or threshold of an alert that paged them.

Resources

In addition to training, we provide responders with resources and tools to use while they’re on call. Datadog On-Call is our primary resource for conducting on-call activities, as it seamlessly integrates monitoring, paging, and incident response into a single platform and helps us collaborate efficiently to respond to emerging issues.

Responders can easily set up On-Call on their mobile device to receive pages and manage their on-call responsibilities. Teammates and stakeholders throughout the organization can then page that individual from within the Datadog platform.

On-call responders also receive pages automatically from alert notifications, for example, if their service’s performance breaches a defined threshold. Teams create and manage their own alerts, and by sending real-time notifications via Datadog On-Call, they can effectively monitor their own services and react quickly upon first signs of an issue.

If an issue meets the criteria for an incident, the on-call responder uses Datadog Incident Management to declare and manage the incident, coordinating mitigation activities and communication until the incident is resolved. The screenshot below shows the form for creating an incident in Datadog using default attributes. You can customize this form to add fields specific to your needs, ensuring that Incident Management supports your organization’s existing process.

Datadog Incident Management interface with a Declare Incident form, showing fields for title, type, severity, and incident commander.

Backup

We believe it’s important to support on-call responders by ensuring that they are not working in isolation, so we schedule secondary responders in our rotations whenever possible. Just as our on-call practice helps us respond to unexpected issues with service availability, our secondary responders keep us safe when a primary responder becomes unavailable. We define escalation policies in On-Call to automatically page the secondary responder if the primary becomes unavailable due to illness, internet outages, or other circumstances that we can’t predict. This ensures that no pages are missed, and it can help reduce the stress of the on-call experience and improve responders’ performance.

Direct manager involvement

Our managers participate in their teams’ on-call rotations so that they stay familiar with procedures and are incentivized to improve the experience. Managers can better understand the intensity of their teams’ on-call work, and teams gain trust in their leadership when they see their manager participating in their rotation.

Some teams may see a higher pager load than others, and it’s important for a team’s leader to be aware if the on-call workload becomes unsustainable. Managers’ experience on call should inform how they manage their rotations—for example, by scheduling shorter turns if they know their responders routinely deal with multiple pages per shift. A sustained high pager load also signals a need to collaborate with stakeholders and prioritize reliability work that improves the affected services and the on-call experience.

Hands-on managers can also recognize when another responder’s on-call experience has been particularly stressful. They understand what a typical shift in their rotation looks like (for example, no more than two pages and one incident) and can see when a team member has carried an unusually large burden. Our managers encourage and enable responders to take time off to recover after a difficult on-call shift, which helps prevent burnout.

Build your sustainable on-call practice with Datadog

Ensuring the reliability of our services is a key priority for teams at Datadog. We implement sustainable on-call rotations that prioritize team well-being and enable each engineer to efficiently and effectively respond to emerging issues. See our documentation to learn more about how Datadog can support your on-call and incident management processes. And if you’re not yet using Datadog, get started today with a free 14-day trial.

Want to work with us? We're hiring!

How we structure on-call rotations at Datadog

Further Reading

Structuring successful on-call rotations

Accounting for team size

Optimizing shift length

Structuring responsibilities

Providing support for our on-call staff

Training

Resources

Backup

Direct manager involvement

Build your sustainable on-call practice with Datadog

Further Reading

Start monitoring your metrics in minutes

How we structure on-call rotations at Datadog

Further Reading

Related jobs at Datadog

Further Reading

How to create an effective paging strategy

1日2兆円超の株式取引を処理するシステムをDatadogで統合監視して安定運用と業務効率化を実現

This Month in Datadog - January 2025

Aha! selects Datadog to streamline observability and service management workflows