When troubleshooting an issue or remediating an outage, engineers need tools that are accessible and easy to use, closely integrated with their services, and tailored to their teams’ specific requirements. But the reality is that they often have to jump between their monitoring platform and other tools their team uses—such as internal full-stack tools, the CLI, and cloud provider consoles—to get a clear picture of the status and scope of the problem while taking the necessary action to communicate internally, inform customers, and remediate the issue. This creates bottlenecks and slows down mean time to resolution (MTTR).
Datadog App Builder—now generally available—is a low-code solution that enables you to create apps in Datadog that facilitate collaboration and direct action by teams throughout your organization. These apps integrate natively into Datadog’s monitoring platform, providing interactive visibility into your telemetry data. Apps are easy to build using drag-and-drop UI components, Datadog monitoring tools and data sources, custom HTTP requests and JavaScript code, and 550+ out-of-the-box actions for popular platforms and a host of AWS, Azure, and Google Cloud services.
In this post, we’ll show how you can use a custom runbook app built with App Builder to accelerate remediation by centralizing context and action into one unified view in Datadog.
Accelerate remediation by using a runbook app
App Builder not only enables you to create apps that identify issues in your systems but also to take direct, immediate action to remediate those issues from the same view within Datadog. Let’s look at an example of a runbook app that enables you to troubleshoot and take action to remediate an incident.
Let’s say your team owns the checkout service for an ecommerce website, and uses Datadog, GitLab, Atlassian Opsgenie, and Atlassian Statuspage. A monitor notifies you that the service is experiencing an outage. By reviewing telemetry data on the service health dashboard, you discover that the root cause is a bug introduced by a broken commit. To remediate, you decide to rollback the commit and retry the GitLab deployment. To start this process, you open the Checkout Service Runbook app, which was linked directly as a runbook on the service page of Service Catalog for easy access. The app enables you to trigger an incident, update the public status page, roll back the commit, and kick off a new build in the pipeline—all without leaving the Datadog platform.
Your runbook app is organized into four sections, with each section focused on a specific task. In the first section, a custom snapshot of your service appears with the most relevant monitors to get a quick summary of its health.
Scrolling down to the second section, you can immediately create an incident directly from within the app by clicking the “Create New Incident” button. You’re also given useful context from Opsgenie, including the current on-call engineer, the number of active incidents on this service, and any past incidents associated with it. If you need to page the on-call engineer, you can click the “Page on-call” button and do that directly from the app as well.
Now that you’ve created the incident, your next priority is to inform your customers about the outage. To update the public status page on Atlassian Statuspage, you simply click “Modify status,” which provides a modal where you can adjust the incident impact level, status, name, and description before posting the update.
With your customers informed, you turn your attention to fixing the issue. The runbook app pulls data directly from GitLab, so you can easily review the pipelines and commits associated with each of the most recent deployments, then rollback the problematic commit with just a few clicks.
Finally, it’s time to update your customers and team about the remediation steps taken. You scroll back up to the incident management and status page sections of the app and change the statuses to “In Progress” and “Identified” respectively, provide an explanation of your solution, and indicate the expected turnaround time for the systems to go back to normal.
How we built this app
To build this app, we added actions from the Actions Catalog that communicate with third-party services. In this case, we used App Builder’s Opsgenie and GitLab actions, as well as custom HTTP actions to interact with public APIs for services that don’t have out-of-the-box App Builder integrations (including listing Statuspage updates or retrying a deployment in GitLab). We also used some additional JavaScript in the post-query transform to format data, such as dates, easily.
This is just one example of the many types of apps that you can build with App Builder. Others include database consoles to empower customer support teams to solve a wider range of issues on their own, custom analyses to extract net new insights, custom visualizations of your company’s heath for leadership teams, and portals for developers to easily spin up new GitHub projects on their own.
Build self-service tools to streamline DevOps processes
By helping DevOps teams not only collect and analyze monitoring data but also perform remediations, Datadog App Builder expands the scope of what Datadog can do. By enabling you to pull in data from and interact with key Datadog products as well as many integrated services and platforms, App Builder can help personnel across your organization take action on their monitoring insights. App Builder can improve teams’ productivity, facilitate smoother collaboration, and limit the cost of outages.
App Builder is now generally available for all Datadog customers. For more information, see our App Builder documentation. If you’re brand new to Datadog, sign up for a free trial.