Bits AI: Reimagining the Way You Run Operations With Autonomous Investigations | Datadog

Bits AI: Reimagining the way you run operations with autonomous investigations

Author Kai Xin Tai

Published: 6月 26, 2024

Last year, we introduced Bits AI, a generative AI-based copilot designed to streamline the process of investigating alerts and responding to incidents. Here’s how it works: As you’re debugging an issue, you can converse with Bits AI to get insights from your observability and security data, find active issues, summarize incidents, search your organization’s knowledge base, and generate code fixes.

But ultimately, our goal is to automate as much of the DevSecOps lifecycle as possible, which means that having Bits AI available to answer your questions on demand, while valuable, is not enough. To approach this more advanced level of automation, Bits AI should also actively work alongside you during investigations and effectively drive towards resolution, much like how a human teammate would.

Today, we’re excited to announce the next evolution of Bits AI. We’ve transformed Bits AI into an autonomous agent capable of performing complex operational tasks—such as investigating alerts and coordinating incidents—without constant human prompting. The process is simple: You determine ahead of time which monitor alerts you’d like Bits to help investigate. Then, when you’re paged, Bits AI immediately springs into action. Drawing upon its comprehensive knowledge of your systems, documented troubleshooting procedures, and best practices, it starts to identify potential root causes. This way, by the time you get to your laptop, you have investigation notes ready to review and take action upon. If you decide to escalate the alert into a full-fledged incident, Bits AI joins the response effort as an additional responder, surfacing key telemetry data, handling customer communications, and continuously monitoring for signs of recovery.

Let’s take a look at an example of what Bits AI can do today.

Investigating alerts

To illustrate the power of Bits AI today, let’s walk through an example scenario that involves my team and me. The scenario begins when I am first alerted, in my team’s Slack channel dedicated to ops, about elevated errors on a service I own called restaurants-api. Within seconds, Bits AI tells me that it has started looking into the issue and is working in a Datadog Notebook. Since I’ve configured this monitor to both notify me in Slack and open a case in Datadog Case Management, Bits AI responds in both places.

Bits AI autonomous activity in Slack in response to triggered alert

In the notebook, we can see that Bits AI has started formulating a plan to diagnose the issue. While Bits AI is capable of building investigation plans without any additional context, you can help it better understand your preferred troubleshooting procedure by adding information to the monitor’s description. This can take the form of a link to a Confluence page, a Datadog Notebook, or just plain-language instructions. The investigation plan that Bits AI creates includes a number of steps to identify the potential cause of the issue. Just as an engineer might scan through dashboards, look for interesting patterns in error logs, analyze slow traces, or find change events and Watchdog alerts when there’s an issue, we’ve built these same capabilities into tools that Bits can leverage to accomplish its goal.

The steps in the plan are executed one at a time. Based on the results, existing steps may be reordered, or new ones may be created. As high-confidence signals are found, Bits AI mentions them in the Slack thread and Case Management case. This continues until the investigation concludes.

Coordinating incident response and remediation

Bits AI is able to help you not only investigate alerts, but also orchestrate incidents that involve multiple responders. In our example, Bits AI has identified, with the assistance of APM and RUM, that this issue is impacting customers. It therefore suggests declaring an incident. Datadog Incident Management streamlines communication by providing easy access to a dedicated Slack channel, conference bridge, and other collaboration tools you utilize.

Right at the top, you can see that Bits AI has carried over context from the alert thread. In this example, Bits has identified that the errors on restaurants-api originate from an upstream service, takeouts-rpc, and recommends paging the team that owns that upstream service. Here, Bits has used APM’s topological information together with metadata in the Service Catalog to understand which dependencies are impacted, who owns them, and who’s on call at the moment. Bits AI has also recommended an update to the status page to inform customers of the ongoing issue.

Not only is Bits AI aware of the incident you’re currently in, but it also keeps track of other ongoing incidents. This way, it can flag any potentially related incidents to prevent duplicative efforts. As new responders join the primary channel, Bits AI automatically brings them up to speed with a summary highlighting the observed issue, contributing factors, and customer impact.

Throughout the incident, Bits AI will proactively chime in if it has valuable insights that will guide you towards the root cause. These may include any additional telemetry data to supplement what responders have already found, suggested workflows, or information from your runbooks. And once the incident is resolved, Bits AI will wrap up its investigation with a final summary and a first draft of a postmortem, which your teams can continue iterating on.

Core principles

When building Bits AI, we wanted to ensure that we were adhering to a few core principles:

  • Steerability: Firstly, we wanted Bits AI to be steerable. It constantly listens to the associated conversation in the Slack thread and comments in the case, and will dynamically update its plan based on any new information that it receives.
  • Explainability: As Bits AI investigates an issue, you can follow its full chain of thought in the notebook it creates at the start of the investigation. Complex issues that span multiple services can often result in lengthy investigations involving dozens of steps, not all of which return meaningful results. We cut through the noise and only output the findings we are highly confident are relevant to the Slack thread and the case. We also always include a link back to the raw data in the Datadog platform so that you can validate all of Bits’ findings.
  • Human-in-the-loop: Finally, we’ve designed Bits AI to be a copilot—ensuring that you stay in the driver’s seat. We’ve asked Bits to only recommend remediation actions to take—it will not make any changes to your systems before receiving a confirmation from you. This allows you to gauge the risks of changes and weigh them against impact.

Get started with Bits AI

By harnessing the power of state-of-the-art large language models to reason, make decisions, and orchestrate complex processes, we can help you operate more efficiently across the end-to-end DevSecOps lifecycle and drive better business outcomes. We’re working every day to make Bits AI smarter and more capable of assisting with a wider range of your daily operational tasks. We’re continuing to invest in AI research, such as in developing new ways to evaluate Bits’ reasoning, investigation, and interpretation skills.

Bits AI’s autonomous investigation capabilities are now available in Preview—sign up here. You can also get started immediately with our generally available AI features built into Datadog Incident Management. If you don’t already have a Datadog account, you can sign up for a to get started.