Troubleshoot and Resolve Pod Issues Easily With Kubernetes Active Remediation

Troubleshoot and resolve pod issues easily with Kubernetes Active Remediation

As organizations increasingly turn to Kubernetes to support their cloud-native applications, it has become critical for teams to analyze and respond to the dense telemetry data related to this orchestration layer. However, even as DevOps and infrastructure teams gain expertise in scaling and managing large Kubernetes environments, application development teams often lack the experience with Kubernetes needed to confidently launch investigations, gather context, and determine the root causes for Kubernetes issues. And even once a root cause is identified, new developers are typically also not familiar enough with complex Kubernetes commands or their impacts to safely apply needed changes.

Datadog Kubernetes Active Remediation helps solve these problems by suggesting clear contextual recommendations and next steps—launched from the Kubernetes Explorer—that are related to the discovery and automated remediation of Kubernetes-specific errors. By using this new feature, application teams can get clarity about how to proceed with Kubernetes troubleshooting, gain visibility into related issues, and take immediate action to resolve them.

In this post, we’ll show you how Kubernetes Active Remediation allows you to assess and fix Kubernetes workload issues by:

Providing context and root cause analysis for Kubernetes errors
Giving you best practices and actions to resolve issues

See recommendations for your most urgent Kubernetes issues

Cryptic Kubernetes errors such as CrashLoopBackOff are typically difficult even for experienced engineers to interpret and respond to. Properly diagnosing these errors requires expertise that can take years to acquire. On top of this, if a Kubernetes-related alert is tied to a large cluster, it is often time-consuming just to review all the related telemetry—which further complicates troubleshooting efforts. Despite these difficulties, however, it is essential to remediate these errors quickly because of the central role Kubernetes plays in supporting applications. If the errors are widespread among pods, it could indicate that the associated service(s) hosted within those pods are failing.

To help you investigate these time-sensitive alerts within your Kubernetes environment, Datadog’s Kubernetes Active Remediation feature makes a Start Remediation button available when you hover over an alert in Kubernetes Explorer.

A Kubernetes error whose pop-up window shows a Start Remediation button.

Clicking this Start Remediation button opens a new Remediation tab that provides more specific information about the alert and the affected workload(s), such as the clusters and services that are most impacted. By then clicking on the alert on this Remediation tab, you can open a side panel that indicates areas for investigation, listed in order of priority.

Beginning with the first investigation area listed, you can review associated sections that show recommended next steps, an explanation of what happened to trigger the alert, and key contextual information. These sections are linked with Datadog Monitors, Service Catalog, and other sources, and they can include helpful information such as relevant code repositories or teams to page (accompanied by appropriate Slack channels or other contact information). After performing the steps and reviewing the information associated with the first investigation area, you can click the second investigation area to reveal its associated next steps, and so on, until the issue is resolved.

A side panel for a Kubernetes error that shows a root cause and investigation areas.

This feature providing investigation areas alongside critical contextual information is especially helpful for resolving issues like CrashLoopBackOff that can stem from a variety of potential root causes—such as incorrect environment variables, restrictive memory limits, or invalid permissions. For example, the side panel above shows a sequence of investigation areas for a CrashLoopBackOff error. Kubernetes Active Remediation is able to provide you with a next-step recommendation to raise the memory limit after it has processed the latest events and container memory metrics specific to the particular workload in question. In this particular case, the feature has recommended investigating network probes (listed as Liveness/Readiness Probes) to address the CrashLoopBackOff issue in a more fundamental way.

Guided actions to efficiently stabilize your environment

Kubernetes Active Remediation not only informs you of investigation areas and associated next steps, but it also gives you an opportunity to perform those steps within the same context.

For example, to resolve the initial memory limit issue, you can update the memory limit for the workload in the UI. If you enter a value that is too low, Active Remediation will validate the value and allow you to preview the change. Additional information that helps you choose a value is provided as well, such as the value of the memory limit over time for this particular workload. If the memory limit has historically been higher, you can feel confident that the change will resolve the issue.

Changing the memory limit assigned to a workload from within an error’s side panel in Kubernetes Explorer.

Once the configuration is confirmed, you can review the live pod statuses to ensure the success of the deployed change. Actions will generate a historical log, which can help you devise new processes and workflows to prevent the same errors from occurring in the future.

As your teams adopt best practices and become more efficient at handling tricky issues, your teams can focus on implementing preventative actions based on common patterns.

Quickly resolve your Kubernetes issues

Datadog Kubernetes Active Remediation expands your organization’s ability to address business-critical Kubernetes issues, providing deep context on issues and root causes. Engineering teams can quickly understand Kubernetes alerts, engage in guided root cause analysis, and take immediate action. And as they adopt best practices and become more efficient at handling these tricky issues, your teams can focus more on implementing preventative measures.

Kubernetes Active Remediation is currently in preview. You can sign up here to join the preview and receive updates. If you’re not already using Datadog, you can start today with a 14-day free trial.

Want to work with us? We're hiring!

Troubleshoot and resolve pod issues easily with Kubernetes Active Remediation

Further Reading

See recommendations for your most urgent Kubernetes issues

Guided actions to efficiently stabilize your environment

Quickly resolve your Kubernetes issues

Further Reading

Start monitoring your metrics in minutes

Troubleshoot and resolve pod issues easily with Kubernetes Active Remediation

Further Reading

See recommendations for your most urgent Kubernetes issues

Guided actions to efficiently stabilize your environment

Quickly resolve your Kubernetes issues

Related jobs at Datadog

Further Reading

The State of DevOps: Accelerating Software Development With Generative AI

How to support a growing Kubernetes cluster with a small etcd

Best practices for monitoring event-driven architectures

vLLM Observability & Monitoring

Start monitoring your metrics in minutes