The Datadog Action Catalog provides more than 1,400 actions to help you accelerate remediation across your infrastructure directly within Datadog. With actions, you can use Workflow Automation to configure workflows that automatically address issues as they happen and build custom apps in App Builder that empower anyone in your organization to act when incidents occur.
Until now, the Action Catalog has been focused on public cloud infrastructure integrations such as cloud providers (e.g., AWS and Azure), software delivery (e.g., GitHub), and ticketing (e.g., Jira). However, 45 percent of enterprise workloads are hosted in private environments, including self-hosted Kubernetes clusters, on-prem databases (e.g., PostgreSQL), and internal CI/CD systems (e.g., private GitLab instances).
We are pleased to announce the availability of private actions, which enable you to build workflows and apps to securely manage your self-hosted infrastructure. Private actions support 300+ actions across six connection types (Kubernetes, GitLab, Jenkins, PostgreSQL, Temporal, and HTTP). In order to use private actions, you need to install a private action runner on a host in your network.
This blog post explores how your team can use private actions to facilitate remediation in Kubernetes by using any of the 150+ supported Kubernetes actions. In particular, we describe how you can:
- Build a workflow that automatically restarts deployments to reduce downtime
- Create an app that lets you manage deployments, pods, and containers directly from Datadog
Build workflows to automatically restart Kubernetes deployments
Kubernetes deployments can fail without warning due to misconfigurations, resource constraints, or sudden crashes. When this happens, teams need to act quickly to restore service availability.
Private actions enable you to configure workflows to instantly restart deployments when monitors are triggered, check if the incident gets resolved, and escalate the issue to the team if necessary.
Let’s say your team has a metric monitor that checks for CPU usage that exceeds 90 percent for five consecutive minutes. This would indicate that your application is under significant load and that, without quick intervention, your users might experience latency or even service outages.
Without automation, responding to this issue might take your team 15–30 minutes or longer, depending on how quickly the on-call engineer responds and how long it takes to assess the problem. Here’s what that process might look like:
- Receive an alert from Datadog and notify the on-call engineer.
- Use kubectl to investigate logs and check resource utilization.
- Manually restart the deployment.
- Monitor performance afterward to ensure CPU usage has stabilized.
This manual and error-prone process slows down your mean time to resolution (MTTR), potentially leading to long downtimes. Instead, you can use Workflow Automation to build a workflow that restarts the Kubernetes deployment whenever the monitor is triggered, essentially remediating the issue in real time.
When the monitor is triggered, all of the relevant information about the Kubernetes deployment, as well as the auto-remediation plans, are communicated to the team via Slack. All necessary information is also passed to the restart deployment action, enabling it to easily target the right deployment.
To confirm that the restart deployment action has solved the issue, the workflow then uses a conditional check action to poll the monitor status for five minutes to see if the CPU usage has fallen back below 90 percent. If these conditions are met, it sends a message to the team relaying that the issue has been resolved. Otherwise, it triggers a PagerDuty incident with details about the issue. With this workflow, your self-hosted services remain stable and your team does not need to intervene manually unless absolutely necessary.
Manage Kubernetes deployments, pods, and containers with a custom app
Without a unified interface that enables both visibility and action, teams have to jump between multiple platforms to track down deployment failures, monitor resource availability, and perform remediations, leading to unnecessary context switching that can extend service disruptions.
With private actions, you can use App Builder to build an app that consolidates mission-critical information about your deployments, pods, and containers. This gives your team visibility and control without requiring Kubernetes expertise or even being logged in to the Kubernetes console.
The app in the following screenshot gives you real-time visibility into your deployments—with no CLI commands required. It surfaces the same information as kubectl get deploy
and enables you to restart failed deployments with a single click.
Choosing a specific deployment opens a new page with all its pods, formatted with the same information that kubectl get pods -o wide
would provide, including each pod’s status. From there, you can delete any pod with a single click, enabling you to quickly resolve issues such as CrashLoopBackOff.
When you choose a pod, a new tab opens with a detailed view of its containers, including their images, ports, and states. If an issue requires escalation, you can create a Jira ticket or PagerDuty incident with just a few clicks. The app automatically pulls in relevant details about the issue and the resources involved, which enables you to focus on remediation rather than information gathering.
Get started with private actions in your workflows and apps
The workflow and app in this post showcased remediation examples focused only on critical Kubernetes resources. We also have 300+ private actions that include GitLab, Postgres, Jenkins, Temporal, and custom HTTP requests, which you can use to create workflows and apps that help your team resolve incidents across your self-hosted infrastructure faster and with less effort than if it were done manually.
To explore how you can start auto-remediating your on-prem environments today, take a look at the private actions we support in our catalog as well as the pre-built blueprints for the workflow and app discussed in this post. To learn more, check out our private actions documentation as well as our Workflow Automation documentation and App Builder documentation. If you aren’t already a Datadog customer, get started with a 14-day free trial.