Remediate Google Cloud Issues With New Actions in Workflow Automation and App Builder | Datadog

Remediate Google Cloud issues with new actions in Workflow Automation and App Builder

Author Syed Sarjeel Yusuf

Published: June 18, 2024

Datadog Actions help you respond to alerts and manage your infrastructure directly from within Datadog. This can be done by creating workflows that automate end-to-end processes or by using App Builder to build resource management tools and self-serve developer platforms. With more than 550 available actions, Datadog Actions offers capabilities such as creating Jira tickets, resizing autoscaling groups, and triggering GitHub pipelines.

Now, Datadog is excited to introduce its first set of Google Cloud Actions. These actions enable you to build custom automations and apps that help you manage and remediate issues in your Google Cloud-based applications from within Datadog, enabling you to complete tasks such as:

  • Scale GKE node pools to ensure sufficient resources to deal with changing usage
  • Block IPs with Google Cloud Armor in response to security signals to protect against malicious actors
  • Snapshot and remove unused Google Compute Engine instances to reduce cloud costs

In this post, we’ll look at how you can use a workflow to automatically scale critical GKE services. Then we’ll look at how an app built using App Builder enables you to easily manage Google compute resources.

Use a workflow to automatically scale critical GKE services

With the new Google Cloud Actions, you can set up automated workflows to scale your GKE node pools with little to no manual intervention. Let’s say your team is running applications on Google Kubernetes Engine (GKE) and using Datadog to measure critical Kubernetes metrics. You set up change alert monitors to measure changes in the load on your nodes and alert your on-call team when change thresholds are exceeded.

During a major traffic event, Datadog detects that CPU usage across your GKE nodes has surged past 80 percent, triggering an alert. This indicates that your application is under significant load and that, without quick intervention, your users might experience latency or even service outages. Without automation, your team would need to manually scale the node pools. Here’s how that process could look:

  1. The responder manually logs in to the GKE Console.
  2. The responder manually reviews the metrics and logs to confirm the need for scaling.
  3. The responder updates the node pool settings, specifying the number of additional nodes required.
  4. The responder monitors the deployment to ensure nodes are added and the system stabilizes.
  5. Finally, the responder informs the team about the actions taken.

In addition to the inherently time-consuming nature of manual processes, the response time also depends on the availability of the responder and their ability to quickly assess and act. These factors can lead to longer downtimes, increased MTTR, and potential revenue loss. Additionally, they can negatively affect the responder’s quality of life, causing stress and burnout due to frequent interruptions, especially during critical off-hours.

With the new Google Cloud Actions, you can use Workflow Automation to automate this remediation process. For example, you can configure the “Scale GKE Node Pools” workflow to run as soon as Datadog triggers the alert. This workflow retrieves the current node pool details, computes the necessary scale increase to handle the surge in traffic, and scales the node pools accordingly.

Scale GKE Node Pools workflow.

The workflow also checks if scaling up the node pools stabilizes the application and sends a notification to the team via Slack, keeping everyone in the loop about the scaling action taken. This is all done without involving the on-call responder, who is only notified with an incident in PagerDuty if the remediation process is not successful and the monitor is still alerting. This automation ensures that your application remains reliable and performant, and frees up your team to focus on strategic tasks rather than repetitively firefighting scaling issues.

Easily manage Google compute resources with an app

When essential applications and services are hosted on Google Cloud, the ability to easily make timely adjustments to them can prevent performance bottlenecks, reduce downtime, and optimize resource utilization. Without quick access, your team might struggle to respond promptly to traffic spikes, security threats, and inefficiencies—potentially leading to service disruptions, increased operational costs, and compromised security.

By using App Builder with the new Google Cloud Actions, you can create an app that provides a centralized, efficient platform that your team can use to take direct action to manage Google Cloud resources directly from within Datadog. The following screenshot shows a Google Cloud Compute Management Console app that provides a single interface where your team can view and manage all of your Google Cloud instances:

Google Cloud Compute Management Console app.

The Status column shows in real time whether an instance is running, stopped, or terminated. This immediate visibility enables you to make quick decisions to ensure that resources are allocated efficiently based on current needs. Responders can start or stop an instance with a single click based on demand, which reduces unnecessary costs associated with idle resources and improves your security posture by enabling you to rapidly terminate or restart instances associated with potential threats.

Additionally, apps can be embedded in dashboards to ensure that all of these responder capabilities are coupled with the rich observability and insights into your Google Cloud infrastructure that Datadog provides.

Google Cloud Compute Management Console app in a Datadog dashboard.

Get started using Google Cloud Actions in your workflows and apps

This post highlights just some of the crucial tasks that can be performed using Google Cloud Actions with Datadog Workflow Automation and App Builder. With these and 550+ other actions, you can create intricate automations tailored to specific needs, improving operational efficiency and ensuring that your infrastructure adapts dynamically to changing demands—all integrated seamlessly with your existing Datadog monitors, dashboards, and security signals.

To see what’s possible, explore the full spectrum of actions in the Actions Catalog and the many blueprints available, including blueprints for the workflow and app described in this post. To learn more, check out our Workflow Automation documentation and App Builder documentation. If you aren’t already a Datadog customer, get started with a 14-day .