Reduce Your AWS Step Functions’ Error Remediation Time by Redriving Executions Directly From Datadog | Datadog

Reduce your AWS Step Functions’ error remediation time by redriving executions directly from Datadog

Author Jake Greenberg

Published: October 2, 2024

AWS enables customers to retry or redrive Step Functions executions to continue any failed executions of Standard Workflows from their points of failure while maintaining all inputs. For example, if you find broken downstream logic in your code or experience unexpected errors upon execution, you can remediate those errors by fully re-running an execution or use redrive to continue this execution. Previously, the only way for customers to take action on failed Step Functions executions was to leave Datadog and do so in the AWS console. This back and forth between platforms caused a disjointed user experience where customers had to first identify which failed executions to address in Datadog before manually copying that information and executing the retries or redrives in the AWS console.

To overcome this challenge, customers can now find and take action on failed executions directly from the Datadog UI using Datadog App Builder. Executing retry or redrive on failed executions directly from Datadog significantly reduces error remediation time and the need to manually document failed execution ARNs. In this post, we’ll walk through how our support for Step Functions monitoring enables you to quickly remediate errors and reduce costs by redriving executions directly from Datadog.

How to redrive or re-run Step Functions executions directly from Datadog

In order to take advantage of the benefits of redriving or retrying your Step Functions workflow, you must already be monitoring Step Functions using Datadog native Step Functions monitoring (note that this is different from Datadog’s Step Functions integration, which provides access to Step Functions Cloudwatch metrics in Datadog). When redriving or retrying an execution for the first time, you may be prompted to configure an AWS connection. Once you’ve configured your connection, Datadog will remember the connection you’ve chosen and you will not have to reconfigure unless you choose to change your connection, which you can do at any time from the action modal.

After you’ve configured a connection, you will immediately be able to redrive failed Standard Workflows and retry all Workflows that are being monitored on Datadog. Simply navigate to the Serverless view, click on the Step Functions execution you’re troubleshooting, and from that workflow’s State Machine Map click “Redrive from Failed Step” to redrive or “Retry Execution” to retry the Step Functions execution.

Redrive step functions directly from Datadog

Get started today

If you are interested in monitoring your AWS Step Functions effectively, the ability to take action on failed executions directly from the Datadog platform will reduce your time to remediation. As you begin using the feature in your remediation workflows, we would love to hear your thoughts using this form. If you don’t already have a Datadog account, you can sign up for a to get started.