Monitor Temporal Cloud with Datadog

Bowen Chen

David Pointeau

Brittany Coppola

Temporal Cloud is the fully managed service that enables you to quickly scale the Temporal workflow orchestration engine across your organization. Using Temporal Cloud, you can offload the infrastructure management of Temporal Services and focus on developing Workflows that increase the resiliency of your applications and help them remain functional throughout service errors and system outages.

Datadog’s Temporal Cloud integration gives you granular insights into your Temporal Services, Temporal’s task polling, Workflow activity, and more. You can use these insights to quickly identify errors and bottlenecks that risk slowing down your applications that rely on Temporal Workflows. In this blog post, we’ll discuss how the Temporal Cloud metrics found in our preconfigured dashboard enable you to do the following:

Visualize the health of your Temporal Cloud Services
Monitor your Temporal Workers’ task polling
Quickly identify errors in your Temporal Workflows

Visualize the health and performance of your Temporal Frontend Services

Temporal Cloud handles infrastructure management for you—however, you’ll still need to rely on Temporal’s Frontend Service to accept and process your applications’ API requests. If your Temporal Frontend service is struggling to handle heavy traffic, it can bottleneck your entire orchestration pipeline, even if you have properly configured workers and task queues. Using Datadog’s preconfigured Temporal Cloud dashboard, you can visualize the current load on this service by monitoring the gRPC request rate over time and how frequently the service is throttling incoming requests or encountering errors. If you notice spikes in the service’s gRPC error rate, you’ll need to investigate your Temporal SDK logs to determine the error code. The error logs can help you identify whether the issue is network-related, the result of an SDK misconfiguration, or due to the rate of workflows exceeding your quota.

Visualize the current load on your service by monitoring its grpc request rate.

Using the metrics in the dashboard, you can load-test your clusters to validate how your system responds under pressure and detect when Temporal is bottlenecking your request lifecycle. State transitions measure the amount of work done by Workflow Executions. This can be a more reliable metric for throughput than the rate of completed Workflows, which can greatly differ in runtime based on the workflow definition. By comparing the average state transition rate over time with service latency, you can determine how well your system responds to increases in load. For instance, if you increase the number of parallel workflows, you should see an increase in the rate of state transitions over time. When this increase begins to be reflected in your Temporal Cloud’s service latency, this indicates the upper limit of load your service is able to handle.

Monitor different service latency metrics using our integration.

Datadog’s preconfigured Temporal Cloud dashboard enables you to visualize Temporal Cloud’s service latency by different operations, the most important being the following:

StartWorkflowExecution: the time from when a Workflow is requested to when Temporal acknowledges it. This response time can increase if too many workflows end up routed to the same history shard or if there’s high CPU pressure on Temporal’s Frontend Service. Routing of Workflows to your history shards is based on Workflow IDs hashes, and it may occasionally result in hot shards. If you notice this service latency responding poorly to load testing, try to distribute your Workflow IDs more evenly by using universally unique identifiers (UUIDs) over numeric or timestamp-based IDs, which can result in overlapping hashes.
SignalWorkflowExecution: the time it takes to route a signal to a running workflow. This response time can increase if your Workflows aren’t available in local caches and need to be rehydrated from shard storage. If you notice this service latency responding poorly to load testing, you may want to consider enabling sticky executions so that your Workflows are more readily available in memory.

Monitor your Temporal Workers’ task polling

Temporal Cloud’s task polling is responsible for efficiently load balancing tasks across available Temporal Workers. A worker actively polls a task queue for tasks to process—Temporal’s Matching Service is responsible for assigning tasks within each task queue to the worker. If done efficiently, the task is assigned from the Matching Service’s memory, which is known as synchronous matching. However, if there are no available workers to match to the task, it can send the task to Temporal’s persistence database, where it needs to be reloaded once a worker becomes available. This is known as asynchronous matching, and it increases the load on the database as well as the overall latency in your system (since tasks are waiting to be assigned). You can monitor the rate of synchronous matching using the Task Sync Match Percentage in our dashboard. Generally, you should aim for a 99 percent or higher rate for synchronous matching.

Monitor your workers' task sync match percentage to ensure the highest amount of synchronous matching.

Temporal Cloud will manage the scaling of Temporal services—however, you’ll still need to manage your Worker pods and pollers. If you notice that the sync match percentage is consistently below this threshold, consider increasing the number of active Worker pods or their respective number of task pollers.

Quickly identify errors in your Temporal Workflows

Temporal Workflows serve as the building blocks to Temporal’s programming model, and ensuring that your Workflows run smoothly is critical to maintain the health of your applications. Datadog’s Temporal Cloud integration enables you to monitor the rate of different Workflow end states, including cancellation, failures, termination, and more. While these end state metrics may sound similar, they have very different meaning and implications. For example, cancellations are typically user-initiated and result in the Workflow exiting gracefully, while terminations forcefully kill the running processes without conducting standard cleanup operations leaving your systems at risk of orphaned processes.

Monitoring these Workflow metrics not only notify you when Workflows fail to complete successfully, but also help you surface underlying issues with your Workflow definition code or your Workers’ provisioned resources. A high average Workflow failure rate indicates that during execution, your Workflows are encountering unhandled exceptions or improper error handling that are causing them to fail. This is usually an issue with your Workflow Definition and requires you to investigate and make changes to your Temporal code. On the other hand, high Workflow timeout rates can result from a few different reasons including an absence of active task pollers, unprovisioned workers, or errors in your retry logic.

Ensure that your Temporal Workflows complete successfully.

After you discover unusual Workflow activity using the dashboard, you can query Temporal’s DescribeWorkflowExecution API for additional context to help troubleshoot your Workflow. For example, pending Workflow activity can indicate asynchronous matching issues, while pending tasks can indicate that your workers may not be polling the correct task queue or are too busy to be assigned new tasks.

Get started with Datadog

Datadog’s Temporal Cloud integration gives you granular visibility into your Temporal Workers, Workflows, and more to help you catch issues such as service latency spikes, failed Workflows, and inefficient task polling. If your organization self-hosts Temporal services, you can learn more about how to monitor your Temporal Server in this blog post.

To start monitoring your Temporal Cloud instances in Datadog, you’ll need to first generate a Metrics endpoint URL in Temporal Cloud and connect your Temporal Cloud account to Dataodg. Review our documentation for step-by-step instructions or for a comprehensive list of all of the Temporal Cloud metrics provided by our integration. If you don’t already have a Datadog account, sign up for a free 14-day trial today.

Monitor Temporal Cloud with Datadog

Visualize the health and performance of your Temporal Frontend Services

Monitor your Temporal Workers’ task polling

Quickly identify errors in your Temporal Workflows

Get started with Datadog

Related Articles

Monitor the health of your Temporal Server with Datadog

What's new for scheduling, scalability, and performance in Kubernetes v1.33?

Monitor Azure data protection services with Datadog

Reduce costs and enhance security with cross-region Datadog connectivity using AWS PrivateLink

Start monitoring your metrics in minutes

Get Started with Datadog

Visualize the health and performance of your Temporal Frontend Services

Monitor your Temporal Workers’ task polling

Quickly identify errors in your Temporal Workflows

Get started with Datadog

Related Articles

Monitor the health of your Temporal Server with Datadog

What's new for scheduling, scalability, and performance in Kubernetes v1.33?

Monitor Azure data protection services with Datadog

Reduce costs and enhance security with cross-region Datadog connectivity using AWS PrivateLink

Related jobs at Datadog

We're always looking for talented people to collaborate with

Start monitoring your metrics in minutes