Debug Application Issues With APM and Cloud Network Monitoring | Datadog

Debug application issues with APM and Cloud Network Monitoring

Author Thomas Sobolik

Editor’s note: This post covers Cloud Network Monitoring, a Datadog feature that was originally called Network Performance Monitoring.

With the advanced containerization that has become the norm in the modern cloud, your infrastructure is likely more distributed, and thus more exposed to networking issues, than ever before. When troubleshooting application performance issues, this can make it difficult to link the symptoms you observe through monitoring the “golden signals” (requests, latency, and errors) on individual endpoints in your application to their underlying root causes. In a highly distributed environment, network monitoring becomes a key piece of the puzzle in understanding the behavior of your networked services.

With Datadog, you can quickly and easily pivot between network telemetry and application performance data to achieve a more comprehensive image of your infrastructure’s health. Datadog APM lets you easily track distributed traces end to end across your entire stack, helping identify and resolve application-level bugs and resource bottlenecks. And Datadog’s Cloud Network Monitoring (CNM) provides observability into network communication by aggregating granular network data into application-layer dependencies across your entire environment. Datadog CNM and APM are tightly integrated, letting you aggregate and analyze both your application and network data using unified service tagging to monitor across services, cloud regions, containers, and more. This makes it easy for your teams to integrate network telemetry into their root cause analysis workflows.

In this post, we will cover how to use Datadog to analyze application and network performance data in tandem to more efficiently troubleshoot the root cause of application issues. We’ll walk through:

Debug application issues with CNM

A triggered APM alert indicates high latency on an endpoint in one of our app's services.

Let’s say we’ve received an alert reporting high average latency on one of our services, web-store. We can start by jumping directly from the alert to a list of web-store’s traces in APM to look for the slowest requests. From here, we can drill into a specific trace to examine spans showing runtime errors, or that have an unusually long execution time, which may suggest that a particular process or function call needs to be optimized.

Datadog APM unifies telemetry from our application and from the underlying infrastructure. This means we can easily correlate the increased request latency with ingested logs from the service that may also help in locating code errors.

The Metrics tab in APM helps us validate the health of the trace's computing resources.

To look for possible resource bottlenecks as a cause, we can examine metrics for each span’s underlying host or container to understand if the latency may be caused by a lack of available CPU or disk space. This can significantly slow down request execution time if, for instance, a host is trying to execute many requests synchronously. The Processes tab breaks down the resource consumption of all processes running on the host to reveal any heavy colocated processes that may need to be streamlined.

Turn to CNM to find the root cause

In our case, everything about our service that we can observe directly through the trace appears normal—we see no relevant code errors and have also validated that the issue is not with our underlying infrastructure. We can take advantage of CNM’s Network tab within APM to refocus our analysis on network issues.

The Network tab in APM facilitates correlation of network and trace data on a single pane of glass.

Examining the trace’s flame graph, we observe significant gaps in between execution start times of some of the spans, representing time gaps between the operation of different services. In these gaps, we can infer that network communication is occurring between web-store and its dependencies, which we can investigate further in the Network tab.

The Network tab shows aggregated network performance metrics like network throughput, errors (TCP retransmits), and latency (TCP round-trip time) for network traffic on the selected service. We can group the destination by the availability-zone facet to see which AZs web-store is communicating with. Grouping by received volume, we notice that our web-store endpoint is sending a significant amount of traffic to us-east4-c, despite our expectation for it to communicate the most with us-east4-b. The unexpected cross-AZ traffic indicates potential misconfiguration on the client side, which would explain our observed latency on that endpoint.

We can verify the hypothesis we developed in APM with the Network Map in CNM.

The network map reveals network dependency relationships between services in our application.

The Network Map visualizes network traffic between any tagged object in our environment, from services to pods to cloud regions. Grouping by availability zone and selecting received volume as our metric for the edge weights, we can see that us-east4-b and us-east4-c are communicating more than usual, which reflects the unusual communication we observed coming from our web-store service in us-east4-c. Now that we’ve determined the root cause, we can begin to focus our remediation efforts. By digging into our code’s deployment history, we can surface any recent deployments that may line up with the spike in cross-AZ traffic volume, and then audit the changes. Issuing a timely rollback, we’ll relieve our web-store service of the undue overhead and mitigate what could have been a significant increase in our cloud provider bill.

Use Datadog APM to investigate observed network issues

The network page reveals network dependency relationships between services in our application, along with vital network metrics at a glance.

Thanks to unified tagging, we can use CNM to identify potential application issues to troubleshoot in APM. For example, we can start in the Network page to get a high-level view of pod communication and health across our environment. When sorting the table by TCP round-trip time, we observe that requests from rails-storefront-5df44896b5-8csqf to product-recommendation-lite-bd8c8d487-g56zp are experiencing high latency as well as connectivity issues, indicated by a large number of TCP retransmits. This could indicate that a misconfiguration or code error on either the source or destination endpoint is causing requests to hang, leading to packet loss.

When inspecting one of these faulty dependencies, we can use the Traces tab to home in on traces that may be throwing errors or suffering from exceptionally long execution time.

We can jump from the network page to a list of relevant traces for a given network path.

In the table shown, we can see some extremely slow requests being generated, some of which are leading to timeouts. CNM and APM are linked through this view, so we can open this list in APM to filter and sort the traces by other facets, including duration and status, to find the slowest traces or those throwing errors. We can drill down into one of the errorful traces to diagnose the cause of its exceptionally long duration.

This trace shows an errorful function call which some requests on our network path are dependent on.

By correlating logs with this trace, we observe a series of timeout errors being thrown by requests to the ShoppingCartController#checkout endpoint. Our trace surfaces spans for the function recommendationController.addPurchases that are attempting to run for over 15 seconds before timing out. This prevents the checkout process from completing, triggering TCP retransmits—thus, we can conclude that this issue may be a significant contributor to web-store’s connectivity issues that we had previously spotted in CNM. We’ve used APM to uproot an issue on a particular endpoint of our service, and we can now work towards releasing a code fix.

Start debugging with Datadog CNM alongside APM

Together, Datadog CNM and APM provide your team with a number of workflows for quick, precise root cause analysis of application performance issues across your infrastructure, all from a single pane of glass. By enabling quick correlation between application traces and networking metrics, these tools make network telemetry accessible to every engineer in your organization, so teams get more context into problems and can streamline the process of managing services and triaging issues in production.

To get started with APM, follow the configuration steps to start tracing your application. Then, leverage Cloud Network Monitoring to start monitoring the health of your application dependencies. Be sure to implement unified tagging to get the most comprehensive correlation and the broadest set of facets. Or if you’re brand new to Datadog, sign up for a 14-day .