Datadog NPM Now Supports Istio Networking | Datadog

Datadog NPM now supports Istio networking

Author Yael Goldstein
Author Paul Gottschling

Published: January 15, 2021

Istio is an open source service mesh that provides an abstraction layer for network traffic between applications, so you can run canary deployments, implement circuit breakers, and otherwise manage the architecture of your network using high-level configuration files. As service meshes become increasingly popular among containerized environments, dev and ops teams need to ensure that Istio is healthy, performant, and routing traffic as intended to keep their network infrastructure running smoothly—and avoid Istio networking issues.

Datadog Network Performance Monitoring automatically visualizes the topology of your Istio-managed network, giving you instant insight into dependencies between services, pods, and containers. You can then track the health and performance of these dependencies in the context of traces, logs, and process data from your infrastructure and applications. This makes it easier to:

Istio networking: Get visibility into Istio network traffic using the Network Page.
Get visibility into Istio network traffic using the Network Page.

How Istio networking works

In a Kubernetes cluster, Istio makes it possible to re-architect your network on the fly by deploying an Envoy proxy as a sidecar within each application pod. When one application pod sends traffic to another, the Envoy proxy within the first pod intercepts the traffic and, using configurations it receives from Istio’s control plane, routes the traffic to another Envoy proxy (or multiple proxies). Since application containers only need to handle traffic to and from their local sidecars, the control plane can reroute mesh traffic by sending new configurations to Envoy—without modifying application code.

The way Istio implements traffic re-routing can make it difficult to track network dependencies. Each Kubernetes pod has a dedicated kernel network namespace within the underlying host. Istio edits the namespace’s iptables rules so that any traffic flowing into or out of the pod will be redirected to the namespace’s loopback interface on a port used by Envoy. Because this process uses network address translation, it modifies the source and destination addresses within IP packets. This means that, at different points in your mesh, the same traffic can appear to flow between different endpoints.

Istio networking: How network address translation works in Istio.
How network address translation works in Istio.

To address this challenge, Network Performance Monitoring automatically accounts for Istio’s network address translation logic. This gives you complete visibility into your Istio traffic with no configuration.

Instantly understand your mesh

The Network Map automatically visualizes the topology of your Istio mesh, so you can instantly tell which parts of your network are communicating, inspect the results of your traffic management configuration, and spot any unintended traffic.

For example, if you are setting a weight-based destination rule in order to canary a new version of a service, you can easily compare traffic volume between releases by visualizing Network Map nodes by image_tag, which indicates the version of each container image in your mesh. Since the color of each node indicates whether any monitors for that service have entered an alerting state, you can see at a glance whether it is safe to increase the proportion of traffic that flows to the service you are canarying.

In this example, we are using node size and edge width to visualize the volume of traffic sent between Kubernetes services and, later, pods.

The Network Map can also reveal misconfigured communication between services. For example, traffic between service pods in different cloud regions can create unnecessary costs as well as compliance risks. You can aggregate traffic by region to see how much cross-regional traffic your Istio cluster is managing—and to identify possible issues.

Keep mesh communication healthy and efficient

Network Performance Monitoring helps you ensure that communication between services in your Istio mesh is healthy and performant at the container, pod, and service layers. The Network Page visualizes key network performance metrics for monitoring the volume, errors, and latency of network dependencies and can be scoped by any combination of Datadog tags, giving you insight into particular parts of your mesh.

For example, you can use the Network Page to check whether downstream dependencies of a service are opening and closing a high number of TCP connections rapidly, leading to heavy CPU utilization on the service’s containers. If the number of TCP connections is increasing, this could indicate that your service may be establishing connections without reusing them.

You can correlate connection churn with the CPU utilization of the underlying infrastructure by clicking one of the series within the graph, then selecting the option to “View related containers.” A Live Container view will indicate which containers are showing a high rate of CPU utilization. You can use this data to guide the value of maxConnections within a destination rule for your service, which prevents a service from establishing new connections after the maximum in order to protect it from resource saturation.

Get context for your NPM data with real-time resource metrics for containers in your Istio mesh.

You can also use the Network Page to identify problematic spikes in metrics like TCP round-trip time and retransmits, so you can quickly resolve connection issues between network dependencies.

Fine-grained visibility into Envoy traffic

While Istio offers flexibility in architecting and managing network connections, it introduces another layer of network infrastructure where errors and latency can arise: the mesh of Envoy proxies. There are various reasons why traffic from an application container will stop at the local Envoy sidecar without entering the rest of your mesh. In one scenario, the sidecar could be misconfigured to parse HTTPS requests from the application container as HTTP. In another, the host may have an insufficient file descriptor limit to support connections with all the proxies in your mesh.

With Network Performance Monitoring, you can monitor not only traffic between Istio-managed services, but also cross-pod traffic through Envoy proxies. Envoy is tagged as a container, meaning that you can map traffic in your Istio mesh at the smallest possible granularity: between the application container originating traffic and its sidecar, and through the destination sidecar to its respective application container. This helps you pinpoint the source of latency and errors in your container traffic.

Istio networking: The Network Page showing connections between each Envoy container and a container running the “reviews” service.
The Network Page showing connections between each Envoy container and a container running the “reviews” service.

All the context you need in one place

While you can resolve some network traffic issues by changing your Istio configuration, for others you need to examine the applications or infrastructure running at Istio-managed endpoints. For example, an application could include its own connection-handling logic or communicate with different upstream endpoints based on the content of an incoming HTTP request. Network Performance Monitoring allows you to distinguish between issues in your Istio configuration (or the performance of Istio’s control plane) and in the health and performance of services running in your Istio mesh.

You can easily correlate network performance metrics from your Istio mesh with data from your applications and infrastructure. When you click on a connection between Istio-managed services within the Network Page, you can view metrics, traces, and logs for that connection within a sidebar. If the volume sent between two services falls dramatically, for example, you can inspect traces related to the connection and see if recently released application code is to blame.

Istio networking: The Network Page sidebar lets you view Istio traffic data in context with logs.
The Network Page sidebar lets you view Istio traffic data in context with logs.

You can also create custom dashboards by exporting graphs from the Network Page, service pages, log analytics views, and out-of-the-box dashboards for the technologies running in your environment, including the integration for the Istio control plane. You can use a custom dashboard to tell, for example, if a recent xDS push correlates with a decline in network volume between two key services. Datadog makes it easy to get context so you can address issues in your mesh more quickly.

And if your Istio-managed services communicate with infrastructure outside your mesh, NPM can provide full visibility into that traffic as well. This means that if a dependency outside your mesh appears to be causing issues, you can use the Network Map to see if any alerts have triggered for that dependency, and the Network Page to see if you should investigate further dependencies outside your mesh.

Every connection in your mesh

Datadog Network Performance Monitoring automatically monitors network traffic flowing through Envoy proxies in your Istio mesh, including traffic modified by network address translation. NPM runs on your Istio hosts as a low-overhead eBPF program managed by the Datadog Agent, and you can install it by editing your Agent configuration files. (Make sure you’ve upgraded the Agent to at least version 7.25.) You can use NPM to locate possible root causes of Istio networking issues, as well as get real-time architecture visualizations and spot inefficient designs. If you’re not yet a Datadog customer, you can for a free trial.