Etcd is a distributed key-value data store that provides highly available, durable storage for distributed applications. In Kubernetes, etcd functions as part of the control plane, storing data about the actual and desired state of the resources in a cluster. Kubernetes controllers use etcd’s data to reconcile the cluster’s actual state to its desired state.
This series focuses on monitoring etcd in Kubernetes. Later in this post, we’ll show you key metrics you should monitor to ensure that etcd is effectively supporting the health and performance of your Kubernetes cluster. But first, we’ll explore how etcd provides highly available data storage and how Kubernetes uses that data to manage the state of the cluster.
How etcd works
In this section, we’ll look at how etcd stores and manages state data for Kubernetes clusters and how it provides high availability and data consistency.
Kubernetes cluster data in etcd
Each object stored in etcd defines a Kubernetes resource as a collection of key-value pairs. By default, each etcd object has a key name that begins with /registry
. This key also includes the type, namespace, and name of the resource. As an example, consider a pod called <POD_NAME>
in the default
namespace. The etcd object that defines the pod will use the key name /registry/pods/default/<POD_NAME>
. The value of that object will contain data that describes the pod, including its name, container image(s), and resource requests and limits.
The Kubernetes API server—
kube-apiserver—acts as etcd’s client. As the API server receives requests from its own clients, it reads and writes cluster state information on their behalf by sending gRPC requests to etcd’s API. For example, when the
kube-scheduler needs to assign a workload to a node, it can update the cluster’s desired state by communicating the change to kube-apiserver, which then sends a write request to etcd. Other Kubernetes components interact with etcd via kube-apiserver—including the
kube-controller-manager and the cloud controller manager—and so does the Kubernetes command-line tool, kubectl
.
Etcd uses multi-version concurrency control (MVCC) to provide visibility into the history of the cluster’s state. When etcd creates, updates, or deletes an object, it first commits that change, writing the new version of the data to disk in the write-ahead log (WAL). Next, it applies the change, persisting it to the key-value store and incrementing the revision—a counter of all the changes that have taken place.
By logging the revisions of the data store and the value of each key across all of its versions, etcd maintains a history that enables kube-apiserver to read previous versions of the cluster’s state. This enables it to watch the value of objects in the data store on behalf of clients that need to stay informed of relevant changes. For example, kube-scheduler is a client of kube-apiserver, which notifies it when etcd’s data reflects a change in the cluster’s desired state. Controllers also rely on kube-apiserver to watch etcd data for them and notify them so that they know when to make a change in the cluster. For example, a node controller can create a watch to detect and evict nodes that have become unresponsive. Etcd sends events to kube-apiserver via gRPC streams to inform it of these changes, and kube-apiserver notifies the controller.
High availability and data consistency
To provide high availability, etcd is designed to operate in a cluster of three or more redundant nodes, with each node having a local copy of the key-value store. In a cluster, etcd can continue to function even if some members become unavailable—for example, due to a failed node or a network partition, in which a network failure prevents some nodes from participating in the cluster.
The size of the cluster determines its fault tolerance, since it only needs the majority of its nodes—in other words, a quorum—to continue to operate. Clusters with an odd number of nodes provide the best fault tolerance. A cluster with an odd number of nodes has a fault tolerance of half of its members rounded down to the nearest whole number. For example, a five-node cluster can lose up to two nodes. If you increase the cluster size by one, you don’t increase its fault tolerance—since you need four of the six nodes running to maintain quorum—but you do have more nodes that can fail. Kubernetes recommends storing cluster state data in a five-node etcd cluster, which can tolerate the loss of two nodes. Larger etcd clusters don’t perform as well because they need to replicate data across more nodes. For this reason, etcd recommends a maximum cluster size of seven nodes.
While distributing the data across multiple nodes gives etcd high availability, it requires a way to manage data consistency so that querying any node gives the same results. Etcd uses the Raft consensus algorithm to synchronize the data on all nodes in the cluster. Raft enables the nodes to elect a leader, which handles all requests to write or update data. All other nodes are followers, which forward requests to the leader when necessary to ensure consistency.
The leader sends a heartbeat message to all followers every 100 ms (by default) to confirm its leader status. If a follower doesn’t receive the heartbeat within the election timeout (which is 1000 ms, by default), it assumes that the cluster has lost leadership. This can happen, for example, if the leader fails or if a network partition disrupts the flow of heartbeats. The follower can then nominate itself, initiating a new election.
When the leader receives a write request, either from a client or forwarded from a follower, it commits the proposed change to its WAL and then sends it as a proposal to all followers. Followers commit the proposal to their WALs and send confirmation to the leader. Once a majority of followers have confirmed, the leader applies the change, updating its copy of the key-value store to reflect the change. The leader then notifies the followers to apply the change to their local copies of the key-value store.
By default, etcd guarantees that it will respond to each read request with the most current value regardless of which node serves the request. If a follower receives a read request, but it has a pending change it has not applied yet, it forwards the request to the leader to ensure that the client will receive the latest data. This introduces latency, though, and you can optionally configure etcd’s consistency mode to enable the follower in this case to possibly respond with stale data instead.
Key etcd metrics
So far in this post, we’ve looked at how Kubernetes stores data in etcd and how etcd provides high availability and data consistency. In this section, we’ll look at metrics you can monitor to ensure the health and performance of your etcd cluster. We’ll explore metrics from these categories:
Terminology in this section comes from our Monitoring 101 series of blog posts, which provide a framework for effective monitoring.
Resource metrics
Each node in your etcd cluster uses its local resources to run the etcd process; receive, forward, and respond to requests; and maintain a copy of the key-value store. The metrics described in this section can keep you informed of the health of individual nodes and can help you correlate your etcd cluster’s resource usage to the performance of your Kubernetes cluster.
Process metrics
Name | Description | Metric type | Availability |
---|---|---|---|
process_open_fds, process_max_fds | The number of file descriptors used by the etcd process | Resource: Utilization | Prometheus |
process_resident_memory_bytes | The amount of memory used by the etcd process, in bytes | Resource: Utilization | Prometheus |
Metric to alert on: process_open_fds, process_max_fds
Each network connection that etcd uses—for example, to communicate with other nodes in the cluster and to receive requests from kube-apiserver—is identified using a unique file descriptor. Etcd also uses file descriptors for reading and writing data in its WAL and key-value store. Like all processes, etcd has a limited number of file descriptors available, and if it exhausts them, it will crash.
You should create a monitor to notify you if any node has used 80 percent of its file descriptors so you have time to allocate more. If you’re using a monitoring platform that provides forecasting capabilities, you can configure a monitor that will predict when this is likely to occur and proactively alert you before it is expected to happen.
Metric to watch: process_resident_memory_bytes
The etcd process uses memory to index and cache data. If your etcd nodes suffer OOM errors or degraded performance, check to see if there’s a correlated increase in this metric. You may be able to resolve issues like this by allocating more memory to etcd, though this may require you to scale up your nodes.
Disk metrics
Etcd’s documentation includes baseline performance data that shows the rate of queries you can expect your etcd server to process. If your cluster’s performance isn’t in line with these figures, disk performance may be a root cause.
If etcd takes longer than expected to write data to disk, it could degrade the performance of your Kubernetes cluster and slow down your application. Using high-performance disks can minimize latency, but it’s important to use the following metrics to understand your risk of performance degradation due to disk latency, particularly if you’re running large and busy Kubernetes clusters.
Name | Description | Metric type | Availability |
---|---|---|---|
etcd_disk_backend_commit_duration_seconds | The time required for etcd to persist its most recent changes to disk, in seconds | Work: Performance | Prometheus |
etcd_disk_wal_fsync_duration_seconds | The time required for etcd to persist to disk the pending changes that exist in the write-ahead log (WAL), in seconds | Work: Performance | Prometheus |
etcd_mvcc_db_total_size_in_bytes | The size of the etcd database, in bytes | Resource: Utilization | Prometheus |
Metric to alert on: etcd_disk_backend_commit_duration_seconds
After the leader receives a write request, it commits the change to its WAL. It then forwards the proposed change to the followers, who update their WALs. This metric reflects the latency in the commit process on each node, whether it’s the leader or a follower. Increased commit duration can be caused by network bottlenecks and disk I/O constraints, and can affect the performance of the Kubernetes cluster.
Commit duration can fluctuate, but you should create an automated monitor to notify you if it increases substantially over a short period of time, e.g., more than 25 percent over the last five minutes. You may be able to reduce commit duration by switching to a storage technology that offers greater throughput.
Metric to alert on: etcd_disk_wal_fsync_duration_seconds
To apply changes from the WAL into the key-value store, the etcd process calls fsync
. If fsync
duration increases, it could be a sign that the disk doesn’t have sufficient I/O available. In this case, you may see a correlation with the etcd_disk_backend_commit_duration_seconds metric.
You should configure a monitor to notify you if etcd_disk_wal_fsync_duration_seconds substantially increases over a short window of time—for example, if it rises more than 50 percent within the last five minutes. If you get notified, you can provision more I/O on the affected node(s).
Metric to alert on: etcd_mvcc_db_total_size_in_bytes
By default, etcd can store up to 2 GiB of data. That maximum is configurable, but etcd recommends setting the maximum size to no larger than 8 GiB. Etcd’s database grows each time Kubernetes makes a change to the state of the cluster. If you reach the maximum size, etcd won’t accept any more writes, and Kubernetes can’t execute cluster management tasks, such as scheduling new pods.
You should create a monitor to notify you if your database is using over 80 percent of its configured maximum storage. This will give you enough time to remediate the problem before you run out of space—for example, by compacting and defragmenting etcd or revising kube-apiserver’s auto-compaction interval (which has a default value of five minutes).
Network performance metrics
Etcd relies on a healthy network for efficient communication with the kube-apiserver and among the etcd cluster’s nodes. If your application’s latency increases, the metrics in this section may help you identify a root cause in the performance of your network.
Name | Description | Metric type | Availability |
---|---|---|---|
etcd_network_peer_round_trip_time_seconds | The time required for a request to make a round trip between two etcd nodes, in seconds | Work: Performance | Prometheus |
grpc_server_handled_total | The total number of gRPC requests handled by the server | Work: Throughput | Prometheus |
Metric to alert on: etcd_network_peer_round_trip_time_seconds
The Raft protocol requires etcd nodes to communicate with each other in order to maintain data consistency—for example, when the leader forwards proposals to followers. An increase in round-trip time that affects any part of your network can degrade Kubernetes performance. You should set a monitor to notify you if round-trip time increases more than 15 percent over a five-minute period. An increase like this can cause high commit latency, so you should troubleshoot and resolve any network-level issues to avoid slowing down your cluster’s performance.
Metric to watch: grpc_server_handled_total
This metric counts the number of gRPC requests that etcd has completed, regardless of the outcome. You can break it down to show the gRPC method and service requested, for example, to see how many calls are going to etcd’s watch
and kv
APIs. You can also analyze etcd’s activity based on the status codes resulting from the gRPC calls, for example, by filtering for DeadlineExceeded
results to see whether calls are timing out or PermissionDenied
results to identify potential security concerns.
You can compare this metric across the nodes in your cluster to understand the relative amount of request traffic flowing to each node. If any single node shows a drop in requests handled, you should investigate the relevant network segments and look at node-level metrics to check for CPU, memory, or network constraints. And you can compare it to grpc_server_started_total—which counts the number of gRPC requests that etcd started to process (even if it didn’t complete them)—to see the proportion of requests that weren’t completed. If a node is failing to process requests, you should look for errors in its gRPC logs.
Watch metrics
The Kubernetes API server establishes watches with the etcd API to detect changes that controllers need to know about to update the cluster configuration, for example by exposing a new Service or replacing a failed pod. Monitoring these metrics can help you ensure that kube-apiserver’s watches are healthy, enabling controllers to keep the cluster in its desired state.
Name | Description | Metric type | Availability |
---|---|---|---|
etcd_debugging_store_watchers | The number of queries watching for changes in etcd’s data | Other | Prometheus |
etcd_debugging_mvcc_slow_watcher_total | The number of unsynchronized watchers | Other | Prometheus |
Metric to watch: etcd_debugging_store_watchers
Etcd should automatically balance watches to distribute them proportionally among the cluster’s nodes. Significantly unbalanced connections could be caused by outdated gRPC client or server versions, as described in the etcd documentation.
Metric to watch: etcd_debugging_mvcc_slow_watcher_total
This will increment when the kube-apiserver is unable to consume events (i.e., updates to the data it’s watching) as fast as etcd is streaming them. This could be caused by network latency or by resource contention on a control plane node, which can prevent kube-apiserver from consuming events quickly enough to stay synchronized with the server.
Raft metrics
Etcd uses the Raft consensus algorithm to ensure that data is consistent across all nodes in the cluster. The metrics described in this section can help you see whether the cluster’s consensus is healthy or at risk.
Name | Description | Metric type | Availability |
---|---|---|---|
etcd_server_leader_changes_seen_total | The number of times the etcd cluster has seen a change in leadership | Other | Prometheus |
etcd_server_proposals_failed_total | The number of proposals that failed to get confirmed by a majority of nodes | Other | Prometheus |
etcd_server_proposals_committed_total, etcd_server_proposals_applied_total | The number of times the node has successfully committed and applied changes | Other | Prometheus |
Metric to alert on: etcd_server_leader_changes_seen_total
Cluster leadership can change at any time due to a failed leader node or network latency that prevents followers from receiving heartbeats within the configured timeout. While a new leader is being elected, the cluster is unable to add or update data, so your applications may experience increased latency.
You should alert on a sustained high rate of leadership changes—for example, more than 100 times in an hour. This can be caused by insufficient network throughput. It could also be due to resource constraints on etcd nodes—which could cause the leader to fail if the etcd cluster is too busy—so you may need to scale up your etcd nodes.
Metric to watch: etcd_server_proposals_failed_total
When the etcd leader proposes a change to the data and that proposal is not confirmed by the majority of nodes, the proposal fails. This metric tracks the number of times this occurs. Failed proposals can happen sporadically due to leader elections, but if you see an ongoing increase in this metric, it could indicate a network issue that has caused the cluster to lose quorum.
Metrics to watch: etcd_server_proposals_committed_total, etcd_server_proposals_applied_total
When a quorum agrees to a proposal, all nodes commit the change to their WALs, then apply the change to their local copy of the key-value store. If a node is healthy, it will show only a small difference between the number of committed and applied proposals. If you see an ongoing decline in the ratio of these two metrics, it means that the node isn’t applying the changes quickly enough, which could be due to insufficient CPU or disk I/O resources. If the issue correlates with increasing CPU usage on your hosts, you can provide more resources by scaling up the nodes.
You should also look for a correlated increase in the etcd_disk_backend_commit_duration_seconds metric to determine if disk performance is a factor. You may be able to mitigate this by upgrading to faster disks and dedicating disks that the nodes can use solely for etcd storage.
Kubernetes metrics
Kubernetes provides some metrics that describe etcd’s performance. Although they are still in alpha, these metrics can give you an additional perspective to help you understand etcd performance. In this section, we’ll look at how you can use one of these metrics to see whether etcd is the source of errors or latency that affect your Kubernetes cluster or your application.
Name | Description | Metric type | Availability |
---|---|---|---|
etcd_request_duration_seconds | The time it takes for the etcd client to receive a request | Work: Performance | Prometheus (Kubernetes) |
Metric to watch: etcd_request_duration_seconds
When kube-apiserver sends a request to read the desired state of the Kubernetes cluster, it must wait for etcd’s response. Any delay in this operation can in turn affect the controllers that rely on kube-apiserver for the data they use to manage the state of the cluster.
You can monitor etcd_request_duration_seconds to track how quickly etcd responds to different types of requests, including create
, update
, get
, and list
. If you see an increase in etcd’s response time, you should try to correlate it to etcd performance: all of these request types can be affected by the performance of etcd’s network and disks. But you should track the different request types separately, as well. The list
operation can become particularly inefficient, and you may be able to optimize etcd’s performance by implementing the informer pattern in place of relying on list
requests.
Monitor etcd to ensure the performance of your Kubernetes clusters
In this post, we’ve explored the key metrics you can monitor to understand etcd performance and explained how these metrics relate to the performance of your Kubernetes cluster. In Part 2 of this series, we’ll look at the tools available to collect and visualize key etcd metrics. And in Part 3, we’ll show you how you can use Datadog to monitor the health and performance of your etcd cluster.