How to Support a Growing Kubernetes Cluster With a Small Etcd

How to support a growing Kubernetes cluster with a small etcd

Key features of etcd storage and architecture

Etcd holds the data that describes the objects in a Kubernetes cluster—such as pods, endpoints, ConfigMaps, and events—and uses multi-version concurrency control (MVCC) to maintain a history of those objects. Kubernetes controllers use this history to detect changes in the state of the cluster and update objects as necessary to align the cluster’s actual state with its desired state. For example, the ReplicaSet controller receives a notification whenever the API server detects a change in etcd data indicating that a pod has been evicted, and the controller then creates a new replica to restore the cluster to its desired state.

To maintain this history, etcd creates a new version of an object’s data whenever the object is updated, generating additional data instead of overwriting the object’s existing data. This means that every revision of every object occupies space in etcd and that every update increases the amount of data etcd needs to store.

To prevent its data store from expanding perpetually, etcd frees up space by periodically compacting old version data. But if the Kubernetes cluster experiences unexpected activity or rapid growth, etcd could run out of space, putting Kubernetes into an unpredictable state.

Etcd operates in a highly available (HA) cluster and uses the Raft protocol to ensure data consistency among its members even in the presence of node failures. Raft designates a single node as the leader, which processes data updates (such as a write request from the API server to add a Kubernetes object) and then replicates those changes to all of the follower nodes. Both the leader and the followers (other nodes in the cluster) can process read requests. The cluster may also contain learners—nodes that have recently joined the cluster, which become followers only after they have received all of the data necessary to participate in the cluster.

With multiple nodes (leaders and followers) able to respond to requests from kube-apiserver, it may seem like adding more nodes would improve performance. However, this strategy does not offer a performance advantage for etcd the way scaling out the Kubernetes control plane does, because in larger etcd clusters, every write operation requires more Raft activity. The resultant overhead of maintaining consistency across a larger number of nodes can cause a bottleneck that potentially degrades performance instead of improving it. Kubernetes advises against autoscaling etcd, and for optimal performance, etcd recommends maintaining a five-node cluster.

For a deeper look at how etcd works, see our etcd monitoring guide.

If you can’t store more than 8 GiB, how do you manage etcd to support a large and growing Kubernetes cluster? And if you can’t scale out an etcd cluster, how do you safeguard performance as Kubernetes activity increases? In the following sections, we’ll look at how you can configure and use etcd in ways that can help ensure its performance even as your Kubernetes cluster grows.

Ensure sufficient resources for your etcd cluster

Although increasing the number of etcd nodes won’t improve your cluster’s performance, you can provision more resources for each node to ensure that it performs reliably while it supports increasing Kubernetes activity. In this section, we’ll look at how you can ensure etcd performance by carefully provisioning and configuring your nodes’ disk, network, and memory resources.

Note that the guidance we’ll give here assumes that you’ve configured external etcd. As opposed to stacked etcd—in which etcd runs on the same infrastructure that hosts your Kubernetes control plane—external etcd provides separate, dedicated infrastructure. This can increase etcd availability and give you the flexibility to provision etcd resources independent of your Kubernetes infrastructure.

Use low-latency storage

Ensuring that etcd is backed by low-latency storage is one of the keys to maintaining a healthy cluster. When etcd creates, updates, or deletes data, each node must commit the change to its disk as an entry in its write-ahead log (WAL). Once etcd has replicated the change to a majority of the nodes in the cluster, all nodes apply the change, updating their local copies of the key-value store on disk. Until etcd finishes these disk operations, the change is not complete, and Kubernetes can’t update the state of the cluster. Any latency in the process of updating the etcd data store can degrade the performance of the Kubernetes cluster, indirectly affecting the containerized applications running there.

A screenshot of the etcd.disk.backend.commit.duration.seconds metric, which shows disk latency.

Even if your nodes’ disks are fast enough to support your cluster’s commit and apply activity, latency could still cause a bottleneck if you need to recover a failed node. Etcd’s HA architecture allows it to continue to function even if some nodes fail, as long as the majority of the cluster’s nodes are still operable. For example, you can lose one node of a three-node cluster, but you’ll have no tolerance for further failure until you replace the failed node with a new one. When a replacement node joins the cluster, it needs to write the entire etcd keyspace to its disk to achieve data consistency with the rest of the cluster. A fast disk reduces the time it takes to restore your cluster’s HA status, limiting your risk.

See our guide to monitoring etcd for a look at key metrics you can track to ensure the performance of the disks in your etcd cluster.

You can minimize latency by configuring your etcd nodes to store their data on a local disk rather than a remote one. For example, if the nodes in your etcd cluster are Amazon EC2 instances, you can get better performance by storing data on their instance store volumes than on Amazon EBS volumes.

But using EBS to store etcd data can be safer, since those volumes persist independent of the EC2 lifecycle, whereas instance stores provide only temporary storage and are tied to a specific EC2 instance. You can manage EBS latency by using EBS-optimized instances, which provide dedicated bandwidth between the instance and the volume. You can also use provisioned IOPS to ensure a minimum level of IOPS for your EBS volumes.

Provide sufficient network throughput

A healthy, performant network is critical to ensuring your etcd cluster’s reliability and fast failure recovery. The Kubernetes API server sends read requests to etcd frequently, as well as write requests, which incur the overhead of Raft’s data consistency activity. Sufficient network throughput can help ensure that etcd can quickly execute routine operations like this. A fast network is also critical for occasional maintenance operations, such as when you need to restart kube-apiserver to revise its logging level, rotate keys, or update certificates. When kube-apiserver restarts, it needs to fetch a large volume of etcd data to populate its in-memory cache before it can process any updates to Kubernetes’ state. Regular maintenance activities like this require a properly resourced and configured network to avoid degrading the performance of a growing Kubernetes cluster.

See our guide to monitoring etcd for a look at key metrics you can track to ensure that your network is fast and reliable.

To minimize network latency, you can deploy nodes in close proximity—for example, in the same data center. You can also use network-optimized nodes to host your etcd members, such as enhanced networking EC2 instances or Azure Accelerated Networking.

Allocate enough memory and monitor its utilization

Because a relatively small etcd data store can support a large and busy Kubernetes cluster, etcd nodes don’t require a large amount of memory. Etcd stores its entire data store in memory—including an index of keys—so the amount of memory used increases as the size of the data store grows. Etcd documentation recommends up to 64 GB of memory in each node, depending on the size of the data store.

Memory requirements are also influenced by the size and activity level of the Kubernetes cluster. Etcd uses memory for caching query results and for serving watches, which allow etcd clients—such as the Kubernetes API server—to track changes in the state of the cluster. Memory usage increases as the number of watch clients increases, such as when controllers are added to the Kubernetes cluster.

Manage the size of your etcd data store

The amount of data stored by etcd increases as your Kubernetes cluster grows to comprise more objects and as the history of each object incorporates a record of each change. But etcd enforces a limit on how large its data store can grow. The storage limit is configurable—its default is 2 GiB, and etcd recommends storing no more than 8 GiB. If the data store grows beyond the configured limit, etcd becomes read-only, which prevents Kubernetes from adding or updating objects. To ensure the health and performance of your growing Kubernetes cluster, you must manage the size of your etcd data store to avoid breaching the storage limit.

Etcd provides some automated maintenance operations that prevent the data store from growing faster than necessary. And there are manual steps and configuration options you can use to manage etcd’s growth. In this section, we’ll describe etcd’s compaction and defragmentation operations, which allow you to prevent the data store from growing faster than necessary. We’ll explain how you can remove old event data and optimize your pod specs to use less space. And we’ll look at increasing your etcd storage quota.

Compact etcd to prune old data

Recall that etcd uses MVCC to store each revision of each object, so when it writes a new version—for example, when you apply a new label to a pod—etcd retains the data that represents the previous revision of the object. Recent revisions can be useful for cluster administration or troubleshooting, but older revisions can become unnecessary as they age, adding to the size of the data store without providing any value.

Etcd’s compaction operation removes old revisions, based either on an object’s age or on the number of newer revisions in the data store. The kube-apiserver automatically prompts etcd to perform compaction every five minutes by default, and you can also run it manually. Compaction reclaims the storage space that had been occupied by the outdated revisions, making it available for etcd to use for new data. Compaction is a preventive measure, without which etcd would experience degraded performance and eventually run out of space.

Compaction helps check etcd’s growth by removing old revisions, but you should maintain sufficient history to troubleshoot your Kubernetes cluster or perform routine administration tasks such as rolling back a problematic deployment to a previous version.

You can adjust your deployments’ revisionHistoryLimit field to influence how much historical data is kept in etcd. By default, Kubernetes stores 10 revisions of each deployment, but if you can safely reduce the amount of deployment history, you can optimize the size of your etcd data store by reducing this number.

Compaction frees up space, but it doesn’t shrink the size of the database; etcd can use that space to store new data, but it’s not yet available to the host’s operating system (OS). We’ll talk about defragmentation, which enables you to reduce the data store, next.

Defragment to reclaim disk space

Defragmentation reclaims the freed space and returns it for use by the OS, effectively making the data store smaller. Defragmentation complements the compaction operation to let you manage the growth of your etcd data store.

Etcd doesn’t automatically execute defragmentation because the process can significantly impact performance, disrupting Kubernetes activity and causing user-facing performance and reliability issues. You must manually trigger defragmentation on each node in the etcd cluster, and etcd recommends defragmenting only one node at a time. Because defragmentation can cause 5xx errors, you should first remove the node from the cluster to minimize the risk of detectable performance degradation. To further mitigate the performance impact of defragmentation, you can speed up the process by limiting the size of your etcd data store.

Defragmentation is a resource-intensive operation and you should run it only as often as your cluster requires it, based on the growth of your data store. Because this is a manual operation, you should monitor your etcd storage, watching the total size of the database to determine when a defragmentation operation is needed. Once you’ve defragmented, you should see the size of your data store decrease sharply, as illustrated in the screenshot below.

A screenshot of a Datadog graphs showing a drop in the value of the etcd.mvcc.db.total.size.in_use.bytes metric.

You can alert on the data store’s growth to ensure enough time to defragment all of your etcd nodes before they breach their storage quota. In the screenshot below, an alert is defined that will automatically notify the k8s-infra team if the size of the etcd data store on any node increases above 2 GiB.

A screenshot of Datadog's new monitor page, defining an alert that triggers if etcd's data store grows above 2 GiB.

See the etcd blog for further guidance on using compaction and defragmentation to manage the size of your etcd data store.

Clear event objects from etcd

Kubernetes events are records of your cluster’s activity and they chronicle every change that takes place there. Each time an object in the cluster changes state—for example, when a pod is added or removed as part of an autoscaling activity—etcd records the event. In a busy cluster, event objects often far outnumber other types of objects stored in etcd, and as a result they account for a substantial portion of etcd’s stored data. Many events only document normal Kubernetes behavior and may not offer any value in cluster administration or troubleshooting. Some types of events can be initially useful for troubleshooting your cluster’s performance—such as investigating a pod that can’t be created or a volume that can’t be attached—but the value of those events declines as they age.

Kubernetes removes events automatically upon expiration of their time-to-live (TTL), which is one hour by default. You can decrease the TTL—prompting Kubernetes to delete events more frequently—by passing your preferred value in the --event-ttl option when you start kube-apiserver. But waiting for events to age out may not be sufficient to manage etcd’s growth in a highly dynamic Kubernetes cluster. Instead, you can manually remove event data from etcd at any time to regain that storage space. For example, once you’ve identified events—or any other type of Kubernetes object—that you no longer need, you can use kubectl delete to manually remove records from etcd. After clearing unnecessary event objects, you’ll need to run etcd’s compaction and defragmentation processes to reclaim the space they had occupied.

Manage the size of your pod specs

While you can efficiently create and maintain the pods in your Kubernetes cluster using lightweight YAML files, it’s possible for pod specifications to become very large. For example, if your pod specs contain embedded libraries or large amounts of configuration data, they can cause bloat in your etcd data store, especially if they’re updated frequently. Remember that because of etcd’s MVCC capabilities, each time a pod is updated—for example, when you add a label—etcd creates a new revision of its pod spec, but the previous version is retained until it’s removed by compaction.

You can better manage your etcd storage space by optimizing your pod specifications, such as to use a ConfigMap or a mounted volume to store configuration data. These approaches ensure that the data is stored outside of your pod specs, and that when Kubernetes creates new pod versions, it doesn’t need to also create new copies of the associated data.

Provision more than 8 GiB for etcd

Although the documentation suggests 8 GiB as the maximum size of the data store, etcd can support a storage quota larger than that, and you can configure your cluster with more storage if necessary. To ensure capacity for a larger data store, you may need to increase the amount of memory and disk space on your etcd hosts. And with a larger data store, compaction and defragmentation operations may take longer, so ensuring that your etcd hosts have sufficient resources becomes even more important.

Deploy multiple etcd clusters

Recall that a substantial portion of etcd’s storage is used to hold event objects. Cluster activity that causes a spike in events—such as rolling back a large deployment, which generates event records for each affected pod—can quickly exhaust the available space in your data store and render your Kubernetes cluster inoperable. To mitigate the risk that an influx of events will affect your cluster’s performance, you can configure etcd to store events in a cluster separate from the one that holds other object types. (This assumes an external etcd architecture and isn’t an option in the case of a stacked etcd cluster.) By isolating event objects onto dedicated infrastructure, you can ensure that the rest of the control plane won’t be affected by a spike in events.

Once you’ve created your event-specific etcd cluster, you can restart kube-apiserver using the --etcd-servers-overrides parameter to configure Kubernetes to send event data to that cluster, as described in the etcd blog. Make sure to create appropriate monitors on the new cluster to proactively manage its space in addition to the original etcd cluster.

Proactively maintain etcd to operate healthy Kubernetes clusters

As your Kubernetes cluster grows in size and complexity, it’s increasingly important to effectively manage your etcd data store. By leveraging etcd’s built-in management functions, allocating sufficient cluster resources, and actively monitoring and maintaining your etcd data, you can continue to grow your Kubernetes cluster without breaching etcd’s data storage limits.

Datadog’s etcd integration collects key metrics from your cluster and allows you to visualize etcd performance. To get started quickly monitoring etcd, see the out-of-the-box dashboards for etcd version 2 or version 3, and create monitors to notify you of changes in key metrics that could indicate cluster health issues.

See our documentation for more information about monitoring your etcd and Kubernetes clusters. For a deep dive into key etcd metrics to watch and alert on, see our etcd monitoring guide. If you’re not already using Datadog, you can get started with a 14-day free trial.

Want to work with us? We're hiring!

How to support a growing Kubernetes cluster with a small etcd

Further Reading

Key features of etcd storage and architecture

Ensure sufficient resources for your etcd cluster

Use low-latency storage

Provide sufficient network throughput

Allocate enough memory and monitor its utilization

Manage the size of your etcd data store

Compact etcd to prune old data

Defragment to reclaim disk space

Clear event objects from etcd

Manage the size of your pod specs

Provision more than 8 GiB for etcd

Deploy multiple etcd clusters

Proactively maintain etcd to operate healthy Kubernetes clusters

Further Reading

Start monitoring your metrics in minutes

How to support a growing Kubernetes cluster with a small etcd

Further Reading

Related jobs at Datadog

Further Reading

The State of DevOps: Accelerating Software Development With Generative AI

Best practices for monitoring event-driven architectures

vLLM Observability & Monitoring

Forbes adopts Datadog to enable observability across its entire technology organization