The use of version control systems, continuous integration (CI), container services, and other tools in software development have enabled developers to ship code more quickly and efficiently. However, as organizations expand their build and packaging ecosystems, they also increase the number of entry points for malicious code injections that can ultimately make their way to production environments. CI/CD pipelines have privileged permissions and access to downstream container registries, making them a valuable target for attackers—and to make matters worse, many organizations continue to rely on risky long-lived credentials for their CI/CD.
To increase resiliency across the software supply chain, organizations—including Google, as well as us at Datadog—have implemented various systems to mitigate the risk of compromise of their cloud registries, network perimeter, and Kubernetes control plane. One of these security solutions is to establish cryptographic provenance for container images through signing and runtime verification. By signing container images in CI and verifying them at various points downstream, you’re able to verify that the image has not been tampered with and is identical to the one that was originally built in CI.
In this post, we’ll discuss the following:
- how image signing helps protect your images from supply chain attacks,
- how signing and verifying works under the hood
- the benefets of implementing image signing as a service
- when to verify your image signatures
- whether signature verification is right for your organization
Are your container images secure?
Organizations that rely on container services to manage and deploy their applications often operate on hundreds of self-hosted Kubernetes clusters and thousands of nodes. This scale of infrastructure is supported by several services such as a version control system, CI provider, container registries, and more—all of which increase the overall supply chain surface area that needs to be secured.
While organizations are building hundreds of thousands of container images each day, attackers are constantly attempting to exploit gaps in container security to inject malicious payloads. Each stage in the pipeline—from the source code itself to the building, packaging, and deploying of container images—presents a potential gateway that attackers can use to compromise your image with code that may eventually reach production environments, even if they don’t have direct production access.
Compromised container registries can enable attackers to insert malicious images to be deployed. If attackers gain access to internal deployment platforms, they can run arbitrary Helm charts in your organization’s Kubernetes clusters, and direct access to a cluster can enable them to create malicious workloads. This cryptojacking exploit covered by Datadog security researchers demonstrates how threat actors accessed exposed Docker API endpoints to create a container that initialized scripts responsible for lateral movement between nodes within a cluster. The infected cluster would then retrieve an XMRig cryptocurrency miner setup script and dedicate its resources to crypto mining.
How image signing and verification works
One method to protect your container images as they move through the software supply chain is to establish a guarantee of provenance, which is done via image signature and verification. This requires a unique signature to be generated for each container image as it is built in CI using a public key signing algorithm, then verifying these signatures downstream to ensure that the image is bit-for-bit the same as when it was built. In order to maximize the length of the guarantee of provenance, images should be signed as close to build time as possible, and verification should be done as close to runtime as possible. Using the cryptojacking incident as an example, if image signing and verification had been implemented, the origin container used to laterally infect the cluster would never have been deployed. Because the image was injected into a Docker container registry by the attacker, it would have been unsigned and caught prior to runtime.
The image signature verification process begins when your CI provider builds your application image and pushes it to your OCI registry where it is stored. The image then needs to be signed: the CI job sends a signature request to a signing client, which signs the image using a private key stored in a key vault (such as Hashicorp’s Transit Secrets Engine). The signing client then pushes the signature to the OCI registry, where it is stored with a reference to its corresponding image.
When downstream image consumers—such as Kubernetes or containerd—attempt to deploy a container using an image, they first pull the image’s signature and send a request to verify it through a verification client. The client retrieves a public key from the key vault that it uses to verify the signature, and if the verification succeeds, the image consumer is able to continue its deployment. The use of private and public key pairs establishes a root of trust between the image producers and image consumers to ensure that the application image is not tampered with between the time of signature and the time of verification.
Storing signatures using OCI-compliant registries
Before we discuss various solutions for signing and verification (as well as the method we use at Datadog), we’ll cover the open standard consensus for storing and distributing signatures at scale.
Image signatures represent an additional build and runtime component. Thus, adopting image signing at scale requires a storage and distribution solution that has low overhead and is easy to adopt into diverse build environments. Fortunately, the container registry, which follows the Open Container Initiative’s (OCI) distribution specifications and is commonly used to store container images, is actually able to store other artifacts—including signature metadata—as long as it falls under OCI specifications. This framework of distributing artifacts across OCI-compliant registries is known as ORAS (OCI registry as storage) and has become the open standard for image signing and verification solutions today.
Storing signature metadata using the OCI registry provides various benefits. First, it does not introduce an additional runtime dependency (which can lead to performance overhead and increased complexity), as you would when, say, introducing a new database. For organizations that run a number of isolated data centers, this presents an opportune solution for signature replication. The registry needs to build an image and then transfer it to the various data centers where they will eventually run. For each image that is built, a detached signature needs to be generated—this refers to a separate artifact that will be used to verify the integrity of our image during the verification process. Using the ORAS framework, both the detached signature and the image are stored within the OCI registry, and they are asynchronously distributed from the data center where the image and signatures are built to various remote data centers where they can be accessed by image consumers.
The slight downside to this method is that it demands the additional storage of a signature payload for each image stored in the registry. However, these are small artifacts in comparison to the image—each signature is a JSON document storing a 64-byte encoded signature and a 64-byte digest that references the corresponding image.
Signing as a service
A key decision you’ll have to make when adopting image signature verification is how you plan on generating signatures from container image digests. At Datadog, we adopted a service-oriented approach using a gRPC service. This service is responsible for pushing signature metadata to the registry and handling the signing of artifacts.
When images are built, jobs in CI push an image to the OCI registry and send an authenticated remote procedure call (RPC) to the signing client. The client uses a secret management service—such as Hashicorp Vault—to generate the image signature. It then builds the signature payload (composed of the signature and a reference to the image) and pushes it to the OCI registry, which now stores both the image and its signature. The registry can now asynchronously distribute images and their signatures across isolated data centers.
One possible alternative was to have CI jobs directly sign images using Hashicorp Vault—however, we chose to adopt the signing service for a few reasons. Image signing needed to be globally implemented across all of our CI jobs. Considering the number of branches spread across production and testing environments, updating each to include signing would have been incredibly time-consuming. By abstracting signing to a dedicated client, we could simply have each CI job make a request to the client. This also made it easy to update—after the initial signing service was released, we were able to push new performance and reliability features at relatively low cost because signing was isolated to the client rather than contained within each CI job.
By abstracting permissions to our signing service, we were able to apply least privilege principles to the signing process and limit the necessary security controls to our signing API. Rather than giving every CI job access to the private key, we were able to restrict access to a single source (our signing client). Full control over the signing service also enabled us to produce audit logs that were more meaningful and granular than Hashicorp Vault’s default audit logs. We were able to configure our logs to include request context such as identifiers for the CI job requesting the signature and additional parsed image fields, as shown in the example below.
When to verify your signature
After you’ve implemented image signing, you’ll need to decide when to verify your image signature. The earlier you verify an image, the quicker you’ll be able to receive developer feedback in case an image fails to verify. But the closer you verify an image to runtime, the longer you’re able to extend the image’s guarantee of integrity. Ideally, you’ll want to verify the image at several different points in order to capitalize on both of these benefits.
One method that is typically recommended in open source projects is to verify signatures in the Kubernetes control plane using admission webhooks. Admission webhooks enable you to run custom code that determines whether the control plane passes or rejects requests to create, change, or delete Kubernetes resources. In the case of verification, the Kubernetes API client makes a request to the Kubernetes API server to create a pod. The server then runs an admission webhook that pulls the image signature from the registry, verifies it using a public key, and returns a response to the server that will create the pod.
However, this method requires the API server to wait on the admission webhook, which needs to communicate with the OCI registry. As a result, slow response times may build backpressure on the API server if the latency of the request to the registry extends beyond your latency budget. Additionally, this method introduces a new cluster-level dependency directly within the hot path of the control plane. This brings new reliability concerns, as previously Kubernetes only pulled images at the node level.
The solution we settled on at Datadog was to verify images further downstream within the container runtime—containerd in our case—which carries over many of the benefits of verifying at the control plane without the same latency and reliability concerns. containerd is located one level below the kubelet, and it receives commands to create or start a container via the container runtime interface (CRI). To verify images within containerd, we implemented a plugin for image verification that runs custom code to verify image signatures. If the plugin returns that the image is verified, containerd continues the image pull. If the plugin returns that the image is unverified, containerd returns an image pull error upstream to the kubelet. This allows us to take advantage of the kubelet’s pull retry loop and cache image verifications that occur at the node level, while bypassing the latency concerns we had with the admission webhook strategy.
Image pulls conducted by the runtime are expected to be slow—by conducting the image verification during the pull process, we’re able to implement signature verification with little to no increases in the process’s overall duration. However, since this implementation requires granular, node-level access, it’s only feasible for organizations that self-host their Kubernetes clusters.
Things to consider before adopting image signing and verification
We’ve discussed the benefits to signing and verifying images—however, there are a few considerations before you decide on adoption.
Is adopting image signing worth it for my organization?
Despite the benefits we previously discussed, implementing image signing as a service may not be the best fit for every organization. If you have low exposure to deploying containerized services and only build images in a few places, implementing image signing likely won’t be worth the effort and overhead. Additionally, you need to consider the complexity of your current deployment pipeline and whether you can guarantee image integrity without signature verification. For instance, if your CI pushes images to the same registry that your Kubernetes cluster pulls images from, you can tighten image security with simple access control rules.
In terms of signature format, should we create our own signature or adopt an open standard?
Adopting cryptographic open standards will almost always benefit the interoperability of your systems, especially if you rely on communication between services orchestrated with open source tooling. However, for large organizations, it may be difficult to work open standards into an existing web of internal services and dependencies, especially if there is no reigning standard. At the time Datadog developed our solution for image signing and verification, we determined that the open-standard solutions did not meet our requirements. That said, open standards are continuing to evolve and become increasingly compatible with diverse build environments, and it’s likely that signing tools such as cosign by Sigstore, Notary V2, and other services would be able to fulfill the needs of most organizations.
How difficult will it be to integrate signatures into our existing CI configurations?
While implementing signature verification only requires a few additional steps, it can still be a lot of work to implement across your CI especially if your organization relies on a diverse set of CI configurations. For this reason, organizations that rely on monorepos or consistent build tooling such as Bazel may have an easier time globally onboarding image signing to their CI. Adopting an image signing service as we outlined above may also help reduce onboarding and maintenance costs.
Your organization’s container runtime solution may also natively support signing and verification services. Examples include containerd and CRI-O. If your organization uses cloud provider services such as AWS EKS to deploy containers, they may offer in-house services such as AWS Signer to easily onboard container image verification into your existing environment.
Guarantee your image integrity today
Image signing and runtime verification can greatly mitigate the risk of compromise within your software supply chain and improve security for your container services. You can learn more about how we implemented our own image verification solution for Datadog in this talk, where we discuss in depth the specific solutions used to sign and encode image payloads as our custom solution to verify signatures at the node level.
If you don’t already have a Datadog account, sign up for a free 14-day trial today.