Linux has evolved sandboxing and isolation techniques beyond simple virtual machines (VMs) that strengthen it from current and future vulnerabilities. Sometimes these sandboxes are called micro VMs.
These sandboxes combine parts of all previous container and VM approaches. You would use them to protect sensitive workloads and data, as they focus on rapid deployment and high performance on shared infrastructure.
In this chapter we’ll discuss different types of micro VMs that use virtual machines and containers together, to protect your running Linux kernel and userspace. The generic term sandboxing is used to cover the entire spectrum: each tool in this chapter combines software and hardware virtualization of technologies and uses Linux’s Kernel Virtual Machine (KVM), which is widely used to power VMs in public cloud services, including Amazon Web Services and Google Cloud.
You run a lot of workloads at BCTL, and you should remember that while these techniques may also protect against Kubernetes mistakes, all of your web-facing software and infrastructure is a more obvious place to defend first. Zero-days and container breakouts are rare in comparison to simple security-sensitive misconfigurations.
Hardened runtimes are newer, and have fewer generally less dangerous CVEs than the kernel or more established container runtimes, so we’ll focus less on historical breakouts and more on the history of micro VM design and rationale.
kubeadm
installs Kubernetes with runc
as its container runtime, using cri-o
or
containerd
to manage it. The
old dockershim
way of running runc
was removed in Kubernetes v1.20, so although Kubernetes doesn’t use Docker
any more, the runc
container runtime that Docker is built on continues to run containers for us. Figure 3-1 shows three ways Kubernetes can consume the runc
container runtime: CRI-O, containerd
, and Docker.
We’ll get into container runtimes in a lot of detail later on in this chapter.
You have two main reasons for isolating a workload or pod—it may have access to sensitive information and data, or it may be untrusted and potentially hostile to other users of the system:
A sensitive workload is one whose data or code is too important to permit unauthorized access to. This may include fraud detection systems, pricing engines, high-frequency trading algorithms, personally identifiable information (PII), financial records, passwords that may be reused in other systems, machine learning models, or an organization’s “secret sauce.” Sensitive workloads are precious.
Untrusted workloads are those that may be dangerous to run. They may allow high-risk user input or run external software.
Examples of potentially untrusted workloads include:
Untrusted workloads may also include software with published or suspected zero-day Common Vulnerabilities and Exposures (CVEs)—if no patch is available and the workload is business-critical, isolating it further may decrease the potential impact of the vulnerability if exploited.
The threat to a host running untrusted workloads is the workload, or process, itself. By sandboxing a process and removing the system APIs available to it, the attack surface presented by the host to the process is decreased. Even if that process is compromised, the risk to the host is less.
BCTL allows users to upload files to import data and shipping manifests, so you have a risk that threat actors will try to upload badly formatted or malicious files to try to force exploitable software errors. The pods that run the batch transformation and processing workloads are a good candidate for sandboxing, as they are processing untrusted inputs as shown in Figure 3-2.
Any data supplied to an application by users can be considered untrusted, however most input will be sanitized in some way (for example, validating against an integer or string type). Complex files like PDFs or videos cannot be sanitized in this way, and rely upon the encoding libraries to be secure, which they sometimes are not. Bugs in this type are often “escapable” like CVE-X or ImageTragick.
Your threat model may include:
An untrusted user input triggers a bug in a workload that an attacker uses to execute malicious code
A sensitive application is compromised and the attacker tries to exfiltrate data
A malicious user on a compromised node attempts to read memory of other processes on the host
New sandboxing code is less well tested, and may contain exploitable bugs
A container image build pulls malicious dependencies and code from unauthenticated external sources that may contain malware
Existing container runtimes come with some hardening by default, and Docker uses default seccomp
and AppArmor profiles
that drop a large number of unused system calls. These are not enabled by default in Kubernetes and must be enforced
with admission control or PodSecurityPolicy. The SeccompDefault=true
kubelet feature gate in v1.22 restores this container runtime default behavior.
Now that we have an idea of the dangers to your systems, let’s take a step back. We’ll look at virtualization: what it is, why we use containers, and how to combine the best bits of containers and VMs.
A major difference between a container and a VM is that containers exist on a shared host kernel. VMs boot a kernel every time they start, use hardware-assisted virtualization, and have a more secure but traditionally slower runtime.
A common perception is that containers are optimized for speed and portability, and virtual machines sacrifice these features for more robust isolation from malicious behavior and higher fault tolerance.
This perception is not entirely true. Both technologies share a lot of common code pathways in the kernel itself. Containers and virtual machines have evolved like co-orbiting stars, never fully able to escape each other’s gravity. Container runtimes are a form of kernel virtualization. The OCI (Open Container Initiative) container image specifications have become the standardized atomic unit of container deployment.
Next-generation sandboxes combine container and virtualization techniques (see Figure 3-3) to reduce
workloads’ access to the kernel. They do this by by emulating kernel functionality in userspace or the isolated guest
environment, thus reducing the host’s attack surface to the process inside the sandbox. Well-defined interfaces can help
to reduce complexity, minimizing the opportunity for untested code paths. And, by integrating the sandboxes with
containerd
, they are also able to interact with OCI images and with a software proxy (“shim”) to connect two different
interfaces, which can be used with orchestrators like Kubernetes.
These sandboxing techniques are especially relevant to public cloud providers, for which multitenancy and bin packing is highly lucrative. Aggressively multitenanted systems such as Google Cloud Functions and AWS Lambda are running “untrusted code as a service,” and this isolation software is born from cloud vendor security requirements to isolate serverless runtimes from other tenants. Multitenancy will be discussed in depth in the next chapter.
Cloud providers use virtual machines as the atomic unit of compute, but they may also wrap the root virtual machine process in container-like technologies. Customers then use the virtual machine to run containers—virtualized inception.
Traditional virtualization emulates a physical hardware architecture in software. Micro VMs emulate as small an API as possible, removing features like I/O devices and even system calls to ensure least privilege. However, they are still running the same Linux kernel code to perform low-level program operations such as memory mapping and opening sockets—just with additional security abstractions to create a secure by default runtime. So even though VMs are not sharing as much of the kernel as containers do, some system calls must still be executed by the host kernel.
Software abstractions require CPU time to execute, and so virtualization must always be a balance of security and performance. It is possible to add enough layers of abstraction and indirection that a process is considered “highly secure,” but it is unlikely that this ultimate security will result in a viable user experience. Unikernels go in the other direction, tracing a program’s execution and then removing almost all kernel functionality except what the program has used. Observability and debuggability are perhaps the reasons that unikernels have not seen widespread adoption.
To understand the trade-offs and compromises inherent in each approach, it is important to grok a comparison of virtualization types. Virtualization has existed for a long time and has many variations.
Although virtual machines and associated technologies have existed since the late 1950s, a lack of hardware support in the 1990s led to their temporary demise. During this time “process virtual machines” became more popular, especially the Java virtual machine (JVM). In this chapter we are exclusively referring to system virtual machines: a form of virtualization not tied to a specific programming language. Examples include KVM/QEMU, VMware, Xen, VirtualBox, etc.
Virtual machine research began in the 1960s to facilitate sharing large, expensive physical machines between multiple users and processes (see Figure 3-4). To share a physical host safely, some level of isolation must be enforced between tenants—and in case of hostile tenants, there should be much less access to the underlying system.
This is performed in hardware (the CPU), software (in the kernel, and userspace), or from cooperation between both layers, and allows many users to share the same large physical hardware. This innovation became the driving technology behind public cloud adoption: safe sharing and isolation for processes, memory, and the resources they require from the physical host machine.
The host machine is split into smaller isolated compute units, traditionally referred to as guests (see Figure 3-5). These guests interact with a virtualized layer above the physical host’s CPU and devices. That layer intercepts system calls to handle them itself: either by proxying them to the host kernel, or handling the request itself—doing the kernel’s job where possible. Full virtualization (e.g., VMware) emulates hardware and boots a full kernel inside the guest. Operating-system–level virtualization (e.g., a container) emulates the host’s kernel (i.e., using namespace, cgroups
, capabilities, and seccomp
) so it can start a containerized process directly on the host kernel. Processes in containers share many of the kernel pathways and security mechanisms that processes in VMs execute.
To boot a kernel, a guest operating system will require access to a subset of the host machine’s functionality, including BIOS routines, devices and peripherals (e.g., keyboard, graphical/console access, storage, and networking), an interrupt controller and an interval timer, a source of entropy (for random number seeds), and the memory address space that it will run in.
Inside each guest virtual machine is an environment in which processes (or workloads) can run. The virtual machine itself is owned by a privileged parent process that manages its setup and interaction with the host, known as a virtual machine monitor or VMM (as in Figure 3-6). This has also been known as a hypervisor, but the distinction is blurred with more recent approaches so the original term VMM is preferred.
Linux has a built-in virtual machine manager called KVM that allows a host kernel to run virtual machines. Along with QEMU, which emulates physical devices and provides memory management to the guest (and can run by itself if necessary), an operating system can run fully emulated by the guest OS and by QEMU (as contrasted with the Xen hypervisor in Figure 3-7). This emulation narrows the interface between the VM and the host kernel and reduces the amount of kernel code the process inside the VM can reach directly. This provides a greater level of isolation from unknown kernel vulnerabilities.
Despite many decades of effort, “in practice no virtual machine is completely equivalent to its real machine counterpart” (“The Ideal Versus the Real”). This is due to the complexities of emulating hardware, and hopefully decreases the chance that we’re living in a simulation.
Like all things we try to secure, virtualization must balance performance with security: decreasing the risk of running your workloads using the minimum possible number of extra checks at runtime. For containers, a shared host kernel is an avenue of potential container escape—the Linux kernel has a long heritage and monolithic codebase.
Linux is mainly written in the C language, which has classes of memory management and range checking vulnerabilities that have proven notoriously difficult to entirely eradicate. Many applications have experienced these exploitable bugs when subjected to fuzzers. This risk means we want to keep hostile code away from trusted interfaces in case they have zero-day vulnerabilities. This is a pretty serious defensive stance—it’s about reducing any window of opportunity for an attacker that has access to zero-day Linux vulnerabilities.
Google’s OSS-Fuzz was born from the swirling maelstrom around the Heartbleed OpenSSL bug, which may have been raging in the wild for up to two years. Critical, internet-bolstering projects like OpenSSL are poorly funded and much goodwill exists in the open source community, so finding these bugs before they are exploited is a vital step in securing critical software.
The sandboxing model defends against zero-days by abstractions. It moves processes away from the Linux system call interface to reduce the opportunities to exploit it, using an assortment of containers and capabilities, LSMs and kernel modules, hardware and software virtualization, and dedicated drivers. Most recent sandboxes use a type-safe language like Golang or Rust, which makes their memory management safer than software programmed in C (which requires manual and potentially error-prone memory management).
Let’s further define what we mean by containers by looking at how they interact with the host kernel, as shown in Figure 3-8.
Containers talk directly to the host kernel, but the layers of LSMs, capabilities, and namespaces ensure they do not have full host kernel access. Conversely, instead of sharing one kernel, VMs use a guest kernel (a dedicated kernel running in a hypervisor). This means if the VM’s guest kernel is compromised, more work is required to break out of the hypervisor and into the host.
Containers are created by a low-level container runtime, and as users we talk to the high-level container runtime that controls it.
The diagram in Figure 3-9 shows the high-level interfaces, with the container managers on the left. Then Kubernetes, Docker, and Podman interact with their respective libraries and runtimes. These perform useful container management features including pushing and pulling container images, managing storage and network interfaces, and interacting with the low-level container runtime.
In the middle column of Figure 3-9 are the container runtimes that your Kubernetes cluster interacts with, while in the right column are the low-level runtimes responsible for starting and managing the container.
That low-level container runtime is directly responsible for starting and managing containers, interfacing with the kernel to create the namespaces and configuration, and finally starting the process in the container. It is also responsible for handling your process inside the container, and getting its system calls to the host kernel at runtime.
Linux was written with a core assumption: that the root user is always in the host namespace. This assumption held true while there were no other namespaces. But this changed with the introduction of user namespaces (the last major kernel namespace to be completed): developing user namespaces required many code changes to code concerning the root user.
User namespaces allow you to map users inside a container to other users on the host, so ID 0 (root) inside the container can create files on a volume that from within the container look to be root-owned. But when you inspect the same volume from the host, they show up as owned by the user root was mapped to (e.g., user ID 1000, or 110000, as shown in Figure 3-10). User namespaces are not enabled in Kubernetes, although work is underway to support them.
Everything in Linux is a file, and files are owned by users. This makes user namespaces wide-reaching and complex, and they have been a source of privilege escalation bugs in previous versions of Linux:
The clone system-call
implementation in the Linux kernel before 3.8.3 does not properly handle a combination of the CLONE_NEWUSER
and
CLONE_FS
flags, which allows local users to gain privileges by calling chroot
and leveraging the sharing of the /
directory between a parent process and a child process.
The capabilities implementation in
the Linux kernel before 3.14.8 does not properly consider that namespaces are inapplicable to inodes, which allows local
users to bypass intended chmod
restrictions by first creating a user namespace, as demonstrated by setting the setgid
bit on a file with group ownership of root
.
The overlayfs
implementation in the Linux kernel package before 3.19.0-21.21 in Ubuntu versions until 15.04 did not properly check permissions for file creation in the upper filesystem directory, which allowed local users to obtain root access by leveraging a configuration in which overlayfs
is permitted in an arbitrary mount namespace.
In the Linux kernel 4.15.x through 4.19.x before 4.19.2, map_write()
in kernel/user_namespace.c allows privilege escalation because it mishandles nested user namespaces with more than 5 UID
or GID
ranges. A user who has CAP_SYS_ADMIN
in an affected user namespace can bypass access controls on resources outside the namespace, as demonstrated by reading /etc/shadow. This occurs because an ID transformation takes place properly for the namespaced-to-kernel direction but not for the
kernel-to-namespaced direction.
Containers are not inherently “insecure,” but as we saw in Chapter 2, they can leak some information about a host, and a root-owned container runtime is a potential exploitation path for a hostile process or container image.
Operations such as creating network adapters in the host network namespace, and mounting host disks, are historically root-only, which has made rootless containers harder to implement. Rootfull container runtimes were the only viable option for the first decade of popularized container use.
Exploits that have abused this rootfulness include CVE-2019-5736, replacing the runc
binary from inside a container via /proc/self/exe, and CVE-2019-14271, attacking the host from inside a container responding to docker cp
.
Underlying concerns about a root-owned daemon can be assuaged by running rootless containers in “unprivileged user namespaces” mode: creating containers using a nonroot user, within their own user namespace. This is supported in Docker 20.0X and Podman.
Rootless means the low-level container runtime process that creates the container is owned by an unprivileged user, and so container breakout via the process tree only escapes to a nonroot user, nullifying some potential attacks.
Rootless containers introduce a hopefully less dangerous risk—user namespaces have historically been a rich source of vulnerabilities. The answer to whether it is riskier to run root-owned daemon or user namespaces isn’t clear-cut, although any reduction of root privileges is likely to be the more effective security boundary. There have been more high-profile breakouts from root-owned Docker, but this may well be down to adoption and widespread use.
Rootless containers (without a root-owned daemon) provide a security boundary as compared to those with root-owned daemons. When code owned by the host’s root user is compromised by a malicious process, it can potentially read and write other users’ files, attack the network and its traffic, or install malware to the host.
The mapping of user identifiers (UIDs) in the guest to actual users on the host depends on the user mappings of the host user namespace, container user namespace, and rootless runtime, as shown in Figure 3-11.
User namespaces allow nonroot users to pretend to be the host’s root user. The “root-in-userns” user can have a “fake” UID 0 and permission to create new namespaces (mount, net, uts, ipc), change the container’s hostname, and mount points.
This allows root-in-userns, which is unprivileged in the host namespace, to create new containers. To achieve this, additional work must be done: network connections into the host network namespace can only be created by the host’s root. For rootless containers, an unprivileged slirp4netns networking device (guarded by seccomp
) is used to create a virtual network device.
Unfortunately, mounting remote filesystems becomes difficult when the remote system, e.g., NFS home directories, does not understand the host’s user namespaces.
In the rootless Podman guide, Dan Walsh says:
If you have a normal process creating files on an NFS share and not taking advantage of user-namespaced capabilities, everything works fine. The problem comes in when the root process inside the container needs to do something on the NFS share that requires special capability access. In that case, the remote kernel will not know about the capability and will most likely deny access.
While rootless Podman has SELinux support (and dynamic profile support via udica), rootless Docker does not yet support AppArmor and, for both runtimes, CRIU (Checkpoint/Restore In Userspace, a feature to freeze running applications) is disabled.
Both rootless runtimes require configuration for some networking features: CAP_NET_BIND_SERVICE
is required by the
kernel to bind to ports below 1024 (historically considered a privileged boundary), and ping is not supported for users
with high UIDs if the ID is not in /proc/sys/net/ipv4/ping_group_range (although this can be changed by host root).
Host networking is not permitted (as it breaks the network isolation), cgroups
v2 are functional but only when running
under systemd
, and cgroup
v1 is not supported by either rootless implementation. There are more details in the docs for
shortcomings of rootless Podman.
Docker and Podman share similar performance and features as both use runc
, although Docker has an established networking model that doesn’t support host networking in rootless mode, whereas Podman reuses Kubernetes’ Container Network Interface (CNI)) plug-ins for greater networking deployment flexibility.
Rootless containers decrease the risk of running your container images. Rootlessness prevents an exploit escalating to root via many host interactions (although some use of SETUID
and SETGID
binaries is often needed by software aiming to avoid running processes as root).
While rootless containers protect the host from the container, it may still be possible to read some data from the host, although an adversary will find this a lot less useful. Root capabilities are needed to interact with potential privilege escalation points including /proc, host devices, and the kernel interface, among others.
Throughout these layers of abstraction, system calls are still ultimately handled by software written in potentially unsafe C. Is the rootless runtime’s exposure to C-based system calls in the Linux kernel really that bad? Well, the C language powers the internet (and world?) and has done so for decades, but its lack of memory management leads to the same critical bugs occurring over and over again. When the kernel, OpenSSL, and other critical software are written in C, we just want to move everything as far away from trusted kernel space as possible.
Whitesource suggests that C has accounted for 47% of all reported vulnerabilities in the last 10 years. This may largely be due to its proliferation and longevity, but highlights the inherent risk.
While “trimmed-down” kernels exist (like unikernels and rump kernels), many traditional and legacy applications are portable onto a container runtime without code modifications. To achieve this feat for a unikernel would require the application to be ported to the new reduced kernel. Containerizing an application is a generally frictionless developer experience, which has contributed to the success of containers.
If a process can exploit the kernel, it can take over the system the kernel is running. This is a risk that adversaries like Captian Hashjack will attempt to exploit, and so cloud providers and hardware vendors have been pioneering different approaches to moving away from Linux system call interaction for the guest.
Linux containers are a lightweight form of isolation as they allow workloads to use kernel APIs directly, minimizing the layers of abstraction. Sandboxes take a variety of other approaches, and generally use container techniques as well.
Linux’s Kernel Virtual Machine (KVM) is a module that allows the kernel to run a nested version of itself as a hypervisor. It uses the processor’s hardware virtualization commands and allows each “guest” to run a full Linux or Windows operating system in the virtual machine with private, virtualized hardware. A virtual machine differs from a container as the guest’s processes are running on their own kernel: container processes always share the host kernel.
Sandboxes combine the best of virtualization and container isolation to optimize for specific use cases.
gVisor and Firecracker (written in Golang and Rust, respectively) both operate on the premise that their statically typed system call proxying (between the workload/guest process and the host kernel) is more secure for consumption by untrusted workloads than the Linux kernel itself, and that performance is not significantly impacted.
gVisor starts a KVM or operates in ptrace
mode (using a debug ptrace
system call to
monitor and control its guest), and inside starts a userspace kernel, which proxies system calls down to the host using
a “sentry” process. This trusted process reimplements 237 Linux system calls and only needs 53 host system calls to
operate. It is constrained to that list of system calls by seccomp
. It also starts a companion “filesystem interaction”
side process called Gofer to prevent a compromised sentry process interacting with the host’s filesystem, and finally
implements its own userspace networking stack to isolate it from
bugs in the Linux TCP/IP stack.
Firecracker, on the other hand, while also using KVM, starts a stripped-down device emulator instead of implementing the heavyweight QEMU process to emulate devices (as traditional Linux virtual machines do). This reduces the host’s attack surface and removes unnecessary code, requiring 36 system calls itself to function.
And finally, at the other end of the diagram in Figure 3-12, KVM/QEMU VMs emulate hardware and so provide a guest kernel and full device emulation, which increases startup times and memory footprint.
Virtualization provides better hardware isolation through CPU integration, but is slower to start and run due to the abstraction layer between the guest and the underlying host.
Containers are lightweight and suitably secure for most workloads. They run in production for multinational organizations around the world. But high-sensitivity workloads and data need greater isolation. You can categorize workloads by risk:
Does this application access a sensitive or high-value asset?
Is this application able to receive untrusted traffic or input?
Have there been vulnerabilities or bugs in this application before?
If the answer to any of those is yes, you may want to consider a next-generation sandboxing technology to further isolate workloads.
gVisor, Firecracker, and Kata Containers all take different approaches to virtual machine isolation, while sharing the aim of challenging the perception of slow startup time and high memory overhead.
Kata Containers is a container runtime that starts a VM and runs a container inside. It is widely compatible and
can run firecracker
as a guest.
Table 3-1 compares these sandboxes and some key features.
Supported container platforms | Dedicated guest kernel | Support different guest kernels | Open source | Hot-plug | Direct access to HW | Required hypervisors | Backed by | |
---|---|---|---|---|---|---|---|---|
gVisor |
Docker, K8s |
Yes |
No |
Yes |
No |
No |
None |
|
Firecracker |
Docker |
Yes |
Yes |
Yes |
No |
No |
KVM |
Amazon |
Kata |
Docker, K8s |
Yes |
Yes |
Yes |
Yes |
Yes |
KVM or Xen |
OpenStack |
Each sandbox combines virtual machine and container technologies: some VMM process, a Linux kernel within the
virtual machine, a Linux userspace in which to run the process once the kernel has booted, and some mix of
kernel-based isolation (that is, container-style namespaces, cgroups
, or seccomp
) either within the VM, around the VMM,
or some combination thereof.
Let’s have a closer look at each one.
Google’s gVisor was originally built to allow untrusted, customer-supplied workloads to run in AppEngine on Borg, Google’s internal orchestrator and the progenitor to Kubernetes. It now protects Google Cloud products: App Engine standard environment, Cloud Functions, Cloud ML Engine, and Cloud Run, and it has been modified to run in GKE. It has the best Docker and Kubernetes integrations from among this chapter’s sandboxing technologies.
To run the examples, the gVisor runtime binary must be installed on the host or worker node.
Docker supports pluggable container runtimes, and a simple docker run -it --runtime=runsc
starts a gVisor sandboxed OCI container. Let’s have a look at what’s in /proc in a vanilla gVisor container to compare it with standard runc
:
user@host:~[
0]
$
docker run -it --runtime=
runsc sublimino/hack\
ls -lasp /proc/1 total 00
dr-xr-xr-x1
root root0
May23
16:22 ./0
dr-xr-xr-x2
root root0
May23
16:22 ../0
-r--r--r--0
root root0
May23
16:22 auxv0
-r--r--r--0
root root0
May23
16:22 cmdline0
-r--r--r--0
root root0
May23
16:22 comm0
lrwxrwxrwx0
root root0
May23
16:22 cwd -> /root0
-r--r--r--0
root root0
May23
16:22 environ0
lrwxrwxrwx0
root root0
May23
16:22 exe -> /usr/bin/coreutils0
dr-x------1
root root0
May23
16:22 fd/0
dr-x------1
root root0
May23
16:22 fdinfo/0
-rw-r--r--0
root root0
May23
16:22 gid_map0
-r--r--r--0
root root0
May23
16:22 io0
-r--r--r--0
root root0
May23
16:22 maps0
-r--------0
root root0
May23
16:22 mem0
-r--r--r--0
root root0
May23
16:22 mountinfo0
-r--r--r--0
root root0
May23
16:22 mounts0
dr-xr-xr-x1
root root0
May23
16:22 net/0
dr-x--x--x1
root root0
May23
16:22 ns/0
-r--r--r--0
root root0
May23
16:22 oom_score0
-rw-r--r--0
root root0
May23
16:22 oom_score_adj0
-r--r--r--0
root root0
May23
16:22 smaps0
-r--r--r--0
root root0
May23
16:22 stat0
-r--r--r--0
root root0
May23
16:22 statm0
-r--r--r--0
root root0
May23
16:22 status0
dr-xr-xr-x3
root root0
May23
16:22 task/0
-rw-r--r--0
root root0
May23
16:22 uid_map
Removing special files from this directory prevents a hostile process from accessing the relevant feature in the underlying host kernel.
There are far fewer entries in /proc than in a runc
container, as this diff shows:
user@host:~[
0]
$
diff -u\
<(
docker run -t sublimino/hack ls -1 /proc/1)
\
<(
docker run -t --runtime=
runsc sublimino/hack ls -1 /proc/1)
-arch_status -attr -autogroup auxv -cgroup -clear_refs cmdline comm -coredump_filter -cpu_resctrl_groups -cpuset cwd environ exe @@ -16,39 +8,17 @@ fdinfo gid_map io -limits -loginuid -map_files maps mem mountinfo mounts -mountstats net ns -numa_maps -oom_adj oom_score oom_score_adj -pagemap -patch_state -personality -projid_map -root -sched -schedstat -sessionid -setgroups smaps -smaps_rollup -stack stat statm status -syscall task -timens_offsets -timers -timerslack_ns uid_map -wchan
The sentry process that simulates the Linux system call interface reimplements over 235 of the ~350 possible system calls in Linux 5.3.11. This shows you a “masked” view of the /proc and /dev virtual filesystems. These filesystems have historically leaked the container abstraction by sharing information from the host (memory, devices, processes, etc.) so are an area of special concern.
Let’s look at system devices under /dev in gVisor and runc
:
user@host:~[
0]
$
diff -u\
<(
docker run -t sublimino/hack ls -1p /dev)
\
<(
docker run -t --runtime=
runsc sublimino/hack ls -1p /dev)
-console -core fd full mqueue/ +net/ null ptmx pts/
We can see that the runsc
gVisor runtime drops the console
and core
devices, but includes a /dev/net/tun
device (under the net/ directory) for its netstack
networking stack, which also runs inside Sentry. Netstack can be bypassed for direct host network access (at the cost of some isolation), or host networking disabled entirely for fully host-isolated networking (depending on the CNI or other network configured within the sandbox).
Apart from these giveaways, gVisor is kind enough to identify itself at boot time, which you can see in a container with
dmesg
:
$
docker run --runtime=
runsc sublimino/hack dmesg[
0.000000]
Starting gVisor...[
0.340005]
Feeding the init monster...[
0.539162]
Committing treasure map to memory...[
0.688276]
Searchingfor
socket adapter...[
0.759369]
Checking naughty and nice process list...[
0.901809]
Rewriting operating system in Javascript...[
1.384894]
Daemonizing children...[
1.439736]
Granting licence tokill
(
2)
...[
1.794506]
Creating process schedule...[
1.917512]
Creating bureaucratic processes...[
2.083647]
Checking naughty and nice process list...[
2.131183]
Ready!
Notably this is not the real time it takes to start the container, and the quirky messages are randomized—don’t
rely on them for automation. If we time
the process we can see it start faster than it claims:
$
time
docker run --runtime=
runsc sublimino/hack dmesg[
0.000000]
Starting gVisor...[
0.599179]
Mounting deweydecimalfs...[
0.764608]
Consulting tar man page...[
0.821558]
Verifying that no non-zero bytes made their way into /dev/zero...[
0.892079]
Synthesizing system calls...[
1.381226]
Preparingfor
the zombie uprising...[
1.521717]
Adversarially training Redcode AI...[
1.717601]
Conjuring /dev/null black hole...[
2.161358]
Accelerating teletypewriter to9600
baud...[
2.423051]
Checking naughty and nice process list...[
2.437441]
Generating random numbers by fair dice roll...[
2.855270]
Ready! real 0m0.852s user 0m0.021s sys 0m0.016s
Unless an application running in a sandbox explicitly checks for these features of the environment, it will be unaware that it is in a sandbox. Your application makes the same system calls as it would to a normal Linux kernel, but the Sentry process intercepts the system calls as shown in Figure 3-13.
Sentry prevents the application interacting directly with the host kernel, and has a seccomp
profile that limits its possible host system calls. This helps prevent escalation in case a tenant breaks into Sentry and attempts to attack the host kernel.
Implementing a userspace kernel is a Herculean undertaking and does not cover every system call. This means some applications are not able to run in gvisor, although in practice this doesn’t happen very often and there are millions of workloads running on GCP under gVisor.
The Sentry has a side process called Gofer. It handles disks and devices, which are historically common VM attack vectors. Separating out these responsibilities increases your resistance to compromise; if Sentry has an exploitable bug, it can’t be used to attack the host’s devices directly because they’re all proxied through Gofer.
gVisor is written in Go to avoid security pitfalls that can plague kernels. Go is strongly typed, with built-in bounds checks, no uninitialized variables, no use-after-free bugs, no stack overflow bugs, and a built-in race detector. However, using Go has its challenges, and the runtime often introduces a little performance overhead.
However, this comes at the cost of some reduced application compatibility and a high per-system-call overhead. Of course, not all applications make a lot of system calls, so this depends on usage.
Application system calls are redirected to Sentry by a Platform Syscall Switcher, which intercepts the application when it tries to make system calls to the kernel. Sentry then makes the required system calls to the host for the containerized process, as shown in Figure 3-14. This proxying prevents the application from directly controlling system calls.
Sentry sits in a loop waiting for a system call to be generated by the application, as shown in Figure 3-15.
It captures the system call with ptrace
, handles it, and returns a response to the process (often without making the expected system call to the host). This simple model protects the underlying kernel from any direct interaction with the process inside the container.
The decreasing number of permitted calls shown in Figure 3-16 limits the exploitable interface of the underlying host kernel to 68 system calls, while the containerized application process believes it has access to all ~350 kernel calls.
The Platform Syscall Switcher, gVisor’s system call interceptor, has two modes: ptrace
and KVM. The ptrace
(“process trace”) system call provides a mechanism for a parent process to observe and modify another process’s behavior. PTRACE_SYSEMU
forces the traced process to stop on entry to the next syscall, and gVisor is able to respond to it or proxy the request to the host kernel, going via Gofer if I/O is required.
Firecracker is a virtual machine monitor (VMM) that boots a dedicated VM for its guest using KVM. Instead of using KVM’s traditional device emulation pairing with QEMU, Firecracker implements its own memory management and device emulation. It has no BIOS (instead implementing Linux Boot Protocol), no PCI support, and stripped down, simple, virtualized devices with a single network device, a block I/O device, timer, clock, serial console, and keyboard device that only simulates Ctrl-Alt-Del to reset the VM, as shown in Figure 3-17.
The Firecracker VMM process that starts the guest virtual machine is in turn started by a jailer process. The jailer configures the security configuration of the VMM sandbox (GID and UID assignment, network namespaces, create chroot, create cgroups
), then terminates and passes control to Firecracker, where seccomp
is enforced around the KVM guest kernel and userspace that it boots.
Instead of using a second process for I/O like gVisor, Firecracker uses the KVM’s virtio drivers to proxy from the guest’s Firecracker process to the host kernel, via the VMM (shown in Figure 3-18). When the Firecracker VM image starts, it boots into protected mode in the guest kernel, never running in its real mode.
Firecracker is compatible with Kubernetes and OCI using the firecracker-containerd shim.
Firecracker invokes far less host kernel code than traditional LXC or gVisor once it has started, although they all touch similar amounts of kernel code to start their sandboxes.
Performance improvements are gained from an isolated memory stack, and lazily flushing data to the page cache instead of disk to increase filesystem performance. It supports arbitrary Linux binaries but does not support generic Linux kernels. It was created for AWS’s Lambda service, forked from Google’s ChromeOS VMM, crosvm:
What makes crosvm unique is a focus on safety within the programming language and a sandbox around the virtual devices to protect the kernel from attack in case of an exploit in the devices.
Firecracker is a statically linked Rust binary that is compatible with Kata Containers, Weave Ignite, firekube, and firecracker-containerd. It provides soft allocation (not allocating memory until it’s actually used) for more aggressive “bin packing,” and so greater resource utilization.
Finally, Kata Containers consists of lightweight VMs containing a container engine. They are highly optimized for running containers. They are also the oldest, and most mature, of the recent sandboxes. Compatibility is wide, with support for most container orchestrators.
Grown from a combination of Intel Clear Containers and Hyper.sh RunV, Kata Containers (Figure 3-19) wraps containers with a dedicated KVM virtual machine and device emulation from a pluggable backend: QEMU, QEMU-lite, NEMU (a custom stripped-down QEMU), or Firecracker. It is an OCI runtime and so supports Kubernetes.
The Kata Containers runtime launches each container on a guest Linux kernel. Each Linux system is on its own hardware-isolated VM, as you can see in Figure 3-20.
The kata-runtime
process is the VMM, and the interface to the OCI runtime. kata-proxy
handles I/O for the kata-agent
(and therefore the application) using KVM’s virtio-serial
, and multiplexes a command channel over the same connection.
kata-shim
is the interface to the container engine, handling container lifecycles, signals, and logs.
The guest is started using KVM and either QEMU or Firecracker. The project has forked QEMU twice to experiment with lightweight start times and has reimplemented a number of features back into QEMU, which is now preferred to NEMU (the most recent fork).
Inside the VM, QEMU boots an optimized kernel, and systemd
starts the kata-agent
process. kata-agent
, which uses libcontainer
and so shares a lot of code with runc
, manages the containers running inside the VM.
Networking is provided by integrating with CNI (or Docker’s CNM), and a network namespace is created for each VM. Because of its networking model, the host network can’t be joined.
SELinux and AppArmor are not currently implemented, and some OCI inconsistencies limit the Docker integration.
Many new VMM technologies have some Rustlang components. So is Rust any good?
It is similar to Golang in that it is memory safe (memory model, virtio, etc.) but it is built atop a memory ownership model, which avoids whole classes of bugs including use after free, double free, and dangling pointer issues.
It has safe and simple concurrency and no garbage collector (which may incur some virtualization overhead and latency), instead using build-time analysis to find segmentation faults and memory issues.
rust-vmm is a development toolkit for new VMMs as shown in Figure 3-21. It is a collection of building blocks (Rust packages, or “crates”) comprised of virtualization components. These are well tested (and therefore better secured) and provide a simple, clean interface. For example, the vm-memory
crate is a guest memory abstraction, providing a guest address, memory regions, and guest shared memory.
The project was birthed from ChromeOS’s cross-vm
(crosvm
), which was forked by Firecracker and subsequently abstracted into “hypervisor from scratch” Rust crates. This approach will enable the development of a plug-and-play hypervisor
architecture.
To see how a runtime is built, you can check out Youki. It’s an experimental container
runtime written in Rust that implements the runc
runtime-spec.
The degree of access and privilege that a guest process has to host features, or virtualized versions of them, impacts the attack surface available to an attacker in control of the guest process.
This new tranche of sandbox technologies is under active development. It’s code, and like all new code, is at risk of exploitable bugs. This is a fact of software, however, and is infinitely better than no new software at all!
It may be that these sandboxes are not yet a target for attackers. The level of innovation and baseline knowledge to contribute means the barrier to entry is set high. Captain Hashjack is likely to prioritize easier targets.
From an administrator’s perspective, modifying or debugging applications within the sandbox becomes slightly more difficult, similar to the difference between bare metal and containerized processes. These difficulties are not insurmountable but require administrator familiarization with the underlying runtime.
It is still possible to run privileged sandboxes that have elevated capabilities within the guest. And although the risks are fewer than for privileged containers, users should be aware that any reduction of isolation increases the risk of running the process inside the sandbox.
Kubernetes and Docker support running multiple container runtimes simultaneously; in Kubernetes, Runtime Class is stable from v1.20 on. This means a Kubernetes worker node can host pods running under different Container Runtime Interfaces (CRIs), which greatly enhances workload separation.
With spec.template.spec.runtimeClassName
you can target a sandbox for a
Kubernetes workload via CRI.
Docker is able to run any OCI-compliant runtime (e.g., runc
, runsc
), but the
Kubernetes kubelet
uses CRI. While Kubernetes has not yet distinguished between
types of sandboxes, we can still set node affinity and toleration so pods are scheduled
on to nodes that have the relevant sandbox technology installed.
To use a new CRI runtime in Kubernetes, create a non-namespaced RuntimeClass
:
apiVersion
:
node.k8s.io/v1
kind
:
RuntimeClass
metadata
:
name
:
gvisor
# The name the RuntimeClass will be referenced by
# RuntimeClass is a non-namespaced resource
handler
:
gvisor
# The name of the corresponding CRI configuration
Then reference the CRI runtime class in the pod definition:
apiVersion
:
v1
kind
:
Pod
metadata
:
name
:
my-gvisor-pod
spec
:
runtimeClassName
:
gvisor
# ...
This has started a new pod using gvisor
. Remember that runsc
(gVisor’s runtime component) must be installed
on the node that the pod is scheduled on.
Generally sandboxes are more secure, and containers are less complex.
When running sensitive or untrusted workloads, you want to narrow the interface between a sandboxed process and the host. There are trade-offs—debugging a rogue process becomes much harder, and traditional tracing tools may not have good compatibility.
There is a general, minor performance overhead for sandboxes over containers (~50–200ms startup), which may be negligible for some workloads, and benchmarking is strongly encouraged. Options may also be limited by platform or nested virtualization options.
As next-generation runtimes have focused on stripping down legacy compatibility, they are very small and very fast to
start up (compared to traditional VMs)—not as fast as LXC or runc
, but fast enough for FaaS providers to offer
aggressive scale rates.
Traditional container runtimes like LXC and runc
are faster to start as they run a process on an existing kernel.
Sandboxes need to configure their own guest kernel, which leads to slightly longer start times.
Managed services are easiest to adopt, with gVisor in GKE and Firecracker in AWS Fargate.
Both of them, and Kata, will run anywhere virtualization is supported, and the future is
bright with the rust-vmm
library promising many more runtimes to keep valuable workloads safe.
Segregating the most sensitive workloads on dedicated nodes in sandboxes gives your systems the greatest resistance to practical compromise.