Chapter 3. Container Runtime Isolation

Linux has evolved sandboxing and isolation techniques beyond simple virtual machines (VMs) that strengthen it from current and future vulnerabilities. Sometimes these sandboxes are called micro VMs.

These sandboxes combine parts of all previous container and VM approaches. You would use them to protect sensitive workloads and data, as they focus on rapid deployment and high performance on shared infrastructure.

In this chapter we’ll discuss different types of micro VMs that use virtual machines and containers together, to protect your running Linux kernel and userspace. The generic term sandboxing is used to cover the entire spectrum: each tool in this chapter combines software and hardware virtualization of technologies and uses Linux’s Kernel Virtual Machine (KVM), which is widely used to power VMs in public cloud services, including Amazon Web Services and Google Cloud.

You run a lot of workloads at BCTL, and you should remember that while these techniques may also protect against Kubernetes mistakes, all of your web-facing software and infrastructure is a more obvious place to defend first. Zero-days and container breakouts are rare in comparison to simple security-sensitive misconfigurations.

Hardened runtimes are newer, and have fewer generally less dangerous CVEs than the kernel or more established container runtimes, so we’ll focus less on historical breakouts and more on the history of micro VM design and rationale.

Threat Model

You have two main reasons for isolating a workload or pod—it may have access to sensitive information and data, or it may be untrusted and potentially hostile to other users of the system:

Examples of potentially untrusted workloads include:

Untrusted workloads may also include software with published or suspected zero-day Common Vulnerabilities and Exposures (CVEs)—if no patch is available and the workload is business-critical, isolating it further may decrease the potential impact of the vulnerability if exploited.

BCTL allows users to upload files to import data and shipping manifests, so you have a risk that threat actors will try to upload badly formatted or malicious files to try to force exploitable software errors. The pods that run the batch transformation and processing workloads are a good candidate for sandboxing, as they are processing untrusted inputs as shown in Figure 3-2.

haku 0302
Figure 3-2. Sandboxing a risky batch workload

Your threat model may include:

Now that we have an idea of the dangers to your systems, let’s take a step back. We’ll look at virtualization: what it is, why we use containers, and how to combine the best bits of containers and VMs.

Containers, Virtual Machines, and Sandboxes

A major difference between a container and a VM is that containers exist on a shared host kernel. VMs boot a kernel every time they start, use hardware-assisted virtualization, and have a more secure but traditionally slower runtime.

A common perception is that containers are optimized for speed and portability, and virtual machines sacrifice these features for more robust isolation from malicious behavior and higher fault tolerance.

This perception is not entirely true. Both technologies share a lot of common code pathways in the kernel itself. Containers and virtual machines have evolved like co-orbiting stars, never fully able to escape each other’s gravity. Container runtimes are a form of kernel virtualization. The OCI (Open Container Initiative) container image specifications have become the standardized atomic unit of container deployment.

Next-generation sandboxes combine container and virtualization techniques (see Figure 3-3) to reduce workloads’ access to the kernel. They do this by by emulating kernel functionality in userspace or the isolated guest environment, thus reducing the host’s attack surface to the process inside the sandbox. Well-defined interfaces can help to reduce complexity, minimizing the opportunity for untested code paths. And, by integrating the sandboxes with containerd, they are also able to interact with OCI images and with a software proxy (“shim”) to connect two different interfaces, which can be used with orchestrators like Kubernetes.

These sandboxing techniques are especially relevant to public cloud providers, for which multitenancy and bin packing is highly lucrative. Aggressively multitenanted systems such as Google Cloud Functions and AWS Lambda are running “untrusted code as a service,” and this isolation software is born from cloud vendor security requirements to isolate serverless runtimes from other tenants. Multitenancy will be discussed in depth in the next chapter.

Cloud providers use virtual machines as the atomic unit of compute, but they may also wrap the root virtual machine process in container-like technologies. Customers then use the virtual machine to run containers—virtualized inception.

Traditional virtualization emulates a physical hardware architecture in software. Micro VMs emulate as small an API as possible, removing features like I/O devices and even system calls to ensure least privilege. However, they are still running the same Linux kernel code to perform low-level program operations such as memory mapping and opening sockets—just with additional security abstractions to create a secure by default runtime. So even though VMs are not sharing as much of the kernel as containers do, some system calls must still be executed by the host kernel.

Software abstractions require CPU time to execute, and so virtualization must always be a balance of security and performance. It is possible to add enough layers of abstraction and indirection that a process is considered “highly secure,” but it is unlikely that this ultimate security will result in a viable user experience. Unikernels go in the other direction, tracing a program’s execution and then removing almost all kernel functionality except what the program has used. Observability and debuggability are perhaps the reasons that unikernels have not seen widespread adoption.

To understand the trade-offs and compromises inherent in each approach, it is important to grok a comparison of virtualization types. Virtualization has existed for a long time and has many variations.

How Virtual Machines Work

Although virtual machines and associated technologies have existed since the late 1950s, a lack of hardware support in the 1990s led to their temporary demise. During this time “process virtual machines” became more popular, especially the Java virtual machine (JVM). In this chapter we are exclusively referring to system virtual machines: a form of virtualization not tied to a specific programming language. Examples include KVM/QEMU, VMware, Xen, VirtualBox, etc.

Virtual machine research began in the 1960s to facilitate sharing large, expensive physical machines between multiple users and processes (see Figure 3-4). To share a physical host safely, some level of isolation must be enforced between tenants—and in case of hostile tenants, there should be much less access to the underlying system.

Container abstractions
Figure 3-4. Family tree of virtualization; source: “The Ideal Versus the Real”

This is performed in hardware (the CPU), software (in the kernel, and userspace), or from cooperation between both layers, and allows many users to share the same large physical hardware. This innovation became the driving technology behind public cloud adoption: safe sharing and isolation for processes, memory, and the resources they require from the physical host machine.

The host machine is split into smaller isolated compute units, traditionally referred to as guests (see Figure 3-5). These guests interact with a virtualized layer above the physical host’s CPU and devices. That layer intercepts system calls to handle them itself: either by proxying them to the host kernel, or handling the request itself—doing the kernel’s job where possible. Full virtualization (e.g., VMware) emulates hardware and boots a full kernel inside the guest. Operating-system–level virtualization (e.g., a container) emulates the host’s kernel (i.e., using namespace, cgroups, capabilities, and seccomp) so it can start a containerized process directly on the host kernel. Processes in containers share many of the kernel pathways and security mechanisms that processes in VMs execute.

To boot a kernel, a guest operating system will require access to a subset of the host machine’s functionality, including BIOS routines, devices and peripherals (e.g., keyboard, graphical/console access, storage, and networking), an interrupt controller and an interval timer, a source of entropy (for random number seeds), and the memory address space that it will run in.

Inside each guest virtual machine is an environment in which processes (or workloads) can run. The virtual machine itself is owned by a privileged parent process that manages its setup and interaction with the host, known as a virtual machine monitor or VMM (as in Figure 3-6). This has also been known as a hypervisor, but the distinction is blurred with more recent approaches so the original term VMM is preferred.

image
Figure 3-6. A virtual machine manager

Linux has a built-in virtual machine manager called KVM that allows a host kernel to run virtual machines. Along with QEMU, which emulates physical devices and provides memory management to the guest (and can run by itself if necessary), an operating system can run fully emulated by the guest OS and by QEMU (as contrasted with the Xen hypervisor in Figure 3-7). This emulation narrows the interface between the VM and the host kernel and reduces the amount of kernel code the process inside the VM can reach directly. This provides a greater level of isolation from unknown kernel vulnerabilities.

image
Figure 3-7. KVM contrasted with Xen and QEMU; source: What Is the Difference Between KVM and QEMU
Note

Despite many decades of effort, “in practice no virtual machine is completely equivalent to its real machine counterpart” (“The Ideal Versus the Real”). This is due to the complexities of emulating hardware, and hopefully decreases the chance that we’re living in a simulation.

Benefits of Virtualization

Like all things we try to secure, virtualization must balance performance with security: decreasing the risk of running your workloads using the minimum possible number of extra checks at runtime. For containers, a shared host kernel is an avenue of potential container escape—the Linux kernel has a long heritage and monolithic codebase.

Linux is mainly written in the C language, which has classes of memory management and range checking vulnerabilities that have proven notoriously difficult to entirely eradicate. Many applications have experienced these exploitable bugs when subjected to fuzzers. This risk means we want to keep hostile code away from trusted interfaces in case they have zero-day vulnerabilities. This is a pretty serious defensive stance—it’s about reducing any window of opportunity for an attacker that has access to zero-day Linux vulnerabilities.

The sandboxing model defends against zero-days by abstractions. It moves processes away from the Linux system call interface to reduce the opportunities to exploit it, using an assortment of containers and capabilities, LSMs and kernel modules, hardware and software virtualization, and dedicated drivers. Most recent sandboxes use a type-safe language like Golang or Rust, which makes their memory management safer than software programmed in C (which requires manual and potentially error-prone memory management).

What’s Wrong with Containers?

Let’s further define what we mean by containers by looking at how they interact with the host kernel, as shown in Figure 3-8.

Containers talk directly to the host kernel, but the layers of LSMs, capabilities, and namespaces ensure they do not have full host kernel access. Conversely, instead of sharing one kernel, VMs use a guest kernel (a dedicated kernel running in a hypervisor). This means if the VM’s guest kernel is compromised, more work is required to break out of the hypervisor and into the host.

Host kernel boundary
Figure 3-8. Host kernel boundary

Containers are created by a low-level container runtime, and as users we talk to the high-level container runtime that controls it.

The diagram in Figure 3-9 shows the high-level interfaces, with the container managers on the left. Then Kubernetes, Docker, and Podman interact with their respective libraries and runtimes. These perform useful container management features including pushing and pulling container images, managing storage and network interfaces, and interacting with the low-level container runtime.

Container abstractions
Figure 3-9. Container abstractions; source: “What’s up with CRI-O, Kata Containers and Podman?”

In the middle column of Figure 3-9 are the container runtimes that your Kubernetes cluster interacts with, while in the right column are the low-level runtimes responsible for starting and managing the container.

That low-level container runtime is directly responsible for starting and managing containers, interfacing with the kernel to create the namespaces and configuration, and finally starting the process in the container. It is also responsible for handling your process inside the container, and getting its system calls to the host kernel at runtime.

User Namespace Vulnerabilities

Linux was written with a core assumption: that the root user is always in the host namespace. This assumption held true while there were no other namespaces. But this changed with the introduction of user namespaces (the last major kernel namespace to be completed): developing user namespaces required many code changes to code concerning the root user.

User namespaces allow you to map users inside a container to other users on the host, so ID 0 (root) inside the container can create files on a volume that from within the container look to be root-owned. But when you inspect the same volume from the host, they show up as owned by the user root was mapped to (e.g., user ID 1000, or 110000, as shown in Figure 3-10). User namespaces are not enabled in Kubernetes, although work is underway to support them.

User namespace user id remapping
Figure 3-10. User namespace user ID remapping

Everything in Linux is a file, and files are owned by users. This makes user namespaces wide-reaching and complex, and they have been a source of privilege escalation bugs in previous versions of Linux:

CVE-2013-1858 (user namespace & CLONE_FS)

The clone system-call implementation in the Linux kernel before 3.8.3 does not properly handle a combination of the CLONE_NEWUSER and CLONE_FS flags, which allows local users to gain privileges by calling chroot and leveraging the sharing of the / directory between a parent process and a child process.

CVE-2014-4014 (user namespace & chmod)

The capabilities implementation in the Linux kernel before 3.14.8 does not properly consider that namespaces are inapplicable to inodes, which allows local users to bypass intended chmod restrictions by first creating a user namespace, as demonstrated by setting the setgid bit on a file with group ownership of root.

CVE-2015-1328 (user namespace & OverlayFS (Ubuntu only))

The overlayfs implementation in the Linux kernel package before 3.19.0-21.21 in Ubuntu versions until 15.04 did not properly check permissions for file creation in the upper filesystem directory, which allowed local users to obtain root access by leveraging a configuration in which overlayfs is permitted in an arbitrary mount namespace.

CVE-2018-18955 (user namespace & complex ID mapping)

In the Linux kernel 4.15.x through 4.19.x before 4.19.2, map_write() in kernel/user_namespace.c allows privilege escalation because it mishandles nested user namespaces with more than 5 UID or GID ranges. A user who has CAP_SYS_ADMIN in an affected user namespace can bypass access controls on resources outside the namespace, as demonstrated by reading /etc/shadow. This occurs because an ID transformation takes place properly for the namespaced-to-kernel direction but not for the kernel-to-namespaced direction.

Containers are not inherently “insecure,” but as we saw in Chapter 2, they can leak some information about a host, and a root-owned container runtime is a potential exploitation path for a hostile process or container image.

Tip

Operations such as creating network adapters in the host network namespace, and mounting host disks, are historically root-only, which has made rootless containers harder to implement. Rootfull container runtimes were the only viable option for the first decade of popularized container use.

Exploits that have abused this rootfulness include CVE-2019-5736, replacing the runc binary from inside a container via /proc/self/exe, and CVE-2019-14271, attacking the host from inside a container responding to docker cp.

Underlying concerns about a root-owned daemon can be assuaged by running rootless containers in “unprivileged user namespaces” mode: creating containers using a nonroot user, within their own user namespace. This is supported in Docker 20.0X and Podman.

Rootless means the low-level container runtime process that creates the container is owned by an unprivileged user, and so container breakout via the process tree only escapes to a nonroot user, nullifying some potential attacks.

The mapping of user identifiers (UIDs) in the guest to actual users on the host depends on the user mappings of the host user namespace, container user namespace, and rootless runtime, as shown in Figure 3-11.

User mapping for Rootless and User Namespace containers
Figure 3-11. Container abstractions; source: “Experimenting with Rootless Docker”

User namespaces allow nonroot users to pretend to be the host’s root user. The “root-in-userns” user can have a “fake” UID 0 and permission to create new namespaces (mount, net, uts, ipc), change the container’s hostname, and mount points.

This allows root-in-userns, which is unprivileged in the host namespace, to create new containers. To achieve this, additional work must be done: network connections into the host network namespace can only be created by the host’s root. For rootless containers, an unprivileged slirp4netns networking device (guarded by seccomp) is used to create a virtual network device.

Unfortunately, mounting remote filesystems becomes difficult when the remote system, e.g., NFS home directories, does not understand the host’s user namespaces.

In the rootless Podman guide, Dan Walsh says:

If you have a normal process creating files on an NFS share and not taking advantage of user-namespaced capabilities, everything works fine. The problem comes in when the root process inside the container needs to do something on the NFS share that requires special capability access. In that case, the remote kernel will not know about the capability and will most likely deny access.

While rootless Podman has SELinux support (and dynamic profile support via udica), rootless Docker does not yet support AppArmor and, for both runtimes, CRIU (Checkpoint/Restore In Userspace, a feature to freeze running applications) is disabled.

Both rootless runtimes require configuration for some networking features: CAP_NET_BIND_SERVICE is required by the kernel to bind to ports below 1024 (historically considered a privileged boundary), and ping is not supported for users with high UIDs if the ID is not in /proc/sys/net/ipv4/ping_group_range (although this can be changed by host root). Host networking is not permitted (as it breaks the network isolation), cgroups v2 are functional but only when running under systemd, and cgroup v1 is not supported by either rootless implementation. There are more details in the docs for shortcomings of rootless Podman.

Docker and Podman share similar performance and features as both use runc, although Docker has an established networking model that doesn’t support host networking in rootless mode, whereas Podman reuses Kubernetes’ Container Network Interface (CNI)) plug-ins for greater networking deployment flexibility.

Rootless containers decrease the risk of running your container images. Rootlessness prevents an exploit escalating to root via many host interactions (although some use of SETUID and SETGID binaries is often needed by software aiming to avoid running processes as root).

While rootless containers protect the host from the container, it may still be possible to read some data from the host, although an adversary will find this a lot less useful. Root capabilities are needed to interact with potential privilege escalation points including /proc, host devices, and the kernel interface, among others.

Throughout these layers of abstraction, system calls are still ultimately handled by software written in potentially unsafe C. Is the rootless runtime’s exposure to C-based system calls in the Linux kernel really that bad? Well, the C language powers the internet (and world?) and has done so for decades, but its lack of memory management leads to the same critical bugs occurring over and over again. When the kernel, OpenSSL, and other critical software are written in C, we just want to move everything as far away from trusted kernel space as possible.

While “trimmed-down” kernels exist (like unikernels and rump kernels), many traditional and legacy applications are portable onto a container runtime without code modifications. To achieve this feat for a unikernel would require the application to be ported to the new reduced kernel. Containerizing an application is a generally frictionless developer experience, which has contributed to the success of containers.

Sandboxing

If a process can exploit the kernel, it can take over the system the kernel is running. This is a risk that adversaries like Captian Hashjack will attempt to exploit, and so cloud providers and hardware vendors have been pioneering different approaches to moving away from Linux system call interaction for the guest.

Linux containers are a lightweight form of isolation as they allow workloads to use kernel APIs directly, minimizing the layers of abstraction. Sandboxes take a variety of other approaches, and generally use container techniques as well.

Sandboxes combine the best of virtualization and container isolation to optimize for specific use cases.

gVisor and Firecracker (written in Golang and Rust, respectively) both operate on the premise that their statically typed system call proxying (between the workload/guest process and the host kernel) is more secure for consumption by untrusted workloads than the Linux kernel itself, and that performance is not significantly impacted.

gVisor starts a KVM or operates in ptrace mode (using a debug ptrace system call to monitor and control its guest), and inside starts a userspace kernel, which proxies system calls down to the host using a “sentry” process. This trusted process reimplements 237 Linux system calls and only needs 53 host system calls to operate. It is constrained to that list of system calls by seccomp. It also starts a companion “filesystem interaction” side process called Gofer to prevent a compromised sentry process interacting with the host’s filesystem, and finally implements its own userspace networking stack to isolate it from bugs in the Linux TCP/IP stack.

Firecracker, on the other hand, while also using KVM, starts a stripped-down device emulator instead of implementing the heavyweight QEMU process to emulate devices (as traditional Linux virtual machines do). This reduces the host’s attack surface and removes unnecessary code, requiring 36 system calls itself to function.

And finally, at the other end of the diagram in Figure 3-12, KVM/QEMU VMs emulate hardware and so provide a guest kernel and full device emulation, which increases startup times and memory footprint.

image
Figure 3-12. Spectrum of isolation

Virtualization provides better hardware isolation through CPU integration, but is slower to start and run due to the abstraction layer between the guest and the underlying host.

Containers are lightweight and suitably secure for most workloads. They run in production for multinational organizations around the world. But high-sensitivity workloads and data need greater isolation. You can categorize workloads by risk:

  • Does this application access a sensitive or high-value asset?

  • Is this application able to receive untrusted traffic or input?

  • Have there been vulnerabilities or bugs in this application before?

If the answer to any of those is yes, you may want to consider a next-generation sandboxing technology to further isolate workloads.

gVisor, Firecracker, and Kata Containers all take different approaches to virtual machine isolation, while sharing the aim of challenging the perception of slow startup time and high memory overhead.

Table 3-1 compares these sandboxes and some key features.

Table 3-1. Comparison of sandbox features; source: “Making Containers More Isolated: An Overview of Sandboxed Container Technologies”
Supported container platforms Dedicated guest kernel Support different guest kernels Open source Hot-plug Direct access to HW Required hypervisors Backed by

gVisor

Docker, K8s

Yes

No

Yes

No

No

None

Google

Firecracker

Docker

Yes

Yes

Yes

No

No

KVM

Amazon

Kata

Docker, K8s

Yes

Yes

Yes

Yes

Yes

KVM or Xen

OpenStack

Each sandbox combines virtual machine and container technologies: some VMM process, a Linux kernel within the virtual machine, a Linux userspace in which to run the process once the kernel has booted, and some mix of kernel-based isolation (that is, container-style namespaces, cgroups, or seccomp) either within the VM, around the VMM, or some combination thereof.

Let’s have a closer look at each one.

gVisor

Google’s gVisor was originally built to allow untrusted, customer-supplied workloads to run in AppEngine on Borg, Google’s internal orchestrator and the progenitor to Kubernetes. It now protects Google Cloud products: App Engine standard environment, Cloud Functions, Cloud ML Engine, and Cloud Run, and it has been modified to run in GKE. It has the best Docker and Kubernetes integrations from among this chapter’s sandboxing technologies.

Docker supports pluggable container runtimes, and a simple docker run -it --runtime=runsc starts a gVisor sandboxed OCI container. Let’s have a look at what’s in /proc in a vanilla gVisor container to compare it with standard runc:

user@host:~ [0]$ docker run -it --runtime=runsc sublimino/hack \
  ls -lasp /proc/1

total 0
0 dr-xr-xr-x 1 root root 0 May 23 16:22 ./
0 dr-xr-xr-x 2 root root 0 May 23 16:22 ../
0 -r--r--r-- 0 root root 0 May 23 16:22 auxv
0 -r--r--r-- 0 root root 0 May 23 16:22 cmdline
0 -r--r--r-- 0 root root 0 May 23 16:22 comm
0 lrwxrwxrwx 0 root root 0 May 23 16:22 cwd -> /root
0 -r--r--r-- 0 root root 0 May 23 16:22 environ
0 lrwxrwxrwx 0 root root 0 May 23 16:22 exe -> /usr/bin/coreutils
0 dr-x------ 1 root root 0 May 23 16:22 fd/
0 dr-x------ 1 root root 0 May 23 16:22 fdinfo/
0 -rw-r--r-- 0 root root 0 May 23 16:22 gid_map
0 -r--r--r-- 0 root root 0 May 23 16:22 io
0 -r--r--r-- 0 root root 0 May 23 16:22 maps
0 -r-------- 0 root root 0 May 23 16:22 mem
0 -r--r--r-- 0 root root 0 May 23 16:22 mountinfo
0 -r--r--r-- 0 root root 0 May 23 16:22 mounts
0 dr-xr-xr-x 1 root root 0 May 23 16:22 net/
0 dr-x--x--x 1 root root 0 May 23 16:22 ns/
0 -r--r--r-- 0 root root 0 May 23 16:22 oom_score
0 -rw-r--r-- 0 root root 0 May 23 16:22 oom_score_adj
0 -r--r--r-- 0 root root 0 May 23 16:22 smaps
0 -r--r--r-- 0 root root 0 May 23 16:22 stat
0 -r--r--r-- 0 root root 0 May 23 16:22 statm
0 -r--r--r-- 0 root root 0 May 23 16:22 status
0 dr-xr-xr-x 3 root root 0 May 23 16:22 task/
0 -rw-r--r-- 0 root root 0 May 23 16:22 uid_map
Note

Removing special files from this directory prevents a hostile process from accessing the relevant feature in the underlying host kernel.

There are far fewer entries in /proc than in a runc container, as this diff shows:

user@host:~ [0]$ diff -u \
  <(docker run -t sublimino/hack ls -1 /proc/1) \
  <(docker run -t --runtime=runsc sublimino/hack ls -1 /proc/1)

-arch_status
-attr
-autogroup
 auxv
-cgroup
-clear_refs
 cmdline
 comm
-coredump_filter
-cpu_resctrl_groups
-cpuset
 cwd
 environ
 exe
@@ -16,39 +8,17 @@
 fdinfo
 gid_map
 io
-limits
-loginuid
-map_files
 maps
 mem
 mountinfo
 mounts
-mountstats
 net
 ns
-numa_maps
-oom_adj
 oom_score
 oom_score_adj
-pagemap
-patch_state
-personality
-projid_map
-root
-sched
-schedstat
-sessionid
-setgroups
 smaps
-smaps_rollup
-stack
 stat
 statm
 status
-syscall
 task
-timens_offsets
-timers
-timerslack_ns
 uid_map
-wchan

The sentry process that simulates the Linux system call interface reimplements over 235 of the ~350 possible system calls in Linux 5.3.11. This shows you a “masked” view of the /proc and /dev virtual filesystems. These filesystems have historically leaked the container abstraction by sharing information from the host (memory, devices, processes, etc.) so are an area of special concern.

Let’s look at system devices under /dev in gVisor and runc:

user@host:~ [0]$ diff -u \
  <(docker run -t sublimino/hack ls -1p /dev) \
  <(docker run -t --runtime=runsc sublimino/hack ls -1p /dev)

-console
-core
 fd
 full
 mqueue/
+net/
 null
 ptmx
 pts/

We can see that the runsc gVisor runtime drops the console and core devices, but includes a /dev/net/tun device (under the net/ directory) for its netstack networking stack, which also runs inside Sentry. Netstack can be bypassed for direct host network access (at the cost of some isolation), or host networking disabled entirely for fully host-isolated networking (depending on the CNI or other network configured within the sandbox).

Apart from these giveaways, gVisor is kind enough to identify itself at boot time, which you can see in a container with dmesg:

$ docker run --runtime=runsc sublimino/hack dmesg
[   0.000000] Starting gVisor...
[   0.340005] Feeding the init monster...
[   0.539162] Committing treasure map to memory...
[   0.688276] Searching for socket adapter...
[   0.759369] Checking naughty and nice process list...
[   0.901809] Rewriting operating system in Javascript...
[   1.384894] Daemonizing children...
[   1.439736] Granting licence to kill(2)...
[   1.794506] Creating process schedule...
[   1.917512] Creating bureaucratic processes...
[   2.083647] Checking naughty and nice process list...
[   2.131183] Ready!

Notably this is not the real time it takes to start the container, and the quirky messages are randomized—don’t rely on them for automation. If we time the process we can see it start faster than it claims:

$ time docker run --runtime=runsc sublimino/hack dmesg
[   0.000000] Starting gVisor...
[   0.599179] Mounting deweydecimalfs...
[   0.764608] Consulting tar man page...
[   0.821558] Verifying that no non-zero bytes made their way into /dev/zero...
[   0.892079] Synthesizing system calls...
[   1.381226] Preparing for the zombie uprising...
[   1.521717] Adversarially training Redcode AI...
[   1.717601] Conjuring /dev/null black hole...
[   2.161358] Accelerating teletypewriter to 9600 baud...
[   2.423051] Checking naughty and nice process list...
[   2.437441] Generating random numbers by fair dice roll...
[   2.855270] Ready!

real    0m0.852s
user    0m0.021s
sys     0m0.016s

Unless an application running in a sandbox explicitly checks for these features of the environment, it will be unaware that it is in a sandbox. Your application makes the same system calls as it would to a normal Linux kernel, but the Sentry process intercepts the system calls as shown in Figure 3-13.

image
Figure 3-13. gVisor container components and privilege boundaries

Sentry prevents the application interacting directly with the host kernel, and has a seccomp profile that limits its possible host system calls. This helps prevent escalation in case a tenant breaks into Sentry and attempts to attack the host kernel.

Implementing a userspace kernel is a Herculean undertaking and does not cover every system call. This means some applications are not able to run in gvisor, although in practice this doesn’t happen very often and there are millions of workloads running on GCP under gVisor.

The Sentry has a side process called Gofer. It handles disks and devices, which are historically common VM attack vectors. Separating out these responsibilities increases your resistance to compromise; if Sentry has an exploitable bug, it can’t be used to attack the host’s devices directly because they’re all proxied through Gofer.

However, this comes at the cost of some reduced application compatibility and a high per-system-call overhead. Of course, not all applications make a lot of system calls, so this depends on usage.

Application system calls are redirected to Sentry by a Platform Syscall Switcher, which intercepts the application when it tries to make system calls to the kernel. Sentry then makes the required system calls to the host for the containerized process, as shown in Figure 3-14. This proxying prevents the application from directly controlling system calls.

image
Figure 3-14. gVisor container components and privilege levels

Sentry sits in a loop waiting for a system call to be generated by the application, as shown in Figure 3-15.

image
Figure 3-15. gVisor sentry pseudocode; source: Resource Sharing

It captures the system call with ptrace, handles it, and returns a response to the process (often without making the expected system call to the host). This simple model protects the underlying kernel from any direct interaction with the process inside the container.

The decreasing number of permitted calls shown in Figure 3-16 limits the exploitable interface of the underlying host kernel to 68 system calls, while the containerized application process believes it has access to all ~350 kernel calls.

The Platform Syscall Switcher, gVisor’s system call interceptor, has two modes: ptrace and KVM. The ptrace (“process trace”) system call provides a mechanism for a parent process to observe and modify another process’s behavior. PTRACE_SYSEMU forces the traced process to stop on entry to the next syscall, and gVisor is able to respond to it or proxy the request to the host kernel, going via Gofer if I/O is required.

image
Figure 3-16. gVisor system call hierarchy

Firecracker

Firecracker is a virtual machine monitor (VMM) that boots a dedicated VM for its guest using KVM. Instead of using KVM’s traditional device emulation pairing with QEMU, Firecracker implements its own memory management and device emulation. It has no BIOS (instead implementing Linux Boot Protocol), no PCI support, and stripped down, simple, virtualized devices with a single network device, a block I/O device, timer, clock, serial console, and keyboard device that only simulates Ctrl-Alt-Del to reset the VM, as shown in Figure 3-17.

image
Figure 3-17. Firecracker and KVM interaction; source: Resource Sharing

The Firecracker VMM process that starts the guest virtual machine is in turn started by a jailer process. The jailer configures the security configuration of the VMM sandbox (GID and UID assignment, network namespaces, create chroot, create cgroups), then terminates and passes control to Firecracker, where seccomp is enforced around the KVM guest kernel and userspace that it boots.

Instead of using a second process for I/O like gVisor, Firecracker uses the KVM’s virtio drivers to proxy from the guest’s Firecracker process to the host kernel, via the VMM (shown in Figure 3-18). When the Firecracker VM image starts, it boots into protected mode in the guest kernel, never running in its real mode.

image
Figure 3-18. Firecracker sandboxing the guest kernel from the host
Tip

Firecracker is compatible with Kubernetes and OCI using the firecracker-containerd shim.

Firecracker invokes far less host kernel code than traditional LXC or gVisor once it has started, although they all touch similar amounts of kernel code to start their sandboxes.

Performance improvements are gained from an isolated memory stack, and lazily flushing data to the page cache instead of disk to increase filesystem performance. It supports arbitrary Linux binaries but does not support generic Linux kernels. It was created for AWS’s Lambda service, forked from Google’s ChromeOS VMM, crosvm:

What makes crosvm unique is a focus on safety within the programming language and a sandbox around the virtual devices to protect the kernel from attack in case of an exploit in the devices.

Chrome OS Virtual Machine Monitor

Firecracker is a statically linked Rust binary that is compatible with Kata Containers, Weave Ignite, firekube, and firecracker-containerd. It provides soft allocation (not allocating memory until it’s actually used) for more aggressive “bin packing,” and so greater resource utilization.

Kata Containers

Finally, Kata Containers consists of lightweight VMs containing a container engine. They are highly optimized for running containers. They are also the oldest, and most mature, of the recent sandboxes. Compatibility is wide, with support for most container orchestrators.

Grown from a combination of Intel Clear Containers and Hyper.sh RunV, Kata Containers (Figure 3-19) wraps containers with a dedicated KVM virtual machine and device emulation from a pluggable backend: QEMU, QEMU-lite, NEMU (a custom stripped-down QEMU), or Firecracker. It is an OCI runtime and so supports Kubernetes.

image
Figure 3-19. Kata Containers architecture

The Kata Containers runtime launches each container on a guest Linux kernel. Each Linux system is on its own hardware-isolated VM, as you can see in Figure 3-20.

The kata-runtime process is the VMM, and the interface to the OCI runtime. kata-proxy handles I/O for the kata-agent (and therefore the application) using KVM’s virtio-serial, and multiplexes a command channel over the same connection.

kata-shim is the interface to the container engine, handling container lifecycles, signals, and logs.

image
Figure 3-20. Kata Containers components

The guest is started using KVM and either QEMU or Firecracker. The project has forked QEMU twice to experiment with lightweight start times and has reimplemented a number of features back into QEMU, which is now preferred to NEMU (the most recent fork).

Inside the VM, QEMU boots an optimized kernel, and systemd starts the kata-agent process. kata-agent, which uses libcontainer and so shares a lot of code with runc, manages the containers running inside the VM.

Networking is provided by integrating with CNI (or Docker’s CNM), and a network namespace is created for each VM. Because of its networking model, the host network can’t be joined.

SELinux and AppArmor are not currently implemented, and some OCI inconsistencies limit the Docker integration.

rust-vmm

Many new VMM technologies have some Rustlang components. So is Rust any good?

It is similar to Golang in that it is memory safe (memory model, virtio, etc.) but it is built atop a memory ownership model, which avoids whole classes of bugs including use after free, double free, and dangling pointer issues.

It has safe and simple concurrency and no garbage collector (which may incur some virtualization overhead and latency), instead using build-time analysis to find segmentation faults and memory issues.

rust-vmm is a development toolkit for new VMMs as shown in Figure 3-21. It is a collection of building blocks (Rust packages, or “crates”) comprised of virtualization components. These are well tested (and therefore better secured) and provide a simple, clean interface. For example, the vm-memory crate is a guest memory abstraction, providing a guest address, memory regions, and guest shared memory.

image
Figure 3-21. Kata Containers components; source: Resource Sharing

The project was birthed from ChromeOS’s cross-vm (crosvm), which was forked by Firecracker and subsequently abstracted into “hypervisor from scratch” Rust crates. This approach will enable the development of a plug-and-play hypervisor architecture.

Note

To see how a runtime is built, you can check out Youki. It’s an experimental container runtime written in Rust that implements the runc runtime-spec.