Valentina Palmiotti
Sep 10, 2022

Attacking Firecracker: AWS' microVM Monitor Written in Rust

By: Valentina Palmiotti, @chompie1337

At Grapl we believe that in order to build the best defensive system we need to deeply understand attacker behaviors. As part of that goal we're investing in offensive security research. Keep up with our blog for new research on high risk vulnerabilities, exploitation, and advanced threat tactics.

This blog post covers attacking a vulnerability in Firecracker, an open source micro-virtual machine (microVM) monitor written in the Rust programming language. It was developed for use in AWS Lambda, a serverless software-as-a-service (SaaS) application hosting service. Firecracker is also used for AWS’ similar Fargate service that provides a way to run containers without having to manage servers for container orchestration. Due to the risks that are introduced via multi-tenancy, Firecracker was intentionally designed with security mind.

In this post, we’ll cover the following topics:

  • What is Firecracker?
  • Why attack it?
  • How does it work?
  • Root cause analysis of a memory corruption vulnerability, CVE-2019-18960
  • Exploit primitives and analysis of exploitability
  • Reflections and takeaways as they relate to security

I had no knowledge of Firecracker (or Rust) prior to conducting this research. My hope is that this post will be useful for those wanted to learn about virtualization, Firecracker, KVM and provide some clarity on the various layers of virtualization and VM escape exploitation.

Firecracker: What is it?

Firecracker is an open source virtual machine monitor (VMM) created and maintained by Amazon Web Services (AWS). Per Amazon’s website, Firecracker is a “new virtualization and open source technology that enables service owners to operate secure multi-tenant container-based services by combining the speed, resource efficiency, and performance enabled by containers with the security and isolation offered by traditional VMs.” [1].  

Firecracker is comparable to QEMU-KVM; they are both VMMs that utilize KVM, a hypervisor built into the Linux kernel. Firecracker was designed to prioritize security and efficiency for serverless workloads. This led to some key design differences to QEMU.  Firecracker is much less flexible than QEMU. In order to minimize complexity and attack surface, Firecracker forgoes non-essential functionality. QEMU, on the other hand, has had many vulnerabilities arise from complex device implementations.

Why Firecracker?

Technology like Firecracker is of particular interest to Grapl because we’re building a multi-tenant system with customer provided code execution. Therefore, it is of upmost importance that multi-tenant boundaries can not be violated. Firecracker is used by AWS to isolate runtimes from each other. Before deciding to use Firecracker in production, we conducted a security review of the product to evaluate whether it was appropriate for our use case. We also wanted to conduct offensive driven research to come up with hardening measures that are effective and worthwhile to implement in our environment. Because Grapl’s use case is specific, unlike AWS which has to run arbitrary applications, we can enforce more constraints on our application (such as execution time, resource usage, credential limitations, the files available to it, etc).  This research came as a result of our security review.

How Does it Work?

First, I’ll briefly explain generally how a virtual machine monitor (VMM) uses KVM and then get into the specifics of Firecracker.

KVM

KVM (Kernel-based Virtual Machine) is a type-1 hypervisor built into the Linux kernel (for x86) that allows a host to run multiple isolated virtual machines. It consists of two loadable kernel modules. The first, kvm.ko, provides the virtualization infrastructure. The second is a processor specific module (for either Intel or AMD) which takes a slice of the host’s physical CPU and maps it directly to the guest’s virtual CPU.

Each guest VM runs as a regular Linux process in the host. KVM in the kernel exposes a low level API to user space processes via ioctls to the /dev/kvm device. Through this API, the VMM user space process can create new VMs, assign vCPUs and physical memory, and intercept I/O or memory accesses to provide the guest access to emulated or virtualization-aware hardware devices [2].

Firecracker Design

Firecracker is a VMM that uses the Linux Kernel’s KVM virtualization infrastructure to provide Linux and OSv microVMs on Linux hosts. On the host, there is one Firecracker process per microVM.

There were some important design decisions with respect to security. The goal of Firecracker is to be a minimal VMM, so it only provides a limited number of emulated devices. These devices are: block storage (virtio-blk), network (virtio-net), vsock (virtio-vsock), balloon driver (virtio-balloon), a serial console, and a partial I8042 keyboard controller used only to stop the VM [4]. For comparison, QEMU has support for over 40 emulated devices, from which vulnerabilities are reported often.

Storage is done via block device rather than file system passthrough, to avoid giving the guest access to the host’s Linux kernel filesystem code, which is complex (and often has exploitable bugs). Firecracker also exposes a REST based configuration API over UNIX domain socket [3].

The Firecracker virtio-vsock design, to support host-guest communication via socket, is also security conscious. The standard way is to use vhost (like what QEMU does), which requires a guest to pass data directly to a vhost kernel module on the host. Instead, Firecracker has its own vsock device as a backend to avoid exposing this additional attack surface. I will describe this design more in detail in the next section.

Firecracker can be further constrained using the jailer program, which applies a set of sandboxing restrictions (such as seccomp) to the process.

virtio-vsock

The vulnerability we’ll discuss is found in the vsock implementation of Firecracker. I will explain this design a bit more in depth in the current section.

virtio-vsock is a guest/host communication device that allows applications on the guest and host to communicate via socket [5]. The standard way of implementing vsock, like what is done by QEMU, is by using the vhost-vsock kernel module. The vhost-vsock kernel module provides virtio device emulation in the kernel, handling the communication with the guest [6]. This allows the guest to pass untrusted data directly to a module running on the host’s kernel.

Firecracker, on the other hand, emulates the virtio-vsock device itself in user space, implementing the device model over MMIO. The vsock device is exposed to the host via a UNIX socket. Firecracker mediates communication between an AF_VSOCK socket (on the guest end) and an AF_UNIX socket (on the host end) [7]. This solution has the advantage of avoiding a new kernel attack surface and there’s also less dependency on host kernel features, like vhost.

The Vulnerability

There have only been three CVEs registered for Firecracker since its creation, and only one that can potentially lead to RCE on the host. In addition to being an RCE vulnerability, I chose to look at CVE-2019-18960 because it is a memory corruption vulnerability. Being completely new to Rust, I thought it would be worthwhile to examine how memory corruption vulnerabilities can still occur in a memory safe language.

The vulnerability is found in the vsock device implementation of Firecracker. As explained in a previous section, Firecracker implements the virtio-vsock device model over MMIO. That means that Firecracker reads directly from the guest’s memory, which also resides in Firecracker’s own process memory.

When a VM is created, Firecracker allocates the memory used for the guest’s RAM using mmap. This is represented by a vector of MemoryRegion structures.

pub struct MemoryRegion {
    mapping: MemoryMapping,
    guest_base: GuestAddress,
}

Here guest_base, a GuestAddress structure, stores a 64 bit base physical address on the guest. The MemoryMapping structure, mapping, stores a pointer to the associated memory in the Firecracker process along with the size.

Firecracker performs I/O on the vsock device using the standard virtio interface. The drivers running in the guest’s kernel communicate with Firecracker through shared buffers. The guest allocates one or more buffers representing the request, registers these buffers with a descriptor table (an array), and signals that the buffers are ready to be consumed via a ring data structure (called a virtqueue). Each index of a descriptor table contains a descriptor which contains information about the guest provided buffer [9].

struct Descriptor {
    addr: u64,
    len: u32,
    flags: u16,
    next: u16,
}

If specified in flags, descriptors can be chained together with next containing the descriptor table index of the chained descriptor. virtio-vsock, buffers in a descriptor chain are used to construct a vsock packet. Something to note at this point: the buffer information in the descriptor comes from the guest, and it should be treated as untrusted.

When creating a new DescriptorChain, the function is_valid is called. Here is where addr and len are checked to make sure the buffer received from the guest is valid.

fn is_valid(&self) -> bool {
        !(self
            .mem
            .checked_offset(self.addr, self.len as usize)
            .is_none()
            || (self.has_next() && self.next >= self.queue_size))
}

Let’s take a look at the checked_offset function.

/// Returns the address plus the offset if it is in range.
pub fn checked_offset(&self, base: GuestAddress, offset: usize) -> Option<GuestAddress> {
    if let Some(addr) = base.checked_add(offset) {
        for region in self.regions.iter() {
            if addr >= region.guest_base && addr < region_end(region) {
                    return Some(addr);
            }
        }
    }
    None
}

In the code snippet above on line 3, the base address is being added to offset (size of the I/O buffer in this case) to check if resulting address results in an integer overflow. If that check passes, the guest’s MemoryRegions are iterated through to see if the resulting address falls within a valid region. However, this check is not sufficient. There are two problems that could occur; the base and result address may belong to two different regions, and the base address may not even exist in a valid region.

Now, for this bug to be exploitable, we need a way for the out of bounds buffer to be used. That is where vsock comes in. Recall that vsock packets are constructed from descriptor chains.

Let’s look at the VsockPacket structure and how it is created. The first descriptor buffer in a descriptor chain will contain the packet header, and the following contains the packet data. Both the header and the data to both are stored as raw pointers, along with the packet size inside the VsockPacket structure.

pub struct VsockPacket {
    hdr: *mut u8,
    buf: Option<*mut u8>,
    buf_size: usize,
}

The pointers to both are copied into the structure after being returned from get_host_address.

let mut pkt = Self {
    hdr: head
        .mem
        .get_host_address(head.addr)
        .map_err(VsockError::GuestMemory)? as *mut u8,
        buf: None,
        buf_size: 0,
};

pkt.buf_size = buf_desc.len as usize;
pkt.buf = Some(
    buf_desc
        .mem
        .get_host_address(buf_desc.addr)
        .map_err(VsockError::GuestMemory)? as *mut u8,
);

The get_host_address function takes a physical address from the guest and returns the corresponding address in the Firecracker process’ memory.

pub fn get_host_address(&self, guest_addr: GuestAddress) -> Result<*const u8> {
        self.do_in_region(guest_addr, 1, |mapping, offset| {
            // This is safe; `do_in_region` already checks that offset is in
            // bounds.
            Ok(unsafe { mapping.as_ptr().add(offset) } as *const u8)
        })
    }

A memory region base address and the offset of the guest address from the base is calculated in do_in_region and the addition of the two is returned as the resulting pointer. On line 5 in the code snippet above, there is an unsafe block. In Rust, a block of code can be prefixed with the unsafe keyword to permit operations such as dereferencing a raw pointer, reading or writing to a mutable static variable, accessing a field of a union (other than to assign it), or calling an unsafe function [10]. In the code snippet above, the comment states that the operation in the unsafe block is safe to allow because do_in_region checks that the offset is in bounds. Let’s take a look:

fn do_in_region<F, T>(&self, guest_addr: GuestAddress, size: usize, cb: F) -> Result<T>
    where
        F: FnOnce(&MemoryMapping, usize) -> Result<T>,
    {
        for region in self.regions.iter() {
            if guest_addr >= region.guest_base && guest_addr < region_end(region) {
                let offset = guest_addr.offset_from(region.guest_base);
                if size <= region.mapping.size() - offset {
                    return cb(&region.mapping, offset);
                }
                break;
            }
        }
        Err(Error::InvalidGuestAddressRange(guest_addr, size))
    }

As seen above, there is a bounds check performed. The function takes a parameter, size, and checks if the size of the buffer fits within the region. This ensures that the pointer being returned has space inside the MemoryRegion for the expected amount of memory that will be accessed.

Now referring back to the calling function, get_host_address, note that 1 is always passed in as the size, instead of the actual size of the corresponding buffer. This means that as long as the buffer address starts in a valid region, it can overrun the region if its size is large enough. Due to the first check in checked_offset, the overrun has to end in a valid memory region to get this far, though.

This is interesting, because without this second bug, the previously discussed bug would not be exploitable.

Now after constructing a VsockPacket, the raw pointer stored in buf will be used to do read/write operations with the packet data to manage communications with the UNIX socket on the host. This can be used to obtain a read/write primitive outside of the guest’s memory space within the Firecracker process.

Exploit Primitives

To exploit this vulnerability an attacker has to have kernel execution in a guest VM. This is in order to execute at the level of the guest’s virtio-vsock driver. The first step of writing an exploit for this vulnerability is to write a kernel module to trigger it. The module has to register an invalid buffer with the vsock device. This is done by writing an invalid address and length combination in a descriptor table entry.

Before beginning to write code, I wanted to first look at what exploit primitives can be constructed with the vulnerability, theoretically. I had some concerns:

a) The area of out of bound’s memory that can be read/written to is limited to a specific area.

and

b) Runtime mitigations in Rust are restrictive.

The first step is to investigate the area of memory that can be controlled. To trigger the vulnerability, there must be at least more than one MemoryRegion associated with a guest’s memory space.

Let’s look at how the regions are created for x86_64 VMs:

const MEM_32BIT_GAP_SIZE: usize = (768 << 20);
 
/// Returns a Vec of the valid memory addresses.
/// These should be used to configure the GuestMemory structure for the platform.
/// For x86_64 all addresses are valid from the start of the kernel except a
/// carve out at the end of 32bit address space.
pub fn arch_memory_regions(size: usize) -> Vec<(GuestAddress, usize)> {
    let memory_gap_start = GuestAddress(FIRST_ADDR_PAST_32BITS - MEM_32BIT_GAP_SIZE);
    let memory_gap_end = GuestAddress(FIRST_ADDR_PAST_32BITS);
    let requested_memory_size = GuestAddress(size);
    let mut regions = Vec::new();

    // case1: guest memory fits before the gap
    if requested_memory_size <= memory_gap_start {
        regions.push((GuestAddress(0), size));
    // case2: guest memory extends beyond the gap
    } else {
        // push memory before the gap
        regions.push((GuestAddress(0), memory_gap_start.offset()));
        regions.push((
            memory_gap_end,
            requested_memory_size.offset_from(memory_gap_start),
        ));
    }

    regions
}

Here we can see that if the guest requires more than 0xD0000000 bytes of memory, a second MemoryRegion is created for the remaining memory. I also looked at the aarch64 implementation, but it’s not possible to trigger the creation of more than one MemoryRegion for a VM in that architecture.

With this information, we know what to do: create a buffer descriptor with a physical address lower than the boundary of the first MemoryRegion (0xD0000000) and provide a length that overruns this address. The diagram below shows the basic exploit primitive we can theoretically achieve:

Exploitability

In order to evaluate the exploitability of this vulnerability we need to investigate what memory can be accessed with the exploit primitive.

To answer this question, I did some debugging from within Firecracker. First, I configured a Firecracker microVM to require enough memory to create two MemoryRegions and printed their addresses during runtime. Below is a screenshot of Firecracker’s memory map after the MemoryRegions have been created for the guest.

Note that the mappings for the two MemoryRegions are contiguous. However, the mapping for the first MemoryRegion occurs at a higher address than the second MemoryRegion. Since our exploit primitive gives us the ability to overflow the mapping for the first MemoryRegion, we have the ability to overwrite at addresses higher than 0x7f1b3f118000 in the Firecracker process*.

There are some interesting areas of memory, such as the stack, that reside at higher addresses in the process. However, the pages mapped at address 0x7f1b3f11a000 are marked with PROT_NONE permissions, and act as a guard page. This means that we cannot overwrite onto the stack - if we have to do a contiguous write beginning from within the first MemoryRegion mapping we will segfault. This gives us DoS of the Firecracker process, which isn’t very powerful if the attacker already has guest kernel execution.

I looked further into how MemoryRegions are mapped, and found nothing that would help gain a more favorable allocation. I dumped the limited accessible area of memory at 0x7f1b3f118000-0x7f1b3f11a00 and found it was entirely NULL bytes. My inclination is that it is unlikely there is anything of interest there.

Since this vulnerability has been patched, the MMIO code that Firecracker uses has been overhauled. Now, “guard” pages are created to surround every guest memory region. The guard region is mapped with PROT_NONE, so that any access to this region will cause a SIGSEGV segfault. This mitigation protects against the exploitation of the exact type of vulnerability we are trying to exploit here.

While the aforementioned protection hadn’t been implemented at the time this vulnerability was patched, it’s an interesting coincidence that a guard page is inhibiting exploitation. The guard page in this case is being mapped somewhere else, possibly at the time the ELF loading. I looked the memory maps of the other processes on the Firecracker host machine and they did not consistently have guard (PROT_NONE) mappings. To further experiment, I wrote a small Rust program and saw that it did had a guard mapping in its memory map, albeit of a different size. I speculate it comes as a result of some sort of Rust mitigation.

This creates at a big road block for exploitation as the memory we can overflow into doesn’t contain anything interesting. At this time I decided to move on, but I have some ideas if I were to continue. Out of curiosity, I would do more analysis to figure out what is creating the mystery guard page. I would also try to see if triggering an offset copy is possible. That is, a way such that the VsockPacket's data buffer is accessed at an offset, and miss the guard page completely. VsockPackets are exchanged to and from Firecracker’s vsock backend which manages the UNIX socket on the host. I would analyze this part of the code to find other possible primitives. I encourage anyone interested to pick up where I left off on this exploit and share their ideas.

*The size of the overflow is restricted to vsock packet size limits, among other things.

Hardening

While Firecracker’s design is security focused, there are a some hardening measures that can be used to further lock down the attack surface.

First, limit untrusted code to running with the lowest privileges possible. Additionally, hardening the guest operating system and running a fully patched kernel is crucial. Without guest kernel execution, an attacker has no way to exploit the vulnerability covered in this post.

The primary recommendation from the authors of Firecracker is to use jailer, a program designed to isolate the Firecracker process in order to enhance security. In the case of exploiting the discussed vulnerability, a takeover of the Firecracker process yields a restrictive execution environment. An attacker would need to bypass all the restrictions imposed by jailer to escalate privileges and execute outside of the Firecracker process. Read a step by step account of what the jailer program does on startup here.

Among the things jailer does is load a seccomp filter for Firecracker, with a per thread profile. This means the different threads in the Firecracker process have different set of system calls that can be called from within the context, depending on the thread’s job. This is nice, but an attacker already in the Firecracker process can trivially hijack another thread that has access to different system calls. Therefore, jailer’s seccomp policy should be treated as a union of all of the thread’s allowable system calls. Currently, io_uring system calls are included in Firecracker’s seccomp filter. Because it redefines how system calls are executed, io_uring offers a seccomp bypass for the supported system calls. This is because seccomp filtering occurs on system call entry after a thread context switch, but system calls executed via io_uring do not go through the normal system call entry. Therefore, Firecracker’s seccomp policy should be treated as its union with all system calls supported by io_uring.

Security Reflections and Takeaways

There are some of the major security takeaways gleaned from doing this short research project exploiting Firecracker:

On the Kernel:

Kernel hardening and attack surface reduction is critical, despite the potential to impose restrictions on use or negatively impact performance. Given a Firecracker vulnerability like the one covered in this post, protecting the kernel prevents an attacker with access to the attack surface. If an attacker did successfully exploit this vulnerability, they would have access to the host and any other VMs executing on that host.

Because of the nature system call filtering via seccomp, io_uring still presents a major security disruption in sandboxing. While it seems most appropriate to use LSM to restrict io_uring, that introduces requirements on the host that may be suboptimal. You can read more about io_uring in my blog post here.

On Firecracker Design:

The Firecracker team’s decision to forgo vhost and implement the back end resulted in a critical vulnerability being introduced. However, the same vulnerability would be much more critical if it were found in the vhost kernel code. Due to the relatively small size of the code base, the memory safety of Rust, the limited attack surface, and the newly introduced mitigations, it’s unlikely these types of vulnerabilities will be common or practically exploitable.

On Rust:

Though Rust is a memory safe language, memory corruption vulnerabilities are still possible. Rust uses & references, which are like pointers in C ("raw pointers"), but with many restrictions that allow Rust to achieve memory safety [11]. However, Rust provides an escape hatch, the unsafe keyword, for bypassing these & restrictions. This is how Rust programs are able to call into native libraries and still able to validate the safety of & references in other parts of the Rust code. Rust does not permit converting a pointer returned from a C library to an & reference, because Rust is unable to validate the safety of the other library. These raw pointers are stored as *const T and *mut T in Rust, which we see in the vulnerable code snippets in this post. Given that the developer must explicitly tell Rust to avoid safety checks within unsafe blocks, it is the developer's responsibility to ensure the operations are safe in all possible cases.

Although not all exploitable bugs are that of memory safety, an interesting project for a vulnerability researcher is to search for unsafe blocks in Rust codebases and look for cases where they can be abused. Code comments asserting the safety of these blocks are clues into the assumptions the developer has made, indicating exactly what should be checked. To this aim, a researcher might be interested in cargo-geiger, which can help identify unsafe blocks in a codebase as well as their dependencies.

In a recent blogpost, the Kani Rust Verifier was used to formally verify the correctness of Firecracker’s virtio device code, with respect to a simple virtio requirement. The proof is for a property described in the virtiodevice specifiction, and is a requirement for the behavior of the guest’s virtio driver. Here, they prove the property is always upheld regardless of malicious device requests from the guest. An interesting experiment would be to repeat the process with all the requirements found in the virtio spec, in particular those that apply to the guest’s driver.

Takeaways for Grapl’s Multi-tenant Architecture

This research was critical to understanding what strategies work best for hardening our multi-tenant architecture. Based on this work, we concluded there should be a focus on hardening the guest operating system. This limits an attacker’s ability to exploit the guest kernel, thus cutting off a considerable attack surface.

As such, we’ve focused on restricting the process within the VM to make escalation to kernel more difficult. This involves leveraging multiple linux sandboxing primitives, primarily through systemd’s native sandboxing features features, including a restrictive seccomp filter

Given the difficulty in exploiting Firecracker even with control over the kernel, we feel confident in our solution.

Conclusion

Security research is critical to Grapl as a company. It helps us keep customer data safe, understand the technology we use at a deeper level, and think through advanced attack scenarios. As part of this research we generated ideas for detection logic, areas for further hardening, and more, which feeds back into product development.

Ultimately, we walk away from this research with a very positive view of Firecracker, a much deeper understanding of its internals, and confidence in our mitigations.

Acknowledgements

My amazing colleagues at Grapl:

Ian Nickles, for his help with instrumentation, Rust, and general research.

Andréa, for her incredible work on the diagrams.

Colin O’Brien, for his help with Rust.

Max Wittek, for his help with Firecracker.

References

  1. https://aws.amazon.com/about-aws/whats-new/2018/11/firecracker-lightweight-virtualization-for-serverless-computing/
  2. https://googleprojectzero.blogspot.com/2021/06/an-epyc-escape-case-study-of-kvm.html
  3. https://assets.amazon.science/96/c6/302e527240a3b1f86c86c3e8fc3d/firecracker-lightweight-virtualization-for-serverless-applications.pdf
  4. https://www.talhoffman.com/2021/07/18/firecracker-internals/
  5. https://wiki.qemu.org/Features/VirtioVsock#:~:text=virtio-vsock%20is%20a%20host,-agent%20or%20SPICE%20vdagent
  6. https://stefano-garzarella.github.io/posts/2019-11-08-kvmforum-2019-vsock/
  7. https://github.com/firecracker-microvm/firecracker/blob/main/docs/vsock.md
  8. https://developer.ibm.com/articles/l-virtio/
  9. https://model-checking.github.io/kani-verifier-blog/2022/07/13/using-the-kani-rust-verifier-on-a-firecracker-example.html
  10. https://doc.rust-lang.org/reference/unsafety.html
  11. https://doc.rust-lang.org/reference/types/pointer.html

Interested in our product? Check out our Github. Reach out for a demo!

Connect with us on the Discord and Slack - we'd love to answer any questions you may have about our product.