Kernel Pwning with eBPF: a Love Story

Updated: Aug 11

By: Valentina Palmiotti, @chompie1337


At Grapl we believe that in order to build the best defensive system we need to deeply understand attacker behaviors. As part of that goal we're investing in offensive security research. Keep up with our blog for new research on high risk vulnerabilities, exploitation, and advanced threat tactics.


Find the released local privilege escalation (LPE) Proof-of-Concept for CVE-2021-3490 here: https://github.com/chompie1337/Linux_LPE_eBPF_CVE-2021-3490. It targets Ubuntu 20.10 (Groovy Gorilla) kernels 5.8.0-25.26 through 5.8.0-52.58. and Ubuntu 21.04 (Hirsute Hippo) 5.11.0-16.17.


This blog post is intended to give a detailed overview of eBPF from the perspective of an exploit developer. In this post, I cover:

  • The basics of how eBPF works

  • The internals of the eBPF verifier

  • A vulnerability (CVE-2021-3490) I exploit for local privilege escalation [13].

  • Debugging eBPF bytecode

  • Exploitation techniques for DoS, information leak, and LPE

  • Some interesting observations and other bugs I discovered

  • New mitigations that make the bug class more difficult to exploit

  • Weaknesses that are still present in eBPF today


I had no knowledge of eBPF going into this. My hope is that by sharing a PoC as well as my experience developing it, it can help others get started with eBPF exploitation.



eBPF: a Primer




Extended Berkeley Packet Filter: What is it?

Berkeley Packet Filter (BPF) was initially created as a way to perform packet filtering in the kernel. Its capabilities were later redesigned and extended to create extended Berkeley Packet Filter (eBPF) [1].


Put simply, eBPF provides a way for a user mode application to run code in the kernel without needing to write a kernel module.The purported benefits of using eBPF versus a kernel module are ease of use, stability, and security. There are also performance improvements gained by doing certain tasks directly in the kernel compared to a pure user mode program. eBPF programs are used to do a myriad of things such as: tracing, instrumentation, hooking system calls, debugging, and of course, packet capturing/filtering.


eBPF programs are written in a high level language and compiled into eBPF bytecode using a toolchain (such as BCC [18]). The eBPF VM uses a simple instruction set that uses eleven* 64-bit registers, a program counter, and a 512 byte fixed-size stack. Nine registers are general purpose read-write, one is a read-only stack pointer and the program counter is implicit [2]. The instruction set is similar to x86, and operates on both 64 and 32 bit values.


BPF_MOV64_IMM(BPF_REG_0, 0)
BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4)
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10)
BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4)
BPF_JMP_IMM(BPF_JNE, BPF_REG_0, 0, 1)
BPF_MOV32_IMM(BPF_REG_3, 0xFFFFFFFF)

Example of eBPF bytecode instructions

*Technically, it uses 12 registers, but the 12th register is an auxiliary register only used to perform ALU sanitation operations [12].

A user mode application loads the bytecode into the kernel using the bpf() [14] syscall, where the eBPF verifier will perform a number of checks to ensure the program is “safe” to run in the kernel. This verification step is critical - eBPF exposes a path for unprivileged users to execute in ring0.


After the program is loaded, the user mode application attaches the program to a “hook point”. A hook point is a place in the kernel where eBPF programs can be attached [5]. eBPF programs are event driven, meaning the program will execute when certain events occur at the hook point. The classic use case is attaching an eBPF program to a socket, where the program will execute when data is written to it.


If the kconfig knob CONFIG_BPF_JIT is set, the eBPF program is JIT compiled into native assembly instructions after it is verified and loaded. Otherwise, when the program is executed it is run in the eBPF interpreter which decodes and executes the eBPF bytecode instructions.

User mode applications can interact with and get data from the eBPF program running in the kernel using eBPF maps and eBPF helper functions, which are accessed via the bpf() syscall.


The sysctl knob kernel.unprivileged_bpf_disabled determines whether unprivileged users are allowed to run eBPF programs. If it is not set, unprivileged users are allowed to attach an eBPF program to a socket that the user owns. In many Linux distributions, such as Ubuntu, unprivileged_bpf_disabled is not enabled by default. Because of this, I decided to look into eBPF more closely, as allowing unprivileged users to run code in the kernel is a ripe attack surface.

eBPF Maps

I mentioned above that user mode processes can interact with a eBPF program in the kernel using eBPF maps. They can also be used by multiple eBPF programs to interact with each other. They are a generic key/value store with an arbitrary data structure [6]. There are various types of maps including: arrays, queues, and stacks.


A map is described by five different attributes:

  • type - the data structure of the map

  • key_size - the size in bytes of the key used to index an element (used in array maps)

  • value_size - the size in bytes of each element

  • max_entries - the maximum number of entries in the map

  • map_flags - describes special characteristics of the map, such as if the entire map memory should be preallocated or not.

eBPF maps can be created and altered from user space via the bpf() syscall using the BPF_MAP_CREATE command, updated using the BPF_MAP_UPDATE_ELEM command, and retrieve its contents using the BPF_MAP_LOOKUP_ELEM command. eBPF maps can accessed by eBPF programs using the file descriptor returned by BPF_MAP_CREATE and calling eBPF helper functions, which will return pointers to values within the map.


The eBPF Verifier

The exploit I wrote leverages a bug in the eBPF verifier. So before I delve into the vulnerability it is important to briefly explain the internals of the verifier.


The verifier starts by building a control flow graph of the program. Then, it will verify each instruction is valid and all memory accesses are safe through each possible flow of control [3]. Afterwards, it will add in runtime checks to the program. This process, called ALU Sanitation, inserts patches to the eBPF bytecode to ensure permitted memory ranges are not violated during runtime when performing pointer arithmetic [4].


The verifier tries to enforce these general set of rules:

  • No back edges, loops, or unreachable instructions.

  • No pointer comparisons can be performed, and only scalar values can be added or subtracted to a pointer. A scalar value in the eBPF verifier is any value that is not derived from a pointer. The verifier keeps track of which registers contain pointers and which contain scalar values.

  • Pointer arithmetic can not leave the “safe” bounds of a map. Meaning, the program can not access anything outside the predefined map memory. To do so, verifier keeps track of the upper and lower bounds of the values for each register.

  • No pointers can be stored in maps or stored as a return value, in order to avoid leaking kernel addresses to user space.


Range Tracking

The verifier stores the following bound values, for every register in each possible path of execution, to ensure there are no out-of-bound memory accesses:

  • umin_value, umax_value store the min/max value of the register when interpreted as an unsigned (64 bit) integer

  • smin_value,smax_value store the min/max value of the register when interpreted as a signed (64 bit) integer.

  • u32_min_value,u32_max_value store the min/max value of the register when interpreted as an unsigned (32 bit) integer.

  • s32_min_value,s32_max_value store the min/max value of the register when interpreted as a signed (32 bit) integer.

  • var_off contains information about the bits of the the register that are known. It is stored in a structure called tnum which contains two 64 bit fields: mask and value. Every bit that is set in mask means the value of that bit is unknown. The unset bits are known, and their true value are stored in value. For example, if var_off = {mask = 0x0; value = 0x1}, all bits of the register are known, and the register is known to have a value of 1. If var_off = {mask = 0xFFFFFFFF00000000; value = 0x3} it means that the lower 32 bits of the register are known to be 0x00000003 and the upper 32 bits are unknown.

These bounds are used to update each other. In particular, if var_off indicates the register is a known constant, the min/max bounds are updated to reflect the known value. We will see why this is important later!


ALU Sanitation

ALU Sanitation is a feature that was introduced to supplement the static range tracking of the verifier. The idea is to prevent OOB memory accesses if the value of registers do not fall within their expected range during runtime. This was added to help mitigate potential vulnerabilities in the verifier and protect against speculative attacks.

For every arithmetic operation that involves a pointer and a scalar register, an alu_limit is calculated. This represents the maximum absolute value that can be added to or subtracted from the pointer [4]. Before each of these operations, the bytecode is patched with the following instructions:

*patch++ = BPF_MOV32_IMM(BPF_REG_AX, aux->alu_limit);
*patch++ = BPF_ALU64_REG(BPF_SUB, BPF_REG_AX, off_reg);
*patch++ = BPF_ALU64_REG(BPF_OR, BPF_REG_AX, off_reg);
*patch++ = BPF_ALU64_IMM(BPF_NEG, BPF_REG_AX, 0);
*patch++ = BPF_ALU64_IMM(BPF_ARSH, BPF_REG_AX, 63);
*patch++ = BPF_ALU64_REG(BPF_AND, BPF_REG_AX, off_reg);

Note that off_reg represents the scalar register being added to the pointer register, and BPF_REG_AUX represents the auxiliary register.

The above instructions do the following:

  1. The value of alu_limit is loaded into BPF_REG_AX.

  2. The value of off_reg at runtime is subtracted from alu_limit and stored into BPF_REG_AX. If off_reg > alu_limit, the highest bit of BPF_REG_AX is set (the sign bit).

  3. If the difference stored in BPF_REG_AUX is positive and off_reg is negative, indicating that alu_limit and the register’s value have opposing signs, the BPF_OR operation will set the sign bit.

  4. The BPF_NEG operation will negate the sign bit. If the sign bit is set, it will become 0, and if not, it will become 1.

  5. The BPF_ARSH operation does an arithmetic right shift of 63 bits. This fills BPF_REG_AX with either all 0s or 1s, the value of the sign bit.

  6. Depending on the result of the above operation, the BPF_AND operation will either null out off_reg or leave it unchanged.

This means that if off_reg exceeds alu_limit, or if off_reg and alu_limit have opposing signs, the value of off_reg will be replaced with 0, nulling the pointer arithmetic operation.

alu_limit Computation:


The way alu_limit is calculated was recently updated [15]. The new implementation may not have been adopted yet by some Linux distributions. For completeness, I will cover both, and revisit why the differences matter as they become relevant in the next sections.


Previously:

The alu_limit is determined by the boundaries of the pointer register. Meaning, if the pointer register points to the beginning of a map, the alu_limit for subtraction is 0, and the alu_limit for addition is equal to the size of the map (minus 1). The alu_limit is updated with subsequent operations on the pointer register.


Now:

The alu_limit is determined by the boundaries of the offset register. Meaning if the value of the offset register at runtime is compared against the register’s boundaries computed during the verifier’s static range tracking.


My initial knowledge of the eBPF verifier came from this excellent blog post by Manfred Paul detailing his exploitation of CVE-2020-8835. I highly recommend checking it out!

The Vulnerability

We now have enough background to do a root cause analysis of CVE-2021-3490.

Recall that the eBPF instruction set can operate on both the entire 64 bits of registers or just the lower 32 bits. For this reason, the verifier range tracking contains separate bounds for the lower 32 bits of a register: {u,s}32_{min,max}_value.

These bounds are updated for every operation. Each operation has two tracking functions with a 64 bit and a 32 bit counter part. Both are called for a 64 bit operation in the function adjust_scalar_min_max_vals.

*
/* WARNING: This function does calculations on 64-bit values, but  * the actual execution may occur on 32-bit values. Therefore,      * things like bitshifts need extra checks in the 32-bit case.
*/
static int adjust_scalar_min_max_vals(struct bpf_verifier_env *env,
                                      struct bpf_insn *insn,
                                      struct bpf_reg_state 
                                                  *dst_reg,
                                      struct bpf_reg_state src_reg)
{
...
        case BPF_AND:
                dst_reg->var_off = tnum_and(dst_reg->var_off,       
                src_reg.var_off);
                scalar32_min_max_and(dst_reg, &src_reg);
                scalar_min_max_and(dst_reg, &src_reg);
                break;
        case BPF_OR:
                dst_reg->var_off = tnum_or(dst_reg->var_off,  
                src_reg.var_off);
                scalar32_min_max_or(dst_reg, &src_reg);
                scalar_min_max_or(dst_reg, &src_reg);
                break;
        case BPF_XOR:
                dst_reg->var_off = tnum_xor(dst_reg->var_off,   
                src_reg.var_off);
                scalar32_min_max_xor(dst_reg, &src_reg);
                scalar_min_max_xor(dst_reg, &src_reg);
                break;
                
...
}        

The bug, CVE-2021-3490, is found in the 32 bit tracking function for BPF_AND, BPF_OR, and BPF_XOR operations. It is the same in each of the functions.


Let’s take a look at an excerpt of the offending code for BPF_AND:

static void scalar32_min_max_and(struct bpf_reg_state *dst_reg,
                                 struct bpf_reg_state *src_reg)
{
    bool src_known = tnum_subreg_is_const(src_reg->var_off);
    bool dst_known = tnum_subreg_is_const(dst_reg->var_off);
    struct tnum var32_off = tnum_subreg(dst_reg->var_off);
    s32 smin_val = src_reg->s32_min_value;
    u32 umax_val = src_reg->u32_max_value;


    /* Assuming scalar64_min_max_and will be called so its safe
    * to skip updating register for known 32-bit case.
    */
    if (src_known && dst_known)
        return;
...
}

As shown in the code snippet above, if the lower 32 bits of both the source and destination register are known, the function skips updating the 32 bit bounds.


The comment above the return states that this is OK, because the 64 bit counterpart will take care of it. Let’s take a look:

static void scalar_min_max_and(struct bpf_reg_state *dst_reg,
                              struct bpf_reg_state *src_reg)
{
    bool src_known = tnum_is_const(src_reg->var_off);
    bool dst_known = tnum_is_const(dst_reg->var_off);
    s64 smin_val = src_reg->smin_value;
    u64 umin_val = src_reg->umin_value;

    if (src_known && dst_known) {
            __mark_reg_known(dst_reg, dst_reg->var_off.value);
            return;
    }
  ...
}

Indeed, we can see if src_known and dst_known are true, the function __mark_reg_known will be called. Can you spot the problem?


In scalar32_min_max_and, the _known variable is calculated using tnum_subreg_is_const . The 64 bit counterpart, scalar_min_max_and, uses tnum_is_const. The difference is that the former returns true if the the lower 32 bits of the register are known constants, and the latter returns true only if the entire 64 bits are constant. If the operation involves registers where the lower 32 bits are known but the upper 32 bits are unknown, the assumption stated in the comment is violated.


In the function adjust_scalar_min_max_vals, before returning, the bounds of the destination register are updated a last time by calling the following three functions:

__update_reg_bounds(dst_reg);
__reg_deduce_bounds(dst_reg);
__reg_bound_offset(dst_reg);
return 0;

Each of these functions have 32 and 64 bit counterparts. I’ll just cover the 32 bit case, since that is what the bug affects.


First, the registers bounds are updated using the current bounds and var_off.

static void __update_reg32_bounds(struct bpf_reg_state *reg)
{
    struct tnum var32_off = tnum_subreg(reg->var_off);

    /* min signed is max(sign bit) | min(other bits) */
    reg->s32_min_value = max_t(s32, reg->s32_min_value,
                               var32_off.value | (var32_off.mask & 
                               S32_MIN));
        
     /* max signed is min(sign bit) | max(other bits) */
     reg->s32_max_value = min_t(s32, reg->s32_max_value,
                                var32_off.value | (var32_off.mask & 
                                S32_MAX));
     reg->u32_min_value = max_t(u32, reg->u32_min_value,
                               (u32)var32_off.value);
     reg->u32_max_value = min(reg->u32_max_value,
                             (u32)(var32_off.value |
                              var32_off.mask));
}

Notice that the min bounds set to either the current min or the known value of register, whichever is larger. Similarly, the max bounds are set either the current max, or the known value of the register, whichever is smaller.

Then, the signed and unsigned bounds are used to update each other in __reg32_deduce_bounds.

/* Uses signed min/max values to inform unsigned, and vice-versa */

static void __reg32_deduce_bounds(struct bpf_reg_state *reg)
{
    /* Learn sign from signed bounds.
     * If we cannot cross the sign boundary, then signed and
     * unsigned bounds
     * are the same, so combine.  This works even in the
     * negative case, e.g.
     * -3 s<= x s<= -1 implies 0xf...fd u<= x u<= 0xf...ff.
     */
    if (reg->s32_min_value >= 0 || reg->s32_max_value < 0) {
            reg->s32_min_value = reg->u32_min_value =
                        max_t(u32, reg->s32_min_value, 
                        reg->u32_min_value);
                reg->s32_max_value = reg->u32_max_value =
                        min_t(u32, reg->s32_max_value, 
                        reg->u32_max_value);
                return;
    }
...
}



Finally, the unsigned bounds are used to update var_off in __reg_bound_offset.

static void __reg_bound_offset(struct bpf_reg_state *reg)
{
    struct tnum var64_off = tnum_intersect(reg->var_off,
                            tnum_range(reg->umin_value,
                                       reg->umax_value));                
    struct tnum var32_off = tnum_intersect(tnum_subreg(reg->var_off),tnum_range(reg->u32_min_value, reg->u32_max_value));

    reg->var_off = tnum_or(tnum_clear_subreg(var64_off), 
                                             var32_off);
}

Here,

  • tnum_range returns a tnum representing the possible values given a range of unsigned integers.

  • tnum_intersect takes two tnums and combines the knowledge conveyed by both into a single tnum.

Let’s go through the steps using an example so we can understand why this is a critical vulnerability.

Suppose we have the instruction BPF_ALU64_REG(BPF_AND, R2, R3). This instruction performs an AND operation on registers R2 and R3 and saves the results in R2.

  • R2 has var_off = {mask = 0xFFFFFFFF00000000; value = 0x1}, meaning the lower 32 bits are known to have a value of 1, and the upper 32 bits are unknown. Because the lower 32 bits of the register are known, its 32bit bounds are equal to the value.

  • R3 has var_off = {mask = 0x0; value = 0x100000002}, meaning the entire 64 bits are known and equal to 0x100000002.

The steps to update the 32 bit bounds of R2 are as follows:

  1. As shown on line 12 of the snippet of adjust_scalar_min_max_vals, the function tnum_and is called. This will perform an AND operation and save the results in var_off of the destination register, R2. Recall, the lower 32 bits in both of the registers are known. All of the bits of R3 are known: the upper 31 bits of are 0, and the 32nd bit is 1. This means that R2 is left with var_off = {mask = 0x100000000; value = 0x0}. This is because 2 & 1 = 0 (for the lower 32 bits), and all but the 32nd bit will be known to be 0, since R3 has a 1 in the 32nd bit.

  2. On the next line, scalar32_min_max_and is called. We already know that this function will return immediately and make no changes to the bounds, because the lower 32 bits of both registers are known.

  3. Then __update_reg32_bounds is called. This will set u32_max_value = 0, because the value of var_off.value = 0 < u32_max_value = 1. Similarly, it will set u32_min_value = 1 because var_off.value = 0 < u32_min_value. The same goes for the signed bounds.

  4. The functions __reg32_deduce_bounds and __reg_bound_offset will not make any changes to the bounds.

Now we can see that in this case, we are left with a register where {u,s}32_max_value = 0 < {u,s}32_min_value = 1!


Now, let’s look at the patch.

@@ -7084,11 +7084,10 @@ static void scalar32_min_max_and(struct bpf_reg_state *dst_reg,
         s32 smin_val = src_reg->s32_min_value;
         u32 umax_val = src_reg->u32_max_value;
 
-        /* Assuming scalar64_min_max_and will be called so its safe
-         * to skip updating register for known 32-bit case.
-         */
-        if (src_known && dst_known)
+        if (src_known && dst_known) {
+                __mark_reg32_known(dst_reg, var32_off.value);
                 return;
+        }


Above we can see that now,__mark_reg32_known is called on the destination register before returning if the lower 32 bits of the source and destination register are known constants.

Why does this matter? Let’s take a look at what __mark_reg32_known does:

/* Mark the unknown part of a register (variable offset or scalar  * value) as known to have the value @imm.
*/
static void __mark_reg32_known(struct bpf_reg_state *reg, u64 imm)
{
    reg->var_off = tnum_const_subreg(reg->var_off, imm);
    reg->s32_min_value = (s32)imm;
    reg->s32_max_value = (s32)imm;
    reg->u32_min_value = (u32)imm;
    reg->u32_max_value = (u32)imm;
}

The function above sets all the of the 32 bit boundaries to the value of the register’s lower 32 bits, which are known to be constant. The correct bounds will be conserved when the final boundary updating functions are called, fixing the bug.


Debugging eBPF Programs


Before getting into exploitation, I’ll briefly cover a couple of ways to debug eBPF programs when writing exploits. I specifically say “when writing exploits”, because there are many tools available to help write and debug eBPF programs for legitimate uses, via emulation. One such tool is rbpf [10], which is a Rust user space virtual machine for eBPF. However, because I exploited a bug in the eBPF verifier, it was important to be able to debug it directly in the kernel to replicate the exact behavior. Additionally, I wrote the bytecode by hand (as opposed to using a toolchain to compile a program into bytecode) so it made using these tools less practical.

Verifier Log Output


When loading an eBPF program, you have the option to specify a log level and a buffer to receive log output from the verifier.

char verifier_log_buff[0x200000] = {0};

union bpf_attr prog_attrs =
{
    .prog_type = BPF_PROG_TYPE_SOCKET_FILTER,
    .insn_cnt = cnt,
    .insns = (uint64_t)insn,
    .license = (uint64_t)"",
    .log_level = 2,
    .log_size = sizeof(verifier_log_buff),
    .log_buf = verifier_log_buff
};


This is useful for debugging loading failure when the verifier rejects the program.

34: (bf) r6 = r3
35: R0_w=invP0 R2_w=map_value(id=0,off=0,ks=4,vs=4919,imm=0) R3_w=map_value(id=0,off=0,ks=4,vs=4919,imm=0) R4_w=invP0 R5_w=invP4294967298 R6_w=map_value(id=0,off=0,ks=4,vs=4919,imm=0) R7_w=invP(id=0) R10=fp0 fp-8=mmmm????
35: (7b) *(u64 *)(r2 +8) = r6
R6 leaks addr into map

Example verifier log output


Keep in mind that if you choose to provide a log output buffer, it should be large enough to receive the entire output. If it is not large enough, loading the program will fail. This will occur even if the program is valid and passes the verifier checks. This can occur if you are loading a long program, as the verifier prints output for every instruction on each path it traverses. I ran into this issue while writing my exploit, so I thought it was worth noting.

Runtime Debugging