i got handed a zig binary called llvm_and_you. statically linked, not stripped, challenge description yapping something about a compiler bug. i throw it in IDA and the main function is 54 instructions that end in jmp rax. cool. love that.

it doesn’t even run

1
2
$ ./llvm_and_you test
Illegal instruction (core dumped)

so gdb tells me it’s dying at vpbroadcastb %esi,%xmm0 inside memset. that’s an EVEX-prefixed instruction from AVX-512. my machine has AVX2 but not AVX-512, and apparently zig’s compiler decided that basic memset needs bleeding-edge vector extensions (or it was intended). 369 EVEX instructions and 122 VEX-encoded mask operations throughout the binary… great.

ended up writing a gdb python script that catches every SIGILL and emulates the offending instruction in software. turns out only six instruction types actually appear at runtime:

instruction what it does encoding
vpbroadcastb reg, xmm/ymm fill vector with byte from GPR EVEX 0F38 7A
vptestnmb ymm, ymm, k find zero bytes -> mask register EVEX 0F38 26
vmovdqu8 ymm, ymm{k} masked byte move EVEX 0F 6F
vpternlogq xmm, xmm, xmm, imm8 ternary logic (XOR when imm=0x96) EVEX 0F3A 25
kortestd k, k OR two mask regs, set EFLAGS VEX 0F 98
kmovd k, mem / mem, k move to/from mask registers VEX 0F 90/91

the emulator parses EVEX/VEX3 prefixes, decodes ModR/M + SIB, reads/writes XMM/YMM bytes and mask registers through gdb. about 500 traps per execution, ~5 seconds per run. its awfully slow, but it works.

annoying note: gdb’s set args eats shell metacharacters. } # $ & all get swallowed. had to make a wrapper script that reads the argument from a file:

1
2
#!/bin/sh
exec "$1" "$(cat /tmp/_flag_arg.txt)"

the nop

so the binary runs now. time to figure out what this obfuscation actually is.

i open llvm_and_you.main in IDA. 54 instructions. the first thing that catches my eye is a single nop at 0x100ce2c, sitting between what looks like a perfectly normal function prologue and a block of bizarre arithmetic that ends in jmp rax.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
; this all makes sense its standard prologue, loads argc/argv
push    rbp
mov     rbp, rsp
push    r15 / r14 / r13 / r12 / rbx
and     rsp, 0xFFFFFFFFFFFFFFE0
sub     rsp, 0xDC0
mov     rax, [os_argv_1]     ; argc
sub     rax, 2
setz    al
mov     [rsp+0x60F], al
mov     rax, [os_argv_0]     ; argv
mov     [rsp+0x600], rax

; and then... this?
nop

; what follows is 30+ instructions of multiply-shift-XOR arithmetic
; ending in jmp rax

i scroll down past the jmp rax. more code. another block of normal-looking instructions, another nop, another block of the same arithmetic-then-jump pattern. and again. and again… 💀 every single basic block in the function ends the same way with real code, then a nop, then a templated chunk of arithmetic that computes a target address and jumps to it.

from various clues and the name of the challenge I gathered it was some form of LLVM control flow obfuscation. every basic block’s successor got replaced with an indirect computed jump through a dispatch table. and that lone nop is the seam (presumably a clue to make it feasible to do within a CTF like pwnEd).

flowchart TB
  real["actual code"]
  nop["NOP"]
  disp["dispatcher (we think): load state, hash, table lookup, XOR decrypt, jmp rax"]
  real --- nop --- disp
  style nop fill:#f96,stroke:#333,color:#000

dispatcher

i spent a while staring at the dispatcher code before i understood what it was doing. it’s actually kinda elegant in a dumb way. here’s the full thing:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
nop                                       ; clue (thank god)

mov     rcx, [qword_1040330]              ; load a 64-bit "state" from .data,  every block seems to use a DIFFERENT global

; hash the state into a table index
mov     rax, rcx
shr     rax, 0x2F                         ; grab top 17 bits
imul    rax, 0x1A4                        ; x 420 (lol)
add     rax, 0x1D7                        ; + 471

; reduce mod 581
mov     r8, 0x70CC728FA459E3              ; magic reciprocal for div 581
mulx    rdx, rdx, r8                      ; high 64 bits of 128-bit multiply
imul    edx, 0x245                        ; x 581
sub     eax, edx                          ; remainder

; decrypt the target
mov     rax, [dispatch_table + rax*8]     ; look up encrypted pointer
xor     rax, rcx                          ; XOR with state
xor     rax, 1                            ; flip low bit (why??)
bzhi    rax, rax, rsi                     ; mask to 47 bits (userspace range)

jmp     rax                               ; gone

the dispatch table is 581 entries of u64 at 0x1040D70, shared across every dispatcher. each basic block has its own state global qword_1040330, qword_10402A8, qword_10400C0, etc, all at different .data addresses, all initialized at load time. the multiplier and offset (420/471 for the first block, 390/171 for the next, etc.) change per block, but the mod-581 reduction and the table are always the same.

the 0x70CC728FA459E3 magic number is a trick I had seen before for x % 581 without a division instruction. mulx gives the high 64 bits of a 128-bit multiply, which is effectively x / 581. multiply back by 581, subtract, you’ve got your remainder. same thing gcc seems to emit.

bzhi is a BMI2 instruction that zeros all bits at and above a given position. here they load 0x2F (47) into a register first, so it masks the result to the lower 47 bits. valid userspace address space on x86-64. clever touch.

branchless conditionals

some blocks need to branch. the obf handles this by computing TWO targets and doing a branchless select:

flowchart TD
    S1["decrypt(state_global_1)"] --> T1["target_true"]
    S2["decrypt(state_global_2)"] --> T2["target_false"]
    T1 --> MUX
    T2 --> MUX
    C["condition byte<br/>(setz/setb before NOP)"] --> MUX
    MUX["target_true × cond + target_false × (1 − cond)"] --> J(["jmp rax"])
1
2
3
4
5
6
7
8
9
; compute both targets with the same dance...
; rax = target_true, rcx = target_false
movzx   edx, sil              ; sil = condition (0 or 1, from setz above)
mov     rdx, rsi
xor     rdx, 1                ; !condition
imul    rax, rsi              ; target_true x condition
imul    rcx, rdx              ; target_false x !condition
add     rax, rcx
jmp     rax

costs ~170 bytes per conditional block instead of ~80 for unconditional, but hey, security.

realizing things

here’s the thing though, those state globals are initialized at load time and nothing ever seems to write to them. the dispatch table is read-only. the multipliers, offsets, everything is a compile-time constant. there is zero runtime variability lol.

the entire control flow graph is recoverable without running the binary.

putting stuff together

finding dispatchers

the NOP splits the functions so we scan .text for 0x90, check if what follows matches the dispatcher template. specifically, look for a 48 8B 0D/48 8B 05 (RIP-relative qword load) targeting a .data address, before hitting an FF E0 (jmp rax).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
for i in range(len(text_data)):
    if text_data[i] != 0x90:
        continue

    state_loads = []
    for j in range(i+1, min(i+210, len(text_data)-1)):
        b = text_data[j:j+7]
        if b[0]==0x48 and b[1]==0x8B and b[2] in (0x05, 0x0D, 0x35, 0x3D):
            disp = struct.unpack_from('<i', b, 3)[0]
            state_va = text_va + j + 7 + disp
            if 0x103f000 <= state_va <= 0x1042000:  # in .data?
                targets = resolve(state_va)
                if targets:
                    state_loads.append(targets)
        if text_data[j]==0xFF and text_data[j+1]==0xE0:  # jmp rax
            break

one state load before jmp rax means unconditional. two means conditional. scanning the binary turned up 686 NOP bytes in .text, of which 470 were followed by a valid dispatcher pattern with 348 unconditional, 122 conditional.

the lazy way to resolve targets

i didn’t bother reimplementing the hash. why would i? there’s only 581 table entries. just try them all haha:

1
2
3
4
5
6
7
def resolve(state_va):
    state = read_u64(state_va)
    return [
        (table[i] ^ state ^ 1) & MASK47
        for i in range(581)
        if TEXT_START <= ((table[i] ^ state ^ 1) & MASK47) <= TEXT_END
    ]

581 XOR operations doesnt take that long tbh and every time, exactly ONE entry produces a valid .text address. the rest XOR to garbage in kernel space or unmapped regions. all that hash/modular arithmetic is just a fancy index computation. since we already know the state value, we can skip it entirely.

patching

for each dispatcher, overwrite the NOP and the first few bytes of dispatcher code with a direct jump. the old dispatcher bytes stay as dead code. IDA ignores unreachable instructions after an unconditional jump.

1
2
3
4
5
6
7
# unconditional: 6 bytes
patch = b'\x90' + b'\xe9' + struct.pack('<i', target - (nop_va + 6))

# conditional: 12 bytes
patch = b'\x90'
patch += b'\x0f\x84' + struct.pack('<i', true_target - (nop_va + 7))   # JE
patch += b'\xe9' + struct.pack('<i', false_target - (nop_va + 12))     # JMP

(je for the true branch because setz sets 1 when ZF=1, the dispatcher picks target_true when byte=1, and je jumps when ZF=1. it all lines up.)

also TIL that we do NOT fill the leftover dispatcher bytes with int3. tried that. IDA saw the padding, lost track of the stack frame, and decided the function had a 660MB stack. just leave the dead code alone lmao.

flowchart LR
    subgraph before["before: obfuscated"]
        direction TB
        A1["block A"] --> D1["dispatcher -> jmp rax"]
        B1["block B"] --> D2["dispatcher -> jmp rax"]
        C1["block C"] --> D3["dispatcher -> jmp rax"]
        D1 -.->|"???"| B1
        D2 -.->|"???"| C1
        D3 -.->|"???"| A1
    end
    subgraph after["after: deobfuscated"]
        direction TB
        A2["block A"] -->|"jmp"| B2["block B"]
        B2 -->|"jmp"| C2["block C"]
        C2 -->|"jmp"| A2
    end

wrestling IDA into submission

loaded the deobfuscated binary in IDA and… it doesn’t see the function. the jump targets scatter across a wide address range and IDA hasn’t analyzed them as code. had to write a stupid IDA Python script that BFS-walks from the entry point, force-creates code at every reachable target, nukes any overlapping auto-detected functions, and defines the function manually.

then i hit F5 and got “decompilation failure: stack frame too big.” of course. hex-rays can’t lift the AVX-512 instructions (vptestnmb, kortestd, kmovd) to microcode (or i had some other unknown issue), and the dead dispatcher code between live blocks confuses its stack analysis. NOPed out 10 AVX-512 instructions, 21 YMM operations, and 3413 bytes of dead code. hex-rays finally cooperated:

1
2
3
4
5
6
7
8
9
10
11
12
void llvm_and_you_main(__m128 _XMM0) {
    if (os_argv_1 == 2) {
        v30 = *(argv + 8);    // flag string pointer
        if ((v30 & 0xFFF) == 0xFE1 || (v30 & 0x1F) == 0)
            while (1);        // bad alignment -> hang
    }
    // substitution cipher?? ...
    v27 = lookup_table[v28];
    _EDX = 37 * v27;
    vpbroadcastb(xmm0, edx); vpaddb(...); vpternlogq(..., 0x96);
    // horizontal OR -> output array
}

messy output cuz too many instructions got NOPed. but the deobfuscated disassembly is perfectly clean. so i just read the asm directly.

figuring out the cipher thing

the cipher block at 0x100df07 processes one flag character at a time. i traced through it instruction by instruction and found two parallel computations running side by side.

from what I understood, the SIMD path handles bytes 1–4 of the output. it broadcasts (37 × pos) & 0xFF across an XMM register, adds a constant vector from .rodata ([0x25, 0x4A, 0x6F, 0x94]), then does a three-way XOR (vpternlogq with imm8=0x96) against the flag character and another .rodata vector ([0x7C, 0x9D, 0xED, 0xBC]). the results get zero-extended to qwords and shifted left by [8, 16, 24, 32] bits to land at the right byte positions.

keep in mind during the actual ctf my approach was just pure bruteforce lol

1
2
3
4
5
6
7
8
vpbroadcastb xmm0, edx            ; broadcast (37*pos) & 0xFF
vmovdqa xmm1, [0x1000330]         ; [0x25, 0x4a, 0x6f, 0x94, ...]
vpaddb  xmm2, xmm0, xmm1         ; position-dependent offsets
vpbroadcastb xmm1, edx            ; broadcast flag_char
vmovdqa xmm0, [0x1000480]         ; [0x7c, 0x9d, 0xed, 0xbc, ...]
vpternlogq xmm0, xmm1, xmm2, 0x96 ; three-way XOR
vpmovzxbq ymm0, xmm0              ; bytes -> qwords
vpsllvq ymm0, ymm0, [0x1000300]   ; shift left by [8, 16, 24, 32]

the scalar path handles bytes 0, 5, 6, and 7. same operation, different constants, done with plain XOR instructions:

1
2
3
4
5
6
7
8
9
mov     al, r8b         ; flag_char
xor     al, bl          ; ^ (37*pos)
xor     al, 0xD3        ; byte 0

mov     dl, r8b
xor     dl, sil         ; ^ (37*pos + 0xB9)
xor     dl, 0xF2        ; byte 5

; same pattern for bytes 6 (offset 0xDE, xor 0x93) and 7 (offset 0x03, xor 0x1C)

everything gets shifted to the right byte offset and OR’d into a single 64-bit value:

flowchart LR
    subgraph input
        POS["pos (0-31)"]
        CHAR["flag_char"]
    end
    POS --> MUL["(37 × pos) & 0xFF"]
    MUL --> SIMD
    MUL --> SCALAR
    CHAR --> SIMD
    CHAR --> SCALAR
    subgraph SIMD["SIMD path -> bytes 1-4"]
        direction TB
        S1["vpaddb: add .rodata offsets"]
        S2["vpternlogq: ⊕ char ⊕ .rodata constants"]
        S3["shift to byte positions 1-4"]
        S1 --> S2 --> S3
    end
    subgraph SCALAR["scalar path -> bytes 0, 5, 6, 7"]
        direction TB
        X0["byte 0: char ⊕ key ⊕ 0xD3"]
        X1["byte 5: char ⊕ key ⊕ 0xF2"]
        X2["byte 6: char ⊕ key ⊕ 0x93"]
        X3["byte 7: char ⊕ key ⊕ 0x1C"]
    end
    SIMD --> OR
    SCALAR --> OR
    OR["OR -> 64-bit output block"]

finally

i stared at it long enough and figured it out. every single byte of the 8-byte output block is the exact same operation with different constants:

1
output_byte[j] = flag_char ^ ((37 * pos + OFFSET[j]) & 0xFF) ^ XOR[j]
byte offset xor source
0 0x00 0xD3 scalar
1 0x25 0x7C SIMD
2 0x4A 0x9D SIMD
3 0x6F 0xED SIMD
4 0x94 0xBC SIMD
5 0xB9 0xF2 scalar
6 0xDE 0x93 scalar
7 0x03 0x1C scalar

the offsets are evenly spaced 0x25 apart (37 in dec), the same as the position multiplier. the XOR constants come from two .rodata vectors. the SIMD path computes 4 of these in parallel, the scalar path does the other 4, then they get OR’d together.

solve

1
flag_char = output_byte[j] ^ ((37 * pos + OFFSET[j]) & 0xFF) ^ XOR[j]

all eight bytes give the same answer. just use byte 0, offset is 0, xor is 0xD3:

1
2
3
4
5
6
7
8
9
flag = []
for pos in range(32):
    combo = pos // 8
    block = 7 - (pos % 8)
    target_val = int(TARGET[combo][block*16 : block*16+16], 16)
    byte0 = target_val & 0xFF
    flag.append(chr(byte0 ^ ((37 * pos) & 0xFF) ^ 0xD3))

print("".join(flag))
1
PWNED{Why_d0n7_w3_s33_m0r3_11vM}