Analyzing Shellcode

Shellcode is a sequence of instructions, or opcodes, represented in any format, and is generally used for executing the product of a successful exploit. Because it is a list of raw instructions that the CPU understands, it is architecture-specific, so an x86 Linux shellcode will not work on SPARC Solaris. Example 10-2 is a simple piece of shellcode for Linux x86 platforms. Its only role is to call the execve( ) system call with enough arguments to execute the /bin/sh program.

Example 10-2. Linux x86 shellcode that executes /bin/sh

\x31\xc0\x68\x2f\x73\x68\xaa\x88\x44\x24\x03\x68\x2f\x62\x69\x6e
\x89\xe3\x50\x53\x89\xe1\xb0\x0b\x33\xd2\xcd\x80\xcc

This example is a great illustration to how small a simple shellcode can be, and robust enough to send as part of an HTTP request or in the payload of a custom packet. While it may make little obvious sense to a human, we will discover how it makes perfect sense to a computer.

Understanding a piece of existing shellcode begins with translating the machine instructions it is composed of into something that is more human-readable. The best tool for this is a dissasembler. A disassembler is an application that will translate raw machine code into assembly language. The ndisasm program provided in the Netwide Assembler (nasm) suite of tools is perfect for this, and it is able to take encoded binary from standard input. Here is the result of disassembling the shellcode from Example 10-2:

$ echo -ne "\x31\xc0\x68\x2f\x73\x68\xaa\x88\x44\x24\x03\x68\x2f\x62\x69\x6e
\x89\xe3\x50\x53\x89\xe1\xb0\x0b\x33\xd2\xcd\x80\xcc" | ndisasm -u -
00000000  31C0              xor eax,eax
00000002  682F7368AA        push dword 0xaa68732f
00000007  88442403          mov [esp+0x3],al
0000000B  682F62696E        push dword 0x6e69622f
00000010  89E3              mov ebx,esp
00000012  50                push eax
00000013  53                push ebx
00000014  89E1              mov ecx,esp
00000016  B00B              mov al,0xb
00000018  33D2              xor edx,edx
0000001A  CD80              int 0x80
0000001C  CC                int3

Because the shellcode in Example 10-2 is not in any traditional structure, we have to use the -u parameter to tell ndisasm that the binary input is in 32-bit mode.

Unfortunately, ndisasm is tied to the x86 architecture. Another tool that has disassembly capabilities and can work on many platforms is GNU objdump from the GNU binutils package. It is supported[29] on many popular architectures, such as i386, MIPS, Sparc, and it also handles many binary formats such as ELF, PE, and Mach-O. Like ndisasm, it works with raw instructions devoid of any structure, which is perfect for using it to work with shellcode:

$ objdump -m i386 -b binary -D /tmp/shellcode
/tmp/shellcode:     file format binary
Disassembly of section .data:
0000000000000000 <.data>:
   0:   31 c0                   xor    %eax,%eax
   2:   68 2f 73 68 aa          push   $0xaa68732f
   7:   88 44 24 03             mov    %al,0x3(%esp)
   b:   68 2f 62 69 6e          push   $0x6e69622f
  10:   89 e3                   mov    %esp,%ebx
  12:   50                      push   %eax
  13:   53                      push   %ebx
  14:   89 e1                   mov    %esp,%ecx
  16:   b0 0b                   mov    $0xb,%al
  18:   33 d2                   xor    %edx,%edx
  1a:   cd 80                   int    $0x80
  1c:   cc                      int3

The -b binary switch instructs objdump to understand the file as a binary program without any format. Because this switch would leave ambiguity as to what platform the instructions in the file are for, we need to provide the architecture as well—hence, the -m i386. The difference in syntax between the assembly code provided by objdump and that provided by ndisasm is AT&T and Intel synta,x respectively. If you like working with the Intel syntax, the -M intel parameter for objdump will allow it. As we can see here, it works exactly the same for other platforms, such as MIPS:

$ objdump -m mips -b binary -D /tmp/shellcode_mips
/tmp/shellcode_mips:     file format binary
Disassembly of section .data:
0000000000000000 <.data>:
   0:   ffff1004        bltzal  zero,0x0
   4:   ab0f0224        li      v0,4011
   8:   55f04620        addi    a2,v0,-4011
   c:   6606ff23        addi    ra,ra,1638
  10:   c2f9ec23        addi    t4,ra,-1598
  14:   6606bd23        addi    sp,sp,1638
  18:   9af9acaf        sw      t4,-1638(sp)
  1c:   9ef9a6af        sw      a2,-1634(sp)
  20:   9af9bd23        addi    sp,sp,-1638
  24:   21208001        move    a0,t4
  28:   2128a003        move    a1,sp
  2c:   cccd4403        syscall 0xd1337
  30:   2f62696e        ldr     t1,25135(s3)
  34:   2f736800        0x68732f

Here we can see the last eight bytes have been decoded as instructions, whereas they are data and should have been decoded as the /bin/sh string. But because there is no structure, there is no way to tell data and code apart.

Sometimes the existing packaged disassemblers will just not do what you need them to. Whether you are doing something programmatically with the instructions, or you want to write a custom disassembler for a more advanced project, there is no need to write something completely from scratch. There are libraries that exist to assist you in your disassembling needs. One such library is libopcode, used by GNU binutils programs to handle assembly language on supported architectures. It is tightly linked with libbfd, which handles binary formats for binutils. Both libopcode and libbfd can be complicated to use, but it's nice to have a mainstream library that can handle many architectures (and they are simpler than writing something from scratch).

Example 10-3 is an example program that uses libopcode to disassemble the Linux /bin/sh shellcode from Example 10-2.

Here is the result of compiling and running such a program:

$ gcc -g -o rawdisass rawdisass.c -lopcodes  -lbfd
$ ./rawdisass
       0 : xor    %eax,%eax
       2 : push   $0xaa68732f
       7 : mov    %al,0x3(%esp)
       B : push   $0x6e69622f
      10 : mov    %esp,%ebx
      12 : push   %eax
      13 : push   %ebx
      14 : mov    %esp,%ecx
      16 : mov    $0xb,%al
      18 : xor    %edx,%edx
      1A : int    $0x80
      1C : int3

Even though the libopcode library supports many architectures and is widely distributed through the GNU binutils package, it is hardly used by other programs that need opcode disassembly. There are several reasons for this; e.g., difficulty of use, the need to initialize many structures even for simple disassembly, and last but not least, the lack of any metadata provided with disassembled opcodes.

A project that should be mentioned is mammon's libdisasm library. It is a standalone library from the same authors of the Bastard disassembly environment. However, libdisasm is more than just a disassembling library—it also provides metadata on the disassembled instructions (e.g., their operands or whether they are read or written). This makes it easy to perform complex functions such as data propagation or determining whether two instructions can be swapped.

The libdisasm library can be used with multiple languages. The following is an example using the library with Python. Disassembly is done on special buffers, DisasmBuffer, that hold the machine code. It also has an attribute that will be filled with a list of address/opcode couples. We only have to iterate over it and print its elements. The operands( ) method returns the operands list for the instruction along with the operand-associated metadata.

#! /usr/bin/env python

import sys
from libdisasm import disasm,disasmbuf

dbuf = disasmbuf.DisasmBuffer(sys.stdin.read( ))
d=disasm.LinearDisassembler( )
d.disassemble(dbuf)
for rva,opcode in dbuf.instructions( ):
    operands = map(lambda x:"%s %-13s" % (x.access( ),"[%s]" % str(x)),
                   opcode.operands( ))
    print "%08x: %-20s %s" % (rva,str(opcode), "".join(operands))

When applied to a shellcode, this small program will output something like this:

$ ./eggdis.py  < /tmp/binsh.egg
00000000: push 11              r-- [11]         rw- [esp]
00000002: pop eax              -w- [eax]        rw- [esp]
00000003: cdq                  rw- [eax]        -w- [edx]
00000004: push edx             r-- [edx]        rw- [esp]
00000005: push 0x68732F6E      r-- [0x68732F6E] rw- [esp]
0000000a: push 0x69622F2F      r-- [0x69622F2F] rw- [esp]
0000000f: mov ebx, esp         -w- [ebx]        r-- [esp]
00000011: push edx             r-- [edx]        rw- [esp]
00000012: push ebx             r-- [ebx]        rw- [esp]
00000013: mov ecx, esp         -w- [ecx]        r-- [esp]
00000015: int −128             r-- [-128]


[29] The objdump program you will find on most platforms is usually tailored to those very platforms. But the GNU objdump can also be compiled to handle many other architectures. Debian users can use the binutils-multiarch package.