There are a lot of methods to obfuscate code but they all share the objective of making the code less readable in order to thwart reverse engineering. One common method is inserting junk instructions to blow up code size making it hard to distinguish between meaningful instructions and code that effectively does nothing.
I propose a solution to automatically filter out junk instructions by tracking the execution state. Using Triton we statically analyze a code trace over obfuscated code and keep track of which instruction has which side effect (for example register and stack writes).
At an arbitrary stopping point, we can then use this information to recursively scan backwards for instructions responsible for state changes and mark them as good, or meaningful. This automatically excludes junk instructions as they ought to not make meaningful state changes by definition.
I have implemented a proof-of-concept script in Python that can be run in IDA Pro and statically remove junk instructions given an entry point. It is only about 200 lines of code.
I have been looking for an excuse to play around with obfuscated/virtualized code for a while. Recently I found a malware sample that was heavily obfuscated making reading the code almost impossible. Reports identified the sample to be packed with VMProtect, a commercial virtualization packer. Looking at it in IDA revealed that not only does it use virtualization but also heavily obfuscates the code by inserting junk instructions.
My goal then was to see if I can statically remove those junk instructions in order to be able to just read the code. I did a quick Google search for automatic junk instruction removal but did not find anything really. As I wanted to try myself at the problem anyway, I stopped searching for prior research. It is therefore possible that my approach has been done before and I have just reinveted the wheel. I'd be happy to read about how others solved the problem!
When I first set out to solve the problem, I meant to document the journey of solving the problem because I myself like to read about those the most but quickly noticed it is very hard to document the whole process with all the trials and errors and the thought process behind them. Reverse engineering is just an inherently non-linear process and as such that did not work.
I still hope the article is interesting enough in itself while describing only what worked eventually.
What are junk instructions
If we want to remove junk instructions, the first question to answer is what a junk instruction even is.
Some obvious examples:
lea eax, dword ptr [esp] mov eax, 7
lea is junk as it writes
eax which is overwritten in the very next instruction.
Or using the flags:
clc add eax, num
clc (clear carry flag) is junk because
add sets the carry flag.
In general I decided on the following definition:
A junk instruction is an instruction for which no other instruction cares about its side effects
Basically, an instruction that writes a register, or a flag, or the stack, or the memory but nobody ever reads that value. Above examples are just that, they have a side effect (written register and written CF) but nobody cares about them before they are overwritten.
A very practical question I had to answer first is: Where do I start my automated code analysis, and where do I end? What are the boundaries of code I analyze in order to remove junk instructions.
In the previous section, you may have noticed the emphasis on ever. That is an actual, hard problem, because we need to start and end our analysis for junk instructions somewhere.
In order to dismiss an instruction, we must be certain that no single instruction in the whole code ever cares about any side effects it has. But that would require perfect knowledge of the whole program and I'm sure this sooner or later fails on the halting problem because everything does. In any case, it's commonly known that for computers based on the Von Neumann architecture this is a hard problem.
For very simple programs (think no conditionals, just one basic block) we can easily gain total knowledge of it, but what about a programm where some register is written and only a conditional block inside an
if later on accesses it. Or worse, some conditonal block inside a loop only after the 1000th iteration accesses it. We would need to explore every possible code path and that would explode in our face. I looked into algorithms to find every trace through loops without repeating blocks, which exist but even for simple loops the number of pathes simply exploded.
So if we can't satisfy the ever requirement, what do we do?
This puzzled me a while when I had kind of an epiphany. The trick is thinking about how a code obfuscator knows where to insert junk instructions. Because in order to not change code semantics, the obfuscator itself must be sure that whatever side effects the junk instruction has doesn't matter. So if they want to insert a
clc, they must be sure nothing ever relies on the carry flag afterwards, until some of the original code writes it.
So effectively, they have to solve the very same problem, requiring perfect knowledge, if they want a general solution!
Which is why I assumed they just narrowed the problem down to something like:
We only insert junk instructions before other writing instructions, thus making sure their side effects are meaningless
Basically, they just do
dead stores as the examples above.
The question still standing is, where do we stop to deobfuscate code? The starting point shall be arbitrary for whatever the reverser thinks looks interesting. We also cannot analyze the whole program. Assuming the
dead store hypothesis holds, we could stop at any writing instruction and start finding junk instructions. That would leave junk instructions for which a write happens afterwards.
I chose a different approach. The start for the analysis is arbitrary, and so is the end. I then assume any written register, memory, stack location at that point in time is relevant. In hindsight that is obvious. It will give us on overapproximation because there may be someone overwriting current state later on but that is fine.
So that was a lot of text for "where do we start, where do we end" - wherever.
The final algorithm I chose works like this:
- At some arbitrary point we start tracing
- For every instruction encountered, we track state changes. We save a structure together with every instruction that records what state variable was last written by which instruction, which include side effects the current instruction itself has.
- At some arbitrary point we stop tracing and declare everything in the state structure meaningful or relevant.
After the trace we have to work backwards.
- The state structure of the last instruction holds every state variable and who wrote to it. We mark every instruction recorded inside as meaningful.
- For each meaningful instruction, we query what the instruction reads (e.g. what register it reads, or memory location). We then look up the instruction who last wrote to that state variable and also mark them meaningful as a meaningful instruction depends on them.
- Repeat until no more instructions are implicitly meaningful.
An example probably explains it best.
Here is our made-up obfuscated code:
1 mov eax, 1337 2 mov ecx, eax 3 add ecx, edx 4 mov edi, 42 5 mov ebp, 3 6 lea ebp, dword ptr [eax+3] 7 xor ebp, ebp
We trace over each instruction recording state changes. The state initially is empty, and inside the state we record the instruction who wrote it.
# Instruction State after instruction 0 initial empty 1 mov eax, 1337 eax:1 2 mov ecx, eax eax:1, ecx:2 3 add ecx, edx eax:1, ecx:3 4 mov edi, 42 eax:1, ecx:3, edi:4 5 mov ebp, 3 eax:1, ecx:3, edi:4, ebp:5 6 lea ebp, dword ptr [eax+3] eax:1, ecx:3, edi:4, ebp:6 7 xor ebp, ebp eax:1, ecx:3, edi:4, ebp:7
Now we check the last state and mark every instruction inside good, which would be instructions (1,3,4,7). Then we walk over the list of good instructions and see what they depend on.
- Instruction 1 depends on nothing, done
- Instruction 3 depends on
edxwe have no information, but for
ecxwe look up who wrote to it in the previous' instructions state and find that
ecxwas written by 2. Add 2 to the good list.
- Instruction 4 depends on nothing, done
- Instruction 7 depends on nothing, done
This iteration added 2 to the good list, so we check 2 again:
- Instruction 2 reads
eax, we look up in the previous' instruction state who wrote it and find 1. Instruction 1 is already in the good list, and the algorithm is done.
The final list of meaningful instructions is (1,2,3,4,7), instruction 5 and 6 got implicitly discared because nobody cared about their side effect and we have our deobfuscated code:
1 mov eax, 1337 2 mov ecx, eax 3 add ecx, edx 4 mov edi, 42 7 xor ebp, ebp
We of course do not know if anyone ever cares about
edi being 42, or
ebp being 0, but at this point in time, the instruction 4 and 7 are responsible for their value and as such cannot be discarded safely.
Of course it is not that easy (but at least close). In practice a bunch of things complicate the process.
Problem 1 - Register aliasing
mov eax, 0xFFFFFFFF mov ax, 1337
Both instructions are meaningful. The 2nd one partially overwrites eax, but the upper 16 bit still depend on the first one. Luckily I didn't find any code in my VMProtected binary that does this. Therefore, I was lazy and simply converted each register to its largest form before recording writes. VMProtect does use a lot of
bh and so on but does never seem to alias them, so that works.
Problem 2 - What do we consider state?
This was a bit more difficult. Regarding every register for example as relevant does not work. If we consider
eip relevant at all, every instruction will be marked meaningful as every instruction obviously modifies
eip. I therefore excluded
I also excluded memory writes for now but that should be a straight-forward fix.
But what about
esp. Obviously stack writes are quite relevant but what about changes to
Problem 3 - The stack
This also is not that easy. VMProtect likes to use junk instructions that write to the stack and then later discards the values without using them.
pushad lea esp, dword ptr [esp+xxx]
When I tried to consider
esp as relevant and tracked it, a lot of stack junk instructions kept being included so I decided on the following approach:
espis completely excluded from state tracking
- Stack writes are tracked just as register writes by their address
- Before adding an instruction to the list of instructions, I discard every stack write in the state, if
espis above it.
This means that the above example is handled like this:
Instruction State pushad stack0:1, stack1:1, stack2:1 ... stack7:1 lea esp, dword ptr [esp+40] empty
pushad writes to 8 memory locations on the stack and we track them just like registers.
lea then modifies
esp and moves it up 40 bytes. Now
esp is above all the previously written stack locations and are therefore removed from the state. As a result, the
pushad is considered junk.
This worked surprisingly well and had no issue with ignoring junk instructions that affect the stack.
Finally, I'll talk about my code. From the start I wanted to have a static tool I could run in IDA that gives me readable code. I use Triton, a dynamic binary analysis framework to do the static code tracing. It also provides all the information I need, such as read/written registers, memory locations and so on. It is quite an amazing tool.
I will now write a bit about the evolution of the tool.
First, this is a code trace from the entry point of the VMProtected binary using Triton, in obfuscated form:
100a63e2  push 0x8beef346 100a63e7  push dword ptr [esp] 100a63ea  mov dword ptr [esp + 4], 0x6fdd8840 100a63f2  mov byte ptr [esp], dh 100a63f5  pushal 100a63f6  pushal 100a63f7  mov dword ptr [esp + 0x40], 0x2383346b 100a63ff  mov byte ptr [esp + 8], 0x5d 100a6404  lea esp, dword ptr [esp + 0x40] 100a6408  jmp 0x100c8e85 100c8e85  pushal 100c8e86  pushfd 100c8e87  pushfd 100c8e88  mov dword ptr [esp + 0x24], eax 100c8e8c  push eax 100c8e8d  pushfd 100c8e8e  mov dword ptr [esp + 0x28], edx 100c8e92  pushfd 100c8e93  mov byte ptr [esp + 4], 0x6f 100c8e98  call 0x100c87bd 100c87bd  jmp 0x100c895e 100c895e  mov dword ptr [esp + 0x2c], ebp 100c8962  pushfd 100c8963  call 0x100c8a92 100c8a92  mov dword ptr [esp + 0x30], ebx 100c8a96  mov byte ptr [esp], dh 100c8a99  mov byte ptr [esp], ah 100c8a9c  push edi 100c8a9d  jmp 0x100c8d55 100c8d55  mov dword ptr [esp + 0x30], esi 100c8d59  mov byte ptr [esp + 4], ch 100c8d5d  mov dword ptr [esp + 0x2c], edi 100c8d61  mov word ptr [esp], 0x79d1 100c8d67  mov byte ptr [esp], cl 100c8d6a  mov dword ptr [esp + 0x28], ebx 100c8d6e  push ebx 100c8d6f  push ecx 100c8d70  push edi 100c8d71  lea esp, dword ptr [esp + 0x34] 100c8d75  jmp 0x100c8c80 100c8c80  bswap si 100c8c83  not si 100c8c86  pushal 100c8c87  pushfd 100c8c89  pop dword ptr [esp + 0x1c] 100c8c8d  shld si, di, cl 100c8c91  jmp 0x100c7dbd 100c7dbd  mov dword ptr [esp + 0x18], ecx 100c7dc1  bsf di, si 100c7dc5  push dword ptr [0x100c6d15] 100c7dcb  pop dword ptr [esp + 0x14] 100c7dcf  movzx cx, bl 100c7dd3  xadd edi, ebp 100c7dd6  mov dword ptr [esp + 0x10], 0 100c7dde  cmc 100c7ddf  jmp 0x100c7c7b 100c7c7b  mov esi, dword ptr [esp + 0x40] 100c7c7f  bts di, ax 100c7c83  btr di, 4 100c7c88  add esi, 0x1644be2b 100c7c8e  neg di 100c7c91  pushal 100c7c92  clc 100c7c93  xor esi, 0x69d1f651 100c7c99  mov word ptr [esp], 0xfaa6 100c7c9f  or edi, eax 100c7ca1  xchg di, bp 100c7ca4  not esi 100c7ca6  and bp, 0x4b2b 100c7cab  rcr bl, cl 100c7cad  dec bp 100c7cb0  stc 100c7cb1  lea ebp, dword ptr [esp + 0x30] 100c7cb5  bts bx, bp 100c7cb9  rcl bx, 4 100c7cbd  add di, di 100c7cc0  neg bx 100c7cc3  sub esp, 0x90 100c7cc9  push esp 100c7cca  lea edi, dword ptr [esp + 4] 100c7cce  lea esp, dword ptr [esp + 4] 100c7cd2  adc bl, dl 100c7cd4  mov ebx, esi 100c7cd6  dec al 100c7cd8  xor cl, cl 100c7cda  cmc 100c7cdb  cmp esi, 0xae51dbba 100c7ce1  add esi, dword ptr [ebp] 100c7ce4  clc 100c7ce5  not cl 100c7ce7  mov al, byte ptr [esi] 100c7ce9  shr cl, 3 100c7cec  rol cl, 6 100c7cef  add al, bl 100c7cf1  movsx cx, dl 100c7cf5  rcl ch, 2 100c7cf8  movzx cx, al 100c7cfc  pushal 100c7cfd  rol al, 7 100c7d00  or ch, dl 100c7d02  shrd cx, di, 6 100c7d07  shl cl, cl 100c7d09  pushal 100c7d0a  neg al 100c7d0c  mov cl, 0x47 100c7d0e  call 0x100c6d19 100c6d19  clc 100c6d1a  sub esi, -1 100c6d1d  test dh, 0x18 100c6d20  not al 100c6d22  dec ecx 100c6d23  add bl, al 100c6d25  rcl cx, 3 100c6d29  rcr ch, 7 100c6d2c  rcr ch, 2 100c6d2f  movzx eax, al 100c6d32  mov byte ptr [esp], bl 100c6d35  clc 100c6d36  not cl 100c6d38  mov ecx, dword ptr [eax*4 + 0x100c7e65] 100c6d3f  lea esp, dword ptr [esp + 0x44] 100c6d43  jb 0x100c8390 100c6d49  push 0x968f74f5 100c6d4e  push dword ptr [esp] 100c6d51  ror ecx, 0xc 100c6d54  test bx, ax 100c6d57  add ecx, 0 100c6d5d  mov byte ptr [esp + 4], bl 100c6d61  mov dword ptr [esp + 4], ecx 100c6d65  pushal 100c6d66  push esp 100c6d67  push dword ptr [esp + 0x28] 100c6d6b  ret 0x2c
And this is the deobfuscated code from my tool at the beginning:
 mov dword ptr [esp + 4], 0x6fdd8840  mov dword ptr [esp + 0x40], 0x2383346b  mov dword ptr [esp + 0x24], eax  mov dword ptr [esp + 0x28], edx  mov dword ptr [esp + 0x2c], ebp  mov dword ptr [esp + 0x30], ebx  mov dword ptr [esp + 0x30], esi  mov dword ptr [esp + 0x2c], edi  mov dword ptr [esp + 0x28], ebx  bswap si  not si  pushal  pushfd  pop dword ptr [esp + 0x1c]  shld si, di, cl  mov dword ptr [esp + 0x18], ecx  bsf di, si  push dword ptr [0x100c6d15]  pop dword ptr [esp + 0x14]  movzx cx, bl  xadd edi, ebp  mov dword ptr [esp + 0x10], 0  mov esi, dword ptr [esp + 0x40]  bts di, ax  btr di, 4  add esi, 0x1644be2b  neg di  pushal  xor esi, 0x69d1f651  mov word ptr [esp], 0xfaa6  not esi  lea ebp, dword ptr [esp + 0x30]  lea edi, dword ptr [esp + 4]  mov ebx, esi  add esi, dword ptr [ebp]  mov al, byte ptr [esi]  add al, bl  rol al, 7  neg al  sub esi, -1  not al  add bl, al  movzx eax, al  mov ecx, dword ptr [eax*4 + 0x100c7e65]  ror ecx, 0xc  add ecx, 0  mov dword ptr [esp + 4], ecx  push dword ptr [esp + 0x28]  ret 0x2c
The obfuscated code was 133 instructions, the deobfuscated form (which looks much more readable) is 49 instructions.
If you actually try to read the code you will find it problematic when it comes to stack accesses. As described above, I do not track
esp so instructions that just modify
sub esp, 0x40
are not in there. Therefore the code is unreadable as we never know what stack location is accessed. I thought about how to solve this and chose the following approach.
Thanks to Triton I know what actual stack location each access is going to. So I just give each unique address a name. This gives us the following:
 mov dword ptr [svar0], 0x6fdd8840  mov dword ptr [svar1], 0x2383346b  mov dword ptr [svar2], eax  mov dword ptr [svar3], edx  mov dword ptr [svar4], ebp  mov dword ptr [svar5], ebx  mov dword ptr [svar6], esi  mov dword ptr [svar7], edi  mov dword ptr [svar8], ebx  bswap si  not si  pushal  pushfd  pop dword ptr [svar9]  shld si, di, cl  mov dword ptr [svar10], ecx  bsf di, si  push dword ptr [0x100c6d15]  pop dword ptr [svar11]  movzx cx, bl  xadd edi, ebp  mov dword ptr [svar12], 0  mov esi, dword ptr [svar0]  bts di, ax  btr di, 4  add esi, 0x1644be2b  neg di  pushal  xor esi, 0x69d1f651  mov word ptr [svar13], 0xfaa6  not esi  lea ebp, dword ptr [svar12]  lea edi, dword ptr [svar14]  mov ebx, esi  add esi, dword ptr [ebp]  mov al, byte ptr [esi]  add al, bl  rol al, 7  neg al  sub esi, -1  not al  add bl, al  movzx eax, al  mov ecx, dword ptr [eax*4 + 0x100c7e65]  ror ecx, 0xc  add ecx, 0  mov dword ptr [svar15], ecx  push dword ptr [svar15]  ret 0x2c
We now can easily read the code and stack accesses are obvious too.
I was also required to add some workarounds because Triton does not implement x86 semantics cleanly. I've opened an issue on github for it. For example the
rol instruction does not read the carry flag but due to the implementation (it reads the old value to write it back if the
rol is a special edge case) Triton would actually say
rol reads the carry flag leading to instruction dependencies that just aren't there.
The deobfuscated code above looks nice and all, but I have yet to prove the result is worth anything. After all I just might have discarded important instructions.
So as a proof I will use the above code and calculate the jump location at the end. I've annotated the disassembly:
 mov dword ptr [svar0], 0x6fdd8840  mov dword ptr [svar1], 0x2383346b  mov dword ptr [svar2], eax  mov dword ptr [svar3], edx  mov dword ptr [svar4], ebp  mov dword ptr [svar5], ebx  mov dword ptr [svar6], esi  mov dword ptr [svar7], edi  mov dword ptr [svar8], ebx  bswap si  not si  pushal  pushfd  pop dword ptr [svar9]  shld si, di, cl  mov dword ptr [svar10], ecx  bsf di, si  push dword ptr [0x100c6d15]  pop dword ptr [svar11]  movzx cx, bl  xadd edi, ebp  mov dword ptr [svar12], 0 svar12 = 0  mov esi, dword ptr [svar0] esi = 0x6fdd8840  bts di, ax  btr di, 4  add esi, 0x1644be2b esi = 0x8622466B  neg di  pushal  xor esi, 0x69d1f651 esi = 0xEFF3B03A  mov word ptr [svar13], 0xfaa6  not esi esi = 100C4FC5  lea ebp, dword ptr [svar12] ebp = pointer to svar12 (which is 0)  lea edi, dword ptr [svar14]  mov ebx, esi ebx = 100C4FC5  add esi, dword ptr [ebp] esi = 100C4FC5  mov al, byte ptr [esi] al = 0x41 (lookup done in IDA)  add al, bl al = 0x06  rol al, 7 al = 3  neg al al = 0xFD  sub esi, -1 esi = 100C4FC6  not al al = 2  add bl, al  movzx eax, al eax = 2  mov ecx, dword ptr [eax*4 + 0x100c7e65] ecx = 0xc8bd0100 (lookup done in IDA)  ror ecx, 0xc ecx = 100c8bd0  add ecx, 0  mov dword ptr [svar15], ecx  push dword ptr [svar15]  ret 0x2c jmp 100c8bd0
And to confirm that result, here's the target in OllyDbg:
So the deobfuscated code actually seems to be good enough for that. This doesn't prove it didn't discard any important instructions but the result is usable.
I quickly noticed my stack var system is fine but when you move on to the next code blob obviously none of the names help anymore. So to actually make the tool practical I may need to see how to handle the stack issue globally.
As mentioned above, register aliasing is on the TODO list, as is just a general code cleanup, it's really not pretty. Memory writes aren't considered relevant yet, which I need to add in, too.
And obviously the code only works for x86 for now but porting it to x64 should be easy.
I'm quite happy with the result for now. Thank you for reading so far!
You can find the script on my github
If you want to see the unadorned raw output, see output.txt