Mastering the Art of Decompilation
When writing code, it is often extremely useful to understand how things work under the hood. It helps you debug your code, write more efficient programs and even optimize for performance. Such understanding also helps to fix bugs in existing programs and libraries.
Learning assembly language is the best way to understand what happens under the hood. Learning assembly language is a tedious process and one needs to understand a lot of details about the architecture, processor and assembler. It’s a long process but worth investing time in it.
Decompilation is an alternative to learning assembly language and understanding what happens under the hood. In this post, I will explain how to use IDA Pro to understand how things work under the hood without writing assembly code yourself.
It’s not as difficult as it seems at first glance. David Solomon (author of Inside Windows NT) once said that just a couple of hours are enough for learning Windows programming. And he was right!
The same thing applies here: you don’t need to be an expert in x86 assembly language; you only need sufficient skills for understanding low-level code generated by the compiler. This means that those who have already worked with programming languages like C or C++ will find it much easier
The process of recovery of the source code from machine object code is called decompilation. Decompilation is a reverse engineering technique used to understand the working of the target application. In this blog, I will share my experiences with decompilation and how it can be used as a differentiating skill in your career.
Decompilation is not taught in most software classes and people have a misconception that it is difficult to learn. In reality, it’s much easier than you think if you have some experience with programming. Let’s start with the basics.
When you listen to a piece of music or watch a work of art you are always able to appreciate the underlying structure. In the case of music, you may notice that the rhythm, tone and tempo are changing dramatically between verses and choruses. Or in the case of art, you may notice that the artist is using different brush strokes in different parts of a painting.
But in order to recognize these characteristics, you need to be able to build an abstract model of what is going on. You must be able to recognize patterns and structures, and then use those patterns as a guide for interpreting what else is happening in the work.
In software engineering we are often faced with similar problems. We may want to understand how our program works (and there are many reasons why we would want to do this), or we may want to understand how other people’s programs work (and here too there are many reasons why we would want to do this).
The process by which we build up abstract models is called decomposition. Decomposition means breaking down a complex system into parts that we can understand more easily. We use decomposition when we look at a disassembled program, but we also use it when we look at source code, documentation and even specifications.
When people think of software developers, they tend to imagine us sitting in front of our computers coding away. However, this is not the case most of the time.
In fact, most of the time we are looking at source code that someone else wrote. Whether it is a library you use for your web application or a micro service that integrates with your app or even a piece of code written by your teammate. You will spend more time looking at code you didn’t write than code you did write.
If you’re lucky, the code will be well documented and easy to understand. But that’s rarely the case. So if you want to be able to work quickly and efficiently as a software developer, you need to learn how to read other people’s code quickly and efficiently.
The process of reading other people’s code is called decompilation (not to be confused with reverse engineering). In this series I’ll go through the process step by step so that it becomes second nature when you read other people’s code.
In the previous article we have discussed my approach of decompilation, which consists of three basic steps:
Emulating code in order to figure out what it does at the lowest level possible. This step allows us to understand the logic of the code, but doesn’t do anything about the structures.
Extracting data types and data structures from the code. This allows us to know what types of objects are being created, manipulated and destroyed by the code. It also allows us to understand what kind of data is stored in a given variable or field.
Restructuring the code so that it becomes more human-readable and closer to its original source. We already discussed this step in an earlier article, so I won’t go into details here.
In this article I would like to discuss how we can use this approach in practice on a simple example: disassembling a simple application and turning it into something that looks like C source code.
I have been using Java for over a decade now, both at my job and in my personal projects. However, I had never learned proper Java bytecode until recently. It was not that hard to pick it up because I already knew the difference between high-level languages, like C