What is a disassembler?
A disassembler does what its name suggests, it takes a program in binary format and outputs an assembly file which contains human readable instructions. It should be possible to re-assemble this file and obtain the original binary file. For my disassembler I did not need this part so I made it output both the assembly instructions and some other useful informations such as the address and the binary value of each opcode. The output file contains the data as well. Data is everything that is not code, such as sprites or any other information the original developer might have put into the rom.
How does a disassembler work?
I don’t know. I’ve never looked into the theory of disassemblers and when I started this one I had no clue. So I came up with ideas and implemented several versions of it. Each one was a bit better and less buggy than the previous one. Until I got something that did the job. At this point I think I learned something about how disassemblers work.
Chip8 and SuperCHIP rom format
Chip8 and SuperCHIP roms do not provide a lot of informations about what is inside the rom. All I knew is that the interpreter loads the rom in memory at address 0x200 and the execution starts from there. That means the very first instruction in the file is at position 0. From there I had to trace every single destination address for jump and call opcodes to figure out which parts of the file contain actual code.
Disassembling binaries for other platforms might be easier from this point of view, they might provide some sort of symbol table that the disassembler can use to find the code. This is not the case with Chip8 and SuperCHIP. Roms also contain sprites, and as for the code there is no easily readable information about where they are in the file.
My approach to this problem is very basic, first find and disassemble all the code segments and then disassemble everything else as pure data, outputting this data as hexadecimal and binary values. Having the binary values makes it a bit easier to see the sprites when scrolling the disassembled output.
Static vs Dynamic Disassembly
With the basic idea I started coding my disassembler. I didn’t immediately recognise the flaw in it. The algorithm goes like this
- Create a Set of all addresses to visit, initially this only contains address 0x200, the beginning of the program. Create another Set which will contain all visited addresses. I need this to avoid getting stuck in a loop visiting the same address over and over again.
- Retrieve one address from the Set of addresses to visit. Remove that address from the Set and add it to the Set of addresses already visited.
- Start reading the opcodes at that address. Keep track of all the addresses that contain code. For every jump (JP) and call (CALL) instruction read their destination address and, if not yet visited, add it to the Set of addresses to visit.
- The current code segment ends at the first return (RET) or unconditional jump instruction. Go back to point 2) until the Set of addresses to visit is empty.
- Disassemble as data any address that doesn’t contain code. Simply loop over the whole file and dump these locations as hexadecimal and binary value.
You can think of this approach as static disassembly, it simply looks at the opcodes and follows the jumps and calls. Pretty easy, it works.
However there is one opcode, just one, that complicates things. The opcode is
JP V0, nnn. It jumps to a location specified by register V0 plus nnn where nnn is a three bytes offset. Now this is a problem, what is the destination address? A static disassembly approach cannot fix this problem, instead I need to actually execute the program so to have the value of register V0 at this point, this would be dynamic disassembly.
Unfortunately in this case it is not possible to track the destination address of the jump and so that code segment will not be disassembled. Luckily however, I tested the disassembler on several roms and never encountered this opcode, but I did not test it on all of them. I stopped here as I was happy with the result and it was already doing what I wanted. Time to move on to another project.
Learning is good
Just by writing this simple tool for my emulator I managed to learn a whole lot of new things. Even something as simple sounding as a disassembler hides some non trivial problems.
I am still wondering if it is always possible to disassemble an entire program with such few informations. A static disassembly approach will not work 100%, and while a dynamic disassembly approach could work there is still an issue. While static disassembly actively looks for executable code segments, dynamic disassembly follows the execution flow. How do you make sure that the execution goes through all possible code segments? Maybe mixing the two approaches is the way to go?