The Basics of Reverse Engineering File Formats

Overview
Within this small page is some helpful information to get you started on being able to reverse engineer files. While not being very detailed, hopefully this page will give you the necessary information to get started reverse engineering files.
Tools of the Trade
If you are interested in reverse engineering then you are going to need some basic tools.

-Hex Editor (which one to use is kinda a personal choice but I've been of lately using UltraEdit32)
-Programming Language of your choice
-Something to convert Hex to Decimal (brain, calculator, etc.)

Basics & Intuition
The people who made the file format you are trying to reverse engineer were most likely humans. Therefore, you should be trying to think if I was going to make a file to do such and such how would I form it. As you are figuring out a file you will slowly adjust your mind to think like them. This will allow you to start making predictions and help with reverse engineering other files by the same group.

Now when reverse engineering files you are going to be looking for certain structures. In general, programmers often use fwrite to write an array of structures of fixed size. First thing is to be able to determine the size of the structure. This is done by looking for something that seems to repeat every some many bytes. Most likely this will be some common attribute set amongst all of them. Second is to figure out the boundaries of these structures. This is done by trying to find an area where it seems to go into something new or the structures start. Since the structures in general should be fixed size length then you can figure out the start and end of the array of structures by knowledge that the length of the array has to be a multiple of the structure size. From here its slowly breaking up the elements of the structure into most likely guesses and eventually testing by making changes and observing the program's response.

Most likely before the structure somewhere you will find a number that tells you how many of the structures there are written to the file. This brings us into headers and similar information. To make it easier on themselves programmers often include filesize info, file ID, and file layout type info. At the beginning of the file is often a 4 character file ID. Normally after this most programmmers include a DWORD that specifies the filesize. In general, do expect to see numbers telling you how many of some type of structure you can expect to find.

Some Basic Tips
-Think simple for datatypes since in general they are
-Remember strings should end in 0
-Look for patterns in the data
-Work slowly don't rush for you are likely to miss something
-If you think it might be a picture try making a RAW file and looking at it in a image program

Challenges
Oftentime, in the midst of reverse engineering a file format you can run into barriers. The basic ones are compression and encryption.

Compression while sometimes a formidable barrier is often one that can be overcome. In general, the files you are looking at are using some compression library that is publicly available (well unless its a compression file format you are reverse engineering). Basic thing I do if I feel a file might be compressed is I check to see if there is a zlib.dll in the directory of the program associated with the file. Then, with a little playing around I often can get a file to decompress. Basically the compressed file or portion will have information befor it most likely on compressed size and uncompressed size(so you know the size of buffer to create). Now it's hard to say exactly how to spot compressed data but often it looks pretty random. If its not ZLib compression that appears to be used I normally don't bother.

Encryption, well thats a pain. In general, if a file is encrypted and program never asks you for an encryption key then it has to be stored either in the file, within the program, or some other data file. More common is to seen some type of mask applied if they want to obscure the data but its still normally about the same amount of pain (depending on the mask you are likely to still see patterns though). The basics of breaking an encrypted file unfortunately end up laying often on reverse engineering the executable to some extent which utilizes the file. This is quite a formidable task to a beginner and not one I would recommend unless the file format is of importance. Before taking on such a task, I would recommend some experience at least with basic file cracking which it's legality is well questionable to say the least. But then again if you live in the US the legality of reverse engineering a file with encryption depends on your intent.

Summary
This document in overview should now give you the basic ideas you need to start reverse engineering file formats. At first it will seem tough, but with perseverance you will find it's rewarding and helps improve your programming skills as you get to see how others code their structures. The skill as well helps with being able to figure out other people's code and data structures by helping you get into their mind of what they were thinking.