Each cell is made up of the following components:
- Payload length (varint)
- Row ID (varint)
- Payload header:
- Payload header length (varint)
- Array of serial types (varints)
- Payload
The payload length describes the overall length of the cell. The Row ID is the unique key in the actual database corresponding to this record. The serial types array in the payload header contains the length and type of data in the payload. We can subtract the payload length by the payload header length to determine how many bytes of the cell is actually recorded data.
Notice that most of these values are varints, or variable length integers. Varints in SQLite are integers that can be anywhere from 1 to 9 bytes in size based on the first bit of each byte. If the first bit is set, that is, a value of 1, then the next byte is a part of the varint. This continues until you have a 9 byte varint or the first bit of a byte isn't set. The first bit isn't set for all 8-bit integers less than 128. This allows large numbers to be stored flexibly within this file format. More details on varints is available at https://www.sqlite.org/src4/doc/trunk/www/varint.wiki.
For example, if the first byte that's processed is 0x03 or 0b00000011, we know the varint is just one-byte long and has the value of 3. If the first byte that's processed is 0x9A or 0b10011010, then the first bit is set and the varint is at least two-bytes long depending on the next byte, using the same decision making process. For our purposes, we will only support varints up to 2 bytes in length. A detailed tutorial on parsing a WAL file can be read at http://www.forensicsfromthesausagefactory.blogspot.com/2011/05/analysis-of-record-structure-within.html. It's highly recommended to use a hex editor and parse a page by hand before attempting to develop the code. Handling varints can be a lot easier through examination in a hex editor and helps cement your understanding of the database structure.
Most of the varints are found in the serial types array. This array immediately follows the payload header length and has a value of 1. The resulting table of varint values dictate the size and data type of the cells:
Varint value |
Size (bytes) |
Data type |
0 |
0 |
Null |
1 |
1 |
8-bit integer |
2 |
2 |
Big-endian 16-bit integer |
3 |
3 |
Big-endian 24-bit integer |
4 |
4 |
Big-endian 32-bit integer |
5 |
6 |
Big-endian 48-bit integer |
6 |
8 |
Big-endian 64-bit integer |
7 |
8 |
Big-endian 64-bit float |
8 |
0 |
Integer constant: 0 |
9 |
0 |
Integer constant: 1 |
10, 11 |
Not used |
|
X >= 12 and even |
(X-12)/2 |
BLOB of length (X-12)/2 |
X >= 13 and odd |
(X-13)/2 |
String of length (X-13)/2 |
The payload begins immediately following the final serial type. Let's look at how we can use varints to properly parse the contents of the payload properly. For example, if given the following serial types: 0, 2, 6, 8, and 25, we would expect a 16-byte payload containing a Null value, a 2-byte 16-bit integer, an 8-byte 64-bit integer, a constant 0, and a 6-byte string. The size of the string is calculated by the equation (25-13) / 2. The following pseudocode highlights this process:
Serial Types = 0, 2, 6, 8, and 25
Payload = 0x166400000009C5BA3649737069646572
Split_Payload = N/A , 0x1664, 0x00000009C5BA3649, N/A, 0x737069646572
Converted_Payload = Null, 5732, 163953206, 0, "spider"
The preceding example illustrates how the 16-byte payload would be decoded using the known serial types. We will employ this same approach when developing our program. Notice that serial types 0, 8, and 9 don't require any space in the payload as their values are static.