Processing cells with the cell

The cell_parser() function is the heart of our program. It's responsible for actually extracting the data stored within the cells. As we'll see, varints add another wrinkle to the code; however, for the most part, we're still ultimately parsing binary structures using struct and making decisions based on those values:

173 def cell_parser(wal_dict, x, y, frame):
174     """
175     The cell_parser function processes WAL cells.
176     :param wal_dict: The dictionary containing parsed WAL objects.
177     :param x: An integer specifying the current frame.
178     :param y: An integer specifying the current cell.
179     :param frame: The content within the frame read from the WAL
180     file.
181     :return: Nothing.
182     """

Before we begin to parse the cells, we instantiate a few variables. The index variable, which we created on line 183, is used to keep track of our current location within the cell. Remember that we're no longer dealing with the entire file itself but a subset of it representing a cell. The frame variable is the page size amount of data read from the database itself. For example, if the page size is 1,024, then the frame variable is 1,024 bytes of data, which correspond to a page in the database. The struct module requires that the data parsed is exactly the length of the data types specified in the struct string. Because of these two facts, we need to use string slicing to provide only the data we want to parse with struct:

183     index = 0

On line 186, we create cell_root, which is essentially a shortcut to the nested cell dictionary within the wal_attributes dictionary. This isn't just about being lazy; this helps with code readability and reduce the overall clutter by referring to a variable that points to a nested dictionary rather than typing it out each time. For the same reason, we create the cell_offset variable on line 187:

184     # Create alias to cell_root to shorten navigating the WAL
185     # dictionary structure.
186     cell_root = wal_dict['frames'][x]['cells'][y]
187     cell_offset = cell_root['offset']

Starting on line 191, we encounter our first varint in the cell payload length. This varint will dictate the overall size of the cell. To extract the varint, we call the single_varint() helper function supplying it a 9 byte slice of data. This function, which we will explain later, will check whether the first byte is greater than or equal to 128; if so, it processes the second byte. In addition to the varint, the single_varint() helper function also returns a count of how many bytes the varint was made up of. This allows us to keep track of our current position in the frame data. We use that returned index to parse the row ID varint in a similar fashion:

189     # Parse the payload length and rowID Varints.
190     try:
191         payload_len, index_a = single_varint(
192         frame[cell_offset:cell_offset + 9])
193         row_id, index_b = single_varint(
194         frame[cell_offset + index_a: cell_offset + index_a + 9])
195     except ValueError:
196         logging.warn(('Found a potential three-byte or greater '
197         'varint in cell {} from frame {}').format(y, x))
198         return

After processing the first two varints, we add the key-value pair to the wal_attributes dictionary. On line 204, we update our index variable to maintain our current position in the frame data. Next, we manually extract the 8-bit payload header length value without the dict_helper() function. We do this for two reasons:

We're only processing one value
Setting cell_root equal to the output of dict_helper() was found to erase all other keys in the individual cell nested dictionary described by cell_root, which, admittedly, isn't ideal

The following code block shows this functionality:

200     # Update the index. Following the payload length and rowID is
201     # the 1-byte header length.
202     cell_root['payloadlength'] = payload_len
203     cell_root['rowid'] = row_id
204     index += index_a + index_b
205     cell_root['headerlength'] = struct.unpack('>b',
206     frame[cell_offset + index: cell_offset + index + 1])[0]

After parsing the payload length, row ID, and payload header length, we can now parse the serial types array. As a reminder, the serial types array contains N varints that is headerlength, 1 bytes long. On line 210, we update the index by 1 to account for the 1 byte header we parsed on line 205. We then extract all of the varints within the appropriate range by calling the multi_varint() function. This function returns a tuple containing the list of serial types and the current index. On lines 218 and 219, we update the wal_attributes and index objects, respectively:

208     # Update the index with the 1-byte header length. Next process
209     # each Varint in "headerlength" - 1 bytes.
210     index += 1
211     try:
212         types, index_a = multi_varint(
213         frame[cell_offset + index:cell_offset+index+cell_root['headerlength']-1])
214     except ValueError:
215         logging.warn(('Found a potential three-byte or greater '
216             'varint in cell {} from frame {}').format(y, x))
217         return
218     cell_root['types'] = types
219     index += index_a

Once the serial types array has been parsed, we can begin to extract the actual data stored in the cell. Recall that the cell payload is the difference between the payload length and payload header length. This value calculated on line 224 is used to pass the remaining contents of the cell to the type_helper() helper function, which is responsible for parsing the data:

221     # Immediately following the end of the Varint headers begins
222     # the actual data described by the headers. Process them using
223     # the typeHelper function.
224     diff = cell_root['payloadlength'] - cell_root['headerlength']
225     cell_root['data'] = type_helper(cell_root['types'],
226     frame[cell_offset + index: cell_offset + index + diff])