Processing cells with the cell_parser() function

The cell_parser() function is the heart of our program. It's responsible for actually extracting the data stored within the cells. As we'll see, varints add another wrinkle to the code; however, for the most part, we're still ultimately parsing binary structures using struct and making decisions based on those values:

173 def cell_parser(wal_dict, x, y, frame):
174 """
175 The cell_parser function processes WAL cells.
176 :param wal_dict: The dictionary containing parsed WAL objects.
177 :param x: An integer specifying the current frame.
178 :param y: An integer specifying the current cell.
179 :param frame: The content within the frame read from the WAL
180 file.
181 :return: Nothing.
182 """

Before we begin to parse the cells, we instantiate a few variables. The index variable, which we created on line 183, is used to keep track of our current location within the cell. Remember that we're no longer dealing with the entire file itself but a subset of it representing a cell. The frame variable is the page size amount of data read from the database itself. For example, if the page size is 1,024, then the frame variable is 1,024 bytes of data, which correspond to a page in the database. The struct module requires that the data parsed is exactly the length of the data types specified in the struct string. Because of these two facts, we need to use string slicing to provide only the data we want to parse with struct:

183     index = 0 

On line 186, we create cell_root, which is essentially a shortcut to the nested cell dictionary within the wal_attributes dictionary. This isn't just about being lazy; this helps with code readability and reduce the overall clutter by referring to a variable that points to a nested dictionary rather than typing it out each time. For the same reason, we create the cell_offset variable on line 187:

184     # Create alias to cell_root to shorten navigating the WAL
185 # dictionary structure.
186 cell_root = wal_dict['frames'][x]['cells'][y]
187 cell_offset = cell_root['offset']

Starting on line 191, we encounter our first varint in the cell payload length. This varint will dictate the overall size of the cell. To extract the varint, we call the single_varint() helper function supplying it a 9 byte slice of data. This function, which we will explain later, will check whether the first byte is greater than or equal to 128; if so, it processes the second byte. In addition to the varint, the single_varint() helper function also returns a count of how many bytes the varint was made up of. This allows us to keep track of our current position in the frame data. We use that returned index to parse the row ID varint in a similar fashion:

189     # Parse the payload length and rowID Varints.
190 try:
191 payload_len, index_a = single_varint(
192 frame[cell_offset:cell_offset + 9])
193 row_id, index_b = single_varint(
194 frame[cell_offset + index_a: cell_offset + index_a + 9])
195 except ValueError:
196 logging.warn(('Found a potential three-byte or greater '
197 'varint in cell {} from frame {}').format(y, x))
198 return

After processing the first two varints, we add the key-value pair to the wal_attributes dictionary. On line 204, we update our index variable to maintain our current position in the frame data. Next, we manually extract the 8-bit payload header length value without the dict_helper() function. We do this for two reasons:

The following code block shows this functionality:

200     # Update the index. Following the payload length and rowID is
201 # the 1-byte header length.
202 cell_root['payloadlength'] = payload_len
203 cell_root['rowid'] = row_id
204 index += index_a + index_b
205 cell_root['headerlength'] = struct.unpack('>b',
206 frame[cell_offset + index: cell_offset + index + 1])[0]

After parsing the payload length, row ID, and payload header length, we can now parse the serial types array. As a reminder, the serial types array contains N varints that is headerlength, 1 bytes long. On line 210, we update the index by 1 to account for the 1 byte header we parsed on line 205. We then extract all of the varints within the appropriate range by calling the multi_varint() function. This function returns a tuple containing the list of serial types and the current index. On lines 218 and 219, we update the wal_attributes and index objects, respectively:

208     # Update the index with the 1-byte header length. Next process
209 # each Varint in "headerlength" - 1 bytes.
210 index += 1
211 try:
212 types, index_a = multi_varint(
213 frame[cell_offset + index:cell_offset+index+cell_root['headerlength']-1])
214 except ValueError:
215 logging.warn(('Found a potential three-byte or greater '
216 'varint in cell {} from frame {}').format(y, x))
217 return
218 cell_root['types'] = types
219 index += index_a

Once the serial types array has been parsed, we can begin to extract the actual data stored in the cell. Recall that the cell payload is the difference between the payload length and payload header length. This value calculated on line 224 is used to pass the remaining contents of the cell to the type_helper() helper function, which is responsible for parsing the data:

221     # Immediately following the end of the Varint headers begins
222 # the actual data described by the headers. Process them using
223 # the typeHelper function.
224 diff = cell_root['payloadlength'] - cell_root['headerlength']
225 cell_root['data'] = type_helper(cell_root['types'],
226 frame[cell_offset + index: cell_offset + index + diff])