Practical Programming, Third Edition

Multiline Records

Not every data record will fit onto a single line. Here is a file in simplified Protein Data Bank (PDB) format that describes the arrangements of atoms in ammonia:

	COMPND AMMONIA
	ATOM 1 N 0.257 -0.363 0.000
	ATOM 2 H 0.257 0.727 0.000
	ATOM 3 H 0.771 -0.727 0.890
	ATOM 4 H 0.771 -0.727 -0.890
	END

The first line is the name of the molecule. All subsequent lines down to the one containing END specify the ID, type, and XYZ coordinates of one of the atoms in the molecule.

Reading this file is straightforward using the techniques that we have built up in this chapter. But what if the file contained two or more molecules, like this:

	COMPND AMMONIA
	ATOM 1 N 0.257 -0.363 0.000
	ATOM 2 H 0.257 0.727 0.000
	ATOM 3 H 0.771 -0.727 0.890
	ATOM 4 H 0.771 -0.727 -0.890
	END
	COMPND METHANOL
	ATOM 1 C -0.748 -0.015 0.024
	ATOM 2 O 0.558 0.420 -0.278
	ATOM 3 H -1.293 -0.202 -0.901
	ATOM 4 H -1.263 0.754 0.600
	ATOM 5 H -0.699 -0.934 0.609
	ATOM 6 H 0.716 1.404 0.137
	END

As always, we tackle this problem by dividing into smaller ones and solving each of those in turn. Our first algorithm is as follows:

	While there are more molecules in the file:
	Read a molecule from the file
	Append it to the list of molecules read so far

Simple, except the only way to tell whether there is another molecule left in the file is to try to read it. Our modified algorithm is as follows:

	reading = True
	while reading:
	Try to read a molecule from the file
	If there is one:
	Append it to the list of molecules read so far
	else: # nothing left
	reading = False

In Python, this is as follows:

	from typing import TextIO
	from io import StringIO

	def read_molecule(reader: TextIO) -> list:
	"""Read a single molecule from reader and return it, or return None to
	signal end of file. The first item in the result is the name of the
	compound; each list contains an atom type and the X, Y, and Z coordinates
	of that atom.

	>>> instring = 'COMPND TEST\\nATOM 1 N 0.1 0.2 0.3\\nATOM 2 N 0.2 0.1 0.0\\nEND\\n'
	>>> infile = StringIO(instring)
	>>> read_molecule(infile)
	['TEST', ['N', '0.1', '0.2', '0.3'], ['N', '0.2', '0.1', '0.0']]
	"""

	# If there isn't another line, we're at the end of the file.
	line = reader.readline()
	if not line:
	return None

	# Name of the molecule: "COMPND name"
	parts = line.split()
	name = parts[1]

	# Other lines are either "END" or "ATOM num atom_type x y z"
	molecule = [name]

	reading = True
	while reading:
	line = reader.readline()
	if line.startswith('END'):
	reading = False
	else:
	parts = line.split()
	molecule.append(parts[2:])

	return molecule

	def read_all_molecules(reader: TextIO) -> list:
	"""Read zero or more molecules from reader, returning a list of the
	molecule information.

	>>> cmpnd1 = 'COMPND T1\\nATOM 1 N 0.1 0.2 0.3\\nATOM 2 N 0.2 0.1 0.0\\nEND\\n'
	>>> cmpnd2 = 'COMPND T2\\nATOM 1 A 0.1 0.2 0.3\\nATOM 2 A 0.2 0.1 0.0\\nEND\\n'
	>>> infile = StringIO(cmpnd1 + cmpnd2)
	>>> result = read_all_molecules(infile)
	>>> result[0]
	['T1', ['N', '0.1', '0.2', '0.3'], ['N', '0.2', '0.1', '0.0']]
	>>> result[1]
	['T2', ['A', '0.1', '0.2', '0.3'], ['A', '0.2', '0.1', '0.0']]
	"""

	# The list of molecule information.
	result = []

	reading = True
	while reading:
	molecule = read_molecule(reader)
	if molecule: # None is treated as False in an if statement
	result.append(molecule)
	else:
	reading = False
	return result

	if __name__ == '__main__':
	molecule_file = open('multimol.pdb', 'r')
	molecules = read_all_molecules(molecule_file)
	molecule_file.close()
	print(molecules)

The work of actually reading a single molecule has been put in a function of its own that must return some false value (such as None) if it can’t find another molecule in the file. This function checks the first line it tries to read to see whether there is actually any data left in the file. If not, it returns immediately to tell read_all_molecules that the end of the file has been reached. Otherwise, it pulls the name of the molecule out of the first line and then reads the molecule’s atoms one at a time down to the END line.

Notice that read_molecule uses exactly the same trick to spot the END that marks the end of a single molecule as read_all_molecules uses to spot the end of the file.