Multiline Records

Not every data record will fit onto a single line. Here is a file in simplified Protein Data Bank (PDB) format that describes the arrangements of atoms in ammonia:

 COMPND AMMONIA
 ATOM 1 N 0.257 -0.363 0.000
 ATOM 2 H 0.257 0.727 0.000
 ATOM 3 H 0.771 -0.727 0.890
 ATOM 4 H 0.771 -0.727 -0.890
 END

The first line is the name of the molecule. All subsequent lines down to the one containing END specify the ID, type, and XYZ coordinates of one of the atoms in the molecule.

Reading this file is straightforward using the techniques that we have built up in this chapter. But what if the file contained two or more molecules, like this:

 COMPND AMMONIA
 ATOM 1 N 0.257 -0.363 0.000
 ATOM 2 H 0.257 0.727 0.000
 ATOM 3 H 0.771 -0.727 0.890
 ATOM 4 H 0.771 -0.727 -0.890
 END
 COMPND METHANOL
 ATOM 1 C -0.748 -0.015 0.024
 ATOM 2 O 0.558 0.420 -0.278
 ATOM 3 H -1.293 -0.202 -0.901
 ATOM 4 H -1.263 0.754 0.600
 ATOM 5 H -0.699 -0.934 0.609
 ATOM 6 H 0.716 1.404 0.137
 END

As always, we tackle this problem by dividing into smaller ones and solving each of those in turn. Our first algorithm is as follows:

 While there are more molecules in the file:
  Read a molecule from the file
  Append it to the list of molecules read so far

Simple, except the only way to tell whether there is another molecule left in the file is to try to read it. Our modified algorithm is as follows:

 reading = True
 while reading:
  Try to read a molecule from the file
  If there is one:
  Append it to the list of molecules read so far
  else: # nothing left
  reading = False

In Python, this is as follows:

 from​ typing ​import​ TextIO
 from​ io ​import​ StringIO
 
 def​ read_molecule(reader: TextIO) -> list:
 """Read a single molecule from reader and return it, or return None to
  signal end of file. The first item in the result is the name of the
  compound; each list contains an atom type and the X, Y, and Z coordinates
  of that atom.
 
  >>> instring = 'COMPND TEST​​\\​​nATOM 1 N 0.1 0.2 0.3​​\\​​nATOM 2 N 0.2 0.1 0.0​​\\​​nEND​​\\​​n'
  >>> infile = StringIO(instring)
  >>> read_molecule(infile)
  ['TEST', ['N', '0.1', '0.2', '0.3'], ['N', '0.2', '0.1', '0.0']]
  """
 
 # If there isn't another line, we're at the end of the file.
  line = reader.readline()
 if​ ​not​ line:
 return​ None
 
 # Name of the molecule: "COMPND name"
  parts = line.split()
  name = parts[1]
 
 # Other lines are either "END" or "ATOM num atom_type x y z"
  molecule = [name]
 
  reading = True
 while​ reading:
  line = reader.readline()
 if​ line.startswith(​'END'​):
  reading = False
 else​:
  parts = line.split()
  molecule.append(parts[2:])
 
 return​ molecule
 
 def​ read_all_molecules(reader: TextIO) -> list:
 """Read zero or more molecules from reader, returning a list of the
  molecule information.
 
  >>> cmpnd1 = 'COMPND T1​​\\​​nATOM 1 N 0.1 0.2 0.3​​\\​​nATOM 2 N 0.2 0.1 0.0​​\\​​nEND​​\\​​n'
  >>> cmpnd2 = 'COMPND T2​​\\​​nATOM 1 A 0.1 0.2 0.3​​\\​​nATOM 2 A 0.2 0.1 0.0​​\\​​nEND​​\\​​n'
  >>> infile = StringIO(cmpnd1 + cmpnd2)
  >>> result = read_all_molecules(infile)
  >>> result[0]
  ['T1', ['N', '0.1', '0.2', '0.3'], ['N', '0.2', '0.1', '0.0']]
  >>> result[1]
  ['T2', ['A', '0.1', '0.2', '0.3'], ['A', '0.2', '0.1', '0.0']]
  """
 
 # The list of molecule information.
  result = []
 
  reading = True
 while​ reading:
  molecule = read_molecule(reader)
 if​ molecule: ​# None is treated as False in an if statement
  result.append(molecule)
 else​:
  reading = False
 return​ result
 
 if​ __name__ == ​'__main__'​:
  molecule_file = open(​'multimol.pdb'​, ​'r'​)
  molecules = read_all_molecules(molecule_file)
  molecule_file.close()
 print​(molecules)

The work of actually reading a single molecule has been put in a function of its own that must return some false value (such as None) if it can’t find another molecule in the file. This function checks the first line it tries to read to see whether there is actually any data left in the file. If not, it returns immediately to tell read_all_molecules that the end of the file has been reached. Otherwise, it pulls the name of the molecule out of the first line and then reads the molecule’s atoms one at a time down to the END line.

Notice that read_molecule uses exactly the same trick to spot the END that marks the end of a single molecule as read_all_molecules uses to spot the end of the file.