Not every data record will fit onto a single line. Here is a file in simplified Protein Data Bank (PDB) format that describes the arrangements of atoms in ammonia:
| COMPND AMMONIA |
| ATOM 1 N 0.257 -0.363 0.000 |
| ATOM 2 H 0.257 0.727 0.000 |
| ATOM 3 H 0.771 -0.727 0.890 |
| ATOM 4 H 0.771 -0.727 -0.890 |
| END |
The first line is the name of the molecule. All subsequent lines down to the one containing END specify the ID, type, and XYZ coordinates of one of the atoms in the molecule.
Reading this file is straightforward using the techniques that we have built up in this chapter. But what if the file contained two or more molecules, like this:
| COMPND AMMONIA |
| ATOM 1 N 0.257 -0.363 0.000 |
| ATOM 2 H 0.257 0.727 0.000 |
| ATOM 3 H 0.771 -0.727 0.890 |
| ATOM 4 H 0.771 -0.727 -0.890 |
| END |
| COMPND METHANOL |
| ATOM 1 C -0.748 -0.015 0.024 |
| ATOM 2 O 0.558 0.420 -0.278 |
| ATOM 3 H -1.293 -0.202 -0.901 |
| ATOM 4 H -1.263 0.754 0.600 |
| ATOM 5 H -0.699 -0.934 0.609 |
| ATOM 6 H 0.716 1.404 0.137 |
| END |
As always, we tackle this problem by dividing into smaller ones and solving each of those in turn. Our first algorithm is as follows:
| While there are more molecules in the file: |
| Read a molecule from the file |
| Append it to the list of molecules read so far |
Simple, except the only way to tell whether there is another molecule left in the file is to try to read it. Our modified algorithm is as follows:
| reading = True |
| while reading: |
| Try to read a molecule from the file |
| If there is one: |
| Append it to the list of molecules read so far |
| else: # nothing left |
| reading = False |
In Python, this is as follows:
| from typing import TextIO |
| from io import StringIO |
| |
| def read_molecule(reader: TextIO) -> list: |
| """Read a single molecule from reader and return it, or return None to |
| signal end of file. The first item in the result is the name of the |
| compound; each list contains an atom type and the X, Y, and Z coordinates |
| of that atom. |
| |
| >>> instring = 'COMPND TEST\\nATOM 1 N 0.1 0.2 0.3\\nATOM 2 N 0.2 0.1 0.0\\nEND\\n' |
| >>> infile = StringIO(instring) |
| >>> read_molecule(infile) |
| ['TEST', ['N', '0.1', '0.2', '0.3'], ['N', '0.2', '0.1', '0.0']] |
| """ |
| |
| # If there isn't another line, we're at the end of the file. |
| line = reader.readline() |
| if not line: |
| return None |
| |
| # Name of the molecule: "COMPND name" |
| parts = line.split() |
| name = parts[1] |
| |
| # Other lines are either "END" or "ATOM num atom_type x y z" |
| molecule = [name] |
| |
| reading = True |
| while reading: |
| line = reader.readline() |
| if line.startswith('END'): |
| reading = False |
| else: |
| parts = line.split() |
| molecule.append(parts[2:]) |
| |
| return molecule |
| |
| def read_all_molecules(reader: TextIO) -> list: |
| """Read zero or more molecules from reader, returning a list of the |
| molecule information. |
| |
| >>> cmpnd1 = 'COMPND T1\\nATOM 1 N 0.1 0.2 0.3\\nATOM 2 N 0.2 0.1 0.0\\nEND\\n' |
| >>> cmpnd2 = 'COMPND T2\\nATOM 1 A 0.1 0.2 0.3\\nATOM 2 A 0.2 0.1 0.0\\nEND\\n' |
| >>> infile = StringIO(cmpnd1 + cmpnd2) |
| >>> result = read_all_molecules(infile) |
| >>> result[0] |
| ['T1', ['N', '0.1', '0.2', '0.3'], ['N', '0.2', '0.1', '0.0']] |
| >>> result[1] |
| ['T2', ['A', '0.1', '0.2', '0.3'], ['A', '0.2', '0.1', '0.0']] |
| """ |
| |
| # The list of molecule information. |
| result = [] |
| |
| reading = True |
| while reading: |
| molecule = read_molecule(reader) |
| if molecule: # None is treated as False in an if statement |
| result.append(molecule) |
| else: |
| reading = False |
| return result |
| |
| if __name__ == '__main__': |
| molecule_file = open('multimol.pdb', 'r') |
| molecules = read_all_molecules(molecule_file) |
| molecule_file.close() |
| print(molecules) |
The work of actually reading a single molecule has been put in a function of its own that must return some false value (such as None) if it can’t find another molecule in the file. This function checks the first line it tries to read to see whether there is actually any data left in the file. If not, it returns immediately to tell read_all_molecules that the end of the file has been reached. Otherwise, it pulls the name of the molecule out of the first line and then reads the molecule’s atoms one at a time down to the END line.
Notice that read_molecule uses exactly the same trick to spot the END that marks the end of a single molecule as read_all_molecules uses to spot the end of the file.