Practical Programming, Third Edition

Writing Algorithms That Use the File-Reading Techniques

There are several common ways to organize information in files. The rest of this chapter will show how to apply the various file-reading techniques to these situations and how to develop some algorithms to help with this.

Skipping the Header

Many data files begin with a header. As described in The Readline Technique, TSDL files begin with a one-line description followed by comments in lines beginning with a #, and the Readline technique can be used to skip that header. The technique ends when we read the first real piece of data, which will be the first line after the description that doesn’t start with a #.

In English, we might try this algorithm to process this kind of a file:

	Skip the first line in the file
	Skip over the comment lines in the file
	For each of the remaining lines in the file:
	Process the data on that line

The problem with this approach is that we can’t tell whether a line is a comment line until we’ve read it, but we can read a line from a file only once—there’s no simple way to “back up” in the file. An alternative approach is to read the line, skip it if it’s a comment, and process it if it’s not. Once we’ve processed the first line of data, we process the remaining lines:

	Skip the first line in the file
	Find and process the first line of data in the file
	For each of the remaining lines:
	Process the data on that line

The thing to notice about this algorithm is that it processes lines in two places: once when it finds the first “interesting” line in the file and once when it handles all of the following lines:

	from typing import TextIO
	from io import StringIO

	def skip_header(reader: TextIO) -> str:
	"""Skip the header in reader and return the first real piece of data.

	>>> infile = StringIO('Example\\n# Comment\\n# Comment\\nData line\\n')
	>>> skip_header(infile)
	'Data line\\n'
	"""

	# Read the description line
	line = reader.readline()

	# Find the first non-comment line
	line = reader.readline()
	while line.startswith('#'):
	line = reader.readline()

	# Now line contains the first real piece of data
	return line

	def process_file(reader: TextIO) -> None:
	"""Read and print the data from reader, which must start with a single
	description line, then a sequence of lines beginning with '#', then a
	sequence of data.

	>>> infile = StringIO('Example\\n# Comment\\nLine 1\\nLine 2\\n')
	>>> process_file(infile)
	Line 1
	Line 2
	"""

	# Find and print the first piece of data
	line = skip_header(reader).strip()
	print(line)

	# Read the rest of the data
	for line in reader:
	line = line.strip()
	print(line)

	if __name__ == '__main__':
	with open('hopedale.txt', 'r') as input_file:
	process_file(input_file)

In skip_header, we return the first line of read data, because once we’ve found it, we can’t read it again (we can go forward but not backward). We’ll want to use skip_header in all of the file-processing functions in this section. Rather than copying the code each time we want to use it, we can put the function in a file called time_series.py (for Time Series Data Library) and use it in other programs using import time_series, as shown in the next example. This allows us to reuse the skip_header code, and if it needs to be modified, then there is only one copy of the function to edit.

This program processes the Hopedale data set to find the smallest number of fox pelts produced in any year. As we progress through the file, we keep the smallest value seen so far in a variable called smallest. That variable is initially set to the value on the first line, since it’s the smallest (and only) value seen so far:

	from typing import TextIO
	import time_series

	def smallest_value(reader: TextIO) -> int:
	"""Read and process reader and return the smallest value after the
	time_series header.

	>>> infile = StringIO('Example\\n1\\n2\\n3\\n')
	>>> smallest_value(infile)
	1
	>>> infile = StringIO('Example\\n3\\n1\\n2\\n')
	>>> smallest_value(infile)
	1
	"""

	line = time_series.skip_header(reader).strip()

	# Now line contains the first data value; this is also the smallest value
	# found so far, because it is the only one we have seen.
	smallest = int(line)

	for line in reader:
	value = int(line.strip())

	# If we find a smaller value, remember it.
	if value < smallest:
	smallest = value

	return smallest

	if __name__ == '__main__':
	with open('hopedale.txt', 'r') as input_file:
	print(smallest_value(input_file))

As with any algorithm, there are other ways to write this; for example, we can replace the if statement with this single line:

smallest = min(smallest, value)

Dealing with Missing Values in Data

We also have data for colored fox fur production in Hebron, Labrador:

	Coloured fox fur production, Hebron, Labrador, 1834-1839
	#Source: C. Elton (1942) "Voles, Mice and Lemmings", Oxford Univ. Press
	#Table 17, p.265--266
	#remark: missing value for 1836
	55
	262
	-
	102
	178
	227

The hyphen indicates that data for the year 1836 is missing. Unfortunately, calling read_smallest on the Hebron data produces this error:

	>>> import read_smallest
	>>> read_smallest.smallest_value(open('hebron.txt', 'r'))
	Traceback (most recent call last):
	File "<stdin>", line 1, in <module>
	File "./read_smallest.py", line 16, in smallest_value
	value = int(line.strip())
	ValueError: invalid literal for int() with base 10: '-'

The problem is that ’-’ isn’t an integer, so calling int(’-’) fails. This isn’t an isolated problem. In general, we will often need to skip blank lines, comments, or lines containing other “nonvalues” in our data. Real data sets often contain omissions or contradictions; dealing with them is just a fact of scientific life.

For the development of this algorithm, we assume that the first value is an integer, because otherwise the time series would simply start at the second value.

To fix our code, we must add a check inside the loop that processes a line only if it contains a real value. We will assume that the first value is never a hyphen because in the TSDL data sets, missing entries are always marked with hyphens. So we just need to check for that before trying to convert the string we have read to an integer:

	from typing import TextIO
	from io import StringIO
	import time_series

	def smallest_value_skip(reader: TextIO) -> int:
	"""Read and process reader, which must start with a time_series header.
	Return the smallest value after the header. Skip missing values, which
	are indicated with a hyphen.

	>>> infile = StringIO('Example\\n1\\n-\\n3\\n')
	>>> smallest_value_skip(infile)
	1
	"""

	line = time_series.skip_header(reader).strip()
	# Now line contains the first data value; this is also the smallest value
	# found so far, because it is the only one we have seen.
	smallest = int(line)

	for line in reader:
	line = line.strip()
	if line != '-':
	value = int(line)
	smallest = min(smallest, value)

	return smallest

	if __name__ == '__main__':
	with open('hebron.txt', 'r') as input_file:
	print(smallest_value_skip(input_file))

Notice that the update to smallest is nested inside the check for hyphens.

Processing Whitespace-Delimited Data

The file at http://robjhyndman.com/tsdldata/ecology1/lynx.dat (Time Series Data Library [Hyn06]) contains information about lynx pelts in the years 1821–1934. All data values are integers, each line contains many values, the values are separated by whitespace, and for reasons best known to the file’s author, each value ends with a period. (Note that author M. J. Campbell’s name below is misspelled in the original file.)

	Annual Number of Lynx Trapped, MacKenzie River, 1821-1934
	#Original Source: Elton, C. and Nicholson, M. (1942)
	#"The ten year cycle in numbers of Canadian lynx",
	#J. Animal Ecology, Vol. 11, 215--244.
	#This is the famous data set which has been listed before in
	#various publications:
	#Cambell, M.J. and Walker, A.M. (1977) "A survey of statistical work on
	#the MacKenzie River series of annual Canadian lynx trappings for the years
	#1821-1934 with a new analysis", J.Roy.Statistical Soc. A 140, 432--436.
	269. 321. 585. 871. 1475. 2821. 3928. 5943. 4950. 2577. 523. 98.
	184. 279. 409. 2285. 2685. 3409. 1824. 409. 151. 45. 68. 213.
	546. 1033. 2129. 2536. 957. 361. 377. 225. 360. 731. 1638. 2725.
	2871. 2119. 684. 299. 236. 245. 552. 1623. 3311. 6721. 4245. 687.
	255. 473. 358. 784. 1594. 1676. 2251. 1426. 756. 299. 201. 229.
	469. 736. 2042. 2811. 4431. 2511. 389. 73. 39. 49. 59. 188.
	377. 1292. 4031. 3495. 587. 105. 153. 387. 758. 1307. 3465. 6991.
	6313. 3794. 1836. 345. 382. 808. 1388. 2713. 3800. 3091. 2985. 3790.
	674. 81. 80. 108. 229. 399. 1132. 2432. 3574. 2935. 1537. 529.
	485. 662. 1000. 1590. 2657. 3396.

Now we’ll develop a program to find the largest value. To process the file, we will break each line into pieces and strip off the periods. Our algorithm is the same as it was for the fox pelt data: find and process the first line of data in the file, and then process each of the subsequent lines. However, the notion of “processing a line” needs to be examined further because there are many values per line. Our refined algorithm, shown next, uses nested loops to handle the notion of “for each line and for each value on that line”:

	Find the first line containing real data after the header
	For each piece of data in the current line:
	Process that piece

	For each of the remaining lines of data:
	For each piece of data in the current line:
	Process that piece

Once again we are processing lines in two different places. That is a strong hint that we should write a helper function to avoid duplicate code. Rewriting our algorithm and making it specific to the problem of finding the largest value makes this clearer:

	Find the first line of real data after the header
	Find the largest value in that line

	For each of the remaining lines of data:
	Find the largest value in that line
	If that value is larger than the previous largest, remember it

The helper function required is one that finds the largest value in a line, and it must split up the line. String method split will split around the whitespace, but we still have to remove the periods at the ends of the values.

We can also simplify our code by initializing largest to -1, since that value is guaranteed to be smaller than any of the (positive) values in the file. That way, no matter what the first real value is, it’ll be larger than the “previous” value (our -1) and replace it.

	from typing import TextIO
	from io import StringIO
	import time_series

	def find_largest(line: str) -> int:
	"""Return the largest value in line, which is a whitespace-delimited string
	of integers that each end with a '.'.

	>>> find_largest('1. 3. 2. 5. 2.')
	5
	"""
	# The largest value seen so far.
	largest = -1
	for value in line.split():
	# Remove the trailing period.
	v = int(value[:-1])
	# If we find a larger value, remember it.
	if v > largest:
	largest = v

	return largest

We now face the same choice as with skip_header: we can put find_largest in a module (possibly time_series), or we can include it in the same file as the rest of the code. We choose the latter this time because the code is specific to this particular data set and problem:

	from typing import TextIO
	from io import StringIO
	import time_series

	def find_largest(line: str) -> int:
	"""Return the largest value in line, which is a whitespace-delimited string
	of integers that each end with a '.'.

	>>> find_largest('1. 3. 2. 5. 2.')
	5
	"""
	# The largest value seen so far.
	largest = -1
	for value in line.split():
	# Remove the trailing period.
	v = int(value[:-1])
	# If we find a larger value, remember it.
	if v > largest:
	largest = v

	return largest

	def process_file(reader: TextIO) -> int:
	"""Read and process reader, which must start with a time_series header.
	Return the largest value after the header. There may be multiple pieces
	of data on each line.

	>>> infile = StringIO('Example\\n 20. 3.\\n 100. 17. 15.\\n')
	>>> process_file(infile)
	100
	"""

	line = time_series.skip_header(reader).strip()
	# The largest value so far is the largest on this first line of data.
	largest = find_largest(line)

	# Check the rest of the lines for larger values.
	for line in reader:
	large = find_largest(line)
	if large > largest:
	largest = large
	return largest

	if __name__ == '__main__':
	with open('lynx.txt', 'r') as input_file:
	print(process_file(input_file))

Notice how simple the code in process_file looks! This happened only because we decided to write helper functions. To show you how much clearer this is, here is the same code without using time_series.skip_header and find_largest as helper methods:

	from typing import TextIO
	from io import StringIO

	def process_file(reader: TextIO) -> int:
	"""Read and process reader, which must start with a time_series header.
	Return the largest value after the header. There may be multiple pieces
	of data on each line.

	>>> infile = StringIO('Example\\n 20. 3.\\n')
	>>> process_file(infile)
	20
	>>> infile = StringIO('Example\\n 20. 3.\\n 100. 17. 15.\\n')
	>>> process_file(infile)
	100
	"""

	# Read the description line
	line = reader.readline()

	# Find the first non-comment line
	line = reader.readline()
	while line.startswith('#'):
	line = reader.readline()

	# Now line contains the first real piece of data

	# The largest value seen so far in the current line
	largest = -1

	for value in line.split():

	# Remove the trailing period
	v = int(value[:-1])
	# If we find a larger value, remember it
	if v > largest:
	largest = v

	# Check the rest of the lines for larger values
	for line in reader:

	# The largest value seen so far in the current line
	largest_in_line = -1

	for value in line.split():

	# Remove the trailing period
	v = int(value[:-1])
	# If we find a larger value, remember it
	if v > largest_in_line:
	largest_in_line = v

	if largest_in_line > largest:
	largest = largest_in_line
	return largest

	if __name__ == '__main__':
	with open('lynx.txt', 'r') as input_file:
	print(process_file(input_file))