There are several common ways to organize information in files. The rest of this chapter will show how to apply the various file-reading techniques to these situations and how to develop some algorithms to help with this.
Many data files begin with a header. As described in The Readline Technique, TSDL files begin with a one-line description followed by comments in lines beginning with a #, and the Readline technique can be used to skip that header. The technique ends when we read the first real piece of data, which will be the first line after the description that doesn’t start with a #.
In English, we might try this algorithm to process this kind of a file:
| Skip the first line in the file |
| Skip over the comment lines in the file |
| For each of the remaining lines in the file: |
| Process the data on that line |
The problem with this approach is that we can’t tell whether a line is a comment line until we’ve read it, but we can read a line from a file only once—there’s no simple way to “back up” in the file. An alternative approach is to read the line, skip it if it’s a comment, and process it if it’s not. Once we’ve processed the first line of data, we process the remaining lines:
| Skip the first line in the file |
| Find and process the first line of data in the file |
| For each of the remaining lines: |
| Process the data on that line |
The thing to notice about this algorithm is that it processes lines in two places: once when it finds the first “interesting” line in the file and once when it handles all of the following lines:
| from typing import TextIO |
| from io import StringIO |
| |
| def skip_header(reader: TextIO) -> str: |
| """Skip the header in reader and return the first real piece of data. |
| |
| >>> infile = StringIO('Example\\n# Comment\\n# Comment\\nData line\\n') |
| >>> skip_header(infile) |
| 'Data line\\n' |
| """ |
| |
| # Read the description line |
| line = reader.readline() |
| |
| # Find the first non-comment line |
| line = reader.readline() |
| while line.startswith('#'): |
| line = reader.readline() |
| |
| # Now line contains the first real piece of data |
| return line |
| |
| def process_file(reader: TextIO) -> None: |
| """Read and print the data from reader, which must start with a single |
| description line, then a sequence of lines beginning with '#', then a |
| sequence of data. |
| |
| >>> infile = StringIO('Example\\n# Comment\\nLine 1\\nLine 2\\n') |
| >>> process_file(infile) |
| Line 1 |
| Line 2 |
| """ |
| |
| # Find and print the first piece of data |
| line = skip_header(reader).strip() |
| print(line) |
| |
| # Read the rest of the data |
| for line in reader: |
| line = line.strip() |
| print(line) |
| |
| if __name__ == '__main__': |
| with open('hopedale.txt', 'r') as input_file: |
| process_file(input_file) |
In skip_header, we return the first line of read data, because once we’ve found it, we can’t read it again (we can go forward but not backward). We’ll want to use skip_header in all of the file-processing functions in this section. Rather than copying the code each time we want to use it, we can put the function in a file called time_series.py (for Time Series Data Library) and use it in other programs using import time_series, as shown in the next example. This allows us to reuse the skip_header code, and if it needs to be modified, then there is only one copy of the function to edit.
This program processes the Hopedale data set to find the smallest number of fox pelts produced in any year. As we progress through the file, we keep the smallest value seen so far in a variable called smallest. That variable is initially set to the value on the first line, since it’s the smallest (and only) value seen so far:
| from typing import TextIO |
| import time_series |
| |
| def smallest_value(reader: TextIO) -> int: |
| """Read and process reader and return the smallest value after the |
| time_series header. |
| |
| >>> infile = StringIO('Example\\n1\\n2\\n3\\n') |
| >>> smallest_value(infile) |
| 1 |
| >>> infile = StringIO('Example\\n3\\n1\\n2\\n') |
| >>> smallest_value(infile) |
| 1 |
| """ |
| |
| line = time_series.skip_header(reader).strip() |
| |
| # Now line contains the first data value; this is also the smallest value |
| # found so far, because it is the only one we have seen. |
| smallest = int(line) |
| |
| for line in reader: |
| value = int(line.strip()) |
| |
| # If we find a smaller value, remember it. |
| if value < smallest: |
| smallest = value |
| |
| return smallest |
| |
| if __name__ == '__main__': |
| with open('hopedale.txt', 'r') as input_file: |
| print(smallest_value(input_file)) |
As with any algorithm, there are other ways to write this; for example, we can replace the if statement with this single line:
| smallest = min(smallest, value) |
We also have data for colored fox fur production in Hebron, Labrador:
| Coloured fox fur production, Hebron, Labrador, 1834-1839 |
| #Source: C. Elton (1942) "Voles, Mice and Lemmings", Oxford Univ. Press |
| #Table 17, p.265--266 |
| #remark: missing value for 1836 |
| 55 |
| 262 |
| - |
| 102 |
| 178 |
| 227 |
The hyphen indicates that data for the year 1836 is missing. Unfortunately, calling read_smallest on the Hebron data produces this error:
| >>> import read_smallest |
| >>> read_smallest.smallest_value(open('hebron.txt', 'r')) |
| Traceback (most recent call last): |
| File "<stdin>", line 1, in <module> |
| File "./read_smallest.py", line 16, in smallest_value |
| value = int(line.strip()) |
| ValueError: invalid literal for int() with base 10: '-' |
The problem is that ’-’ isn’t an integer, so calling int(’-’) fails. This isn’t an isolated problem. In general, we will often need to skip blank lines, comments, or lines containing other “nonvalues” in our data. Real data sets often contain omissions or contradictions; dealing with them is just a fact of scientific life.
For the development of this algorithm, we assume that the first value is an integer, because otherwise the time series would simply start at the second value.
To fix our code, we must add a check inside the loop that processes a line only if it contains a real value. We will assume that the first value is never a hyphen because in the TSDL data sets, missing entries are always marked with hyphens. So we just need to check for that before trying to convert the string we have read to an integer:
| from typing import TextIO |
| from io import StringIO |
| import time_series |
| |
| def smallest_value_skip(reader: TextIO) -> int: |
| """Read and process reader, which must start with a time_series header. |
| Return the smallest value after the header. Skip missing values, which |
| are indicated with a hyphen. |
| |
| >>> infile = StringIO('Example\\n1\\n-\\n3\\n') |
| >>> smallest_value_skip(infile) |
| 1 |
| """ |
| |
| line = time_series.skip_header(reader).strip() |
| # Now line contains the first data value; this is also the smallest value |
| # found so far, because it is the only one we have seen. |
| smallest = int(line) |
| |
| for line in reader: |
| line = line.strip() |
| if line != '-': |
| value = int(line) |
| smallest = min(smallest, value) |
| |
| return smallest |
| |
| if __name__ == '__main__': |
| with open('hebron.txt', 'r') as input_file: |
| print(smallest_value_skip(input_file)) |
Notice that the update to smallest is nested inside the check for hyphens.
The file at http://robjhyndman.com/tsdldata/ecology1/lynx.dat (Time Series Data Library [Hyn06]) contains information about lynx pelts in the years 1821–1934. All data values are integers, each line contains many values, the values are separated by whitespace, and for reasons best known to the file’s author, each value ends with a period. (Note that author M. J. Campbell’s name below is misspelled in the original file.)
| Annual Number of Lynx Trapped, MacKenzie River, 1821-1934 |
| #Original Source: Elton, C. and Nicholson, M. (1942) |
| #"The ten year cycle in numbers of Canadian lynx", |
| #J. Animal Ecology, Vol. 11, 215--244. |
| #This is the famous data set which has been listed before in |
| #various publications: |
| #Cambell, M.J. and Walker, A.M. (1977) "A survey of statistical work on |
| #the MacKenzie River series of annual Canadian lynx trappings for the years |
| #1821-1934 with a new analysis", J.Roy.Statistical Soc. A 140, 432--436. |
| 269. 321. 585. 871. 1475. 2821. 3928. 5943. 4950. 2577. 523. 98. |
| 184. 279. 409. 2285. 2685. 3409. 1824. 409. 151. 45. 68. 213. |
| 546. 1033. 2129. 2536. 957. 361. 377. 225. 360. 731. 1638. 2725. |
| 2871. 2119. 684. 299. 236. 245. 552. 1623. 3311. 6721. 4245. 687. |
| 255. 473. 358. 784. 1594. 1676. 2251. 1426. 756. 299. 201. 229. |
| 469. 736. 2042. 2811. 4431. 2511. 389. 73. 39. 49. 59. 188. |
| 377. 1292. 4031. 3495. 587. 105. 153. 387. 758. 1307. 3465. 6991. |
| 6313. 3794. 1836. 345. 382. 808. 1388. 2713. 3800. 3091. 2985. 3790. |
| 674. 81. 80. 108. 229. 399. 1132. 2432. 3574. 2935. 1537. 529. |
| 485. 662. 1000. 1590. 2657. 3396. |
Now we’ll develop a program to find the largest value. To process the file, we will break each line into pieces and strip off the periods. Our algorithm is the same as it was for the fox pelt data: find and process the first line of data in the file, and then process each of the subsequent lines. However, the notion of “processing a line” needs to be examined further because there are many values per line. Our refined algorithm, shown next, uses nested loops to handle the notion of “for each line and for each value on that line”:
| Find the first line containing real data after the header |
| For each piece of data in the current line: |
| Process that piece |
| |
| For each of the remaining lines of data: |
| For each piece of data in the current line: |
| Process that piece |
Once again we are processing lines in two different places. That is a strong hint that we should write a helper function to avoid duplicate code. Rewriting our algorithm and making it specific to the problem of finding the largest value makes this clearer:
| Find the first line of real data after the header |
| Find the largest value in that line |
| |
| For each of the remaining lines of data: |
| Find the largest value in that line |
| If that value is larger than the previous largest, remember it |
The helper function required is one that finds the largest value in a line, and it must split up the line. String method split will split around the whitespace, but we still have to remove the periods at the ends of the values.
We can also simplify our code by initializing largest to -1, since that value is guaranteed to be smaller than any of the (positive) values in the file. That way, no matter what the first real value is, it’ll be larger than the “previous” value (our -1) and replace it.
| from typing import TextIO |
| from io import StringIO |
| import time_series |
| |
| def find_largest(line: str) -> int: |
| """Return the largest value in line, which is a whitespace-delimited string |
| of integers that each end with a '.'. |
| |
| >>> find_largest('1. 3. 2. 5. 2.') |
| 5 |
| """ |
| # The largest value seen so far. |
| largest = -1 |
| for value in line.split(): |
| # Remove the trailing period. |
| v = int(value[:-1]) |
| # If we find a larger value, remember it. |
| if v > largest: |
| largest = v |
| |
| return largest |
We now face the same choice as with skip_header: we can put find_largest in a module (possibly time_series), or we can include it in the same file as the rest of the code. We choose the latter this time because the code is specific to this particular data set and problem:
| from typing import TextIO |
| from io import StringIO |
| import time_series |
| |
| def find_largest(line: str) -> int: |
| """Return the largest value in line, which is a whitespace-delimited string |
| of integers that each end with a '.'. |
| |
| >>> find_largest('1. 3. 2. 5. 2.') |
| 5 |
| """ |
| # The largest value seen so far. |
| largest = -1 |
| for value in line.split(): |
| # Remove the trailing period. |
| v = int(value[:-1]) |
| # If we find a larger value, remember it. |
| if v > largest: |
| largest = v |
| |
| return largest |
| |
| def process_file(reader: TextIO) -> int: |
| """Read and process reader, which must start with a time_series header. |
| Return the largest value after the header. There may be multiple pieces |
| of data on each line. |
| |
| >>> infile = StringIO('Example\\n 20. 3.\\n 100. 17. 15.\\n') |
| >>> process_file(infile) |
| 100 |
| """ |
| |
| line = time_series.skip_header(reader).strip() |
| # The largest value so far is the largest on this first line of data. |
| largest = find_largest(line) |
| |
| # Check the rest of the lines for larger values. |
| for line in reader: |
| large = find_largest(line) |
| if large > largest: |
| largest = large |
| return largest |
| |
| if __name__ == '__main__': |
| with open('lynx.txt', 'r') as input_file: |
| print(process_file(input_file)) |
Notice how simple the code in process_file looks! This happened only because we decided to write helper functions. To show you how much clearer this is, here is the same code without using time_series.skip_header and find_largest as helper methods:
| from typing import TextIO |
| from io import StringIO |
| |
| def process_file(reader: TextIO) -> int: |
| """Read and process reader, which must start with a time_series header. |
| Return the largest value after the header. There may be multiple pieces |
| of data on each line. |
| |
| >>> infile = StringIO('Example\\n 20. 3.\\n') |
| >>> process_file(infile) |
| 20 |
| >>> infile = StringIO('Example\\n 20. 3.\\n 100. 17. 15.\\n') |
| >>> process_file(infile) |
| 100 |
| """ |
| |
| # Read the description line |
| line = reader.readline() |
| |
| # Find the first non-comment line |
| line = reader.readline() |
| while line.startswith('#'): |
| line = reader.readline() |
| |
| # Now line contains the first real piece of data |
| |
| # The largest value seen so far in the current line |
| largest = -1 |
| |
| for value in line.split(): |
| |
| # Remove the trailing period |
| v = int(value[:-1]) |
| # If we find a larger value, remember it |
| if v > largest: |
| largest = v |
| |
| # Check the rest of the lines for larger values |
| for line in reader: |
| |
| # The largest value seen so far in the current line |
| largest_in_line = -1 |
| |
| for value in line.split(): |
| |
| # Remove the trailing period |
| v = int(value[:-1]) |
| # If we find a larger value, remember it |
| if v > largest_in_line: |
| largest_in_line = v |
| |
| if largest_in_line > largest: |
| largest = largest_in_line |
| return largest |
| |
| if __name__ == '__main__': |
| with open('lynx.txt', 'r') as input_file: |
| print(process_file(input_file)) |