Column 12: A Sample Problem

Small computer programs are often educational and entertaining. This column tells the story of a tiny program that, in addition to those qualities, proved useful to a company.

12.1 The Problem

It was the early 1980’s, and the company had just purchased their first personal computers. After I got their primary system up and running, I encouraged people to keep an eye open for tasks around the office that could be done by a program. The firm’s business was public opinion polling, and an alert employee suggested automating the task of drawing a random sample from a printed list of precincts. Because doing the job by hand required a boring hour with a table of random numbers, she proposed the following program.

The input consists of a list of precinct names and an integer m. The output is a list of m of the precincts chosen at random. There are usually a few hundred precinct names (each a string of at most a dozen characters), and m is typically between 20 and 40.

That’s the user’s idea for a program. Do you have any suggestions about the problem definition before we dive into coding?

My primary response was that it was a great idea; the task was ripe for automation. I then pointed out that typing several hundred names, while perhaps easier than dealing with long columns of random numbers, was still a tedious and error-prone task. In general, it’s foolish to prepare a lot of input when the program is going to ignore the bulk of it anyway. I therefore suggested an alternative program.

The input consists of two integers m and n, with m < n. The output is a sorted list of m random integers in the range 0..n – 1 in which no integer occurs more than once.† For probability buffs, we desire a sorted selection without replacement in which each selection occurs with equal probability.

† The real program produced m integers in the range 1..n; I have changed to a zero-based range in this column for consistency with the other ranges in the book, and so that we can use the programs to produce random samples of C arrays. Programmers may count from zero, but pollsters start at one.

When m = 20 and n = 200, the program might produce a 20-element sequence that starts 4, 15, 17, .... The user then draws a sample of size 20 from 200 precincts by counting through the list and marking the 4^th, 15^th, and 17^th names, and so on. (The output is required to be sorted because the hardcopy list isn’t numbered.)

That specification met with the enthusiastic approval of its potential users. After the program was implemented, the task that previously required an hour could be accomplished in a few minutes.

Now look at the problem from the other side: how would you implement the program? Assume that you have a function bigrand( ) that returns a large random integer (much larger than m and n), and a function randint(i, j) that returns a random integer chosen uniformly in the range i..j. Problem 1 looks inside the implementation of such functions.

12.2 One Solution

As soon as we settled on the problem to be solved, I ran to my nearest copy of Knuth’s Seminumerical Algorithms (having copies of Knuth’s three volumes both at home and at work has been well worth the investment). Because I had studied the book carefully a decade earlier, I vaguely recalled that it contained several algorithms for problems like this. After spending a few minutes considering several possible designs that we’ll study shortly, I realized that Algorithm S in Knuth’s Section 3.4.2 was the ideal solution to my problem.

The algorithm considers the integers 0, 1, 2, ..., n – 1 in order, and selects each one by an appropriate random test. By visiting the integers in order, we guarantee that the output will be sorted.

To understand the selection criterion, let’s consider the example that m = 2 and n = 5. We should select the first integer 0 with probability 2/5; a program implements that by a statement like

if (bigrand() % 5) < 2

Unfortunately, we can’t select 1 with the same probability: doing so might or might not give us a total of 2 out of the 5 integers. We will therefore bias the decision and select 1 with probability 1/4 if 0 was chosen but with probability 2/4 if 0 was not chosen. In general, to select s numbers out of r remaining, we’ll select the next number with probability s/r. Here is the pseudocode:

Click here to view code image

select = m
remaining = n
for i = [0, n)
    if (bigrand() % remaining) < select
        print i
        select--
    remaining--

As long as m≤n, the program selects exactly m integers: it can’t select more because when select goes to zero no integer is selected, and it can’t select fewer because when select/remaining goes to one an integer is always selected. The for statement ensures that the integers are printed in sorted order. The above description should help you believe that each subset is equally likely to be picked; Knuth gives a probabilistic proof of that fact.

Knuth’s second volume made the program easy to write. Even including titles, input, output, range checking and the like, the final program required only thirteen lines of Basic. It was finished half an hour after the problem was defined, and was used for years without problems. Here is an implementation in C++:

Click here to view code image

void genknuth(int m, int n)
{   for (int i = 0; i < n; i++)
        /* select m of remaining n-i */
        if ((bigrand() % (n-i)) < m) {
            cout << i << "\n";
            m--;
        }
}

The program used only a few dozen bytes of memory, and was lightning fast for the company’s problems. This code can be slow, however, when n is very large. Using this algorithm to generate a few random 32-bit positive integers (that is, n = 2³²), for instance, requires about twelve minutes on my computer. Back-of-the-envelope quiz: how long would it take to generate one random 48- or 64-bit integer with this code?

12.3 The Design Space

One part of a programmer’s job is solving today’s problem. Another, and perhaps more important, part of the job is to prepare for solving tomorrow’s problems. Sometimes that preparation involves taking classes or studying books like Knuth’s. More often, though, programmers learn by the mental exercise of asking how we might have solved a problem differently. Let’s do that now by exploring the space of possible designs for the sampling problem.

When I talked about the problem to a seminar at West Point, I asked for a better approach than the original problem statement (typing all 200 names to the program). One student suggested photocopying the precinct list, cutting the copy with a paper slicer, shaking the slips in a paper bag, and then pulling out the required number of slips. That cadet showed the “conceptual blockbusting” that is the subject of Adams’s book cited in Section 1.7.†

† Page 57 of that book sketches Arthur Koestler’s views on three kinds of creativity. Ah! insights are his name for originality, and aha! insights are acts of discovery. He would call this cadet’s solution a haha! insight: the low-tech answer to a high-tech question is an act of comic inspiration (as in Solutions 1.10, 1.11 and 1.12).

From now on we’ll confine our search to a program to write m sorted integers at random from 0..n – 1. We’ll start by evaluating Program 1. The algorithmic idea is straightforward, the code is short, it uses just a few words of space, and the run time is fine for this application. The run time is proportional to n, though, which might be prohibitive for some applications. It’s worth a few minutes to study other ways of solving the problem. Sketch as many high-level designs as you can before reading on; don’t worry about implementation details yet.

One solution inserts random integers into an initially empty set until there are enough. In pseudocode, it can be sketched as

Click here to view code image

initialize set S to empty
size = 0
while size < m do
    t = bigrand() % n
    if t is not in S
        insert t into S
        size++
print the elements of S in sorted order

The algorithm is not biased towards any particular element; its output is random. We are still left with the problem of implementing the set S; think about an appropriate data structure.

In the old days, I would have worried about the relative merits of sorted linked lists, binary search trees, and all the other usual data structure suspects. Today, though, I can exploit the work put into the C++ Standard Template Library and call a set a set:

Click here to view code image

void gensets(int m, int n)
{   set<int> S;
    while (S.size() < m)
        S.insert(bigrand() % n);
    set<int>::iterator i;
    for (i = S.begin(); i != S.end(); ++i)
        cout << *i << "\n";
}

I was delighted that the real code is the same length as the pseudocode. This program generates and prints a million sorted, distinct 31-bit random integers in about 20 seconds on my machine. Since it takes about 12.5 seconds just to generate and to print a million unsorted integers without worrying about duplicates, the set operations consume just 7.5 seconds.

The C++ STL specification guarantees that each insertion will run in O(log m) time, and iterating through the set will take O(m) time, so the complete program will take O(m log m) time (when m is small compared to n). The data structure is, however, expensive in terms of space: my 128-megabyte machine starts thrashing around m = 1,700,000. The next column considers possible implementations of the set.

Another way to generate a sorted subset of random integers is to shuffle an n-element array that contains the numbers 0..n – 1, and then sort the first m for the output. Knuth’s Algorithm P in Section 3.4.2 shuffles the array x[0..n – 1].

for i = [0, n)
swap(i, randint(i, n-1))

Ashley Shepherd and Alex Woronow observed that in this problem we need shuffle only the first m elements of the array, which gives this C++ program:

Click here to view code image

void genshuf(int m, int n)
{   int i, j;
    int *x = new int[n];
    for (i = 0; i < n; i++)
        x[i] = i;
    for (i = 0; i < m; i++) {
        j = randint(i, n-1);
        int t = x[i]; x[i] = x[j]; x[j] = t;
    }
    sort(x, x+m);
    for (i = 0; i < m; i++)
        cout << x[i] << "\n";
}

The algorithm uses n words of memory and O(n+m log m) time, but the technique of Problem 1.9 reduces the time to O(m log m). We can view this algorithm as an alternative to Program 2 in which we represent the set of selected elements in x[0..i – 1] and the set of unselected elements in x[i..n – 1]. By explicitly representing the unselected elements we avoid testing whether the new element was previously chosen. Unfortunately, because this method uses O(n) time and memory, it is usually dominated by the algorithm from Knuth.

The functions we have seen so far offer several different solutions to the problem, but they by no means cover the design space. Suppose, for instance, that n is a million and m is n – 10. We might generate a sorted random sample of 10 elements, and then report the integers that aren’t there. Next, suppose that n is ten million and m is 2³¹. We could generate eleven million integers, sort them, scan through to remove duplicates, and then generate a ten-million-element sorted sample of that. Solution 9 describes a particularly clever algorithm based on searching due to Bob Floyd.

12.4 Principles

This column illustrates several important steps in the programming process. Although the following discussion presents the stages in one natural order, the design process is more active: we hop from one activity to another, usually iterating through each many times before arriving at an acceptable solution.

Understand the Perceived Problem. Talk with the user about the context in which the problem arises. Problem statements often include ideas about solutions; like all early ideas, they should be considered but not to the exclusion of others.

Specify an Abstract Problem. A clean, crisp problem statement helps us first to solve this problem and then to see how this solution can be applied to other problems.

Explore the Design Space. Too many programmers jump too quickly to “the” solution to their problem; they think for a minute and code for a day rather than thinking for an hour and coding for an hour. Using informal high-level languages helps us to describe designs: pseudocode represents control flow and abstract data types represent the crucial data structures. Knowledge of the literature is invaluable at this stage of the design process.

Implement One Solution. On lucky days our exploration of the design space shows that one program is far superior to the rest; at other times we have to prototype the top few to choose the best. We should strive to implement the chosen design in straightforward code, using the most powerful operations available.†

† Problem 6 describes a class exercise that I graded on programming style. Most students turned in one-page solutions and received mediocre grades. Two students who had spent the previous summer on a large software development project turned in beautifully documented five-page programs, broken into a dozen functions, each with an elaborate heading. They received failing grades. The best programs worked in five lines of code, and the inflation factor of sixty was too much for a passing grade. When the pair complained that they were employing standard software engineering tools, I should have quoted Pamela Zave: “The purpose of software engineering is to control complexity, not to create it.” A few more minutes spent looking for a simple program might have spared them hours documenting their complex program.

Retrospect. Polya’s delightful How to Solve It can help any programmer become a better problem solver. On page 15 he observes that “There remains always something to do; with sufficient study and penetration, we could improve any solution, and, in any case, we can always improve our understanding of the solution.” His hints are particularly helpful for looking back at programming problems.

12.5 Problems

1. The C library rand( ) function typically returns about fifteen random bits. Use that function to implement a function bigrand( ) to return at least 30 random bits, and a function randint(l, u) to return a random integer in the range [l, u].

2. Section 12.1 specified that all m-element subsets be chosen with equal probability, which is a stronger requirement than choosing each integer with probability m/n. Describe an algorithm that chooses each element equiprobably, but chooses some subsets with greater probability than others.

3. Show that when m<n/2, the expected number of member tests made by the set-based algorithm before finding a number not in the set is less than 2.

4. Counting the member tests in the set-based program leads to many interesting problems in combinatorics and probability theory. How many member tests does the program make on the average as a function of m and n? How many does it make when m = n? When is it likely to make more than m tests?

5. This column described several algorithms for one problem; all are available at this book’s web site. Measure their performance on your system, and describe when each is appropriate as a function of constraints on run time, space, etc.

6. [Class Exercise] I assigned the problem of generating sorted subsets twice in an undergraduate course on algorithms. Before the unit on sorting and searching, students had to write a program for m = 20 and n = 400; the primary grading criterion was a short, clean program — run time was not an issue. After the unit on sorting and searching they had to solve the problem again with m = 5,000,000 and n = 1,000,000,000, and the grade was based primarily on run time.

7. [V. A. Vyssotsky] Algorithms for generating combinatorial objects are often profitably expressed as recursive functions. Knuth’s algorithm can be written as

Click here to view code image

void randselect(m, n)
        pre   0 <= m <= n
        post  m distinct integers from 0..n-l are
              printed in decreasing order
    if m > 0
        if (bigrand() % n) < m
            print n-1
            randselect(m-1, n-1)
        else
            randselect(m, n-1)

This program prints the random integers in decreasing order; how could you make them appear in increasing order? Argue the correctness of the resulting program. How could you use the basic recursive structure of this program to generate all m-element subsets of 0.. n – 1?

8. How would you generate a random selection of m integers from 0..n – 1 with the constraint that the final output must appear in random order? How would you generate a sorted list if duplicate integers were allowed in the list? What if both duplicates and random order were desired?

9. [R. W. Floyd] When m is near n, the set-based algorithm generates many random numbers that are discarded because they are already present in the set. Can you find an algorithm that uses only m random numbers, even in the worst case?

10. How could you select one of n objects at random, where you see the objects sequentially but you don’t know the value of n beforehand? For concreteness, how would you read a text file, and select and print one random line, when you don’t know the number of lines in advance?

11. [M. I. Shamos] A promotional game consists of a card containing 16 spots, which hide a random permutation of the integers 1.. 16. The player rubs the dots off the card to expose the hidden integers. If the integer 3 is ever exposed then the card loses; if 1 and 2 (in either order) are both revealed then the card wins. Describe the steps you would take to compute the probability that randomly choosing a sequence of spots wins the game; assume that you may use at most one hour of CPU time.

12. My first version of one of the programs in this column had a nasty bug that caused the program to die when m = 0. For other values of m, it produced output that looked random but wasn’t. How would you test a program that produces a sample to ensure that its output is indeed random?

12.6 Further Reading

Volume 2 of Don Knuth’s Art of Computer Programming is Seminumerical Algorithms. The third edition was published by Addison-Wesley in 1998. Chapter 3 (the first half of the book) is about random numbers, and Chapter 4 (the second half) is about arithmetic. Section 3.4.2 on “Random Sampling and Shuffling” is especially relevant to this column. If you need to build a random number generator or functions that perform advanced arithmetic, then you need this book.