Chapter 8. Threads and Collection Classes

In this chapter, we’ll look at how threads interact with the collection classes provided by Java. We’ll examine some synchronization issues and how they affect our choice and usage of collection classes.

The collection classes comprise many of the classes in the java.util package (and, in J2SE 5.0, some of the classes in the java.util.concurrent package). Collection classes are used to store objects in some data structure: a hashtable, an array, a queue, and so on. Collection classes interact with Java threads in a few areas:

We begin this chapter with an overview of the collection classes; the overview addresses the thread-safety of the various classes. Next, we show how some of the newer collection classes interact with threads. And finally, we show a common design pattern in which multiple threads use the collections: the producer-consumer model.

In the beginning, Java provided only a few collection classes. In fact, in the first version of Java, these classes weren’t even referred to as collection classes; they were simply utility classes that Java provided. For the most part, these classes were all threadsafe; the early collection classes were designed to prevent developers from inadvertently corrupting the data structures by using them in different threads without appropriate data synchronization.

JDK 1.2 introduced the formal idea of collection classes. The few existing data collection classes from JDK 1.0 and 1.1 were integrated into this framework, which was expanded to include new classes and new interfaces. Defining the collection classes in terms of interfaces made it possible to write programs that could use different collection implementations at runtime.

The controversial change introduced in JDK 1.2 is that most of the collection classes are now, by default, not threadsafe. Threadsafe versions of the classes exist, but the decision was made to allow the developer to manage the thread-safety of the classes. Two factors inform this decision: the performance of synchronization and the requirements of algorithms that use the collection. We’ll have more to say on those issues in the next section. JDK 1.3 and 1.4 added some minor extensions to these collection classes.

J2SE 5.0 introduces a number of new collection classes. Some of these classes are simple extensions to the existing collections framework, but many of them have two distinguishing features. First, their internal implementation makes heavy use of the new J2SE 5.0 synchronization tools (in particular, atomic variables). Second, most of these classes are designed to be used by multiple threads and support the idea of thread notification when data in the collection becomes available.

The majority of collection classes are not threadsafe. When used in multithreaded programs, access to them must always be controlled by some synchronization. This synchronization can be accomplished either by using a “wrapper” class that synchronizes every access operation (using the Collections class, which we’ll show later) or by using explicit synchronization:

java.util.BitSet

A bit set stores an array of boolean (1-bit) values. The size of the array can grow at will. A BitSet saves space compared to an array of booleans since the bit set can store multiple values in one long variable. Despite its name, it does not implement the Set interface.

java.util.HashSet (a Set)

A class that implements an unordered set collection.

java.util.TreeSet (a SortedSet)

A class that implements a sorted (and ordered) set collection.

java.util.HashMap (a Map)

A class that implements an unordered map collection.

java.util.WeakHashMap (a Map)

This class is similar to the HashMap class. The difference is that the key is a weak reference—it is not counted as a reference by the garbage collector. The class therefore deletes key-value pair entries from the map when the key has been garbage collected.

java.util.TreeMap (a SortedMap)

A class that implements a sorted (and ordered) map collection. This map is based on binary trees (so operations require log(n) time to perform).

java.util.ArrayList (a List)

A class that implements a list collection. Internally, it is implemented using arrays.

java.util.LinkedList (a List and a Queue)

A class that implements a list and a queue collection, providing a doubly linked list.

java.util.LinkedHashSet (a Set)

A set collection that sorts its items based on the order in which they are added to the set.

java.util.LinkedHashMap (a Map)

A map collection that sorts its items based on the order in which they are added to the map.

java.util.IdentityHashMap (a Map)

A map collection. Unlike all other maps, this class uses == for key comparison instead of the equals() method.

java.util.EnumSet (a Set)

A specialized set collection that holds only Enum values.

java.util.EnumMap (a Map)

A specialized map collection that uses only Enum values as keys.

java.util.PriorityQueue (a Queue)

An unbounded queue in which retrieval is not based on order (LIFO or FIFO); instead, objects are removed according to which is the smallest (as determined by the Comparable or Comparator interface).

A number of classes in the java.util.concurrent package are designed to provide thread notification when their contents change. They are inherently threadsafe since they are expected to be used by multiple threads simultaneously. They simplify usage of collections by providing semantics to handle out-of-space and out-of-data conditions within the collection. We’ll see examples of this later in the chapter.

When writing a multithreaded program, the most important question when using a collection class is how to manage its synchronization. Synchronization can be managed by the collection class itself or managed explicitly in your program code. In the examples in this section, we’ll explore both of these options.

Let’s take the simple case first. In the simple case, you’re going to use the collection class to store shared data. Other threads retrieve data from the collection, but there won’t be much (if any) manipulation of the data.

In this case, the easiest object to use is a threadsafe collection (e.g., a Vector or Hashtable). That’s what we’ve done all along in our CharacterEventHandler class:

package javathreads.examples.ch08.example1;

import java.util.*;

public class CharacterEventHandler {
    private Vector listeners = new Vector( );

    public void addCharacterListener(CharacterListener cl) {
        listeners.add(cl);
    }

    public void removeCharacterListener(CharacterListener cl) {
        listeners.remove(cl);
    }

    public void fireNewCharacter(CharacterSource source, int c) {
        CharacterEvent ce = new CharacterEvent(source, c);
        CharacterListener[] cl = (CharacterListener[] )
                                  listeners.toArray(new CharacterListener[0]);
        for (int i = 0; i < cl.length; i++)
            cl[i].newCharacter(ce);
    }
}

In this case, using a vector is sufficient for our purposes. If multiple threads call methods of this class at the same time, there is no conflict. Because the listeners collection is threadsafe, we can call its add(), remove(), and toArray() methods at the same time without corrupting the internal state of the Vector object. Strictly speaking, there is a race condition here in our use of the toArray() method; we’ll talk about that a little more in the next section. But the point is that none of the methods on the vector see data in an inconsistent state because the Vector class itself is threadsafe.

A second option would be to use a thread-unsafe class (e.g., the ArrayList class) and manage the synchronization explicitly:

package javathreads.examples.ch08.example2;
...
public class CharacterEventHandler {
    private ArrayList listeners = new ArrayList( );
    public synchronized void addCharacterListener(CharacterListener cl) {
        ...
    }
    public synchronized void removeCharacterListener(CharacterListener cl) {
        ...
    }
    public synchronized void fireNewCharacter(CharacterSource source, int c) {
        ...
    }
}

Or we could have synchronized the class like this:

package javathreads.examples.ch08.example3;
...
public class CharacterEventHandler {
    private ArrayList listeners = new ArrayList( );

    public void addCharacterListener(CharacterListener cl) {
        synchronized(listeners) {
            listeners.add(cl);
        }
    }

    public void removeCharacterListener(CharacterListener cl) {
        synchronized(listeners) {
            listeners.add(cl);
        }
    }
    public void fireNewCharacter(CharacterSource source, int c) {
        CharacterEvent ce = new CharacterEvent(source, c);
        CharacterListener[] cl;
        synchronized(listeners) {
            cl = (CharacterListener[])
                                  listeners.toArray(new CharacterListener[0]);
        }
        for (int i = 0; i < cl.length; i++)
            cl[i].newCharacter(ce);
    }
}

In this example, it doesn’t matter whether we synchronize on the collection object or the event handler object (this); either one ensures that two threads are not simultaneously calling methods of the ArrayList class.

Our third option is to use a synchronized version of the thread-unsafe collection class. Most thread-unsafe collection classes have a synchronized counterpart that is threadsafe. The threadsafe collections are constructed by calling one of these static methods of the Collections class:

Set s = Collections.synchronizedSet(new HashSet(...));
Set s = Collections.synchronizedSet(new LinkedHashSet(...));
SortedSet s = Collections.synchronizedSortedSet(new TreeSet(...));
Set s = Collections.synchronizedSet(EnumSet.noneOf(obj.class));
Map m = Collections.synchronizedMap(new HashMap(...));
Map m = Collections.synchronizedMap(new LinkedHashMap(...));
SortedMap m = Collections.synchronizedSortedMap(new TreeMap(...));
Map m = Collections.synchronizedMap(new WeakHashMap(...));
Map m = Collections.synchronizedMap(new IdentityHashMap(...));
Map m = Collections.synchronizedMap(new EnumMap(...));
List list = Collections.synchronizedList(new ArrayList(...));
List list = Collections.synchronizedList(new LinkedList(...));

Any of these options protect access to the data held in the collection. This is accomplished by wrapping the collection in an object that synchronizes every method of the collection interface: it is not designed as an optimally synchronized class. Also note that the queue collection is not supported: the Collections class supplies only wrapper classes that support the Set, Map, and List interfaces. This is not a problem in most cases since the majority of the queue implementations are synchronized (and synchronized optimally).

A more complex case arises when you need to perform multiple operations atomically on the data held in the collection. In the previous section, we were able to use simple synchronization because the methods that needed to access the data in the collection performed only a single operation. The addCharacterListener() method has only a single statement that uses the listeners vector, so it doesn’t matter if the data changes after the addCharacterListener( ) method calls the listeners.add() method. As a result, we could rely on the container to provide the synchronization.

We alluded to a race condition in the fireNewCharacter() method. After we call the listeners.toArray( ) method, we cycle through the listeners to call each of them. It’s entirely possible that another thread will call the removeCharacterListener() method while we’re looping through the array. That won’t corrupt the array or the listeners vector, but in some algorithms, it could be a problem: we’d be operating on data that has been removed from the vector. In our program, that’s okay: we have a benign race condition. In other programs, that may not necessarily be the case.

Suppose we want to keep track of all the characters that players typed correctly (or incorrectly). We could do that with the following:

package javathreads.examples.ch08.example4;

import java.util.*;
import javax.swing.*;
import javax.swing.table.*;

public class CharCounter {
    public HashMap correctChars = new HashMap( );
    public HashMap incorrectChars = new HashMap( );
    private AbstractTableModel atm;

    public void correctChar(int c) {
        synchronized(correctChars) {
            Integer key = new Integer(c);
            Integer num = (Integer) correctChars.get(key);
            if (num == null)
                correctChars.put(key, new Integer(1));
            else correctChars.put(key, new Integer(num.intValue( ) +1));
            if (atm != null)
                atm.fireTableDataChanged( );
        }
    }

    public int getCorrectNum(int c) {
        synchronized(correctChars) {
            Integer key = new Integer(c);
            Integer num = (Integer) correctChars.get(key);
            if (num == null)
                return 0;
            return num.intValue( );
        }
    }

    public void incorrectChar(int c) {
        synchronized(incorrectChars) {
            Integer key = new Integer(c);
            Integer num = (Integer) incorrectChars.get(key);
            if (num == null)
                incorrectChars.put(key, new Integer(-1));
            else incorrectChars.put(key, new Integer(num.intValue( ) -1));
            if (atm != null)
                atm.fireTableDataChanged( );
        }
    }

    public int getIncorrectNum(int c) {
        synchronized(incorrectChars) {
            Integer key = new Integer(c);
            Integer num = (Integer) incorrectChars.get(key);
            if (num == null)
                return 0;
            return num.intValue( );
        }
    }

    public void addModel(AbstractTableModel atm) {
        this.atm = atm;
    }
}

Here we use thread-unsafe collections to hold the data and explicitly synchronize access around the code that uses the collections. It would be insufficient to use Hashtable collections in this code without also synchronizing as we did earlier. Although retrieving a value from a hashtable is threadsafe, and replacing an element in a hashtable is also threadsafe, the overall operation is not threadsafe: both collection operations must be atomic for the algorithm to succeed. Otherwise, two threads could simultaneously retrieve the stored value, increment it, and store it; the net result would be a score that is one less than it should be.

The moral of the story is that using a threadsafe collection does not guarantee the correctness of your program. Because of the explicit synchronization required in this example, we were able to use a thread-unsafe collection (although, as we’ll see in Chapter 14, if you use a threadsafe collection, it’s unlikely you’ll see much difference.)

Many situations call for using each element of a collection. Such is the case in our example. We called the toArray() method, which returns an array containing every element in the vector. The Vector and Hashtable classes also have methods that return a java.util.Enumeration object that contains every element in the collection. More generally, all collection classes implement one or more methods that return a java.util.Iterator object. The iterator also contains every element in the collection.

Each of these techniques presents special synchronization concerns. We’ve already seen that looping through the array returned by the toArray() method can lead to a situation where we’re accessing an element in the array that no longer appears in the collection. That may or may not be a problem for your program; if it is a problem, the solution is to synchronize access around the loop that uses the array.

Enumeration objects are difficult to use without explicit synchronization. The enumeration keeps state information about the collection; if the collection is modified while the enumeration is active, the enumeration may become confused. The enumeration fails in some random way, possibly through an unexpected runtime exception (e.g., a NullPointerException).

To use an enumeration of a collection that may also be used by multiple threads, you should synchronize on the collection object itself:

package javathreads.examples.ch08.example5;
...
    public void fireNewCharacter(CharacterSource source, int c) {
        CharacterEvent ce = new CharacterEvent(source, c);
        Enumeration e;
               synchronized(listeners) {
               e = listeners.elements( );
               while (e.hasMoreElements( )) {
                       ((CharacterListener) e.nextElement( )).newCharacter(ce);
                  }
         }
    }
}

You could synchronize the method instead, as long as your collection is not used in any unsynchronized method. The point is that the enumeration and all uses of the collection must be locked by the same synchronization object.

Iterators behave somewhat differently. If the underlying collection of an iterator is modified while the iterator is active, the next access to the iterator throws a ConcurrentModificationException, which is also a runtime exception. Unlike enumerations, if the iterator fails, the underlying collection can still be used. The way in which iterators fail immediately after a modify operation is called “fail-fast.”

The safest way to use an iterator is to make sure its use is synchronized by its underlying collection (just as we did with the enumeration)—or to make sure that it and the collection are protected by the same synchronization lock.

You can’t rely upon the fail-fast nature of iterators. Iterators make a best effort at determining when the underlying collection has changed, but in the absence of synchronization, it’s impossible to predict when the failure occurs. Once a failure has occurred, the iterator is not useful for further processing. Therefore, you’re left with a situation where some elements of the collection have been processed and others have not.

Two classes—CopyOnWriteArrayList and CopyOnWriteArraySet —provide special iteration semantics. These classes are designed to copy the underlying collection when necessary so that iterators operate on a snapshot of the data from the time the iterator was created. Modifying the collection while the iterator is active creates a copy of the collection for the iterator.

This is an expensive operation, both in terms of time and memory usage. However, it ensures that iterators can be used from unsynchronized code because the iterators end up operating on old copies of the data. So, the iterators never throw a concurrent modification exception.

These classes are designed for cases where modifications to the collection are rare and the iterator of the collection is used frequently by multiple threads. This allows the iterators to be unsynchronized and still be threadsafe; as long as the updates are rare enough, this yields better overall performance. Note, however, that race conditions are still possible with this technique; it’s essentially the same type of operation as we saw earlier with the toArray() method. The difference is when the copying occurs: when you call the toArray() method, a copy of the collection is made at that time. With the copy-on-write classes, the copy is made whenever the collection is modified.

One of the more common patterns in threaded programming is the producer/consumer pattern. The idea is to process data asynchronously by partitioning requests among different groups of threads. The producer is a thread (or group of threads) that generates requests (or data) to be processed. The consumer is a thread (or group of threads) that takes those requests (or data) and acts upon them. This pattern provides a clean separation that allows for better thread design and makes development and debugging easier. This pattern is shown in Figure 8-1.

The producer/consumer pattern is common for threaded programs because it is easy to make threadsafe. We just need to provide a safe way to pass data from the producer to the consumer. Data needs to be synchronized only during the small period of time when it is being passed between producer and consumer. We can use simple synchronization since the acts of inserting and removing from the collection are single operations. Therefore, any threadsafe vector, list, or queue can be used.

The queue-based collection classes added to J2SE 5.0 were specifically designed for this model. The queue data type is perfect to use for this pattern since it has the simple semantics of adding and removing a single element (with an optional ordering of the requests). Furthermore, blocking queues provide thread-control functionality: this allows you to focus on the functionality of your program while the queue takes care of thread and space management issues. Of course, if you need control over such issues, you can use a nonblocking queue and use your own explicit synchronization and notification.

Here’s a simple producer that uses a blocking queue:

package javathreads.examples.ch08.example6;

import java.util.*;
import java.util.concurrent.*;

public class FibonacciProducer implements Runnable {
    private Thread thr;
    private BlockingQueue<Integer> queue;

    public FibonacciProducer(BlockingQueue<Integer> q) {
        queue = q;
        thr = new Thread(this);
        thr.start( );
    }

    public void run( ) {
        try {
            for(int x=0;;x++) {
                Thread.sleep(1000);
                queue.put(new Integer(x));
                System.out.println("Produced request " + x);
            }
        } catch (InterruptedException ex) {
        }
    }
}

The producer is implemented to run in a separate thread; it uses the queue to store requests to be processed. We’re using a blocking queue because we want the queue to handle the case where the producer gets too far ahead of the consumer. When that happens, we want the producer to block (so that it does not produce any more requests until the consumer catches up).

Here’s the consumer:

package javathreads.examples.ch08.example6;

import java.util.concurrent.*;

public class FibonacciConsumer implements Runnable {
    private Fibonacci fib = new Fibonacci( );
    private Thread thr;
    private BlockingQueue<Integer> queue;

    public FibonacciConsumer(BlockingQueue<Integer> q) {
        queue = q;
        thr = new Thread(this);
        thr.start( );
    }

    public void run( ) {
        int request, result;
        try {
            while (true) {
                request = queue.take( ).intValue( );
                result = fib.calculateWithCache(request);
                System.out.println(
                        "Calculated result of " + result + " from " + request);
            }
        } catch (InterruptedException ex) {
        }
    }
}

The consumer also runs in its own thread. It blocks until a request is in the queue, at which point it calculates a Fibonacci number based on the request. The actual calculation is performed by the Fibonacci class available in the online examples (along with a testing program).

Notice that the producer and consumer threads are decoupled: the producer never directly calls the consumer (and vice versa). This allows us to interchange different producers without affecting the consumer. It also allows us to have multiple producers serviced by a single consumer, or multiple consumers servicing a single producer. More generally, we can vary the number of either based on performance needs or user requirements.

The queue has also hidden all of the interesting thread code. When the queue is full, the producer blocks: it waits on a condition variable. Later, when the consumer takes an element from the queue, it notifies the waiting producer. A similar situation arises when the consumer calls the take() method on an empty queue. You could write all the condition variable code to handle this, but it’s far easier to allow the queue to do it for you.

We chose to calculate a Fibonacci number in our test program because we used a recursive algorithm that takes an increasingly long time to compute. It’s interesting to watch how the producer and consumer interact in this case. In the beginning, the consumer is blocked a lot of the time because it can calculate the Fibonacci number in less than one second (the time period between requests from the producer). Later, the producer spends most of its time blocked because it has overwhelmed the consumer and filled the queue.

If you have a multiprocessor machine, you can run the example with multiple consumer threads, but eventually the result is the same: the calculations take too long for the consumers to keep up.

So, which are the best collections to use? Obviously, no single answer fits all cases. However, here are some general suggestions. By adhering to these suggestions, we can narrow the choice of which collection to use.

In this chapter, we have examined how threads interact with Java’s collection classes. We’ve seen the synchronization requirements imposed by different classes and how to handle those requirements effectively. We’ve also examined how these classes can be used for the common design pattern known as the producer/consumer pattern.