Chapter 5

Reference counting

The algorithms considered so far have all been indirect. Each has traced the graph of live objects from a set of known roots to identify all live objects. In this chapter, we consider the last class of fundamental algorithms, reference counting [Collins, 1960]. Rather than tracing reachable objects and then inferring that all unvisited objects must be garbage, reference counting operates directly on objects as references are created or destroyed.

Reference counting maintains a simple invariant: an object is presumed to be live if and only if the number of references to that object is greater than zero.1 Reference counting therefore associates a reference count with each object managed; typically this count is stored as an additional slot in the object’s header. In its most naïve implementation, shown in Algorithm 5.1, reference counts are incremented or decremented as references to objects are created or destroyed. Procedure Write increments the reference count of the new target and then decrements the count of the old target. Note that it is called even for updates of local variables. We also assume it is called to write null into local variables before each procedure returns. The operations addReference and deleteReference increment and decrement respectively the reference counts of their object argument. Note that it is essential that the reference counts are adjusted in this order (lines 9–10) to prevent premature reclamation of the target in the case when the old and the new targets of the pointer are the same, that is, src[i]=ref. Once a reference count is zero (line 20), the object can be freed and the reference counts of all its children decremented, which may in turn lead to their reclamation and so on recursively.

The Write method in Algorithm 5.1 is an example of a write barrier. For these, the compiler emits a short code sequence around the actual pointer write. As we shall see later in this book, mutators are required to execute barriers in many systems. More precisely, they are required whenever collectors do not consider the liveness of the entire object graph, atomically with respect to the mutator. Such collectors may execute concurrently, either in lock-step with the mutator as for reference counting or asynchronously in another thread. Alternatively, the collector may process different regions of the heap at different frequencies, as do generational collectors. In all these cases, mutator barriers must be executed in order to preserve the invariants of the collector algorithm.

Algorithm 5.1: Simple reference counting

 1 New():

 2 ref ← allocate()

 3 if ref = null

 4  error “Out of memory”

 5 rc(ref) ← 0

 6 return ref

 7

 8 atomic Write(src, i, ref):

 9 addReference(ref)

10 deleteReference(src[i])

11 src[i] ← ref

12

13 addReference(ref):

14 if ref ≠ null

15  rc(ref) ← rc(ref) + 1

16

17 deleteReference(ref):

18 if ref ≠ = null

19  rc(ref) ← rc(ref) — 1

20  if rc(ref) = 0

21   for each fld in Pointers(ref)

22    deleteReference(*fld)

23   free(ref)

5.1    Advantages and disadvantages of reference counting

There are a number of reasons why reference counting might be an attractive option. Memory management costs are distributed throughout the computation. Potentially, reference counting can recycle memory as soon as an object becomes garbage (but we shall see below why this may not always be desirable). Consequently, it may continue to operate satisfactorily in a nearly full heap, unlike tracing collectors which need some headroom. Since reference counting operates directly on the sources and targets of pointers, the locality of a reference counting algorithm may be no worse than that of its client program. Client programs can use destructive updates rather than copying objects if they can prove that an object is not shared. Reference counting can be implemented without assistance from or knowledge of the run-time system. In particular, it is not necessary to know the roots of the program. Reference counting can reclaim some memory even if parts of the system are unavailable: this is particularly useful in distributed systems [Rodrigues and Jones, 1998].

For these reasons, reference counting has been adopted for many systems including programming language implementations (early versions of Smalltalk and Lisp; also awk, perl and python); applications such as Photoshop, Real Networks’ Rhapsody Music Service, Océ printers, scanners and document management systems; as well as operating systems’ file managers. Libraries to support safe reclamation of objects are widely available for languages like C++ that do not yet require automatic memory management. Such libraries often use smart pointers to access objects. Smart pointers typically override constructors and operators such as assignment, either to enforce unique ownership of objects or to provide reference counting. Unique pointers ensure that an object has a single ‘owner’. When this owner is destroyed, the object also can be destroyed. For example, the next C++ standard is expected to include a unique_ptr template. Many C++ programmers use smart pointers to provide reference counting to manage memory automatically. The best known smart pointer library for C++ is the Boost library,2 which provides reference counting through shared pointer objects. One drawback of smart pointers is that they have semantics different from those of the raw pointers that they imitate [Edelson, 1992].

Unfortunately, there are also a number of disadvantages to reference counting. First, reference counting imposes a time overhead on the mutator. In contrast to the tracing algorithms we considered in earlier chapters, Algorithm 5.1 redefined all pointer Read and Write operations in order to manipulate reference counts. Even non-destructive operations such as iteration require the reference counts of each element in the list to be incremented and then decremented as a pointer moves across a data structure such as a list. From a performance point of view, it is particularly undesirable to add overhead to operations that manipulate registers or thread stack slots. For this reason alone, this naïve algorithm is impractical for use as a general purpose, high volume, high performance memory manager. Fortunately, as we shall see, the cost of reference counted pointer manipulations can be reduced substantially.

Second, both the reference count manipulations and the pointer load or store must be a single atomic action in order to prevent races between mutator threads which would risk premature reclamation of objects. It is insufficient to protect the integrity of the reference count operation alone. For now, we simply assert that actions are atomic, without explaining how this might be achieved. We reconsider this in Chapter 18 when examine reference counting and concurrency in detail. Some smart pointer libraries that provide reference counting require careful use by the programmer if races are to be avoided. For example, in the Boost library, concurrent threads can read the same shared_ptr instance simultaneously, or can modify different shared_ptr instances simultaneously, but the library enforces atomicity only upon reference count manipulations. The combination of pointer read or write and reference count increment is not a single atomic action. Thus, the application programmer must take care to prevent races to update a pointer slot, which might lead to undefined behaviour.

Third, naïve reference counting turns read-only operations into ones requiring stores to memory (to update reference counts). Similarly, it requires reading and writing the old referent of a pointer field when changing that field to refer to a different object. These writes ‘pollute’ the cache and induce extra memory traffic.

Fourth, reference counting cannot reclaim cyclic data structures (that is, data structures that contain references to themselves). Even if such a structure is isolated from the rest of the object graph — it is unreachable — the reference counts of its components will never drop to zero. Unfortunately, self-referential structures are common (doubly-linked lists, trees whose nodes hold a back pointer to the root, and so on), although their frequency varies widely between applications [Bacon and Rajan, 2001].

Fifth, in the worst case, the number of references to an object could be equal to the number of objects in the heap. This means that the reference count field must be pointer sized, that is, a whole slot. Given that the average size of nodes in object-oriented languages is small (for example, Java instance objects are typically 20–64 bytes long [Dieckmann and Hölzle, 1999, 2001; Blackburn et al, 2006a], and Lisp cons cells usually fit into two or three slots), this overhead can be significant.

Finally, reference counting may still induce pauses. When the last reference to the head of a large pointer structure is deleted, reference counting must recursively delete each descendant of the root. Boehm [2004] suggests that thread-safe implementations of reference counting may even lead to longer maximum pause times than tracing collectors. Weizenbaum [1969] suggested lazy reference counting: rather than immediately freeing garbage pointer structures, deleteReference adds an object with a zero reference count to a to-be-freed list, without destroying its contents. When the object is later acquired by the allocator, its children can be processed similarly, without recursive freeing. Unfortunately, this technique allows large garbage structures to be hidden by smaller ones, and hence increases overall space requirements [Boehm, 2004].

Let us now see the extent to which we can resolve two of the major problems facing reference counting: the cost of reference count manipulations and collecting cyclic garbage. It turns out that common solutions to both of these problems involve a stop-the-world pause. We mention these in this chapter but examine how this requirement can be relaxed in Chapter 18.

5.2    Improving efficiency

There are two ways in which the efficiency of reference counting can be improved. Either the number of barrier operations must be reduced or expensive synchronised operations must be replaced by cheaper, unsynchronised ones. Blackburn and McKinley [2003] define a useful taxonomy of solutions.

Deferral Deferred reference counting trades fine grained incrementality (the immediate recovery of all garbage) for efficiency. The identification of some garbage objects is deferred to a reclamation phase at the end of some period. These schemes eliminate some barrier operations.

Coalescing Many reference count adjustments are temporary and hence ‘unnecessary’; programmers often remove them by hand. In some special cases, this can also be done by the compiler. However, it is also possible to do this more generally at run time by tracking only the state of an object at the beginning and end of some period. This coalesced reference counting ignores all but the first modification to a field of an object in each period.

Buffering Buffered reference counting also defers identification of garbage. However, unlike deferred or coalesced reference counting, it buffers all reference count increments and decrements for later processing. Only the collector thread is allowed to apply reference count modifications. Buffering considers when to apply reference count modifications, not whether to.

One fundamental obstacle to efficiency is that object reference counts are part of the global state of the program, but operations on (thread) local state are usually more efficient. The three classes of solution above share a common approach to this problem: they divide execution into periods or epochs. Within an epoch, some or all synchronised reference counting operations can be either eliminated or replaced by unsynchronised writes (to thread-local buffers). Identification of garbage is performed only at the end of an epoch, either with mutator threads halted, or concurrently by a separate collector thread (or threads). In all cases, changes to local reference count state are not revealed (that is, applied to the global state using an atomic action) until an epoch boundary.

In this chapter, we consider two sequential approaches, deferred and coalesced reference counting. Here, collection epochs are separated by stop-the-world pauses to repair reference counts. In Chapter 18, we shall see first how buffered reference counting devolves responsibility for reference count manipulations to a concurrent thread, and then how we can coalesce reference counts concurrently.

Image

Figure 5.1: Deferred reference counting schematic, showing whether reference counting operations on pointer loads or stores should be deferred or be performed eagerly. The arrows indicate the source and target of pointers loaded or stored.

Blackburn and McKinley [2003], doi: 10.1145/949305.949336. © 2003 Association for Computing Machinery, Inc. Reprinted by permission.

5.3    Deferred reference counting

Manipulating reference counts is expensive compared with the cost to the mutator of simple tracing algorithms. Although, as we shall see in later chapters, generational and concurrent algorithms also impose a small overhead on the mutator, these are much lower than the overhead of safely manipulating reference counts. To overwrite a pointer, Write in Algorithm 5.1 executed a dozen or so instructions (though in some cases the compiler could statically elide some tests). The reference count adjustments must be atomic operations and be kept consistent with pointer updates. Furthermore, Write modifies both the old and new targets of the field in question, possibly polluting the cache with dirty words that will not be reused soon. Optimisation to remove matching increments and decrements is error prone if done by hand, but has proved effective as a compiler optimisation [Cann and Oldehoeft, 1988].

Most high-performance reference counting systems (for example, that of Blackburn and McKinley [2003]) use deferred reference counting. The overwhelming majority of pointer loads are to local and temporary variables, that is, to registers or stack slots. Long ago, Deutsch and Bobrow [1976] showed how to remove reference count manipulations from these operations by adjusting counts only when pointers are stored into heap objects. Figure 5.1 shows an abstract view of deferred reference counting in which operations on heap objects are performed immediately but those involving stacks or registers are deferred. There is, of course, a cost to pay. If reference count manipulations on local variables are ignored, then counts will no longer be accurate. It is therefore no longer safe to reclaim an object just because its reference count is zero. In order to reclaim any garbage, deferred reference counting must periodically correct counts during stop-the-world pauses. Fortunately, these pauses are likely to be short compared with those of tracing collectors, such as mark-sweep [Ungar, 1984].

Algorithm 5.2 loads object references using the simple, unbarriered implementation of Read from Chapter 1. Similarly, references can also be written to roots using an unbarriered store (line 14). In contrast, writes to heap objects must be barriered. In this case, the reference count of the new target is incremented as usual (line 17). However, if decrementing the reference count of the old target causes it to drop to zero, the Write barrier adds the object whose reference count is zero to a zero count table (ZCT) rather than immediately reclaiming it (line 26). The zero count table can be implemented in a variety of ways, for example with a bitmap [Baden, 1983] or a hash table [Deutsch and Bobrow, 1976]. An object with a reference count of zero cannot be reclaimed at this point because there might be an uncounted reference to it from the program stack. Conceptually, the zero count table contains objects whose reference counts are zero but may be live. Depending on the implementation of the zero count table and whether it is desirable to limit the size of the zero count table, we can also choose to remove the new target from the zero count table when writing a reference into a heap object, as its true reference count must be positive (line 19).

Algorithm 5.2: Deferred reference counting

 1 New():

 2 ref ← allocate()

 3 if ref = null

 4  collect()

 5  ref ← allocate()

 6  if ref = null

 7   error “Out of memory”

 8 rc(ref) ← 0

 9 add(zct, ref)

10 return ref

11

12 Write(src, i, ref):

13 if src = Roots

14  src[i] ← ref

15 else

16  atomic

17   addReference(ref)

18   remove(zct, ref)

19   deleteReferenceToZCT(src[i])

20   src[i] ← ref

21

22 deleteReferenceToZCT(ref):

23 if ref ≠ null

24  rc(ref) ← rc(ref) – 1

25  if rc(ref) = 0

26   add(zct, ref)

/* defer freeing */

27

28 atomic collect():

29 for each fld in Roots

/* mark the stacks */

30  addReference(*fld)

31 sweepZCT()

32 for each fld in Roots

/* unmark the stacks */

33  deleteReferenceToZCT(*fld)

34

35 sweepZCT():

36 while not isEmpty(zct)

37  ref ← remove(zct)

38  if rc(ref) = 0

/* now reclaim garbage */

39   for each fld in Pointers(ref)

40    deleteReference(*fld)

41   free(ref)

However, at some point garbage objects must be collected if the program is not to run out of memory. Periodically, for example when the allocator fails to return memory to New, all threads are stopped while each object in the zero count table is scrutinised to determine whether its true reference count should be zero. An object in the zero count table with reference count zero can only be live if there are one or more references to it from the roots. The simplest way to discover this is to scan the roots and ‘mark’ any objects found by incrementing their reference counts (line 29). After this, no object referenced from the stack can have a reference count of zero, so any object with a zero count must be garbage. We could now sweep the entire heap, as with mark-sweep collection (for example, Algorithm 2.3), looking for and reclaiming ‘unmarked’ objects with zero reference counts. However, it is sufficient to confine this search to the zero count table. The entries in the zero count table are scanned and any objects with zero counts are immediately processed and freed, in the same way as in the simple Algorithm 5.1. Finally, the ‘mark’ operations must be reversed: the stack is scanned again and the reference counts of any objects found are decremented (reverted to their previous value). If an object’s reference count becomes zero, it is reinstated in the zero count table.

Deferred reference counting removes the cost of manipulating reference counts on local variables from the mutator. Several, older, studies have suggested that it can reduce the cost of pointer manipulations by 80% or more [Ungar, 1984; Baden, 1983]. Given the increased importance of locality, we can speculate that its performance advantage over naïve reference counting will be even larger on modern hardware. However, reference count adjustments due to object field updates must still be performed eagerly rather than deferred, and must be atomic. Next, we explore how to replace expensive atomic reference count manipulations caused by updates to objects’ fields with simple instructions, and how to reduce the number of modifications necessary.

5.4    Coalesced reference counting

Deferred reference counting addresses the problem of the cost of applying reference count operations to local variables. This leaves the question of how to reduce the reference counting overhead of writing to objects’ pointer fields. Levanoni and Petrank [1999] observed that, in any period and for any object field, only the values of the field at the start and the end of the period need be considered; intermediate operations can be ignored. Thus several states of the object can be coalesced to just these two. For example, suppose an object X has a field f which originally refers to object O0, and that this field is repeatedly modified to refer to objects O1, O2,…, On. This would incur the reference counting operations

rc(O0), rc(O1)++, rc(O1) , rc(O2)++, , rc(On)++.

The pairs of intermediate operations (shown boxed) cancel each other and can be omitted. Levanoni and Petrank eliminate them by copying objects to a local log before their first modification in an epoch. When a pointer slot of a clean object is updated, Algorithm 5.3 logs the object by saving its address and the values of its pointer fields to a local update buffer (line 5). The modified object is marked as dirty. Note that an initialising store need not be logged as the object’s reference fields will be null. Instead, we need only to record the object as dirty.

The log procedure attempts to avoid duplicating entries in the thread’s local log by first appending the original values of the object’s pointer fields to the log (line 11). Next it checks that src is still not dirty, and only then is the entry committed by writing src to the log (appendAndCommit), tagged so that it can be recognised as an object entry rather than a field entry, and the log’s internal cursor is advanced (line 13). The object is marked dirty by writing a pointer to this log entry in its header. Note that even if a race leads to records being created in more than one thread’s local buffer, the algorithm guarantees that all these records will contain identical information so it does not matter to which log’s entry the header points. Note that, depending on the processor’s memory consistency model, this write barrier may not require any synchronisation operations.

Algorithm 5.3: Coalesced reference counting: write barrier

 1 me ← myThreadId

 2

 3 Write(src, i, ref):

 4 if not dirty(src)

 5  log (src)

 6 src[i] ← ref

 7

 8 log (src) :

 9 for each fld in Pointers(src)

10  if *fld ≠ null

11   append(updates[me], *fld)

12 if not dirty(src)

13  slot ← appendAndCommit(updates[me], src)

14  setDirty(src, slot)

15

16 dirty(src):

17 return logPointer(src) ≠ CLEAN

18

19 setDirty(src, slot)

20 logPointer(src) ← slot

/* address of entry for src in updates[me] */

Later, we will discuss how coalesced reference counts can be processed concurrently with mutator threads, but here we simply stop the world periodically to process the logs. At the start of each collection cycle, Algorithm 5.4 halts every thread, transfers their update buffers to the collector’s log, and allocates fresh ones. As we noted above, race conditions mean that an entry for an object may appear in more than one thread’s update buffer. This is harmless provided the collector processes each dirty object only once. The processReferenceCounts procedure tests whether an object is still dirty before updating the reference counts. The counts of the children of an object before its first modification in this epoch are decremented, and then those of its children at the time of the collection are incremented. In a simple system, any object whose reference count drops to zero could be freed recursively. However, if reference counting on local variables is deferred, or if for efficiency the algorithm does not guarantee to process all increments before decrements, we simply remember any object whose count has dropped to zero. The algorithm cleans the object so that it will not be processed again in this cycle. Pointers to an object’s previous children can be found directly from the log entry. Its current children can be found from the object itself (recall that the log contains a reference to that object). Notice that there is opportunity for prefetching objects or reference count fields in both the increment and decrement loops [Paz and Petrank, 2007].

Algorithm 5.4: Coalesced reference counting: update reference counts

 1 atomic collect():

 2 collectBuffers()

 3 processReferenceCounts()

 4 sweepZCT()

 5

 6 collectBuffers():

 7 collectorLog ← []

 8 for each t in Threads

 9  collectorLog ← collectorLog + updates[t]

10

11 processReferenceCounts():

12 for each entry in collectorLog

13  obj ← objFromLog(entry)

14  if dirty(obj)

/* Do not process duplicates */

15   logPointer(obj) ← CLEAN

16   incrementNew(obj )

17   decrementOld(entry)

18

19 decrementOld(entry):

20 for each fld in Pointers(entry)

/* use the values in the collector’s log */

21  child ← *fld

22  if child ≠ null

23   rc(child) ← rc(child) – 1

24   if rc(child) = 0

25    add(zct, child)

26

27 incrementNew(obj):

28 for each fld in Pointers(obj)

/* use the values in the object */

29  child ← *fld

30  if child ≠ null

31   rc(child) ← rc(child) + 1

Image

Figure 5.2: Coalesced reference counting: if A was modified in the previous epoch, for example by overwriting the reference to C with a reference to D, A’s reference fields will have been copied to the log. The old referent C can be found in the collector’s log and the most recent new referent D can be found directly from A.

Let us look at the example in Figure 5.2. Suppose that A was modified in the previous epoch to swing its pointer from C to D. The old values of the object’s fields (B and C) will have been recorded in a log which has been passed to the collector (shown on the left of the figure). The collector will therefore decrement the reference counts of B and C and increment those of B and D. This retains the original value of B’s reference count since the pointer from A to B was never modified.

Thus, through a combination of deferred reference counting and coalescing, much of reference counting’s overhead on the mutator has been removed. In particular, we have removed any necessity for mutator threads to employ expensive synchronisation operations. However, this benefit has come at some cost. We have reintroduced pauses for garbage collection although we expect these to be shorter than those required for tracing collection. We have reduced the promptness of collection (since no object is reclaimed until the end of an epoch) and added space overheads for the buffers and zero count table. Coalesced reference counting may also require the collector to decrement and then increment the same children of unmodified slots.

5.5    Cyclic reference counting

Because the reference counts of objects making up a cyclic data structure must necessarily be at least one, reference counting on its own cannot reclaim such structures. However, cycles are common, created both by application programmers and by the run-time system. Applications often use doubly-linked lists and circular buffers. Object-relation mapping systems may require that databases know their tables and vice versa. Some real-world structures are naturally cyclic, such as roads in geographical information systems. Lazy functional language implementations commonly use cycles to express recursion [Turner, 1979, the Y combinator]. A number of techniques have been proposed to solve the cycle problem; we review some of these now.

The simplest approach is to combine reference counting with occasional, backup tracing collection. The hope is that most objects will not be reachable from cycles and hence will be reclaimed promptly by reference counting; the remaining cyclic structures will be handled by the tracing collector. This simply reduces the frequency of full, tracing collections. At the language level, Friedman and Wise [1979] observed that cycles can only be created in pure functional languages by recursive definitions, and hence can be treated specially provided certain restrictions are observed. Bobrow [1980] required the programmer to identify reference counted groups of cells, which are then collected en masse.

Several authors have suggested distinguishing writes of pointers that close cycles from those of other pointers [Friedman and Wise, 1979; Brownbridge, 1985; Salkild, 1987; Pepels et al, 1988; Axford, 1990]. Normal references are denoted strong and cycle-closing ones weak. If strong pointers are never allowed to form a cycle, then the strong-pointer graph will be amenable to standard reference counting. Brownbridge’s algorithm has been widely cited. In brief, each object is given a strong and a weak reference count. The write barrier checks the strength of pointers and targets, and weakens any pointer that would close a cycle. Reference deletion may require the strength of pointers to be modified in order to preserve the invariants that all reachable objects are strongly reachable without creating any cycles of strong references. Unfortunately, this algorithm is unsafe and may reclaim objects prematurely: see Salkild’s counter-example [Jones, 1996, Chapter 6.5]. Salkild [1987] amended the algorithm to make it safe but at the cost of non-termination in some cases. Pepels et al [1988] provided a very complex solution but it is expensive both in terms of space, with double the space overheads of normal reference counting, and in terms of performance, having twice the cost of standard reference counting in most cases and being exponential in the worst case.

The most widely adopted mechanisms for handling cycles through reference counting use a technique called trial deletion. The key observation is that it is not necessary for a backup tracing collector to visit the whole live object graph. Instead, its attention can be confined to those parts of the graph where a pointer deletion might have created a garbage cycle. Note that:

•  In any garbage pointer structure, all reference counts must be due to internal pointers (that is, pointers between objects within the structure).

•  Garbage cycles can arise only from a pointer deletion that leaves a reference count greater than zero.

Partial tracing algorithms take advantage of these observations by tracing the subgraph rooted at an object suspected of being garbage. These algorithms trial-delete each reference encountered by temporarily decrementing reference counts, in effect removing the contribution of these internal pointers. If the reference count of any object remains non-zero, it must be because there is a pointer to the object from outside the subgraph, and hence neither the object nor its transitive closure is garbage.

The Recycler [Bacon et al, 2001; Bacon and Rajan, 2001; Paz et al, 2007] supports concurrent cyclic reference counting. In Algorithm 5.5, we show the simpler, synchronous, version, deferring the asynchronous collector to Chapter 15. The cycle collection algorithm operates in three phases.

1.  First, the collector traces partial graphs, starting from objects identified as possible members of garbage cycles, decrementing reference counts due to internal pointers (markCandidates). Objects visited are coloured grey.

2.  Second, the reference count of each node in these subgraphs is checked: if it is nonzero, the object must be live due to a reference external to the subgraph being traced, and so the effect of the first phase must be undone (scan), recolouring live grey objects black. Other grey objects are coloured white.

3.  Finally, any members of the subgraph that are still white must be garbage and are reclaimed (collectCandidates).

Algorithm 5.5: The Recycler

 1 New() :

 2 ref ← allocate()

 3 if ref = null

 4  collect()

/* the cycle collector */

 5  ref ← allocate()

 6  if ref = null

 7   error “Out of memory”

 8 rc(ref) ← 0

 9 return ref

10

11 addReference(ref):

12 if ref ≠ null

13  rc(ref) ← rc(ref) + 1

14  colour(ref) ← black

/* cannot be in a garbage cycle */

15

16 deleteReference(ref ) :

17 if ref ≠ null

18  rc(ref) ← rc(ref) – 1

19  if rc(ref) = 0

20   release(ref)

21  else

22   candidate(ref )

/* might isolate a garbage cycle */

23

24 release(ref):

25 for each fld in Pointers(ref)

26  deleteReference(fld)

27 colour(ref) ← black

/* objects on the free—list are black */

28 if not ref in   candidates

/* deal with candidates later */

29  free(ref)

30

31 candidate(ref) :

/* colour as a candidate and add to the set */

32 if colour(ref) ≠ purple

33  colour(ref) ← purple

34  candidates ← candidates ∪ {ref}

35

36 atomic collect():

37 markCandidates()

38 for each ref in candidates

39  scan(ref)

40 collectCandidates()

41 markCandidates():

42 for ref in candidates

43  if colour(ref) = purple

44   markGrey(ref)

45  else

46   remove(candidates, ref)

47   if colour(ref) = black && rc(ref) = 0

48    free(ref)

49

50 markGrey(ref):

51 if colour(ref) ≠ grey

52  colour(ref) ← grey

53  for each fld in Pointers(ref)

54   child ← *fld

55   if child ≠ null

56    rc(child) ← rc(child) — 1

/* trial deletion */

57    markGrey(child)

58

59 scan(ref):

60 if colour(ref) = grey

61  if rc(ref) > 0

62   scanBlack(ref)

/* there must be an external reference */

63  else

64   colour(ref) ← white

/* looks like garbage… */

65   for each fld in Pointers(ref)

/* …so continue */

66    child ← *fld

67    if child ≠ null

68     scan(child)

69

70 scanBlack(ref):

/* repair the reference counts of live data */

71 colour(ref) ← black

72 for each fld in Pointers(ref)

73  child ← *fld

74  if child ≠ null

75   rc(child) ← rc(child) + 1

/* undo the trial deletion */

76   if colour(child) ≠ black

77    scanBlack(child)

78 collectCandidates():

79 while not isEmpty(candidates)

80  ref ← remove(candidates)

81  collectWhite(ref)

82

83 collectWhite(ref):

84 if colour(ref) = white && not ref in candidates

85  colour(ref) ← black

/* free—list objects are black */

86  for each fld in Pointers(ref)

87   child ← *fld

88   if child ≠ null

89    collectWhite(child)

90  free(ref)

In its synchronous mode, the Recycler uses five colours to identify nodes. As usual, black means live (or free) and white is garbage. Grey is now a possible member of a garbage cycle, and we add the colour purple to indicate objects that are candidates for roots of garbage cycles.

Deleting any reference other than the last to an object may isolate a garbage cycle. In this case, Algorithm 5.5 colours the object purple and adds it to a set of candidate members of garbage cycles (line 22). Otherwise, the object is garbage and its reference count must be zero. Procedure release resets its colour to black, processes its children recursively and, if it is not a candidate, frees the object. The reclamation of any objects in the candidates set is postponed to the markCandidates phase. For example, in Figure 5.3a, some reference to object A has been deleted. A’s reference count was non-zero, so it has been added to the candidates set.

In the first phase of collecting cycles, the markCandidates procedure establishes the extent of possible garbage structures, and removes the effect of internal references from the counts. It considers every object in the set of garbage candidates. If the object is still purple (hence, no references to the object have been added since it was added to the set), its transitive closure is marked grey. Otherwise it is removed from the set and, if it is a black object with reference count zero, it is freed. As markGrey traces a reference, the reference count of its target is decremented. Thus, in Figure 5.3b, the subgraph rooted at A has been marked grey and the contribution of references internal to the subgraph has been removed from the reference counts.

In the second phase of collection, each candidate and its grey transitive closure is scanned for external references. If a reference count is non-zero, it can only be because there is a reference to this object from outside the grey sub-graph. In this case, the effect of markGrey is undone by scanBlack: reference counts are incremented and objects are reverted to black. On the other hand, if the reference count is zero, the object is coloured white and the scan continues to its children. Note that at this point we cannot say that a white object is definitely garbage as it might be revisited later by scanBlack starting from another node in the subgraph. For example, in Figure 5.3b, objects reference count was non-zero, Y and Z have zero reference counts but are externally reachable via X. When scan reaches X, which has a non-zero reference count, it will invoke scanBlack on the grey transitive closure of X, restoring reference counts, leaving the graph in the state shown in Figure 5.3c.

Image

Figure 5.3: Cyclic reference counting. The first field of each object is its reference count.

Image

Figure 5.4: The synchronous Recycler state transition diagram, showing mutator and collector operations and the colours of objects.

With kind permission from Springer Science+Business Media: Bacon and Rajan [2001], figure 3, page 214.

Finally, the third, collectWhite phase reclaims white (garbage) objects. The set of candidates is emptied, and the colour of each object inspected. Each white object found is freed (its colour reset to black) and its children inspected recursively. Note that the algorithm does not process child objects found to be in the candidates set, but defers reclaiming these to a subsequent iteration of the inner loop of collectCandidates in this cycle.

The asynchronous Recycler algorithm improves over earlier trial deletion algorithms, such as that of Martinez et al [1990], which performed the trial deletion eagerly as soon as a candidate was discovered. Lins [1992] also processed candidates lazily like the Recycler in the hope that candidates will be eliminated by later mutator actions, which might either remove the last reference so that the object can be freed immediately or add a fresh reference to it. Unfortunately, Lins performed all three phases of the collection cycle separately on each object, which led to complexity quadratic in the size of the graph in the worst case. In contrast, the complexity of the Recycler is O(N + E), where N is the number of nodes and E the number of edges. This seemingly small change made a substantial difference to performance, reducing the garbage collection time for moderately sized Java programs from minutes (Lins) to a maximum of a few milliseconds (Recycler).

Further improvements can be gained by recognising statically that certain classes of object, including but not limited to those that contain no pointers, can never be members of cycles. The Recycler allocates objects of these types as green rather than black, and never adds them to the candidate set nor traces through them. Bacon and Rajan [2001] found that this reduced the size of the candidate set by an order of magnitude. Figure 5.4 illustrates the full state transition system of the synchronous Recycler, including green nodes.

5.6    Limited-field reference counting

The remaining concern is the space overhead incurred by storing reference counts in object headers. In theory, the reference count field of an object should be large enough to hold a pointer-sized value since an object may be referenced by every other object in the heap. An additional field of this size represents a significant overhead to small objects. However, only contrived applications will cause counts to grow so large; in practice most objects have small reference counts. Indeed, most objects are not shared at all and so the space they use could be reused immediately the pointer to them is deleted [Clark and Green, 1977; Stoye et al, 1984; Hartel, 1988]. In functional languages, this allows objects such as arrays to be updated in place rather than having to modify a copy of the object. Given a priori knowledge of the upper bound on reference counts, it would be possible to use a smaller field for the reference count. Unfortunately, it is common for some objects to be very popular [Printezis and Garthwaite, 2002].

However, it is still possible to limit the size of the reference count field provided that some backup mechanism is occasionally invoked to deal with reference count overflow. Once a reference count has been incremented to the maximum permissible value, it becomes a sticky reference count, not changed by any subsequent pointer updates. The most extreme option is to use a single bit for the reference count, thus concentrating reference counting on the common case of objects that are not shared. The bit can either be stored in the object itself [Wise and Friedman, 1977] or in the pointer [Stoye et al, 1984]. The corollary of limited-field reference counts is that once objects become stuck they can no longer be reclaimed by reference counting. A backup tracing collector is needed to handle such objects. As the tracing collector traverses each pointer, it can restore the correct reference counts (wherever this is no greater than the sticky value); Wise [1993a] shows that, with some effort, a mark-compact or copying collector can also restore uniqueness information. Such a backup tracing collector would be needed to reclaim garbage cycles in any case.

5.7    Issues to consider

Reference counting is attractive for the promptness with which it reclaims garbage objects and its good locality properties. Simple reference counting can reclaim the space occupied by an object as soon as the last pointer to that object is removed. Its operation involves only the targets of old and new pointers read or written, unlike tracing collection which visits every live object in the heap. However, these strengths are also the weaknesses of simple reference counting. Because it cannot reclaim an object until the last pointer to that object has been removed, it cannot reclaim cycles of garbage. Reference counting taxes every pointer read and write operation and thus imposes a much larger tax on throughput than tracing does. Furthermore, multithreaded applications require the manipulation of reference counts and updating of pointers to be expensively synchronised. This tight coupling of mutator actions and memory manager risks some fragility, especially if ‘unnecessary’ reference count updates are optimised away by hand. Finally, reference counts increase the sizes of objects.

The environment

Despite these concerns, it would be wrong to dismiss reference counting without further thought. Certainly, its drawbacks make simple reference counting uncompetitive as a general purpose memory management component of a virtual machine, especially if the majority of objects managed are small, cycles are common and the rate of pointer mutation is high. However, there are environments which are favourable to reference counting. Reference counting can play well in a mixed ecology where the lifetimes of most objects are sufficiently simple to be managed explicitly. It can be restricted to managing a smaller number of resources with more complex owner relationships. Often such resources will be large, in which case the overhead for an additional reference count field in the header will be negligible. Data such as bitmaps for images and so on will not contain any pointers, so the problem of reclaiming cyclic structures does not arise. Furthermore, reference counting can be implemented as part of a library rather than being baked into the language’s run-time system. It can therefore give the programmer complete control over its use, allowing her to make the decision between performance overhead and guarantees of safety. Nevertheless, it is essential that reference counting be used carefully. In particular, the programmer must ensure that races between pointer modifications and reference count updates are avoided. If reference counting is implemented through smart pointers, he must also be aware that the semantics of pointers and smart pointers differ. As Edelson [1992] wrote, ‘They are smart, but they are not pointers’.

Advanced solutions

Sophisticated reference counting algorithms can offer solutions to many of the problems faced by naïve reference counting but, paradoxically, these algorithms often introduce behaviours similar to those of stop-the-world tracing collectors. We examine this duality further in the next chapter.

Garbage cycles can be reclaimed by a backup, tracing collector or by using the trial deletion algorithms we discussed in Section 5.5. In both cases, this requires mutator threads to be suspended while we reclaim cyclic data (although we show how these stop-the-world pauses can be removed in later chapters).

Although the worst case requires reference count fields to be almost as large as pointer fields, most applications hold only a few references to most objects. Often, it is possible for the reference count to hijack a few bits from an existing header word (for example, one used for object hashing or for locks). However, it is common for a very few objects to be heavily referenced. If limited-field reference counting is used, these objects will either leak — which may not be a serious problem if they are few in number or have very long lifetimes — or must be reclaimed by a backup tracing collector. Note, however, that in comparing the space overhead of reference counting and, say, mark-sweep collection it is not sufficient simply to measure the cost of the reference count fields. In order not to thrash, tracing collectors require some headroom in the heap. If the application is given a heap of, say, 20% larger than its maximum volume of live data, then at least 10% of the heap will be ‘wasted’ on average. This fraction may be similar to the overhead of reference counting (depending on the average size of objects it manages).

The throughput overhead of reference counting can be addressed by omitting to count some pointer manipulations and by reducing the cost of others. Deferred reference counting ignores mutator operations on local variables. This allows the counts of objects reachable from roots to be lower than their true value, and hence prevents their prompt reclamation (since a reference count of zero no longer necessarily means that the object is garbage). Coalesced reference counting accounts for the state of an object only at the beginning and end of an epoch: it ignores pointer manipulations in between. In one sense, this automates the behaviour of programmers who often optimise away temporary adjustments to reference counts (for example, to an iterator as it traverses a list). However, once again, one consequence of deferred and coalesced reference counting is to reintroduce stop-the-world pauses during which reference counts are corrected.

As well as removing some reference counting operations, both deferred and coalesced reference counting reduce the synchronisation cost of other operations. Deferred reference counting does so simply by omitting to manipulate reference counts on local variables. Coalesced reference counting does not need synchronisation because races are benign: at worst, the same values might be written to the logs of two different threads. However, both solutions add space overhead to the cost of reference counting, either to store the zero count table or to store update logs.

A further attraction of these advanced reference counting techniques is that they scale well with large heaps. Their cost is proportional only to the number of pointer writes made, and not to the volume of live data. As we shall see in Chapter 10, hybrid collectors are possible, combining tracing collection for short-lived, heavily mutated data with reference counting for longer-lived, more stable data.

In the next chapter, we compare all four forms of collection we have examined so far: mark-sweep, mark-compact, copying and reference counting. We then consider a remarkable abstraction of tracing and advanced reference counting collection that reveals that they are surprisingly similar in many ways.

1Reference listing algorithms, commonly used by distributed systems such as Java’s RMI libraries, modify this invariant so that an object is deemed to be live if and only if the set of clients believed to be holding a reference to the object is non-empty. This offers certain fault tolerance benefits, for example, set insertion or deletion is idempotent, unlike counter arithmetic.

2The Boost libraries for C++, www.boost.org.