Chapter 18

Concurrent reference counting

We discussed reference counting in Chapter 5. The two chief issues facing naïve reference counting were its inability to collect garbage cycles and the high cost of manipulating reference counts, particularly in the face of races between different mutator threads. The solution to cyclic garbage was trial deletion (partial tracing). We used deferred reference counting to avoid having mutators manipulate reference counts on local variables and coalescing to avoid having to make ‘redundant’ changes to reference counts that would be cancelled out by later mutations; a useful side-effect of coalescing is that it tolerates mutator races. All three solutions required stopping the world while the collector reconciled reference counts and reclaimed any garbage. In this chapter, we relax this requirement and consider the changes that need to be made in order to allow a reference counting collector thread to run concurrently with mutator threads.

18.1  Simple reference counting revisited

To be correct, reference counting algorithms must preserve the invariant that an object’s reference count is equal to the number of references to that object. Maintaining this invariant becomes more complicated with multiple mutator threads. At first sight, it may seem that it is more difficult to Write safely than to Read safely. Updating a pointer slot requires three actions: the reference count of the new target must be incremented, that of the old target be decremented and the pointer written. It is important that these three actions be coordinated, even though multiple mutator threads may manipulate pointers to the objects in question.1 Objects must not be reclaimed prematurely (for example, because their reference count has temporarily dropped to zero) nor is it desirable for garbage to float indeinitely in the heap. Figure 18.1 illustrates the problem. Here, even if all the reference count increments and decrements are be performed atomically, some thread interleavings may lead to an incorrect result because the reference count of old may be decremented twice and the reference count of one of the new targets may be too high.

The difficulty of concurrent reference counting does not lie solely with incrementing or decrementing reference count fields. This can be done easily enough with one of the atomic primitive operations like AtomicIncrement discussed in Chapter 13. The harder problem is to synchronise reference count modifications with pointer loads or stores; in Algorithm 5.1 we simply required the mutator Read and Write actions to be atomic. The simplest way to do this is to lock the object containing the field that is being read or written, src, as illustrated in Algorithm 18.1. This is safe. After Read has locked src, the value of field i cannot change. If it is null, Read is trivially correct. Otherwise, src holds a reference to some object tgt. The reference counting invariant ensures that tgt’s reference count cannot drop to zero before src is unlocked since there is a reference to tgt from src. Thus, we can guarantee that tgt cannot be freed during the Read and that addReference will be able to update the count rather than potentially corrupting memory. A similar argument establishes the safety of Write.

Image

Figure 18.1: Reference counting must synchronise the manipulation of counts with pointer updates. Here, two threads race to update an object field. Note that old is a local variable of each thread’s Write method.

Algorithm 18.1: Eager reference counting with locks

 1 Read(src, i):

 2  lock(src)

 3   tgt ← src[i]

 4   addReference(tgt)

 5  unlock(src)

 6  return tgt

 7

 8 Write(src, i, ref):

 9  addReference(ref)

10  lock(src)

11   old ← src[i]

12   src[i] ← ref

13   deleteReference(old)

14  unlock(src)

It is appealing to hope that we can find a lock-free solution, using commonly available primitive operations. Unfortunately, single memory location primitives are insufficient to guarantee safety. The problem does not lie in Write. Imagine that, instead of the coarser grain lock, we use atomic increments and decrements to update reference counts and CompareAndSwap for the pointer write, as in Algorithm 18.2. If ref is non-null, then the writing thread holds a reference to it so ref cannot be reclaimed until Write returns (whether or not we use eager or deferred reference counting). Write spins, attempting to set the pointer field until we are successful: at that point, we know that next we will decrement the count of the correct old object and that only the winning thread will do this. Note that the reference count of this old target remains an overestimate until deleteReference(old) is called, and so old cannot be prematurely deleted.

We cannot apply the same tactic in Read, though. Even if Read uses a primitive atomic operation to update the reference count, unless we lock src it is possible that some other thread will delete the pointer src[i] and reclaim its target between the point that we load the reference (line 19) and the increment of the reference count. The attempt to increment the reference count may corrupt memory that has been freed and maybe reallocated.

Algorithm 18.2: Eager reference counting with CompareAndSwap is broken

 1 Write(src, i, ref):

 2  if ref ≠ null

 3   Atomicincrement(&rc(ref))

/* ref guaranteed to be non–free */

 4  loop

 5   old ← src[i]

 6   if CompareAndSet(&src[i], old, ref)

 7    deleteReference(old)

 8    return

 9

10 deleteReference(ref) :

/* ref guaranteed to be null or non–free */

11  if ref ≠ null

12   AtomicDecrement(&rc(ref ))

13   if rc(ref) = 0

14    for each fld in Pointers(ref)

15     deleteReference(*fld)

16    free(ref)

17

18 Read(src, i):

19  tgt = src[i]

20  AtomicIncrement(&rc(tgt))

/* oops! */

21  return tgt

Algorithm 18.3: Eager reference counting with CompareAndSwap2

 1 Read(src, i, root):

 2  loop

 3   tgt ← src[i]

 4   if tgt = null

 5    return null

 6   rc ← rc(tgt)

 7   if CompareAndSet2(&src[i], &rc(tgt), tgt, rc, tgt, rc + 1)

 8    return tgt

Although single memory location primitive operations are insufficiently powerful to provide a solution, Detlefs et al [2001, 2002b] show that the CompareAndSwap2 primitive discussed in Section 13.3 is sufficient. CompareAndSwap2 can atomically update two independent memory locations. Although this is not sufficient to maintain accurate reference counts at all times, it is sufficient to guarantee the weaker invariant that (i) while there remains a pointer to an object, its reference count cannot be zero, and (ii) if there are no pointers to an object, its reference count will eventually become zero. In Algorithm 18.3, CompareAndSwap2 is used to increment the reference count and simultaneously to check that the pointer to the object still exists, thus avoiding the possibility of modifying an object’s header after it has been freed.

18.2  Buffered reference counting

Eager reference counting schemes require either locks or multi-word atomic primitives, which are (currently) not widely available. Deferred reference counting partially finessed the problem we saw in the previous section by not applying reference count operations to local variables and deferring reclamation of objects with zero reference counts (see Section 5.3). This leaves the question of how to reduce the overhead of pointer writes to object fields. We now turn to look at buffered reference counting techniques that use only simple loads and stores in the mutator write barrier, yet support multithreaded applications.

In order to avoid the cost of synchronising reference count manipulations by different mutator threads, DeTreville [1990] had mutators log the old and new referents of each pointer update to a buffer (in a hybrid collector for Modula-2+ that used mark-sweep as an occasional backup collector to handle cycles). A single, separate reference counting thread processed the log and adjusted objects’ reference counts, thereby ensuring that the modifications were trivially atomic. In order to prevent inadvertently applying a reference count decrement before an increment that causally preceded it (and hence prematurely reclaiming an object), increments were applied before decrements. Unfortunately, buffering updates does not resolve the problem of coordinating the reference count manipulations with the pointer write. DeTreville offered two solutions, neither of which is entirely satisfactory. His first approach was, as above, to protect the entire Write operation with a lock. This ensures that records are correctly appended to the shared buffer as well as synchronising the updates. To avoid the cost of making every write atomic, his second solution provided each mutator thread with its own buffer, which was periodically passed to the reference counting thread, but this required the programmer to take care to ensure that pointer writes were performed atomically, if necessary performing the locking manually, to avoid the problems illustrated by Figure 18.1.

Bacon and Rajan [2001] also provided each thread with a local buffer but required the update of the pointer field to be atomic, as for example in Algorithm 18.4; a CompareAnd-Swap with retry could be used to do this. The mutator write barrier on a processor adds the old and new values of slot i to its local myUpdates buffer (line 9). Once again, reference counting of local variables is deferred, and time is divided into ragged epochs to ensure that objects are not prematurely deleted, by using a single shared epoch number plus per-thread local epoch numbers. Periodically, just as with deferred reference counting, a processor will interrupt a thread and scan all the processor’s local stacks, logging references found to a local myStackBuffer. The processor then transfers its myStackBuffer and myUpdates to the collector, and updates its local epoch number, e. Finally, it schedules the collection thread of the next processor before resuming the interrupted thread.

The collector thread runs on the last processor. In each collection cycle k, the collector applies the increments of epoch k and the decrements of epoch k – 1. Finally it increments the global epoch counter (for simplicity, we assume an unbounded number of global updatesBuffers in Algorithm 18.4). The advantage of this technique is that it is never necessary to halt all mutators simultaneously: the collector is on-the-fly. Note how the collector uses a variant of deferred reference counting. At the start of the collection the counts of objects directly referenced from thread stacks (in this epoch) are incremented; at the end of the cycle, the reference counts of those directly reachable from the stacks in the previous epoch are decremented.

18.3  Concurrent, cyclic reference counting

This leaves the question of collecting garbage cycles by reference counting without introducing stop-the-world pauses. The Recycler [Bacon et al, 2001; Bacon and Rajan, 2001] reclaims cyclic garbage by tracing candidate subgraphs, applying trial deletion to reference counts. Although buffering successfully devolves reference counting to a spare processor, the Recycler faces three problems in collecting cycles in a concurrent world.

Algorithm 18.4: Concurrent buffered reference counting

 1 shared epoch

 2 shared updatesBuffer [ ]

/* one buffer per epoch */

 3

 4 Write(src, i, ref):

 5  if src = Roots

 6   src[i] ← ref

 7  else

 8   old ← AtomicExchange(&src[i], ref)

 9   log(old, ref)

10

11 log(old, new):

12  myUpdates ← myUpdates + [〈old, new〉]

13

14 collect():

15  /* each processor passes its buffers on to a global updatesBuffer */

16   myStackBuffer ← []

17  for each local ref in myStacks

/* deferred reference counting */

18   myStackBuffer ← myStackBuffer + [〈ref, ref〉]

19  atomic

20   updatesBuffer[e] ← updatesBuffer [e] + myStackBuffer

21  atomic

22   updatesBuffer[e] ← updatesBuffer [e] + myUpdates

23  myUpdates ← []

24  e ← e + 1

25

26  me ← myProcessorld

27  if me < MAX_PROCESSORS

28   schedule(collect, me+1)

/* schedule collect() on the next processor */

29  else

30   /* the last processor updates the reference counts*/

31   for each 〈old, new〉 in updatesBuffer[epoch]

32    addReference(new)

33   for each 〈old, new〉 in updatesBuffer[epoch ←1]

34    deleteReference(old)

35   release(updatesBuffer[epoch ←1])

/* free the old buffer */

36    epoch ← epoch + 1

•  It cannot guarantee that it will retrace the same subgraph since the graph may be modified by mutators while the Recycler is detecting garbage cycles.

•  Pointer deletions may disconnect portions of the subgraph.

•  Reference counts may be out of date.

To resolve these problems, the asynchronous Recycler operates in two phases. The first phase is much the same as the synchronous collector described in Chapter 5. However the asynchronous collector defers the freeing of objects discovered by collectWhite (Algorithm 5.5) to the next phase which checks that these objects are indeed still garbage. There are several disadvantages to this approach. In theory, but probably not in practice, it is possible for some garbage cycles not to be collected — the collector is not guaranteed to be complete. Further, trial deletion cannot use the original reference count but must add a second, cyclic reference count field to object headers. Third, the algorithm must trace candidate cycles again, in the second phase, in order to avoid incorrectly reclaiming live objects. It also adds overhead to the reference counting write barrier as it must fix the colours of objects left white or grey by improper traversals.

Image

Figure 18.2: Concurrent coalesced reference counting: in the previous epoch A was modified to point to C and the values of its reference fields logged. However, A has been modified again in this epoch (to point to D), and so marked dirty and logged again. The original referent B can be found in the collector’s global log, just as in Figure 5.2. The reference to C that was added in the previous epoch will be in some thread’s current log: this log can be found from A’s getLogPointer field.

The fundamental problem is that the Recycler is trying to apply an algorithm designed for synchronous collection in a world where the topology of the object graph is continually changing. Next, we see below how this circle can be squared by providing the Recycler with a fixed snapshot of the heap.

18.4  Taking a snapshot of the heap

We saw in Chapter 5 how coalesced reference counting provided the collector with a snapshot of the heap. Thread-local buffers, passed synchronously to the collector, held replicas of objects into which pointers had been written. Every thread was halted at the start of a collection cycle, its buffers were transferred to the collector, and fresh buffers allocated. The collector simply used the replica to find and decrement the reference counts of the old targets and the current version of the object to find and increment the new targets. All dirty objects were cleaned.

Let us see first how we can allow the reference counting thread to run concurrently with the mutators (after a brief pause to transfer buffers), and then consider how to make that concurrent algorithm on-the-fly. In the first case, all the mutator threads can be stopped temporarily while their buffers are transferred to the collector. However, once all the mutator threads have transferred their buffers, they can be restarted. The collector’s task is to modify the reference counts of the old and new children of every modified object. Reference decrements can be handled as before, using the replicas in the logs, but handling increments is more involved (Algorithm 18.5). The task is to increment the reference counts of the children of each object in the collector’s log, using the state of the object at the time that the log was transferred. There are two cases to consider, since the logged object may have been modified since the logs were transferred to the collector.

Algorithm 18.5: Sliding views: update reference counts

 1 incrementNew(entry):

/* use the entry in the collector’s log */

 2  obj ← objFromLog(entry)

/* the current object */

 3  if not dirty(obj)

 4   replica ← copy(obj)

/* copy the object’s reference slots */

 5   if dirty(obj)

 6    replica ← getLogPointer(obj)

/* entry in some thread’s log */

 7  else

 8   replica ← getLogPointer(obj)

 9

10  for each fld in Pointers(replica)

11   child ← *fld

12  if child ≠ null

13    rc(child) ← rc(child) + 1

14    mark(child)

/* if tracing young generation */

If the object remains clean, its state has not changed, so the reference counts of its current children are incremented. Note that incrementNew in Algorithm 18.5 must check again after making a replica of a clean object in case it was dirtied while the copy was being taken.

If the object has been modified since the logs were transferred, then it will have been re-marked dirty and its state at the time of the transfer can be found in a fresh log buffer of some mutator. The object’s dirty pointer will now refer to this log, which can be read without synchronising with that thread. Consider the example in Figure 18.2. A has been modified again in this epoch, which complicates finding C, the target of the last update to A in the previous epoch. As A is dirty, its previous contents will be held in some thread’s current local log (shown on the right of the figure): the log refers to C. Thus, we can decrement the reference count of B and increment the reference count of C. In the next epoch, C’s reference count will be decremented to reflect the action Write(A, 0, D).

18.5  Sliding views reference counting

For the snapshot of the heap, we stopped the world while threads’ modification buffers were transferred to the collector. We relax that restriction now, instead stopping threads one at a time, on the fly. This gives a distorted view of the heap. In this sliding view, the values of different objects are recorded (and transferred to the collector thread) at different times. Sliding views require neither locks nor use of atomic instructions (at least, assuming sequential consistency), but coordinates mutators and collector threads with four handshakes between each mutator thread and the collector thread, similar to those used by Doligez and Gonthier [1994]. We consider what modifications need to be made to the algorithm to support weaker consistency models later. Sliding views can be used in several contexts: for plain reference counting [Levanoni and Petrank, 1999, 2001, 2006], for managing the old generation of generational [Azatchi and Petrank, 2003] and age-oriented [Paz et al, 2003, 2005b] collectors, and for integration with cyclic reference counting collectors [Paz et al, 2005a, 2007]. Here, we consider how sliding views can be used in an age-oriented collector and then extend it to reclaim cyclic structures.

Age-oriented collection

Age-oriented collectors partition the heap into young and old generations. Unlike traditional generational collectors, both generations are collected at the same time: there are no nursery collections and inter-generational pointers do not need to be trapped. Appropriate policies and techniques are chosen for the management of each generation. Since the weak generational hypothesis expects most objects to die young, and young objects are likely to have high mutation rates (for example, as they are initialised), a young generation benefits from a collector tuned to low survivor rates. In contrast, the old generation can be managed by a collector tuned to lower death and mutation rates. Paz et al [2003] adopt a mark-sweep collector for the young generation (since it need not trace large volumes of dead objects) and a sliding views reference counting collector for the old generation (as it can handle huge live heaps). Their age-oriented collector does not move objects: instead, each object has a bit in its header denoting its generation.

The algorithm

On-the-fly collection starts by gathering a sliding view (Algorithm 18.6). Incremental collection of a sliding view requires careful treatment of modifications made while the view is being gathered. Pointer writes are protected by adding an incremental update write barrier called snooping to the Write operation of Algorithm 5.3 (see Algorithm 18.7). This barrier prevents missing a referent o whose only reference is removed from a slot s1 before the sliding view reads s1, and then is written to another slot s2 after s2 is added to the view.

At the start of a cycle, each thread’s snoopFlag is raised (without synchronisation). While the sliding view is being collected (and the snoopFlag is up for this thread), the new referent of any modified object is recorded in the thread’s local mySnoopedBuffer (line 25 of Algorithm 18.7). In terms of the tricolour abstraction, this Dijkstra-style barrier marks ref grey. Objects are allocated dirty in the young generation (Algorithm 18.8) in order to avoid activating the write barrier when their slots are initialised.

After the collector has raised the snoopFlag for each mutator thread, it executes the first handshake. The handshake stops each thread, one at a time, and transfers its local log and young set to the collector’s updates buffer.

Next, all modified and young objects are cleaned. This risks a race. As cleaning is performed while mutator threads are running, it may erase the dirty state of objects modified concurrently with cleaning. A second handshake therefore pauses each thread, again on the fly, and scans its local log to identify objects modified during cleaning. The dirty state of these objects is restored (‘reinforced’).

A third, empty handshake ensures that no thread has fallen behind. The collector is now ready to begin to mark the young generation and update the reference counts of the old generation.

Concurrent marking starts with a fourth handshake, again suspending threads one at a time to scan their stacks in the usual way for mark-sweep and deferred reference counting collectors. Each thread’s snoop flag can then be dropped.

The items in each thread’s mySnoopedBuffer are transferred to the work list asynchronously. The collector is then ready to process reference counts in the old generation. Notice that the nursery cannot be marked until the old generation has been processed since an update may have added a pointer from an old to a young object.

Processing the old generation also requires the counts of old and new objects in the updates buffer to be processed: if the reference count of a young object is updated (it is reachable from the old generation and will be promoted), and that object has not been marked, it is added to the marker’s work list.

Algorithm 18.6: Sliding views: the collector

 1 shared updates

 2 shared snoopFlag[MAX_PROCESSORS]

/* one per processor */

 3

 4 collect():

 5  collectSlidingView()

 6  on–the–fly handshake 4:

 7   for each thread t

 8    suspend(t)

 9    scanStack(t)

10    snoopFlag[t] ← false

11    resume(t)

12  processReferenceCounts()

13  markNursery()

14  sweepNursery()

15  sweepZCT()

16  collectCycles()

17

18 collectSlidingView() :

19  on–the–fly handshake 1:

20   for each thread t

21    suspend(t)

22    snoopFlag[t] ← true

23    transfer t’s buffers to updates

24    resume(t)

25  clean modified and young objects

26  on–the–fly handshake 2:

27   for each thread t

28    suspend(t)

29    find modify-clean conflicts

30    resume(t)

31  reinforce dirty objects

32  on–the–fly handshake 3:

33   for each thread t

34    suspend(t)

35    resume(t) 36

37 processReferenceCounts():

38  for each obj in updates

39   decrementOld(obj)

40   incrementNew(obj)

41

42 collectCycles():

43  markCandidates()

44  markLiveBlack()

45  scan()

46  collectWhite()

47  processBuffers()

Algorithm 18.7: Sliding views: Write

 1 shared logs [MAX_PROCESSORS]

/* one per processor */

 2 shared snoopFlag[MAX_PROCESSORS]

/* one per processor */

 3 me ← myProcessorld

 4

 5 Write(src, i, ref):

 6  if src = Roots

 7   src[i] ← ref

 8  else

 9   if not dirty(src)

10    log(src)

$

11   src[i] ← ref

$

12   snoop(ref)

/* for sliding view */

13

14 log(ref):

15  for each fld in Pointers(ref)

16   if *fld ≠ null

17    add(logs[me], *fld)

18  if not dirty(ref)

19   /* commit the entry if ref is still clean */

20   entry ← add(logs[me], ref)

21   logPointer(ref) ← entry

22

23 snoop(ref):

24  if snoopFlag[me] && ref ≠ null

25   mySnoopedBuffer ← mySnoopedBuffer + [ref]

/* mark grey */

Algorithm 18.8: Sliding views: New

 1 New() :

 2  ref ← allocate()

 3  add(myYoungSet, ref)

 4  setDirty(ref)

/* allocate dirty */

 5  return ref

Once the old generation has been processed and all inter-generational references have been discovered, the young generation is traced (markNursery), marking objects with the incrementNew procedure, and swept (sweepNursery), freeing any unmarked object.

Objects in the old generation can be reclaimed in the same way as with deferred reference counting. Any object that is unmarked, has a zero reference count and is not directly referenced by a root, is reclaimed (sweepZCT). If an object with a zero reference count is dirty, recursive freeing decrements the reference counts of its descendants, found from its log entry (see Figure 18.3); otherwise, its current fields are used.

Sliding views cycle reclamation

As presented so far, the age-oriented collector can reclaim cycles in the nursery but not in the old generation. Paz et al [2007] combine the Recycler’s cycle collection algorithm with an age-oriented collector. The difficulty that faced the asynchronous Recycler was that the topology of the heap could vary under its feet. In contrast, sliding views presents the collector with a fixed view of the heap, using the original version of unmodified objects or the logged copy of modified objects (see Figure 18.3). It is therefore possible for each pass of the trial deletion algorithm to retrace the steps of the first pass. Thus, by working on the sliding view of the heap, we can apply the simpler synchronous algorithm rather than the more complicated, multi-coloured asynchronous algorithm of Bacon and Rajan [2001].

Image

Figure 18.3: Sliding views allow a fixed snapshot of the graph to be traced by using values stored in the log. Here, the shaded objects indicate the state of the graph at the time that the pointer from X to Y was overwritten to refer to Z. The old version of the graph can be traced by using the value of X’s field stored in the log.

Paz et al introduce a number of optimisations that can further reduce the number of objects that trial deletion must trace. Like Bacon and Rajan [2001], they ignore scalar objects that cannot be members of cycles. Mature objects are considered for cycle reclamation only if they have survived several collections. This requires a queue of candidate buffers rather than a single one (they found a delay of two collection cycles to be effective). Paz et al also try to avoid considering objects that might be live, including root referents, snooped objects and objects modified after the sliding view was collected. An additional markBlack phase pre-processes these objects, marking them and their sliding view descendants black. This raises a dilemma. The set of objects known to be live (actually, a subset of the dirty objects) is not fixed during the collection, so it is not possible to identify how many reference count modifications the collector might have made to an object before it became dirty. Hence, it is not possible to restore its original count. Instead, cycle detection operates on a second, cyclic, reference count. The alternative, to consider these objects regardless, would lead to more objects being processed by the reference counter.

Memory consistency

The sliding views algorithms presented above assume sequential consistency, which modern processors do not always guarantee. On the mutator side, it is important that the order of operations in Write are preserved to ensure that (i) the values seen in the log are the correct ones (that is, they represent a snapshot of the modified object as it was before the collection cycle started); (ii) the collector reads only completed log entries; and (iii) object fields cannot be updated after a collection starts without being snooped. The handshakes used by the algorithm solve some dependency issues on weakly consistent platforms (en suring that the collector only sees complete log entries, or that new references are snooped during the collection cycle). Two further modifications are necessary. First, on the mutator side, synchronisation must be added in the write barrier around the logging to ensure that it is seen before the modification of the pointer. Levanoni and Petrank [2006] do this by placing a memory fence before and after log(src). Second, something similar is needed on the collector side. However, the approach above would be inefficient since most objects are unlikely to be dirty since their log pointers have been reset. Instead, on a weakly consistent platform, the collector can reduce the cost of synchronisation by reading batches of values from the buffers into a local array before processing them.

18.6  Issues to consider

The immediate problem facing reference counting is how to ensure that objects’ reference counts are correct in the face of concurrent modifications to the object graph. The simplest solution is to require mutators to lock an object before it is modified. If the cost of locking is considered to be too high, then an alternative solution must be found. Current solutions rest on avoiding races between mutators that compromise the consistency of reference counts. Note that the memory manager is concerned only to preserve the consistency of the heap; whether or not mutator races violate the correctness of the user program is of no interest to it.

To preserve coherence, we must ask how we can serialise pointer writes and reference count operations. A partial solution is to use deferred reference counting, since this defers reclamation of garbage objects and, in particular, devolves the task to a single collector thread. However, this accounts only for pointer loads and stores, and not for writes into object fields. Thus, the question becomes, how can we devolve reference count modifications necessitated by writes to pointer fields from the mutator threads to a collector thread? One solution is for each mutator to buffer its reference counting operations and periodically pass them to the collector. A further step, coalesced reference counting, extends this by taking a snapshot of objects before they are modified: this allows the collector thread to avoid applying any redundant reference count modifications. In both cases reference count manipulation and object reclamation is separated from the action of writing pointers and is performed by a single collector thread (although it would be relatively straightforward to use parallel collector threads). Taking a snapshot of the state of the heap also simplifies concurrent cyclic reference counting. Trial deletion algorithms need to traverse a subgraph of the heap multiple times. By traversing the snapshot, the collector can ensure that it traces the same subgraph each time, even in the face of concurrent mutator activity.

Finally we note that there is a large literature on safe reclamation of memory when using dynamic memory structures, from the ABA-prevention tags used in IBM’s System 370 onwards. Other lock-free reference counting methods that require multi-word atomic primitives include Michael and Scott [1995] and Herlihy et al [2002]. Techniques that use timestamps to delay releasing an object until it is safe to do so are scheduler-dependent and tend to be vulnerable to the delay or failure of a single thread. For example, the Read-Copy-Update method [McKenney and Slingwine, 1998], used in the Linux kernel, delays reclamation of an object until all threads that have accessed it reach a ‘quiescence’ point. Other mechanisms that use immediate (rather than deferred) reference counting require a particular programming style, for example hazard pointers [Michael, 2004] or announcement schemes [Sundell, 2005].

1Note that we are not concerned about the correctness of the user program in the face of races, but we must ensure the consistency of the heap.