So far we have assumed a monolithic approach to garbage collection: all objects are managed by the same collection algorithm and all are collected at the same time. However there is no reason why this should be so and substantial performance benefits accrue from a more discriminating treatment of objects. The best known example is generational collection [Lieberman and Hewitt, 1983; Ungar, 1984], which segregates objects by age and preferentially collects younger objects. There are many reasons why it might be beneficial to treat different categories of object in different ways. Some but not all of these reasons are related to the collector technology that might be used to manage them. As we saw in earlier chapters, objects can be managed either by a direct algorithm (such as reference counting) or by an indirect, tracing algorithm. Tracing algorithms may move objects (mark-compact or copying) or not (mark-sweep). We might therefore consider whether or not we wish to have the collector move different categories of object and, if so, how we might wish to move them. We might wish to distinguish, quickly by their address, which collection or allocation algorithm to apply to different objects. Most commonly, we might wish to distinguish when we collect different categories of object.
It is useful to distinguish the sets of objects to which we want to apply certain memory management policies from the mechanisms that are used to implement those policies efficiently. We shall use the term space to indicate a logical set of objects that receive similar treatment. A space may use one or more chunks of address space. Chunks are contiguous and often power-of-two sized and aligned.
It is often effective to split the heap into partitions, each managed under a different policy or with a different mechanism. These ideas were first explored in Bishop’s influential thesis [1977]. These reasons include object mobility, size, lower space overheads, easier identification of object properties, improved garbage collection yield, reduced pause time, better locality, and so on. We examine these motivations now, before considering particular models of garbage collection and object management that take advantage of heap partitioning.
In a hybrid collector it may be necessary to distinguish objects that can be moved from those that either cannot be moved or which it is costly to move. It may be impossible to move objects due to lack of communication between the run-time system and the compiler, or because an object is passed to the operating system (for example, an I/O buffer). Chase [1987, 1988] suggests that asynchronous movement may also be particularly detrimental to compiler optimisations. In order to move an object, we must be able to discover every reference to that object so that each can be updated to point to the object’s new location. In contrast, if collection is non-moving, it suffices that a tracing collector finds at least one reference. Thus, objects cannot be moved if a reference has been passed to a library (for example, through the Java Native Interface) that does not expect garbage collection. Either such objects must be pinned or we must ensure that garbage collection is not enabled for that space while the object is accessible to the library.1
The references that must be updated in order to move objects include the root set. Determining an accurate map of root references is one of the more challenging parts of building the interface between a managed language and its run-time. We discuss this in detail in Chapter 11. One commonly chosen route, sometimes to an initial implementation, is to scan roots (thread stacks and registers) conservatively rather than construct a type-accurate map of which stack frame slots and so on contain object references. This tactic is inevitable if the compiler does not provide type-accurate information (for example, compilers for languages like C and C++). Conservative stack scanning [Boehm and Weiser, 1988] treats every slot in every stack frame as a potential reference, applying tests to discard those values found that cannot be pointers (for example, because they ‘point’ outside the range of the heap or to a location in the heap at which no object has been allocated). Since conservative stack scanning identifies a superset of the true pointer slots in the stack, it is not possible to change the values of any of these (since we might inadvertently change an integer that just happened to look like a pointer). Thus, conservative collection cannot move any object directly referenced by the roots. However, if appropriate information (which need not be full type information) is provided for objects in the heap, then a mostly-copying collector can safely move any object except for one which appears to be directly reachable from ambiguous roots [Bartlett, 1988a].
It may also be undesirable (rather than impossible) to move some objects. For example, the cost of moving large objects may outweigh the fragmentation costs of not moving them. A common strategy is to allocate objects larger than a certain threshold into a separate large object space (LOS). We have already seen how segregated-fits allocators treat large and small objects differently. Large objects are typically placed on separate pages (so a minimum size might be half a page), and are managed by a non-moving collector such as mark-sweep. Notice that, by placing an object on its own page, it can also be ‘copied’ virtually, either by Baker’s Treadmill [1992a] or by remapping virtual memory pages [Withington, 1991].
It may be useful to segregate objects in order to reduce overall heap space requirements. It is desirable to create objects in a space managed by a strategy that supports fast allocation and offers good spatial locality (as a sequence of objects is allocated and initialised). Blackburn et al [2004a] showed that the difference in cost between sequential and free-list allocation is small (accounting for only 1% of total execution time) and is dominated by the second order effect of improved locality, particularly for young objects which benefit from being laid out in allocation order.
Both copying and sliding collectors eliminate fragmentation and allow sequential allocation. However, copying collectors require twice the address space of non-moving collectors and mark-compact collection is comparatively slow. It is therefore often useful to segregate objects so that different spaces can be managed by different memory managers. Those objects that are expected to live for some time, and for which fragmentation is not likely to be an immediate concern, can be kept in a space that is primarily non-moving but visited by occasional compaction passes. Those objects with higher rates of allocation and higher expected mortality can be placed in a space managed by a copying collector for fast allocation and cheap collection (proportional to the number of survivors, which is expected to be low). Note that the expense of reserving copy space for large objects is a further reason for managing large object spaces with a non-copying collector.
Physically segregating objects of different categories also allows a particular property, such as type, to be determined simply from the address of the object rather than by retrieving the value of one of its field or, worse, by chasing a pointer. This has several benefits. First, it offers a cache advantage since it removes the necessity to load a further field (particularly if the placement of objects of a particular category is made statically and so the address comparison is against a compile-time constant). Second, segregation by property, whereby all objects sharing the same property are placed in the same contiguous chunk in order to allow a quick address-based identification of the space, allows the property to be associated with the space rather than replicated in each object’s header. Third, the kind of the object is significant for some collectors. Objects that do not contain pointers do not need to be scanned by a tracing collector. Large pointer-free objects may benefit from being stored in their own space, whereas the cost of processing a large array of pointers is likely to be dominated by the cost of tracing the pointers rather than, say, the cost of moving the object. Conservative collectors benefit particularly from placing large compressed bitmaps in areas that are never scanned as they are a frequent source of false pointers [Boehm, 1993]. Cycle-collecting tracing collectors can also benefit from segregating inherently acyclic objects which cannot be candidate roots of garbage cycles.
Virtual machines often generate and store code sequences in the heap. Moving and reclaiming code has special problems such as identifying, and keeping consistent, references to code, or determining when code is no longer used and hence can be unloaded (note that class reloading is generally not transparent since the class may have state). Code objects also tend to be large and long lived. For these reasons, it is often desirable not to relocate code objects [Reppy, 1993], and to consider unloading code as an exceptional case particular to certain applications.
The best known reason for segregation is to exploit object demographics. It is common for some objects to remain in use from the moment they are allocated to the end of the program while others have very short lives. As long ago as 1976 Deutsch and Bobrow noted that “statistics show that a newly allocated datum is likely to be either ‘nailed down’ or abandoned within a relatively short time”. Indeed, it is even common for a significant fraction of allocation points in Java programs to create objects with a bimodal lifetime distribution [Jones and Ryder, 2008]. Numerous studies have confirmed that the object lifetime behaviour of many (but not all) applications supports the weak generational hypothesis that “most objects die young” [Ungar, 1984]. The insight behind a range of strategies, both generational and quasi-generational, is that the best way to reclaim the most storage space for the least effort is to concentrate collection effort on those objects most likely to be garbage.
If the distribution of object lifetimes is sufficiently skewed, then it is worth repeatedly collecting a subset (or subsets) of the heap rather than the entire heap [Baker, 1993].
For example, generational collectors typically collect a single space of the heap (the young generation or nursery) many times for every time that they collect the entire heap. Note that there is a trade-off here. By not tracing the whole heap at every collection, the collector allows some garbage to go unreclaimed (to float in the heap). This means that the space available for the allocation of new objects is smaller than it would have been otherwise, and hence that the collector is invoked more often. Furthermore, as we shall see later, segregating the heap into collected and uncollected spaces imposes more bookkeeping effort on both the mutator and the collector. Nevertheless, provided that the space chosen for collection has a sufficiently low survival rate, a partitioned collection strategy can be very effective.
Partitioning for responsiveness
The cost of collection to a tracing collector is largely dependent on the volume of live objects to be traced. If a copying collector is used, the cost of the scavenge depends only on the volume of live objects; even in a mark-sweep collector, the cost of tracing dominates the cost of sweeping. By restricting the size of the condemned space that the collector traces, we bound the volume of objects scavenged or marked, and hence the time required for a garbage collection. In a stop-the-world collector, this means shorter pause times. Unfortunately, collecting a subset of the heap improves only expected times. Since collection of a single space may return insufficient free space for computation to continue, it may still be necessary to collect the whole heap. Thus, in general, partitioned collection cannot reduce worst-case pause times.
The extreme case for partitioning is to allow a space to be reclaimed in constant time. If no objects within a condemned region are reachable from outside that region, then there is no tracing work for a collector to do to reclaim the region: the memory occupied by that region can be returned en masse to the allocator. Determining that a region is unreachable requires the combination of appropriate object access disciplines and heap structures (such as stacks of scoped regions). The responsibility for correct usage is typically placed entirely on the programmer (as for example with the Real-time Specification for Java). However, given a suitably tractable language, such as ML, regions can also be inferred automatically [Tofte and Talpin, 1994]. The Cyclone extension to C reduces the burden on programmers though a complex type system which allows some type and region inferencing [Grossman et al, 2002].
The importance of locality for good performance continues to increase as the memory hierarchy becomes more complex (more levels, multiple CPU cores and sockets, and nonuniform memory access). Simple collectors tend to interact poorly with virtual memory and caches. Tracing collectors touch every live object as part of the trace. Mark-sweep collectors may touch dead objects as well. Copying collectors may touch every page of the heap even though only half of it is in use for live objects at any time.
Researchers have long argued that a collector should not be used simply to reclaim garbage but should also be used to improve the locality of the system as a whole [Fenichel and Yochelson, 1969; White, 1980]. We saw in Chapter 4 how the traversal order of copying collectors can be varied in order to improve mutator locality by co-locating parents and children. Generational collectors can obtain further locality improvements for both the collector the mutator. The collector benefits from concentrating most effort on a subsection of the heap likely to return the most free space for the least effort. The mutator benefits from reducing its working set size, since younger objects typically have higher mutation rates than older ones [Blackburn et al, 2004b].
Garbage collection requires synchronisation between mutator and collector threads. On-the-fly collection, which never pauses more than one mutator thread at a time, may require a complex system of handshakes with the mutator threads but even stop-the-world collection requires synchronisation to bring all mutator threads to a halt. This cost can be reduced if we halt just a single thread at a time and collect only those objects that were allocated by that thread and which cannot have escaped to become reachable by other threads. To achieve this, the collector must be able to distinguish those objects that are accessible from only one thread from those that may be shared, for example by allocating in thread-local heaplets [Doligez and Leroy, 1993; Doligez and Gonthier, 1994; Steensgaard, 2000; Jones and King, 2005].
At a larger granularity, it may be desirable to distinguish the objects accessible to particular tasks, where a task comprises a set of cooperating threads. For example, a server may run multiple managed applications, each of which usually requires its own complete virtual machine to be loaded and initialised. In contrast, a multi-tasking virtual machine (MVM) allows many applications (tasks) to run within a single invocation of the multi-tasking virtual machine [Palacz et al, 1994; Soman et al, 2006, 2008; Wegiel and Krintz, 2008]. Care is clearly needed to ensure that different tasks cannot interfere with one another, either directly (by obtaining access to another’s data) or indirectly (through denying another task fair access to system resources such as memory, CPU time, and so on). It is particularly desirable to be able to unload all the resources of a task when it has completed without having to disturb other tasks (for example, without having to run the garbage collector). All these matters are simplified by segregating unshared data owned by different threads.
One reason for not wishing to touch objects that are accessible to other threads is to reduce synchronisation overheads. However, we may also wish to partition objects by their usage because their geography leads to different demographic behaviours. Xian et al [2007] observed that remotable objects instantiated as part of client requests in an application server tend to live longer than local objects; extending Sun’s HotSpot generational collector to recognise this allowed the server to handle larger workloads. More generally, in a system managed by distributed garbage collection, it will be desirable to manage local and remote objects and references with different policies and mechanisms, since the cost of accessing a remote object will be many orders of magnitude more expensive than accessing a local object.
Distribution is not the only reason why the cost of object access may not be uniform. Earlier we paid particular attention to how tracing collectors might minimise the cost of cache misses. The cost of a cache miss may be a few hundred cycles whereas accessing an object on a page that is swapped out will incur millions of cycles. Avoiding frequent page misses was a priority for an earlier generation of collectors whereas today a configuration that leads to heavy paging might be considered irredeemably broken.2 Physical page organisation (in memory or swapped-out) can be considered as another form of heap partitioning, and indeed one that can be exploited. The Bookmarking collector [Hertz et al, 2005] cooperates with the virtual memory system in order first of all to improve the choice (from the collector’s point of view) of the page to be swapped out and, second, to allow a collector’s trace to complete without access to objects on non-resident pages.
Similarly, non-uniform memory access machines have some banks of memory closer to particular processors than others. Sun’s HotSpot collector recognises this property and will preferentially allocate objects in ‘near’ memory to minimise latency on large servers where access times to memory vary significantly.
Finally, we might wish to partition objects according to their mutability. Recently created objects tend to be modified more frequently (for example to initialise their fields) than longer lived objects [Wolczko and Williams, 1992; Bacon and Rajan, 2001; Blackburn and McKinley, 2003; Levanoni and Petrank, 2006]. Memory managers based on reference counting tend to incur a high per-update overhead and thus are less suitable for objects that are modified frequently. On the other hand, in very large heaps, only a comparatively small proportion of objects will be updated in any period but a tracing collector must nevertheless visit all objects that are candidates for garbage. Reference counting might be better suited to this scenario.
Doligez and Gonthier segregate ML objects by mutability (and by thread) in order to allow each thread to have its own space of immutable, unshared objects, as well as a single shared space [Doligez and Leroy, 1993; Doligez and Gonthier, 1994]. Their scheme requires a strong property from references: there must be no pointers to objects inside a thread’s local heap from objects outside that local heap (that is, from other threads’ local heaps or from the shared space). References into a thread’s private heap are prevented from escaping to the shared heap by a copy on write policy; this is semantically transparent since the target of the reference is immutable. Together, these properties allow each thread’s private heap to be collected asynchronously. A further advantage of this approach is that, unlike most schemes in which spaces are collected independently, it is not necessary to track pointers that cross spaces (though the mutator must still detect them).
Probably the most obvious, and the most common, way to partition the heap is by dividing it into non-overlapping ranges of addresses. At its simplest, each space occupies a contiguous chunk of heap memory so this mapping is one to one. It is more efficient to align chunks on power of two boundaries. In that case an object’s space is encoded into the highest bits of its address and can be found by a shift or mask operation. Once the space identity is known, the collector can decide how to process the object (for example, mark it, copy it, ignore it and so on). If the layout of the spaces is known at compile time, this test can be particularly efficient — a comparison against a constant. Otherwise, the space can be looked up, using these bits as an index into a table.
However, contiguous areas may not make efficient use of memory in 32-bit systems, as the range of virtual address space they may occupy must be reserved in advance. Although this does not commit physical memory pages, which can be mapped in and out of the contiguous space on demand, contiguous spaces are nevertheless inflexible and may lead to virtual memory exhaustion even though there are sufficient physical pages available. An additional difficulty in many cases is the tendency of the operating system to map code segments for libraries in unpredictable places — sometimes intentionally unpredictable in order to improve security. This makes it difficult to reserve large contiguous ranges of virtual address space. For the most part, these problems can be eliminated in a 64-bit address space.
The alternative is to implement spaces as discontiguous sets of chunks of address space. Typically, a discontiguous space comprises a list of fixed size frames of contiguous virtual address space. As before, operations on frames are more efficient if frames are aligned on 2n boundaries and are a multiple of the operating system’s page size. Again, the disadvantage is that an object’s space may need to be looked up in a table.
It is not always necessary to implement spaces by segregating objects physically. Instead, an object’s space may be indicated by some bits in its header [Domani et al, 2002]. Although this precludes determining its space through a fast address comparison, this approach nevertheless has some advantages. First, it allows objects to be partitioned according to properties that vary at run time, such as age or thread reachability, even in the case where the collector does not move objects. Second, it may facilitate handling objects that need to be pinned temporarily, for example because they are accessible to code that is not collector aware. Finally, run-time partitioning may be more accurate than choices made statically. For example, static escape analyses provide only a conservative estimate of whether an object may be shared. Static analyses do not yet scale to the very largest programs, and the presence of dynamic class loading commonly necessitates excessively conservative analysis, although Jones and King [2005] show how to obtain a more accurate static estimate of escapement in the context of thread-local allocation. If object escapement is tracked dynamically, then the distinction is between objects that are currently thread-local and those that are (or have been) accessible to more than one thread.3 The downside of dynamic segregation is that it imposes more work on the write barrier. Whenever a pointer update causes its referent to become potentially shared, then the referent and its transitive closure must be marked as shared.
Finally in this section, we note that collecting only a subset of the partitions of the heap necessarily leads to a collector that is incomplete: it cannot reclaim any garbage in partitions that are not collected. Even if the collector takes care to scavenge every partition at some time, say on a round-robin basis, garbage cycles that span partitions will not be collected. In order to ensure completeness, some discipline must be imposed on the order in which partitions are collected and the destination partition to which unreclaimed objects are moved. A simple, and widely used, solution is to collect the entire heap when other tactics fail. However, more sophisticated strategies are possible as we shall see when we consider Mature Object Spaces (also called the Train collector) [Hudson and Moss, 1992].
Partitioning decisions can be made statically (at compile time) or dynamically — when an object is allocated, at collection time or by the mutator as it accesses objects.
The best known partitioning scheme is generational, whereby objects are segregated by age, but this is just one form of age related partitioning. Age related collectors segregate objects by their age into a number of spaces. In this case, partitioning is performed dynamically, by the collector. As an object’s age increases beyond some threshold, it is promoted(moved physically or logically) into the next space.
Objects may also be segregated by the collector because of constraints on moving objects. For example, mostly-copying collectors may not be able to move some objects while they are considered pinned — accessible by code that is not aware that objects’ locations may change.
Partitioning decisions may be also made by the allocator. Most commonly, allocators determine from the size of an allocation request whether the object should be allocated in a large object space. In systems supporting explicit memory regions visible to the programmer or inferred by the compiler (such as scoped regions), the allocator or compiler can place objects in a particular region. Allocators in thread-local systems place objects in a heaplet of the executing thread unless they are directed that the object is shared. Some generational systems may attempt to co-locate a new object in the same region as one that will point to it, on the grounds that eventually it will be promoted there anyway [Guyer and McKinley, 2004].
An object’s space may also be decided statically, by its type, because it is code, or through some other analysis. If it is known a priori that all objects of a particular kind share a common property, such as immortality, then the compiler can determine the space in which these objects should be allocated, and generate the appropriate code sequence. Generational collectors normally allocate in a nursery region set aside for new objects; later, the collector may decide to promote some of these objects to an older generation. However, if the compiler ‘knows’ that certain objects (for instance, those allocated at a particular point in the code) will usually be promoted, then it can pretenure these objects by allocating them directly into that generation [Cheng et al, 1998; Blackburn et al, 2001, 2007; Marion et al, 2007].
Finally, objects may be repartitioned by the mutator as it runs if the heap is managed by a concurrent collector (Chapter 15). Mutator access to objects may be mediated by read or write barriers, each of which may cause one or more objects to be moved or marked. The colouring of objects (black, grey, white) and the old/new space holding the object may be thought of as a partitioning. The mutator can also dynamically discriminate objects according to other properties. As we saw above, the write barrier used by Domani et al [2002] logically segregates objects as they escape their allocating thread. Collaboration between the run-time system and the operating system can repartition objects as pages are swapped in and out [Hertz et al, 2005].
In the next two chapters, we investigate a variety of partitioned garbage collectors. Chapter 9 looks at generational collectors in detail, while Chapter 10 examines a wide variety of other schemes, including both those based on different ways of exploiting object’s ages and those based on non-temporal properties.
1An alternative to passing a direct object reference into the library is to pass an indirect reference (or handle), which can be registered with the collector for updating as necessary. This is the typical solution for the Java Native Interface.
2On the other hand, many current generation netbooks have limited memory and page thrashing is a concern.
3We do not know of any system that reverts objects that were once shared, but are no longer, back to local.