Developers are increasingly turning to managed languages and run-time systems for the many virtues they offer, from the increased security they bestow to code to the flexibility they provide by abstracting away from operating system and architecture. The benefits of managed code are widely accepted [Butters, 2007]. Because many services are provided by the virtual machine, programmers have less code to write. Code is safer if it is type-safe and if the run-time system verifies code as it is loaded, checks for resource access violations and the bounds of arrays and other collections, and manages memory automatically. Deployment costs are lower since it is easier to deploy applications to different platforms, even if the mantra ‘write once, run anywhere’ is over-optimistic. Consequently, programmers can spend a greater proportion of development time on the logic of their application.
Almost all modern programming languages make use of dynamic memory allocation. This allows objects to be allocated and deallocated even if their total size was not known at the time that the program was compiled, and if their lifetime may exceed that of the subroutine activation1 that allocated them. A dynamically allocated object is stored in a heap, rather than on the stack (in the activation record or stack frame of the procedure that allocated it) or statically (whereby the name of an object is bound to a storage location known at compile or link time). Heap allocation is particularly important because it allows the programmer:
• to choose dynamically the size of new objects (thus avoiding program failure through exceeding hard-coded limits on arrays);
• to define and use recursive data structures such as lists, trees and maps;
• to return newly created objects to the parent procedure (allowing, for example, factory methods);
• to return a function as the result of another function (for example, closures or suspensions in functional languages).
Heap allocated objects are accessed through references. Typically, a reference is a pointer to the object (that is, the address in memory of the object). However, a reference may alternatively refer to an object only indirectly, for instance through a handle which in turn points to the object. Handles offer the advantage of allowing an object to be relocated (updating its handle) without having to change every reference to that object/handle throughout the program.
Any non-trivial program, running in a finite amount of memory, will need from time to time to recover the storage used by objects that are no longer needed by the computation. Memory used by heap objects can be reclaimed using explicit deallocation (for example, with C’s free
or C++’s delete
operator) or automatically by the run-time system, using reference counting [Collins, 1960] or a tracing garbage collector [McCarthy, 1960]. Manual reclamation risks programming errors; these may arise in two ways.
Memory may be freed prematurely, while there are still references to it. Such a reference is called a dangling pointer (see Figure 1.1). If the program subsequently follows a dangling pointer, the result is unpredictable. The application programmer has no control over what happens to deallocated memory, so the run-time system may choose, among other options, to clear (fill with zeroes) the space used by the deleted object, to allocate a new object in that space or to return that memory to the operating system. The best that the programmer can hope for is that the program crashes immediately. However, it is more likely that it will continue for millions of cycles before crashing (making debugging difficult) or simply run to completion but produce incorrect results (which might not even be easy to detect). One way to detect dangling references is to use fat pointers. These can be used to hold the version number of their target as well as the pointer itself. Operations such as dereferencing must then check that the version number stored in the pointer matches that stored in the object. However, this approach is mostly restricted to use with debugging tools because of its overhead, and it is not completely reliable.2
The second kind of error is that the programmer may fail to free an object no longer required by the program, leading to a memory leak. In small programs, leaks may be benign but in large programs they are likely to lead either to substantial performance degradation (as the memory manager struggles to satisfy new allocation requests) or to failure (if the program runs out of memory). Often a single incorrect deallocation may lead to both dangling pointers and memory leaks (as in Figure 1.1).
Programming errors of this kind are particularly prevalent in the presence of sharing, when two or more subroutines may hold references to an object. This is even more problematic for concurrent programming when two or more threads may reference an object. With the increasing ubiquity of multicore processors, considerable effort has gone into the construction of libraries of data structures that are thread-safe. Algorithms that access these structures need to guard against a number of problems, including deadlock, livelock and ABA errors.3 Automatic memory management eases the construction of concurrent algorithms significantly (for example, by eliminating certain ABA problems). Without this, programming solutions are much more complicated [Herlihy and Shavit, 2008].
The issue is more fundamental than simply being a matter of programmers needing to take more care. Difficulties of correct memory management are often inherent to the programming problem in question.4 More generally, safe deallocation of an object is complex because, as Wilson [1994] points out, “liveness is a global property”, whereas the decision to call free
on a variable is a local one.
So how do programmers cope in languages not supported by automatic dynamic memory management? Considerable effort has been invested in resolving this dilemma. The key advice has been to be consistent in the way that they manage the ownership of objects [Belotsky, 2003; Cline and Lomow, 1995]. Belotsky [2003] and others offer several possible strategies for C++. First, programmers should avoid heap allocation altogether, wherever possible. For example, objects can be allocated on the stack instead. When the objects’ creating method returns, the popping of the stack will free these objects automatically. Secondly, programmers should pass and return objects by value, by copying the full contents of a parameter/result rather than by passing references. Clearly both of these approaches remove all allocation/deallocation errors but they do so at the cost of both increased memory pressure and the loss of sharing. In some circumstances it may be appropriate to use custom allocators, for example, that manage a pool of objects. At the end of a program phase, the entire pool can be freed as a whole.
C++ has seen several attempts to use special pointer classes and templates to improve memory management. These overload normal pointer operations in order to provide safe storage reclamation. However, such smart pointers have several limitations. The auto_ptr
class template cannot be used with the Standard Template Library and will be deprecated in the expected next edition of the C++ standard [Boehm and Spertus, 2009].5 It will be replaced by an improved unique_ptr
that provides strict ownership semantics that allow the target object to be deleted when the unique pointer is. The standard will also include a reference counted shared_ptr
,6 but these also have limitations. Reference counted pointers are unable to manage self-referential (cyclic) data structures. Most smart pointers are provided as libraries, which restricts their applicability if efficiency is a concern. Possibly, they are most appropriately used to manage very large blocks, references to which are rarely assigned or passed, in which case they might be significantly cheaper than tracing collection. On the other hand, without the cooperation of the compiler and run-time system, reference counted pointers are not an efficient, general purpose solution to the management of small objects, especially if pointer manipulation is to be thread-safe.
The plethora of strategies for safe manual memory management throws up yet another problem. If it is essential for the programmer to manage object ownership consistently, which approach should she adopt? This is particularly problematic when using library code. Which approach does the library take? Do all the libraries used by the program use the same approach?
1.2 Automatic dynamic memory management
Automatic dynamic memory management resolves many of these issues. Garbage collection (GC) prevents dangling pointers being created: an object is reclaimed only when there is no pointer to it from a reachable object. Conversely, in principle all garbage is guaranteed to be freed — any object that is unreachable will eventually be reclaimed by the collector — with two caveats. The first is that tracing collection uses a definition of ‘garbage’ that is decidable and may not include all objects that will never be accessed again. The second is that in practice, as we shall see in later chapters, garbage collector implementations may choose for efficiency reasons not to reclaim some objects. Only the collector releases objects so the double-freeing problem cannot arise. All reclamation decisions are deferred to the collector, which has global knowledge of the structure of objects in the heap and the threads that can access them. The problems of explicit deallocation were largely due to the difficulty of making a global decision in a local context. Automatic dynamic memory management simply finesses this problem.
Above all, memory management is a software engineering issue. Well-designed programs are built from components (in the loosest sense of the term) that are highly cohesive and loosely coupled. Increasing the cohesion of modules makes programs easier to maintain. Ideally, a programmer should be able to understand the behaviour of a module from the code of that module alone, or at worst a few closely related modules. Reducing the coupling between modules means that the behaviour of one module is not dependent on the implementation of another module. As far as correct memory management is concerned, this means that modules should not have to know the rules of the memory management game played by other modules. In contrast, explicit memory management goes against sound software engineering principles of minimal communication between components; it clutters interfaces, either explicitly through additional parameters to communicate ownership rights, or implicitly by requiring programmers to conform to particular idioms. Requiring code to understand the rules of engagement limits the reusability of components.
The key argument in favour of garbage collection is not just that it simplifies coding — which it does — but that it uncouples the problem of memory management from interfaces, rather than scattering it throughout the code. It improves reusability. This is why garbage collection, in one form or another, has been a requirement of almost all modern languages (see Table 1.1). It is even expected that the next C++ standard will require code to be written so as to allow a garbage-collected implementation [Boehm and Spertus, 2009]. There is substantial evidence that managed code, including automatic memory management, reduces development costs [Butters, 2007]. Unfortunately, most of this evidence is anecdotal or compares development in different languages and systems (hence comparing more than just memory management strategies), and few detailed comparative studies have been performed. Nevertheless, one author has suggested that memory management should be the prime concern in the design of software for complex systems [Nagle, 1995]. Rovner [1985] estimated that 40% of development time for Xerox’s Mesa system was spent on getting memory management correct. Possibly the strongest corroboration of the case for automatic dynamic memory management is an indirect, economic one: the continued existence of a wide variety of vendors and tools for detection of memory errors.
We do not claim that garbage collection is a silver bullet that will eradicate all memory-related programming errors or that it is applicable in all situations. Memory leaks are one of the most prevalent kinds of memory error. Although garbage collection tends to reduce the chance of memory leaks, it does not guarantee to eliminate them. If an object structure becomes unreachable to the rest of the program (for example, through any chain of pointers from the known roots), then the garbage collector will reclaim it. Since this is the only way that an object can be deleted, dangling pointers cannot arise. Furthermore, if deletion of an object causes its children to become unreachable, they too will be reclaimed. Thus, neither of the scenarios of Figure 1.1 are possible. However, garbage collection cannot guarantee the absence of space leaks. It has no answer to the problem of a data structure that is still reachable, but grows without limit (for example, if a programmer repeatedly adds data to a cache but never removes objects from that cache), or that is reachable and simply never accessed again.
Automatic dynamic memory management is designed to do just what it says. Some critics of garbage collection have complained that it is unable to provide general resource management, for example, to close files or windows promptly after their last use. However, this is unfair. Garbage collection is not a universal panacea. It attacks and solves a specific question: the management of memory resources. Nevertheless, the problem of general resource management in a garbage collected language is a substantial one. With explicitly-managed systems there is a straightforward and natural coupling between memory reclamation and the disposal of other resources. Automatic memory management introduces the problem of how to structure resource management in the absence of a natural coupling. However, it is interesting to observe that many resource release scenarios require something akin to a collector in order to detect whether the resource is still in use (reachable) from the rest of the program.
ActionScript (2000) |
Algol-68 (1965) |
APL (1964) |
AppleScript (1993) |
AspectJ (2001) |
Awk (1977) |
Beta (1983) |
C# (1999) |
Cyclone (2006) |
Managed C++ (2002) |
Cecil (1992) |
Cedar (1983) |
Clean (1984) |
CLU (1974) |
D (2007) |
Dart (2011) |
Dylan (1992) |
Dynace (1993) |
E (1997) |
Eiffel (1986) |
Elasti-C (1997) |
Emerald (1988) |
Erlang (1990) |
Euphoria (1993) |
F# (2005) |
Fortress (2006) |
Green (1998) |
Go (2010) |
Groovy (2004) |
Haskell (1990) |
Hope (1978) |
Icon (1977) |
Java (1994) |
JavaScript (1994) |
Liana (1991) |
Limbo (1996) |
Lingo (1991) |
Lisp (1958) |
LotusScript (1995) |
Lua (1994) |
Mathematica (1987) |
MATLAB (1970s) |
Mercury (1993) |
Miranda (1985) |
ML (1990) |
Modula-3 (1988) |
Oberon (1985) |
Objective-C (2007–) |
Obliq (1993) |
Perl (1986) |
Pike (1996) |
PHP (1995) |
Pliant (1999) |
POP-2 (1970) |
PostScript (1982) |
Prolog (1972) |
Python (1991) |
R (1993) |
Rexx (1979) |
Ruby (1993) |
Sather (1990) |
Scala (2003) |
Scheme (1975) |
Self (1986) |
SETL (1969) |
Simula (1964) |
SISAL (1983) |
Smalltalk (1972) |
SNOBOL (1962) |
Squeak (1996) |
Tcl (1990) |
Theta (1994) |
VB.NET (2001) |
VBScript (1996) |
Visual Basic (1991) |
VHDL (1987) |
X10 (2004) |
YAFL (1993) |
Online sources: Dictionary of Programming Languages, Wikipedia and Google.
1.3 Comparing garbage collection algorithms
In this book we discuss a wide range of collectors, each designed with different workloads, hardware contexts and performance requirements in mind. Unfortunately, it is never possible to identify a ‘best’ collector for all configurations. For example, Fitzgerald and Tarditi [2000] found in a study of 20 benchmarks and six collectors that for every collector there was at least one benchmark that would run at least 15% faster with a more appropriate collector. Singer et al [2007b] applied machine learning techniques to predict the best collector configuration for a particular program. Others have explored allowing Java virtual machines to switch collectors as they run if they believe that the characteristics of the workload being run would benefit from a different collector [Printezis, 2001; Soman et al, 2004]. In this section, we examine the metrics by which collectors can be compared. Nevertheless, such comparisons are difficult in both principle and practice. Details of implementation, locality and the practical significance of the constants in algorithmic complexity formulae make them less than perfect guides to practice. Moreover, the metrics are not independent variables. Not only does the performance of an algorithm depend on the topology and volume of objects in the heap, but also on the access patterns of the application. Worse, the tuning options in production virtual machines are inter-connected. Variation of one parameter to achieve a particular goal may lead to other, contradictory effects.
The prime consideration is that garbage collection should be safe: the collector must never reclaim the storage of live objects. However, safety comes with a cost, particularly for concurrent collectors (see Chapter 15). The safety of conservative collection, which receives no assistance from the compiler or run-time system, may in principle be vulnerable to certain compiler optimisations that disguise pointers [Jones, 1996, Chapter 9].
A common goal for end users is that their programs should run faster. However, there are several aspects to this. One is that the overall time spent in garbage collection should be as low as possible. This is commonly referred to in the literature as the mark/cons ratio, comparing the early Lisp activities of the collector (‘marking’ live objects) and the mutator (creating or ‘consing’ new list cells). However, the user is most likely to want the application as a whole (mutator plus collector) to execute in as little time as possible. In most well designed configurations, much more CPU time is spent in the mutator than the collector. Therefore it may be worthwhile trading some collector performance for increased mutator throughput. For example, systems managed by mark-sweep collection occasionally perform more expensive compacting phases in order to reduce fragmentation so as to improve mutator allocation performance (and possibly mutator performance more generally).
Ideally, garbage collection should be complete: eventually, all garbage in the heap should be reclaimed. However, this is not always possible nor even desirable. Pure reference counting collectors, for example, are unable to reclaim cyclic garbage (self-referential structures). For performance reasons, it may be desirable not to collect the whole heap at every collection cycle. For example, generational collectors segregate objects by their age into two or more regions called generations (we discuss generational garbage collection in Chapter 9). By concentrating effort on the youngest generation, generational collectors can both improve total collection time and reduce the average pause time for individual collections.
Concurrent collectors interleave the execution of mutators and collectors; the goal of such collectors is to avoid, or at least bound, interruptions to the user program. One consequence is that objects that become garbage after a collection cycle has started may not be reclaimed until the end of the next cycle; such objects are called floating garbage. Hence, in a concurrent setting it may be more appropriate to define completeness as eventual reclamation of all garbage, as opposed to reclamation within one cycle. Different collection algorithms may vary in their promptness of reclamation, again leading to time/space trade-offs.
On the other hand, an important requirement may be to minimise the collector’s intrusion on program execution. Many collectors introduce pauses into a program’s execution because they stop all mutator threads while collecting garbage. It is clearly desirable to make these pauses as short as possible. This might be particularly important for interactive applications or servers handling transactions (when failure to meet a deadline might lead to the transaction being retried, thus building up a backlog of work). However, mechanisms for limiting pause times may have side-effects, as we shall see in more detail in later chapters. For example, generational collectors address this goal by frequently and quickly collecting a small nursery region, and only occasionally collecting larger, older generations. Clearly, when tuning a generational collector, there is a balance to be struck between the sizes of the generations, and hence not only the pause times required to collect different generations but also the frequency of collections. However, because the sources of some inter-generational pointers must be recorded, generational collection imposes a small tax on pointer write operations by the mutator.
Parallel collectors stop the world to collect but reduce pause times by employing multiple threads. Concurrent and incremental collectors aim to reduce pause times still further by occasionally performing a small quantum of collection work interleaved or in parallel with mutator actions. This too requires taxation of the mutator in order to ensure correct synchronisation between mutators and collectors. As we shall see in Chapter 15, there are different ways to handle this synchronisation. The choice of mechanism affects both space and time costs. It also affects termination of a garbage collection cycle. The cost of the taxation on mutator time depends on how and which manipulations of the heap by the mutator (loads or stores) are recorded. The costs on space, and also collector termination, depends on how much floating garbage (see below) a system tolerates. Multiple mutator and collector threads add to the complexity. In any case, decreasing pause time will increase overall processing time (decrease processing rate).
Maximum or average pause times on their own are not adequate measures. It is also important that the mutator makes progress. The distribution of pause times is therefore also of interest. There are a number of ways that pause time distributions may be reported. The simplest might be a measure of variation such as standard deviation or a graphical representation of the distribution. More interesting measures include minimum mutator utilisation (MMU) and bounded mutator utilisation (BMU). Both the MMU [Cheng and Blelloch, 2001] and BMU [Sachindran et al, 2004] measures seek to display concisely the (minimum) fraction of time spent in the mutator, for any given time window. The x-axis of Figure 1.2 represents time, from 0 to total execution time, and its y-axis the fraction of CPU time spent in the mutator (utilisation). Thus, not only do MMU and BMU curves show total garbage collection time as a fraction of overall execution time (the y-intercept, at the top right of the curves is the mutators’ overall share of processor time), but they also show the maximum pause time (the longest window for which the mutator’s CPU utilisation is zero) as the x-intercept. In general, curves that are higher and more to the left are preferable since they tend towards a higher mutator utilisation for a smaller maximum pause. Note that the MMU is the minimum mutator utilisation (y) in any time window (x). As a consequence it is possible for a larger window to have a lower MMU than a smaller window, leading to dips in the curve. In contrast, BMU curves give the MMU in that time window or any larger one. Monotonically increasing BMU curves are perhaps more intuitive than MMU.
The goal of memory management is safe and efficient use of space. Different memory managers, both explicit and automatic, impose different space overheads. Some garbage collectors may impose per-object space costs (for example, to store reference counts); others may be able to smuggle these overheads into objects’ existing layouts (for example, a mark bit can often be hidden in a header word, or a forwarding pointer may be written over user data). Collectors may have a per-heap space overhead. For example, copying collectors divide the heap into two semispaces. Only one semispace is available to the mutator at any time; the other is held as a copy reserve into which the collector will evacuate live objects at collection time. Collectors may require auxiliary data structures. Tracing collectors need mark stacks to guide the traversal of the pointer graph in the heap; they may also store mark bits in separate bitmap tables rather than in the objects themselves. Concurrent collectors, or collectors that divide the heap into independently collected regions, require remembered sets that record where the mutator has changed the value of pointers, or the locations of pointers that span regions, respectively.
Optimisations for specific languages
Garbage collection algorithms can also be characterised by their applicability to different language paradigms. Functional languages in particular have offered a rich vein for optimisations related to memory management. Some languages, such as ML, distinguish mutable from immutable data. Pure functional languages, such as Haskell, go further and do not allow the user to modify any values (programs are referentially transparent). Internally, however, they typically update data structures at most once (from a ‘thunk’ to weak head normal form); this gives multi-generation collectors opportunities to promote fully evaluated data structures eagerly (see Chapter 9). Authors have also suggested complete mechanisms for handling cyclic data structures with reference counting. Declarative languages may also allow other mechanisms for efficient management of heap spaces. Any data created in a logic language after a ‘choice point’ becomes unreachable after the program backtracks to that point. With a memory manager that keeps objects laid out in the heap in their order of allocation, memory allocated after the choice point can be reclaimed in constant time. Conversely, different language definitions may make specific requirements of the collector. The most notable are the ability to deal with a variety of pointer strengths and the need for the collector to cause dead objects to be finalised.
The final metrics we identify here are scalability and portability. With the increasing prevalence of multicore hardware on the desktop and even laptop (rather than just in large servers), it is becoming increasingly important that garbage collection can take advantage of the parallel hardware on offer. Furthermore, we expect parallel hardware to increase in scale (number of cores and sockets) and for heterogeneous processors to become more common. The demands on servers are also increasing, as heap sizes move into the tens or hundreds of gigabytes scale and as transaction loads increase. A number of collection algorithms depend on support from the operating system or hardware (for instance, by protecting pages or by double mapping virtual memory space, or on the availability of certain atomic operations on the processor). Such techniques are not necessarily portable.
1.4 A performance disadvantage?
We conclude the discussion of the comparative merits of automatic and manual dynamic memory management by asking if automatic memory management must be at a performance disadvantage compared with manual techniques. In general, the cost of automatic dynamic memory management is highly dependent on application behaviour and even hardware, making it impossible to offer simple estimates of overhead. Nevertheless, a long running criticism of garbage collection has been that it is slow compared to explicit memory management and imposes unacceptable overheads, both in terms of overall throughput and in pauses for garbage collection. While it is true that automatic memory management does impose a performance penalty on the program, it is not as much as is commonly assumed. Furthermore, explicit operations like malloc
and free
also impose a significant cost. Hertz, Feng, and Berger [2005] measured the true cost of garbage collection for a variety of Java benchmarks and collection algorithms. They instrumented a Java virtual machine to discover precisely when objects became unreachable, and then used the reachability trace as an oracle to drive a simulator, measuring cycles and cache misses. They compared a wide variety of garbage collector configurations against different implementations of malloc/free
: the simulator invoked free
at the point where the trace indicated that an object had become garbage. Although, as expected, results varied between both collectors and explicit allocators, Hertz et al found garbage collectors could match the execution time performance of explicit allocation provided they were given a sufficiently large heap (five times the minimum required). For more typical heap sizes, the garbage collection overhead increased to 17% on average.
One of the most welcome changes over the past decade or so has been the improvement in experimental methodology reported in the literature on memory management. Nevertheless, it remains clear that reporting standards in computer science have some way to improve before they match the quality of the very best practice in the natural or social sciences. Mytkowicz et al [2008] find measurement bias to be ‘significant and commonplace’.
In a study of a large number of papers on garbage collection, Georges et al [2007] found the experimental methodology, even where reported, to be inadequately rigorous in many cases. Many reported performance improvements were sufficiently small, and the reports lacking in statistical analysis, to raise questions of whether any confidence could be placed in the results. Errors introduced may be systematic or random. Systematic errors are largely due to poor experimental practice and can often be reduced by more careful design of experiments. Random errors are typically due to non-determinism in the system under measurement. By their nature, these are unpredictable and often outside the experimenter’s control; they should be addressed statistically.
The use of synthetic or small scale, ‘toy’, benchmarks has long been criticised as inadequate [Zorn, 1989]. Such benchmarks risk introducing systematic errors because they do not reflect the interactions in memory allocation that occur in real programs, or because their working sets are sufficiently small that they exhibit locality effects that real programs would not. Wilson et al [1995a] provide an excellent critique of such practices. Fortunately, other than for stress testing, synthetic and toy benchmarks have been largely abandoned in favour of larger scale benchmark suites, consisting of widely used programs that are believed to represent a wide range of typical behaviour (for example, the DaCapo suite for Java [Blackburn et al, 2006b]).
Experiments with benchmark suites that contain a large number of realistic programs can introduce systematic bias. Managed run-times, in particular, offer several opportunities for the introduction of systematic errors. Experimenters need to take care to distinguish the context that they are trying to examine: are they interested in start-up costs (important, for example, for short-lived programs) or in the steady state? For the latter, it is important to exclude system warm-up effects such as class loading and dynamic code optimisation. In both cases, it is probably important to disregard cold-start effects such as latency caused by loading the necessary files into the disk cache: thus Georges et al [2007] advocate running several invocations of the virtual machine and benchmark and discarding the first.
Dynamic (or run-time) compilation is a major source of non-determinism, and is particularly difficult to deal with when comparing alternative algorithms. One solution is to remove it. Compiler replay [Blackburn et al, 2006b] allows the user to record which methods are optimised and to which level in a preparatory run of the benchmark. This record can then used by the virtual machine to ensure the same level of optimisation in subsequent, performance runs. However, a problem with this approach is that alternative implementations typically execute different methods, particularly in the component under test. It is not clear which compilation record should be used. Two separate ones? Their intersection?
Sound experimental practice requires that outcomes are valid even in the presence of bias (for example, random errors). This requires repetitions of the experiment and statistical comparison of the results. To be able to state with confidence that one approach is superior to another requires that, first, a confidence level is stated, and second, confidence intervals for each alternative are derived from the results and that these intervals are not found to overlap. Georges et al [2007] offer a statistically rigorous methodology to address non-deterministic and unpredictable errors (including the effects of dynamic compilation). They advocate invoking one instance of the virtual machine and executing a benchmark many times until it reaches a steady state (that is, when the coefficient of variation7 for the last k benchmark iterations falls below some preset threshold). These k iterations can then be used to compute a mean for the benchmark under steady state. By repeating this process, an overall mean and a confidence interval can be computed. Better, the whole distribution (or at least more than one or two moments of it) should be reported.
Garbage collection research needs thorough performance reports. A single ‘spot’ figure, even if decorated with a confidence interval, is not sufficient. The reason is that memory management involves space and time trade-offs. In most circumstances, one way to reduce collection times is to increase the size of the heap (up to a certain point — after that locality effects typically cause execution times to deteriorate). Thus, no experiment that reports a figure for just a single heap size can be taken seriously. It is vital, therefore, that environments allow the user to control the size of heaps (and spaces within those heaps) in order to understand fully the performance characteristics of a particular memory management algorithm. We firmly advocate this even for production virtual machines which may automatically adapt sizes for optimal performance; while automatic adaptation might be appropriate for end users, researchers and developers need more insight.
The chaotic nature of garbage collection reinforces this requirement. By calling garbage collection chaotic, we mean that small changes in configuration can, and commonly do, lead to large changes in behaviour. One example is the scheduling of collections. Even a small change to the point at which a garbage collection occurs may mean that a large data structure either remains reachable or becomes garbage. This can have large effects not only on the cost of the current collection but on how soon the next collection will occur, thus making such variation self-amplifying. By providing results for a range of heap sizes (often expressed in terms of multiples of the smallest heap size in which a program will run to completion), such ‘jitter’ is made readily apparent.
We conclude this chapter by explaining the notation used in the rest of the book. We also give more precise definitions of some of the terms used earlier.
First, a note about units of storage. We adopt the convention that a byte comprises eight bits. Similarly, we use kilobyte (KB), megabyte (MB), gigabyte (GB) and terabyte (TB) to mean a corresponding power of two multiple of the unit byte (210, 220, 230, 240, respectively), in flagrant disregard for the standard definitions of the SI decimal prefixes.
The heap is either a contiguous array of memory words or organised into a set of discontiguous blocks of contiguous words. A granule is the smallest unit of allocation, typically a word or double-word, depending on alignment requirements. A chunk is a large contiguous group of granules. A cell is a generally smaller contiguous group of granules and may be allocated or free, or even wasted or unusable for some reason.
An object is a cell allocated for use by the application. An object is usually assumed to be a contiguous array of addressable bytes or words, divided into slots or fields, as in Figure 1.3 (although some memory managers for real-time or embedded systems may construct an individual large object as a pointer structure, this structure is not revealed to the user program). A field may contain a reference or some other scalar non-reference value such as an integer. A reference is either a pointer to a heap object or the distinguished value null
. Usually, a reference will be the canonical pointer to the head of the object (that is, its first address), or it may point to some offset from the head. An object will sometimes also have a header field which stores metadata used by the run-time system, commonly (but not always) stored at the head of an object. A derived pointer is a pointer obtained by adding an offset to an object’s canonical pointer. An interior pointer is a derived pointer to an internal object field.
A block is an aligned chunk of a particular size, usually a power of two. For completeness we mention also that a frame (when not referring to a stack frame) means a large 2k sized portion of address space, and a space is a possibly discontiguous collection of chunks, or even objects, that receive similar treatment by the system. A page is as defined by the hardware and operating system’s virtual memory mechanism, and a cache line (or cache block) is as defined by its cache. A card is a 2k aligned chunk, smaller than a page, related to some schemes for remembering cross-space pointers (Section 11.8).
The heap is often characterised as an object graph, which is a directed graph whose nodes are heap objects and whose directed edges are the references to heap objects stored in their fields. An edge is a reference from a source node or a root (see below) to a destination node.
Following Dijkstra et al [1976, 1978], a garbage-collected program is divided into two semi-independent parts.
• The mutator executes application code, which allocates new objects and mutates the object graph by changing reference fields so that they refer to different destination objects. These reference fields may be contained in heap objects as well as other places known as roots, such as static variables, thread stacks, and so on. As a result of such reference updates, any object can end up disconnected from the roots, that is, unreachable by following any sequence of edges from the roots.
• The collector executes garbage collection code, which discovers unreachable objects and reclaims their storage.
A program may have more than one mutator thread, but the threads together can usually be thought of as a single actor over the heap. Equally, there may be one or more collector threads.
Separately from the heap memory, we assume some finite set of mutator roots, representing pointers held in storage that is directly accessible to the mutator without going through other objects. By extension, objects in the heap referred to directly by the roots are called root objects. The mutator visits objects in the graph by loading pointers from the current set of root objects (adding new roots as it goes). The mutator can also discard a root by overwriting the root pointer’s storage with some other reference (that is, null
or a pointer to another object). We denote the set of (addresses of) the roots by Roots
.
In practice, the roots usually comprise static/global storage and thread-local storage (such as thread stacks) containing pointers through which mutator threads can directly manipulate heap objects. As mutator threads execute over time, their state (and so their roots) will change.
In a type-safe programming language, once an object becomes unreachable in the heap, and the mutator has discarded all root pointers to that object, then there is no way for the mutator to reacquire a pointer to the object. The mutator cannot ‘rediscover’ the object arbitrarily (without interaction with the run-time system) — there is no pointer the mutator can traverse to it and arithmetic construction of new pointers is prohibited. A variety of languages support finalisation of at least some objects. These appear to the mutator to be ‘resurrected’ by the run-time system. Our point is that the mutator cannot gain access to any arbitrary unreachable object by its efforts alone.
References, fields and addresses
In general, we shall refer to a heap node N by using its memory address (though this need not necessarily be the initial word of an object, but may be to some appropriate standard point in the layout of the object’s data and metadata). Given an object (at address) N, we can refer to arbitrary fields of the object — which may or may not contain pointers — by treating the object as an array of fields: the ith field of an object N will be denoted N[i], counting fields from 0; the number of fields of N is written |N|. We write the usual C syntax for dereferencing a (non-null) pointer p as *
p. Similarly, we use &
to obtain the address of a field. Thus, we write &
N[i
] for the address of the ith field of N. Given an object (at address) N the set Pointers
(N) denotes the set of (addresses of) pointer fields of N. More formally:
Pointers
(N) = {a | a = &N[i], ∀i : 0 ≤ i < |N| where N[i] is a pointer
For convenience, we write Pointers
to denote the set of all pointer fields of all objects in the heap. Similarly, Nodes
denotes the set of all (allocated) objects in the heap. For convenience, we will also treat the set Roots
as a pseudo-object (separate from the heap), and define Pointers(Roots)=Roots
synonymously. By implication, this allows us to write Roots
[i] to refer to the ith root field.
Liveness, correctness and reachability
An object is said to be live if it will be accessed at some time in the future execution of the mutator. A garbage collector is correct only if it never reclaims live objects. Unfortunately, liveness is an undecidable property of programs: there is no way to decide for an arbitrary program whether it will ever access a particular heap object or not.8 Just because a program continues to hold a pointer to an object does not mean it will access it. Fortunately, we can approximate liveness by a property that is decidable: pointer reachability. An object N is reachable from an object M if N can be reached by following a chain of pointers, starting from some field f of M. By extension, an object is only usable by a mutator if there is a chain of pointers from one of the mutator’s roots to the object.
More formally (in the mathematical sense that allows reasoning about reachability), we can define the immediate ‘points-to’ relation →f as follows. For any two heap nodes M, N in Nodes
, M →f N if and only if there is some field location f=&
M[i] in Pointers
(M) such that *
f=N. Similarly, Roots
→f N if and only if there is some field f in Roots
such that *
f=N. We say that N is directly reachable from M, written M → N, if there is some field f in Pointers
(M) such that M →f N (that is, some field f of M points to N). Then, the set of reachable objects in the heap is the transitive referential closure from the set of Roots
under the → relation, that is, the least set
reachable = {N ∈ |
An object that is unreachable in the heap, and not pointed to by any mutator root, can never be accessed by a type-safe mutator. Conversely, any object reachable from the roots may be accessed by the mutator. Thus, liveness is more profitably defined for garbage collectors by reachability. Unreachable objects are certainly dead and can safely be reclaimed. But any reachable object may still be live and must be retained. Although we realise that doing so is not strictly accurate, we will tend to use live and dead interchangeably with reachable and unreachable, and garbage as synonymous with unreachable.
We use a common pseudo-code to describe garbage collection algorithms. We offer these algorithm fragments as illustrative rather than definitive, preferring to resolve ambiguities informally in the text rather than formally in the pseudocode. Our goal is a concise and representative description of each algorithm rather than a full-fleshed implementation.
Indentation denotes the extent of procedure bodies and the scope of control statements. The assignment operator is ← and the equality operator is =. Otherwise we use C-style symbols for the other logical and relational operators, such as ||
(conditional or), &&
(conditional and), ≤, ≥, ≠, %
(modulus) and so on.
The heap allocator, which can be thought of as functionally orthogonal to the collector, supports two operations: allocate
, which reserves the underlying memory storage for an object, and free
which returns that storage to the allocator for subsequent re-use. The size of the storage reserved by allocate
is passed as an optional parameter; when omitted the allocation is of a fixed-size object, or the size of the object is not necessary for understanding of the algorithm. Where necessary, we may pass further arguments to allocate
, for example to distinguish arrays from other objects, or arrays of pointers from those that do not contain pointers, or to include other information necessary to initialise object headers.
Mutator read and write operations
As they execute, mutator threads perform several operations of interest to the collector: New, Read
and Write
. We adopt the convention of naming mutator operations with a leading upper-case letter, as opposed to lower-case for collector operations. Generally, these operations have the expected behaviour: allocating a new object, reading an object field or writing an object field. Specific memory managers may augment these basic operations with additional functionality that turns the operation into a barrier: an action that results in synchronous or asynchronous communication with the collector. We distinguish read barriers and write barriers.
New()
. The New
operation obtains a new heap object from the heap allocator which returns the address of the first word of the newly-allocated object. The mechanism for actual allocation may vary from one heap implementation to another, but collectors usually need to be informed that a given object has been allocated in order to initialise metadata for that object, and before it can be manipulated by the mutator. The trivial default definition of New
simply allocates.
New():
return allocate()
Read(src,i)
. The Read
operation accesses an object field in memory (which may hold a scalar or a pointer) and returns the value stored at that location. Read
generalises memory loads and takes two arguments: (a pointer to) the object and the (index of its) field being accessed. We allow src=Roots
if the field src[i]
is a root (that is, &src[i]∈Roots
). The default, trivial definition of Read
simply returns the contents of the field.
Read(src, i):
return src[i]
Write(src,i,val)
. The Write
operation modifies a particular location in memory. It generalises memory stores and takes three arguments: (a pointer to) the source object and the (index of its) field to be modified, plus the (scalar or pointer) value to be stored. Again, if src=Roots
then the field src[i]
is a root (that is, &src[i]∈Roots
). The default, trivial definition of Write
simply updates the field.
Write(src, i, val):
src[i] ← val
In the face of concurrency between mutator threads, collector threads, and between the mutator and collector, all collector algorithms require that certain code sequences appear to execute atomically. For example, stopping mutator threads makes the task of garbage collection appear to occur atomically: the mutator threads will never access the heap in the middle of garbage collection. Moreover, when running the collector concurrently with the mutator, the New, Read
, and Write
operations may need to appear to execute atomically with respect to the collector and/or other mutator threads. To simplify the exposition of collector algorithms we will usually leave implicit the precise mechanism by which atomicity of operations is achieved, simply marking them with the keyword atomic
. The meaning is clear: all the steps of an atomic operation must appear to execute indivisibly and instantaneously with respect to other operations. That is, other operations will appear to execute either before or after the atomic operation, but never interleaved between any of the steps that constitute the atomic operation. For discussion of different techniques to achieve atomicity as desired see Chapter 11 and Chapter 13.
Sets, multisets, sequences and tuples
We use abstract data structures where this clarifies the discussion of an algorithm. We use mathematical notation where it is appropriate but does not obscure simpler concepts. For the most part, we will be interested in sets and tuples, and simple operations over them to add, remove or detect elements.
We use the usual definition of a set as a collection of distinct (that is, unique) elements. The cardinality of a set S, written |S|, is the number of its elements.
In addition to the standard set notation, we also make use of multisets. A multiset’s elements may have repeated membership in the multiset. The cardinality of a multiset is the total number of its elements, including repeated memberships. The number of times an element appears is its multiplicity. We adopt the following notation:
• [] denotes the empty multiset
• [a, a, b] denotes the multiset containing two as and one b
• [a, b] + [a] = [a, a, b] denotes multiset union
• [a, a, b] — [a] = [a, b] denotes multiset subtraction
A sequence is an ordered list of elements. Unlike a set (or multiset), order matters. Like a multiset, the same element can appear multiple times at different positions in the sequence. We adopt the following notation:
• () denotes the empty sequence
• (a, a, b) denotes the sequence containing two as followed by a b
• (a, b) · (a) = (a, b, a) denotes appending of the sequence (a) to (a, b)
While a tuple of length k can be thought of as being equivalent to a sequence of the same length, we sometimes find it convenient to use a different notation to emphasise the fixed length of a tuple as opposed to the variable length of a sequence, and so on. We adopt the notation below for tuples; we use tuples only of length two or more.
• ⟨a1,…,ak⟩ denotes the k-tuple whose ith member is ai, for 1 ≤ i ≤ k
1We shall tend to use the terms method, function, procedure and subroutine interchangeably.
2Tools such as the memcheck
leak detector used with the valgrind
open source instrumentation framework (see http://valgrind.org
) are more reliable, but even slower. There are also a number of commercially available programs for helping to debug memory issues.
3ABA error: a memory location is written (A), overwritten (B) and then overwritten again with the previous value A (see Chapter 13).
4“When C++ is your hammer, everything looks like a thumb,” Steven M. Haflich, Chair of the NCITS/J13 technical committee for ANSI standard for Common Lisp.
5The final committee draft for the next ISO C++ standard is currently referred to as C++0x.
7The coefficient of variation is the standard deviation divided by the mean.
8The undecidability of liveness is a corollary of the halting problem.