Global Catastrophic Risks

• 7 •
Systems-based risk analysis

Yacov Y. Haimes

7.1 Introduction

Risk models provide the roadmaps that guide the analyst throughout the journey of risk assessment, if the adage ‘To manage risk, one must measure it’ constitutes the compass for risk management. The process of risk assessment and management may be viewed through many lenses, depending on the perspective, vision, values, and circumstances. This chapter addresses the complex problem of coping with catastrophic risks by taking a systems engineering perspective. Systems engineering is a multidisciplinary approach distinguished by a practical philosophy that advocates holism in cognition and decision making. The ultimate purposes of systems engineering are to (1) build an understanding of the system’s nature, functional behaviour, and interaction with its environment, (2) improve the decision-making process (e.g., in planning, design, development, operation, and management), and (3) identify, quantify, and evaluate risks, uncertainties, and variability within the decision-making process.

Engineering systems are almost always designed, constructed, and operated under unavoidable conditions of risk and uncertainty and are often expected to achieve multiple and conflicting objectives. The overall process of identifying, quantifying, evaluating, and trading-off risks, benefits, and costs should be neither a separate, cosmetic afterthought nor a gratuitous add-on technical analysis. Rather, it should constitute an integral and explicit component of the overall managerial decision-making process. In risk assessment, the analyst often attempts to answer the following set of triplet questions (Kaplan and Garrick, 1981): ‘What can go wrong?’, ‘What is the likelihood that it would go wrong?’, and ‘What are the consequences?’ Answers to these questions help risk analysts identify, measure, quantify, and evaluate risks and their consequences and impacts.

Risk management builds on the risk assessment process by seeking answers to a second set of three questions (Haimes, 1991): ‘What can be done and what options are available?’, ‘What are their associated trade-offs in terms of all costs, benefits, and risks?’, and ‘What are the impacts of current management decisions on future options?’ Note that the last question is the most critical one for any managerial decision-making. This is so because unless the negative and positive impacts of current decisions on future options are assessed and evaluated (to the extent possible), these policy decisions cannot be deemed to be ‘optimal’ in any sense of the word. Indeed, the assessment and management of risk is essentially a synthesis and amalgamation of the empirical and normative, the quantitative and qualitative, and the objective and subjective efforts. Total risk management can be realized only when these questions are addressed in the broader context of management, where all options and their associated trade-offs are considered within the hierarchical organizational structure. Evaluating the total trade-offs among all important and related system objectives in terms of costs, benefits, and risks cannot be done seriously and meaningfully in isolation from the modelling of the system and from considering the prospective resource allocations of the overall organization.

Theory, methodology, and computational tools drawn primarily from systems engineering provide the technical foundations upon which the above two sets of triplet questions are addressed quantitatively. Good management must thus incorporate and address risk management within a holistic, systemic, and all-encompassing framework and address the following four sources of failure: hardware, software, organizational, and human. This set of sources is intended to be internally comprehensive (i.e., comprehensive within the system’s own internal environment. External sources of failure are not discussed here because they are commonly system-dependent.) However, the above four failure elements are not necessarily independent of each other. The distinction between software and hardware is not always straightforward, and separating human and organizational failure often is not an easy task. Nevertheless, these four categories provide a meaningful foundation upon which to build a total risk management framework.

In many respects, systems engineering and risk analysis are intertwined, and only together do they make a complete process. To paraphrase Albert Einstein’s comment about the laws of mathematics and reality, we say: ‘To the extent to which risk analysis is real, it is not precise; to the extent to which risk analysis is precise, it is not real’. The same can be applied to systems engineering, since modelling constitutes the foundations for both quantitative risk analysis and systems engineering, and the reality is that no single model can precisely represent large-scale and complex systems.

7.2 Risk to interdependent infrastructure and sectors of the economy

The myriad economic, organizational, and institutional sectors, among others, that characterize countries in the developed world can be viewed as a complex large-scale system of systems. (In a similar way, albeit on an entirely different scale, this may apply to the terrorist networks and to the global socio-economic and political environment.) Each system is composed of numerous interconnected and interdependent cyber, physical, social, and organizational infrastructures (subsystems), whose relationships are dynamic (i.e., ever changing with time), non-linear (defeating a simplistic modelling schema), probabilistic (fraught with uncertainty), and spatially distributed (agents and infrastructures with possible overlapping characteristics are spread all over the continent(s)). These systems are managed or coordinated by multiple government agencies, corporate divisions, and decision makers, with diverse missions, resources, timetables, and agendas that are often in competition and conflict. Because of the above characteristics, failures due to human and organizational errors are common. Risks of extreme and catastrophic events facing this complex enterprise cannot and should not be addressed using the conventional expected value of risk as the sole measure for risk. Indeed, assessing and managing the myriad sources of risks facing many countries around the world will be fraught with difficult challenges. However, systems modelling can greatly enhance the likelihood of successfully managing such risks. Although we recognize the difficulties in developing, sustaining, and applying the necessary models in our quest to capture the essence of the multiple perspectives and aspects of these complex systems, there is no other viable alternative but to meet this worthy challenge.

The literature of risk analysis is replete with misleading definitions of vulnerability. Of particular concern is the definition of risk as the product of impact, vulnerability, and threat. This means that in the parlance of systems engineering we must rely on the building blocks of mathematical models, focusing on the use of state variables. For example, to control the production of steel, one must have an understanding of the states of the steel at any instant – its temperature and other physical and chemical properties. To know when to irrigate and fertilize a farm to maximize crop yield, a farmer must assess the soil moisture and the level of nutrients in the soil. To treat a patient, a physician must first know the temperature, blood pressure, and other states of the patient’s physical health.

State variables, which constitute the building blocks for representing and measuring risk to infrastructure and economic systems, are used to define the following terms (Haimes, 2004, 2006, Haimes et al. 2002):

1. Vulnerability is the manifestation of the inherent states of the system (e.g., physical, technical, organizational, cultural) that can be exploited to adversely affect (cause harm or damage to) that system.

2. Intent is the desire or motivation to attack a target and cause adverse effects.

3. Capability is the ability and capacity to attack a target and cause adverse effects.

4. Threat is the intent and capability to adversely affect (cause harm or damage to) the system by adversely changing its states.

5. Risk (when viewed from the perspectives of terrorism) can be considered qualitatively as the result of a threat with adverse effects to a vulnerable system. Quantitatively, however, risk is a measure of the probability and severity of adverse effects.

Thus, it is clear that modelling risk as the probability and severity of adverse effects requires knowledge of the vulnerabilities, intents, capabilities, and threats to the infrastructure system. Threats to a vulnerable system include terrorist networks whose purposes are to change some of the fundamental states of a country: from a stable to an unstable government, from operable to inoperable infrastructures, and from a trustworthy to an untrustworthy cyber system. These terrorist networks that threaten a country have the same goals as those commissioned to protect its safety, albeit in opposite directions – both want to control the states of the systems in order to achieve their objectives.

Note that the vulnerability of a system is multidimensional, namely, a vector in mathematical terms. For example, suppose we consider the risk of hurricane to a major hospital. The states of the hospital, which represent vulnerabilities, are functionality/availability of the electric power, water supply, telecommunications, and intensive care and other emergency units, which are critical to the overall functionality of the hospital. Furthermore, each one of these state variables is not static in its operations and functionality – its levels of functionality change and evolve continuously. In addition, each is a system of its own and has its own sub-state variables. For example, the water supply system consists of the main pipes, distribution system, and pumps, among other elements, each with its own attributes. Therefore, to use or oversimplify the multidimensional vulnerability to a scalar quantity in representing risk could mask the underlying causes of risk and lead to results that are not useful.

The significance of understanding the systems-based nature of a system’s vulnerability through its essential state variables manifests itself in both the risk assessment process (the problem) and the risk management process (the remedy). As noted in Section 7 1, in risk assessment we ask: What can go wrong? What is the likelihood? What might be the consequences? (Kaplan and Garrick, 1981). In risk management we ask: What can be done and what options are available? What are the trade-offs in terms of all relevant costs, benefits, and risks? What are the impacts of current decisions on future options? (Haimes, 1991, 2004) This significance is also evident in the interplay between vulnerability and threat. (Recall that a threat to a vulnerable system, with adverse effects, yields risk.) Indeed, to answer the triplet questions in risk assessment, it is imperative to have knowledge of those states that represent the essence of the system under study and of their levels of functionality and security.

7.3 Hierarchical holographic modelling and the theory of scenario structuring

7.3.1 Philosophy and methodology of hierarchical holographic modelling

Hierarchical holographic modelling (HHM) is a holistic philosophy/ methodology aimed at capturing and representing the essence of the inherent diverse characteristics and attributes of a system – its multiple aspects, perspectives, facets, views, dimensions, and hierarchies. Central to the mathematical and systems basis of holographic modelling is the overlapping among various holographic models with respect to the objective functions, constraints, decision variables, and input-output relationships of the basic system. The term holographic refers to the desire to have a multi-view image of a system when identifying vulnerabilities (as opposed to a single view, or a flat image of the system). Views of risk can include but are not limited to (1) economic, (2) health, (3) technical, (4) political, and (5) social systems. In addition, risks can be geography related and time related. In order to capture a holographic outcome, the team that performs the analysis must provide a broad array of experience and knowledge.

The term hierarchical refers to the desire to understand what can go wrong at many different levels of the system hierarchy. HHM recognizes that for the risk assessment to be complete, one must recognize that there are macroscopic risks that are understood at the upper management level of an organization that are very different from the microscopic risks observed at lower levels. In a particular situation, a microscopic risk can become a critical factor in making things go wrong. To carry out a complete HHM analysis, the team that performs the analysis must include people who bring knowledge up and down the hierarchy.

HHM has turned out to be particularly useful in modelling large-scale, complex, and hierarchical systems, such as defence and civilian infrastructure systems. The multiple visions and perspectives of HHM add strength to risk analysis. It has been extensively and successfully deployed to study risks for government agencies such as the President’s Commission on Critical Infrastructure Protection (PCCIP), the FBI, NASA, the Virginia Department of Transportation (VDOT), and the National Ground Intelligence Center, among others. (These cases are discussed as examples in Haimes [2004].) The HHM methodology/philosophy is grounded on the premise that in the process of modelling large-scale and complex systems, more than one mathematical or conceptual model is likely to emerge. Each of these models may adopt a specific point of view, yet all may be regarded as acceptable representations of the infrastructure system. Through HHM, multiple models can be developed and coordinated to capture the essence of many dimensions, visions, and perspectives of infrastructure systems.

7.3.2 The definition of risk

In the first issue of Risk Analysis, Kaplan and Garrick (1981) set forth the following ‘set of triplets’ definition of risk, R:

where Si here denotes the i-th ‘risk scenario’, L_i denotes the likelihood of that scenario, and Xi the ‘damage vector’ or resulting consequences. This definition has served the field of risk analysis well since then, and much early debate has been thoroughly resolved about how to quantify the Liand Xi, and the meaning of ‘probability’, ‘frequency’, and ‘probability of frequency’ in this connection (Kaplan, 1993, 1996).

In Kaplan and Garrick (1981) the Si themselves were defined, somewhat informally, as answers to the question, ‘What can go wrong?’ with the system or process being analysed. Subsequently, a subscript ‘c’ was added to the set of triplets by Kaplan (1991, 1993):

This denotes that the set of scenarios {S_i} should be ‘complete’, meaning it should include ‘all the possible scenarios, or at least all the important ones’.

7.3.3 Historical perspectives

At about the same time that Kaplan and Garrick’s (1981) definition of risk was published, so too was the first article on HHM (Haimes, 1981, 2004). Central to the HHM method is a particular diagram (see, for example, Fig. 7.1). This is particularly useful for analysing systems with multiple, interacting (perhaps overlapping) subsystems, such as a regional transportation or water supply system. The different columns in the diagram reflect different ‘perspectives’ on the overall system.

HHM can be seen as part of the theory of scenario structuring (TSS) and vice versa, that is, TSS as part of HHM. Under the sweeping generalization of the HHM method, the different methods of scenario structuring can lead to seemingly different sets of scenarios for the same underlying problem. This fact is a bit awkward from the standpoint of the ‘set of triplets’ definition of risk (Kaplan and Garrick, 1981).

Fig. 7.1 Hierarchical holographic modelling (HHM) framework for identification of sources of risk.

The HHM approach divides the continuum but does not necessarily partition it. In other words, it allows the set of subsets to be overlapping, that is, non-disjoint. It argues that disjointedness is required only when we are going to quantify the likelihood of the scenarios, and even then, only if we are going to add up these likelihoods (in which case the overlapping areas would end up counted twice). Thus, if the risk analysis seeks mainly to identify scenarios rather than to quantify their likelihood, the disjointedness requirement can be relaxed somewhat, so that it becomes a preference rather than a necessity.

To see how HHM and TSS fit within each other (Kaplan et al., 2001), one key idea is to view the HHM diagram as a depiction of the success scenario S_0. Each box in Fig. 7.1 may then be viewed as defining a set of actions or results required of the system, as part of the definition of’success’. Conversely then, each box also defines a set of risk scenarios in which there is failure to accomplish one or more of the actions or results defined by that box. The union of all these sets contains all possible risk scenarios and is then ‘complete’.

This completeness is, of course, a very desirable feature. However, the intersection of two of our risk scenario sets, corresponding to two different HHM boxes, may not be empty. In other words, our scenario sets may not be ‘disjoint’.

7.4 Phantom system models for risk management of emergent multi-scale systems

No single model can capture all the dimensions necessary to adequately evaluate the efficacy of risk assessment and management activities of emergent multi-scale systems. This is because it is impossible to identify all relevant state variables and their sub-states that adequately represent large and multi-scale systems (Haimes, 1977, 1981, 2004). Indeed, there is a need for theory and methodology that will enable analysts to appropriately rationalize risk management decisions through a process that

1. identifies existing and potential emergent risks systemically,

2. evaluates, prioritizes, and filters these risks based on justifiable selection criteria,

3. collects, integrates, and develops appropriate metrics and a collection of models to understand the critical aspects of regions,

4. recognizes emergent risks that produce large impacts and risk management strategies that potentially reduce those impacts for various time frames,

5. optimally learns from implementing risk management strategies, and

6. adheres to an adaptive risk management process that is responsive to dynamic, internal, and external forced changes. To do so effectively, models must be developed to periodically quantify, to the extent possible, the efficacy of risk management options in terms of their costs, benefits, and remaining risks.

A risk-based, multi-model, systems-driven approach can effectively address these emergent challenges. Such an approach must be capable of maximally utilizing what is known now and optimally learn, update, and adapt through time as decisions are made and more information becomes available at various regional levels. The methodology must quantify risks as well as measure the extent of learning to quantify adaptability. This learn-as-you-go tactic will result in re-evaluation and evolving/learning risk management over time.

Phantom system models (PSMs) (Haimes, 2007) enable research teams to effectively analyse major forced (contextual) changes on the characteristics and performance of emergent multi-scale systems, such as cyber and physical infrastructure systems, or major socio-economic systems. The PSM is aimed at providing a reasoned virtual-to-real experimental modelling framework with which to explore and thus understand the relationships that characterize the nature of emergent multi-scale systems. The PSM philosophy rejects a dogmatic approach to problem-solving that relies on a modelling approach structured exclusively on a single school of thinking. Rather, PSM attempts to draw on a pluri-modelling schema that builds on the multiple perspectives gained through generating multiple models. This leads to the construction of appropriate complementary models on which to deduce logical conclusions for future actions in risk management and systems engineering. Thus, we shift from only deciding what is optimal, given what we know, to answering questions such as (1) What do we need to know? (2) What are the impacts of having more precise and updated knowledge about complex systems from a risk reduction standpoint? and (3) What knowledge is needed for acceptable risk management decision-making? Answering this mandates seeking the ‘truth’ about the unknowable complex nature of emergent systems; it requires intellectually bias-free modellers and thinkers who are empowered to experiment with a multitude of modelling and simulation approaches and to collaborate for appropriate solutions.

The PSM has three important functions: (1) identify the states that would characterize the system, (2) enable modellers and analysts to explore cause-and-effect relationships in virtual-to-real laboratory settings, and (3) develop modelling and analyses capabilities to assess irreversible extreme risks; anticipate and understand the likelihoods and consequences of the forced changes around and within these risks; and design and build reconfigured systems to be sufficiently resilient under forced changes, and at acceptable recovery time and costs.

The PSM builds on and incorporates input from HHM, and by doing so seeks to develop causal relationships through various modelling and simulation tools; it imbues life and realism into phantom ideas for emergent systems that otherwise would never have been realized. In other words, with different modelling and simulation tools, PSM legitimizes the exploration and experimentation of out-of-the-box and seemingly ‘crazy’ ideas and ultimately discovers insightful implications that otherwise would have been completely missed and dismissed.

7.5 Risk of extreme and catastrophic events

7.5.1 The limitations of the expected value of risk

One of the most dominant steps in the risk assessment process is the quantification of risk, yet the validity of the approach most commonly used to quantify risk – its expected value – has received neither the broad professional scrutiny it deserves nor the hoped-for wider mathematical challenge that it mandates. One of the few exceptions is the conditional expected value of the risk of extreme events (among other conditional expected values of risks) generated by the partitioned multi-objective risk method (PMRM) (Asbeck and Haimes, 1984; Haimes, 2004).

Let p_x (x) denote the probability density function of the random variable X, where, for example, X is the concentration of the contaminant trichloroethylene (TCE) in a groundwater system, measured in parts per billion (ppb). The expected value of the concentration (the risk of the groundwater being contaminated by an average concentration of TCE), is E(X) ppb. If the probability density function is discretized to n regions over the entire universe of contaminant concentrations, then E(X) equals the sum of the product of p_i and x_i, where p_i is the probability that the i -th segment of the probability regime has a TCE concentration of x_i. Integration (instead of summation) can be used for the continuous case. Note, however, that the expected-value operation commensurates contaminations (events) of low concentration and high frequency with contaminations of high concentration and low frequency. For example, events x ₁ = 2 pbb and x ₂ = 20,000 ppb that have the probabilities p1 = 0.1 and p2 = 0.00001, respectively, yield the same contribution to the overall expected value: (0.1)(2) + (0.00001)(20, 000) = 0.2+0.2. However, to the decision maker in charge, the relatively low likelihood of a disastrous contamination of the groundwater system with 20,000 ppb of TCE cannot be equivalent to the contamination at a low concentration of 0.2 ppb, even with a very high likelihood of such contamination. Owing to the nature of mathematical smoothing, the averaging function of the contaminant concentration in this example does not lend itself to prudent management decisions. This is because the expected value of risk does not accentuate catastrophic events and their consequences, thus misrepresenting what would be perceived as an unacceptable risk.

7.5.2 The partitioned multi-objective risk method

Before the partitioned multi-objective risk method (PMRM) was developed, problems with at least one random variable were solved by computing and minimizing the unconditional expectation of the random variable representing damage. In contrast, the PMRM isolates a number of damage ranges (by specifying so-called partitioning probabilities) and generates conditional expectations of damage, given that the damage falls within a particular range. A conditional expectation is defined as the expected value of a random variable, given that this value lies within some pre-specified probability range. Clearly, the values of conditional expectations depend on where the probability axis is partitioned. The analyst subjectively chooses where to partition in response to the extreme characteristics of the decision-making problem. For example, if the decision-maker is concerned about the once-in-a-million-years catastrophe, the partitioning should be such that the expected catastrophic risk is emphasized.

The ultimate aim of good risk assessment and management is to suggest some theoretically sound and defensible foundations for regulatory agency guidelines for the selection of probability distributions. Such guidelines should help incorporate meaningful decision criteria, accurate assessments of risk in regulatory problems, and reproducible and persuasive analyses. Since these risk evaluations are often tied to highly infrequent or low-probability catastrophic events, it is imperative that these guidelines consider and build on the statistics of extreme events. Selecting probability distributions to characterize the risk of extreme events is a subject of emerging studies in risk management (Bier and Abhichandani, 2003; Haimes, 2004; Lambert et al., 1994; Leemis, 1995).

There is abundant literature that reviews the methods of approximating probability distributions from empirical data. Goodness-of-fit tests determine whether hypothesized distributions should be rejected as representations of empirical data. Approaches such as the method of moments and maximum likelihood are used to estimate distribution parameters. The caveat in directly applying accepted methods to natural hazards and environmental scenarios is that most deal with selecting the best matches for the ‘entire’ distribution. The problem is that these assessments and decisions typically address worst-case scenarios on the tails of distributions. The differences in distribution tails can be very significant even if the parameters that characterize the central tendency of the distribution are similar. A normal and a uniform distribution that have similar expected values can markedly differ on the tails. The possibility of significantly misrepresenting the tails, which are potentially the most relevant portion of the distribution, highlights the importance of considering extreme events when selecting probability distributions (see also Chapter 8, this volume).

More time and effort should be spent to characterize the tails of distributions when modelling the entire distribution. Improved matching between extreme events and distribution tails provides policymakers with more accurate and relevant information. Major factors to consider when developing distributions that account for tail behaviours include (1) the availability of data, (2) the characteristics of the distribution tail, such as shape and rate of decay, and (3) the value of additional information in assessment.

The conditional expectations of a problem are found by partitioning the problem’s probability axis and mapping these partitions onto the damage axis. Consequently, the damage axis is partitioned into corresponding ranges. A conditional expectation is defined as the expected value of a random variable given that this value lies within some pre-specified probability range. Clearly, the values of conditional expectations are dependent on where the probability axis is partitioned. The choice of where to partition is made subjectively by the analyst in response to the extreme characteristics of the problem. If, for example, the analyst is concerned about the once-in-a-million-years catastrophe, the partitioning should be such that the expected catastrophic risk is emphasized. Although no general rule exists to guide the partitioning, Asbeck and Haimes (1984) suggest that if three damage ranges are considered for a normal distribution, then the +1s and +4s partitioning values provide an effective rule of thumb. These values correspond to partitioning the probability axis at 0.84 and 0.99968; that is, the low-damage range would contain 84% of the damage events, the intermediate range would contain just under 16%, and the catastrophic range would contain about 0.032% (probability of 0.00032). In the literature, catastrophic events are generally said to be those with a probability of exceedance of 10⁻⁵ (see, for instance, the National Research Council report on dam safety [National Research Council, 1985]). This probability corresponds to events exceeding +4s.

A continuous random variable X of damages has a cumulative distribution function (cdf)P(x)and a probability density function (pdf)p(x), which are defined by the following relationships:

The cdf represents the non-exceedance probability of x. The exceedance probability of x is defined as the probability that X is observed to be greater than x and is equal to one minus the cdf evaluated at x. The expected value, average, or mean value of the random variable X is defined as

For the discrete case, where the universe of events (sample space) of the random variable X is discretized into I segments, the expected value of damage E [X ] can be written as

where x_i is the i-th segment of the damage.

In the PMRM, the concept of the expected value of damage is extended to generate multiple conditional expected-value functions, each associated with a particular range of exceedance probabilities or their corresponding range of damage severities. The resulting conditional expected-value functions, in conjunction with the traditional expected value, provide a family of risk measures associated with a particular policy.

Let 1 – α₁ and 1 – α₂ where 0 < α₁ < α ₂ < 1, denote exceedance probabilities that partition the domain of X into three ranges, as follows. On a plot of exceedance probability, there is a unique damage β₁ on the damage axis that corresponds to the exceedance probability 1 – α₁ on the probability axis. Similarly, there is a unique damage β₂ that corresponds to the exceedance probability 1 – α₂. Damages less than β₁ are considered to be of low severity, and damages greater than β ₂ of high severity. Similarly, damages of a magnitude between β₁ and β ₂ are considered to be of moderate severity. The partitioning of risk into three severity ranges is illustrated in Fig. 7.2. For example, if the partitioning probability α₁ is specified to be 0.05, then β₁ is the 5th exceedance percentile. Similarly, if α₂ is 0.95 (i.e., 1 – α₂ is equal to 0.05), then β₂ is the 95th exceedance percentile.

For each of the three ranges, the conditional expected damage (given that the damage is within that particular range) provides a measure of the risk associated with the range. These measures are obtained by defining the conditional expected value. Consequently, the new measures of risk are f₂ (·), of high exceedance probability and low severity; f₃ (·), of medium exceedance probability and moderate severity; and f₄ (·), of low exceedance probability and high severity. The function f₂ (·), is the conditional expected value of X, given that x is less than or equal to β₁:

Fig. 7.2 PDF of failure rate distributions for four designs.

Thus, for a particular policy option, there are three measures of risk, f2(·), f₃ (·), andf₄ (·), in addition to the traditional expected value denoted by f₅ (·). The function f₁ (·) is reserved for the cost associated with the management of risk. Note that

since the total probability of the sample space of X is necessarily equal to one. In the PMRM, all or some subset of these five measures are balanced in a multi-objective formulation. The details are made more explicit in the next two sections.

7.5.3 Risk versus reliability analysis

Over time, most, if not all, man-made products and structures ultimately fail. Reliability is commonly used to quantify this time-dependent failure of a system. Indeed, the concept of reliability plays a major role in engineering planning, design, development, construction, operation, maintenance, and replacement.

The distinction between reliability and risk is not merely a semantic issue; rather, it is a major element in resource allocation throughout the life cycle of a product (whether in design, construction, operation, maintenance, or replacement). The distinction between risk and safety, well articulated over two decades ago by Lowrance (1976), is vital when addressing the design, construction, and maintenance of physical systems, since by their nature such systems are built of materials that are susceptible to failure. The probability of such a failure and its associated consequences constitute the measure of risk. Safety manifests itself in the level of risk that is acceptable to those in charge of the system. For instance, the selected strength of chosen materials, and their resistance to the loads and demands placed on them, is a manifestation of the level of acceptable safety. The ability of materials to sustain loads and avoid failures is best viewed as a random process – a process characterized by two random variables: (1) the load (demand) and (2) the resistance (supply or capacity).

Unreliability, as a measure of the probability that the system does not meet its intended functions, does not include the consequences of failures. On the other hand, as a measure of the probability (i.e., unreliability) and severity (consequences) of the adverse effects, risk is inclusive and thus more representative. Clearly, not all failures can justifiably be prevented at all costs. Thus, system reliability cannot constitute a viable metric for resource allocation unless an a priori level of reliability has been determined. This brings us to the duality between risk and reliability on the one hand, and multiple objectives and a single objective optimization on the other.

In the multiple-objective model, the level of acceptable reliability is associated with the corresponding consequences (i.e., constituting a risk measure) and is thus traded off with the associated cost that would reduce the risk (i.e., improve the reliability). In the simple-objective model, on the other hand, the level of acceptable reliability is not explicitly associated with the corresponding consequences; rather it is predetermined (or parametrically evaluated) and thus is considered as a constraint in the model.

There are, of course, both historical and evolutionary reasons for the more common use of reliability analysis rather than risk analysis, as well as substantive and functional justifications. Historically, engineers have always been concerned with strength of materials, durability of product, safety, surety, and operability of various systems. The concept of risk as a quantitative measure of both the probability and consequences (or an adverse effect) of a failure has evolved relatively recently. From the substantive-functional perspective, however, many engineers or decision-makers cannot relate to the amalgamation of two diverse concepts with different units – probabilities and consequences – into one concept termed risk. Nor do they accept the metric with which risk is commonly measured. The common metric for risk – the expected value of adverse outcome – essentially commensurates events of low probability and high consequences with those of high probability and low consequences. In this sense, one may find basic philosophical justifications for engineers to avoid using the risk metric and instead work with reliability. Furthermore and most important, dealing with reliability does not require the engineer to make explicit trade-offs between cost and the outcome resulting from product failure. Thus, design engineers isolate themselves from the social consequences that are by-products of the trade-offs between reliability and cost. The design of levees for flood protection may clarify this point.

Designating a ‘one-hundred-year return period’ means that the engineer will design a flood protection levee for a predetermined water level that on the average is not expected to be exceeded more than once every hundred years. Here, ignoring the socio-economic consequences, such as loss of lives and property damage due to a high water level that would most likely exceed the one-hundred-year return period, the design engineers shield themselves from the broader issues of consequences, that is, risk to the population’s social well-being. On the other hand, addressing the multi-objective dimension that the risk metric brings requires much closer interaction and coordination between the design engineers and the decision makers. In this case, an interactive process is required to reach acceptable levels of risks, costs, and benefits. In a nutshell, complex issues, especially those involving public policy with health and socio-economic dimensions, should not be addressed through overly simplified models and tools. As the demarcation line between hardware and software slowly but surely fades away, and with the ever-evolving and increasing role of design engineers and systems analysts in technology-based decision-making, a new paradigm shift is emerging. This shift is characterized by a strong overlapping of the responsibilities of engineers, executives, and less technically trained managers.

The likelihood of multiple or compound failure modes in infrastructure systems (as well as in other physical systems) adds another dimension to the limitations of a single reliability metric for such infrastructures (Park et al., 1998; Schneiter et al., 1996). Indeed, because the multiple reliabilities of a system must be addressed, the need for explicit trade-offs among risks and costs becomes more critical. Compound failure modes are defined as two or more paths to failure with consequences that depend on the occurrence of combinations of failure paths. Consider the following examples: (1) a water distribution system, which can fail to provide adequate pressure, flow volume, water quality, and other needs; (2) the navigation channel of an inland waterway, which can fail by exceeding the dredge capacity and by closure to barge traffic; and (3) highway bridges, where failure can occur from deterioration of the bridge deck, corrosion or fatigue of structural elements, or an external loading such as floodwater. None of these failure modes is independent of the others in probability or consequence. For example, in the case of the bridge, deck cracking can contribute to structural corrosion and structural deterioration in turn can increase the vulnerability of the bridge to floods. Nevertheless, the individual failure modes of bridges are typically analysed independent of one another. Acknowledging the need for multiple metrics of reliability of an infrastructure could markedly improve decisions regarding maintenance and rehabilitation, especially when these multiple reliabilities are augmented with risk metrics.

Suggestions for further reading

Apgar, D. (2006). Risk Intelligence: Learning to Manage What We Don’t Know (Boston, MA: Harvard Business School Press). This book is to help business managers deal more effectively with risk.

Levitt, S.D. and Dubner, S.J. (2005). Freakonomics: A Rogue Economist Explores the Hidden Side of Everything (New York: HarperCollins). A popular book that adapts insights from economics to understand a variety of everyday phenomena.

Taleb, N.N. (2007). The Black Swan: The Impact of the Highly Improbable (Random House: New York). An engagingly written book, from the perspective of an investor, about risk, especially long-shot risk and how people fail to take them into account.

References

Asbeck, E.L. and Haimes, Y.Y. (1984). The partitioned multiobjective risk method (PMRM). Large Scale Syst., 6(1), 13–38.

Bier, V.M. and Abhichandani, V. (2003). Optimal allocation of resources for defense of simple series and parallel systems from determined adversaries. In Haimes, Y.Y., Moser, D.A., and Stakhiv, E.Z. (eds.), Risk-based Decision Making in Water Resources X (Reston, VA: ASCE).

Haimes, Y.Y. (1977), Hierarchical Analyses of Water Resources Systems, Modeling &Optimization of Large Scale Systems, McGraw Hill, New York.

Haimes, Y.Y. (1981). Hierarchical holographic modeling. IEEE Trans. Syst. Man Cybernet., 11(9), 606–617.

Haimes, Y.Y. (1991). Total risk management. Risk Anal., 11(2), 169–171.

Haimes, Y.Y. (2004). Risk Modeling, Assessment, and Management. 2nd edition (New York: John Wiley).

Haimes, Y.Y. (2006). On the definition of vulnerabilities in measuring risks to infrastructures. Risk Anal., 26(2), 293–296.

Haimes, Y.Y. (2007). Phantom system models for emergent multiscale systems. J. Infrastruct. Syst., 13, 81–87.

Haimes, Y.Y., Kaplan, S., and Lambert, J.H. (2002). Risk filtering, ranking, and management framework using hierarchical holographic modeling. Risk Anal., 22(2), 383–397.

Kaplan, S. (1991). The general theory of quantitative risk assessment. In Haimes, Y., Moser, D., and Stakhiv, E. (eds.), Risk-based Decision Making in Water Resources V, pp. 11–39 (New York: American Society of Civil Engineers).

Kaplan, S. (1993). The general theory of quantitative risk assessment – its role in the regulation of agricultural pests. Proc. APHIS/NAPPO Int. Workshop Ident. Assess. Manag. Risks Exotic Agric. Pests, 11(1), 123–126.

Kaplan, S. (1996). An Introduction to TRIZ, The Russian Theory of Inventive Problem Solving (Southfield, MI: Ideation International).

Kaplan, S. and Garrick, B.J. (1981). On the quantitative definition of risk. Risk Anal., 1(1), 11 -27.

Kaplan, S., Haimes, Y.Y., and Garrick, B.J. (2001). Fitting hierarchical holographic modeling (HHM) into the theory of scenario structuring, and a refinement to the quantitative definition of risk. Risk Anal., 21(5), 807–819.

Kaplan, S., Zlotin, B., Zussman, A., and Vishnipolski, S. (1999). New Tools for Failure and Risk Analysis – Anticipatory Failure Determination and the Theory of Scenario Structuring (Southfield, MI: Ideation).

Lambert, J.H., Matalas, N.C., Ling, C.W., Haimes, Y.Y., and Li, D. (1994). Selection of probability distributions in characterizing risk of extreme events. Risk Anal., 149(5), 731–742.

Leemis, M.L. (1995). Reliability: Probabilistic Models and Statistical Methods (Englewood Cliffs, NJ: Prentice-Hall).

Lowrance, W.W. (1976). Of Acceptable Risk (Los Altos, CA: William Kaufmann).

National Research Council (NRC), Committee on Safety Criteria for Dams. (1985). Safety of Dams – Flood and Earthquake Criteria (Washington, DC: National Academy Press).

Park, J.I., Lambert, J.H., and Haimes, Y.Y. (1998). Hydraulic power capacity of water distribution networks in uncertain conditions of deterioration. Water Resources Res., 34(2), 3605–3614.

Schneiter, C.D. Li, Haimes, Y.Y., and Lambert, J.H. (1996). Capacity reliability and optimum rehabilitation decision making for water distribution networks. Water Resources Res., 32(7), 2271–2278.