S oftware metrics and analysis provide the data and information that allows an organization’s quality management system to be based on a solid foundation of facts. The objective is to drive continual improvement in all quality parameters through a goal-oriented measurement and analysis system.
Software metrics programs should be designed to provide the specific information necessary to manage software projects and improve software engineering processes, products, and services. The foundation of this approach is aimed at making practitioners ask not “what should I measure?” but “why am I measuring?” or “what business needs does the organization wish its measurement initiative to address?” (Goodman 1993).
Measuring is a powerful way to track progress toward project, process, and product goals. As Grady (1992) states, “Without such measures for managing software, it is difficult for any organization to understand whether it is successful, and it is difficult to resist frequent changes of strategy.”
According to Humphrey (1989), there are four major roles (reasons) for collecting data and implementing software metrics:
Effective software metrics provide the objective data and information necessary to help an organization, its management, its teams, and its individuals:
The bottom line is that effective metrics help improve software projects, products, processes, and services.
1. TERMINOLOGY
Define and describe metrics and measurement terms such as reliability, internal and external validity, explicit and derived measures, and variation. (Understand)
BODY OF KNOWLEDGE V.A.1
Metrics Defined
The term metrics means different things to different people. When someone buys a book or picks up an article on software metrics, the topic can vary from project cost, and effort prediction and modeling, to defect tracking and root cause analysis, to a specific test coverage metric, to computer performance modeling, or even to the application of statistical process control charts to software.
Goodman (2004) defines software metrics as “The continuous application of measurement-based techniques to the software development process and its products to supply meaningful and timely management information, together with the use of those techniques to improve that process and its products.”
As illustrated in Figure 18.1 , Goodman’s definition can be expanded to include software projects and services. Examples of software services include:
Goodman’s definition can also be expanded to include engineering, as well as management information. In fact, measurement is one of the key required elements to move the software from a craft to an engineering discipline.
Figure 18.1 Metrics defined.
Figure 18.2 Converting measurement data into information and knowledge.
Software metrics are standardized ways of measuring the attributes of software processes, products, and services in order to provide the information needed to improve those processes, products, and services. The same metrics can then be used to monitor the impacts of those improvements, thus providing the feedback loop required for continual improvement.
Measurement Defined
The use of measurement is common. People use measurements in everyday life to weigh themselves in the morning, or when they glance at the clock or at the odometer in their car. Measurements are used extensively in most areas of operations and manufacturing to estimate costs, calibrate equipment, and monitor inventories. Science and engineering disciplines depend on the rigor that measurements provide. What does measurement really mean?
According to Fenton (1997), “measurement is the process by which numbers or symbols are assigned to attributes of entities in the real world in such a way as to describe them according to clearly defined rules.” The left hand side of Figure 18.2 illustrates this definition of measurement.
Entities are nouns, for example, a person, place, thing, event, or time period. An attribute is a feature or property of an entity. To measure, the entity being measured must first be determined. For example, a car could be selected as the entity. Once the entity is selected, the attributes of that entity that need to be described must be chosen. According to IEEE (1998d), an attribute is a measurable physical or abstract property of an entity. Attributes for a car include its speed, its fuel efficiency, and the air pressure in its tires.
Finally, a mapping system, also called the measurement method or counting criteria, must be defined and accepted. It is meaningless to say that the car’s speed is 65 or its tire pressure is 32 unless people know that they are talking about miles per hour or pounds per square inch. So what is a mapping system?
In ancient times there were no real standard measurements. This caused a certain level of havoc with commerce. Was it better to buy cloth from merchant A or merchant B? What were their prices per length? In England they solved this problem by standardizing the “yard” as the distance between King Henry I’s nose and his fingertips. The “inch” was the distance between the first and second knuckle of the king’s finger and the “foot” was literally the length of his foot.
To a certain extent, the software industry is still in those ancient times. As software practitioners try to implement software metrics, they quickly discover that very few standardized mapping systems exist for their measurements. Even for a seemingly simple metric such as the severity of a software defect, no standard mapping system has been widely accepted. Examples from different organizations include:
An important element of a successful metrics program is the selection, definition, and consistent use of mapping systems for selected metrics. The software industry as a whole may not be able to solve this problem, but each organization must solve it to have a successful metrics program.
Data to Information to Knowledge
Once the measurement process is performed, the result is one or more numbers or symbols. These are data items. Data items are simply “facts” that have been collected in some storable, transferable, or expressible format.
However, if the data are going to be useful, they must be transformed into information products that can be interpreted by people and transformed into knowledge, so that it can be used to make better, fact-based decisions. According to the Guide to Data Management Body of Knowledge (DAMA 2009), “information is data in context.” The raw material of information is data. By adding the context, collected data starts to have meaning. For example, a data item stored as the number 14 does not by itself provide any usable information. If the data item’s context is known, that data item is converted to information. For example, when the data item has:
Once one or more data items are converted to information, the resulting information can also be stored as additional data items.
Information in and of itself is not useful until human intelligence is applied to convert it to knowledge through the identification of patterns and trends, relationships, assumptions, and relevance. Going back to the data item example, if the information regarding the 14 newly-detected defects found last week during the system testing of software product ABC is simply put in a report that no one reads or uses, then the information is never converted to knowledge. However, if the project manager determines that this is a higher defect arrival rate (trend ) than was found during the previous three weeks (relationship ) and determines that corrective action is needed (assumption ) resulting in the shifting of an additional software engineer onto problem correction (relevance ), that information becomes knowledge. Of course the project manager can also decide that everything is under control and that no action is necessary. If this is the case, the information has still been converted to knowledge. The right hand side of Figure 18.2 illustrates this transformation of measurement data into information and then knowledge.
The moral is that the goal should never be “To put metrics (or measurements) in place.” That goal can result in an organization becoming DRIP (Data Rich—Information Poor). The goal should be to provide people with the knowledge they need to make better, fact-based decisions. Metrics are simply the tools to make certain that standards exist for measuring in a consistent, repeatable manner in order to create reliable and valid data. For example, by establishing “what” is being measured through well-defined and understood entities and attributes, and “how” it is being measured through standardizing the mapping system and the conditions under which measures are done. Metrics provide a standardized definition of how to turn collected data into information, and how to interpret that information to create knowledge.
Reliability and Validity
Metric reliability is a function of consistency or repeatability of the measure. A metric is reliable if different people (or the same person multiple times) can use that metric over and over to measure an identical entity and obtain the same results each time. For example, if two people count the number of lines of code in a source code module, they would both have the same, consistent count (within an acceptable level of measurement error, as defined later in this chapter). Or, if one person measured the cyclomatic complexity of a detailed design element today and then measures the same design element tomorrow, that person would get the same, consistent measure (within an acceptable level of measurement error).
A metric is valid if it accurately measures what it is expected to measure, that is, if it captures the real value of the attribute it is intended to describe. IEEE (1998d) describes validity in terms of the following criteria:
This definition of validity actually refers to the internal validity of the metric or its validity in a narrow sense—does it measure the actual attribute? External validity, also called predictive validity, refers to the ability to generalize or transfer the metric results to other populations or conditions. For a metric to be externally valid, it must be internally valid. It must also be able to be used as part of a prediction system, estimation process, or planning model. For example, can the mean time to fix a defect, measured for one project, be generalized to predict the mean time to fix for other projects? If so, then the mean-time-to-fix metric is considered externally valid. External validity is verified empirically through comparison of the predicted results to the subsequently observed actual results.
A metric that is reliable but not valid can be consistently measured, but that measurement does not reflect the actual attribute. An example of a metric that is reliable but not valid is using cyclomatic complexity for measuring reliability. Cyclomatic complexity can be consistently measured because it is well-defined and has a consistent mapping system. However, it is internally invalid as a measure of reliability because it is a measure of complexity, not of reliability.
A metric is internally valid but not reliable if it measures what it is supposed to measure, but can not be measured consistently. For example, if criteria are not clearly defined on how to assign numbers to the severity of a reported problem, resulting in different people assigning different severities to the same problem, the severity metric may be valid but not reliable. As illustrated in Figure 18.3 , the goal is to have metrics that are both reliable and internally valid. In fact, IEEE (1998d) actually includes reliability as one of the criteria for validity. For predictive-type metrics, external validity is also required.
To have metrics that are both reliable and valid, there must be agreement on standard definitions for the entity and its attributes that are being measured. Software practitioners may use different terms to mean the same thing. For example, the terms defect report, problem report, incident report, fault report, anomaly report, issues report or call report may be used by various organizations or teams for the same item. But unfortunately, they may also refer to different items. One organization may use “user call reports” for a user complaint and “problem reports” as the description of a problem in the software. Their customer may use “problem reports” for the initial complaint and “defect reports” for the problem in the software.
Different interpretations of terminology can be a major barrier to correct interpretation and understanding of metrics. For example, a metric was created to report the “trend of open software problems” for a software development manager. The manager was very pleased with the report because she could quickly pull information that had previously been difficult to get from the antiquated problem-tracking tool. One day the manager brought this report to a product review meeting, so she could discuss the progress her team had made in resolving the problem backlog. The trend showed a significant decrease, from over 50 open problems six weeks ago, to only three open problems currently. When she put the graph up on the overhead, the customer service manager exploded. “What is going on here? Those numbers are completely wrong! I know for a fact that my customers are calling me every day to complain about the over 20 open field problems!” The problem was not with the numbers, but with the interpretation of the word “open.” To the software development manager, the problem was no longer open when they had fixed it, checked the source into the configuration management library, and handed it off for system testing. But to the customer service manager, the problem was still open until his customers in the field had their fix.
Figure 18.3 Reliability and validity.
As the above examples illustrate, the software industry has very few standardized definitions for software entities and their attributes. Everyone has an opinion, and the debate will probably continue for many years. An organization’s metrics program can not wait that long. The suggested approach is to adopt standard definitions within an organization and then apply them consistently. Glossaries such as the ISO/IEC/IEEE Systems and Software Engineering—Vocabulary (ISO/IEC/IEEE 2010) or the glossary in this book can be used as a starting point. An organization can then pick and choose the definitions that correspond with their objectives, or use them as a basis for tailoring their own definitions. It can also be extremely beneficial to include these standard definitions as an appendix to each metrics report (or add definition pop-ups to online reports) so that everyone who receives the report understands what definitions are being used.
Explicit Measures
Data items can be either explicit or derived measures. Explicit measures, also called base measures, metrics primitives, or direct metrics, are measured directly. The Capability Maturity Model Integration (CMMI ) for Development (SEI 2010) defines a base measure as a “measure defined in terms of an attribute and the method for quantifying it.” For example, explicit measures for a code inspection would include the number of:
For explicit measures, the mapping system used to collect the data for each measure must be defined and understood. Some mapping systems are established by using standardized units of measure (for example, dollars, hours, days). Other mapping systems define the counting criteria used to determine what does and does not get counted when performing the measurement. For example, if the metric is the “problem report arrival rate per month,” the counting criteria could be as simple as counting all of the problem reports in the problem-reporting database that had an open date during each month. However, if the measure was “defect counts” instead, the counting criteria might exclude all the problem reports in the database that did not result from a product defect (for example, those defined as works-as-designed, operator error, or withdrawn).
Some counting criteria are very complex, like the counting criteria for measuring function points, or the criteria for counting effort on a project. In the latter example, the units of effort might be defined in terms of staff hours, months, or years, depending on the size of the project. Other counting criteria decisions would include:
Of course, many organizations solve this problem by simply stating that if the time is charged to the project account number, then it counts. But they still must have counting criteria established that define the rules for what to charge to those account numbers.
Having a clearly defined and communicated mapping system helps everyone interpret the measures the same way. The metrics mapping system and, if applicable, data needed based on the associated counting criteria, define the first level of data that needs to be collected in order to implement the metric.
Derived Measures
According to the CMMI for Development (2010), a derived measure, also called complex metrics, is a “measure that is defined as a function of two or more values of explicit measures,” that is, the mathematical combinations of explicit measures or other derived metrics. For a code inspection, examples of derived metrics would include:
Most measurement functions include an element of simplification. When creating a function, an organization needs to be pragmatic. If an attempt is made to include all of the elements that affect the attribute or characterize the entity, the functions can become so complicated that the metric is useless. Being pragmatic means not trying to create the perfect function. Pick the explicit measures that are the most important. The ideal measurement function is simple enough to be easy to use, and at the same time provides enough information to help people make better, more informed decisions. Remember that the function can always be modified in the future to include additional levels of detail as the organization gets closer to its goal. The people designing a function should ask themselves:
Measurement functions should be selected and tailored to the organization’s information needs. As illustrated in Figure 18.4 , to demonstrate the selection of a tailored function, consider a metric that calculates the duration of unplanned system outages.
Figure 18.4 Different functions for the same metric—examples.
Measurement Scales
There are four different measurement scales that define the type of data that is being collected. The measurement scale is important because it defines the mathematics, or kinds of statistics, that can be done using the collected data.
The nominal scale is the simplest form of measurement scale. In nominal scale measurements, items are assigned to a classification (one-to-one mapping), where that classification categorizes the attribute of the entity. Examples of nominal scale measurements include:
The categories in a nominal scale measurement must be jointly exhaustive and cover all possibilities. This means that every measurement must be assigned a classification. Many nominal scale measures include a category of “other,” so everything fits somewhere. The categories must also be mutually exclusive. Each item must fit one and only one category. If an attribute is classified in one category, it can not be classified in any of the other categories.
The nominal scale does not make any assumptions about order or sequence. The only math that can be done on nominal scale measures is to count the number of items in each category and look at their distributions.
The ordinal scale classifies the attribute of the entity by order. However, there are no assumptions made about the magnitude of the differences between categories (a defect with a critical severity is not twice as bad as one with a major severity). Examples of ordinal scale measurements include:
Since there is order, the transitive property is valid for ordinal scale measures, that is if data item A > data item B and data item B > data item C, then data item A > data item C. However, without an assumption of magnitude, mathematical operations of addition, subtraction, multiplication, and division can not be used on ordinal scale measurement values.
For interval scale measurements, the exact distance between the data items is known. This allows the mathematical operations of addition and subtraction to be applied to interval scale measurement values. However, there is no absolute or non-arbitrary zero point in the interval scale, so multiplication and division do not apply. The classic example of an interval scale measure is calendar time. For example, it is valid to say, “May 1st plus 10 days is May 11th ,” but saying “May 1st times May 11th ” is invalid. Interval scale measurements require well-established units of measure that can be agreed upon by everyone using the measurement. Either the mode or the median can be used as the central tendency statistic for interval scale metrics.
In the ratio scale, not only is the exact distance between the scales known, but there is an absolute or non-arbitrary zero point. All mathematical operations can be applied to ratio scale measurement values, including multiplication and division. Examples of ratio scale measurement include:
Note that since derived measures are mathematical combinations of two or more explicit measures or other derived measures, those explicit and derived measures must be at a minimum interval scale measurements (if only addition and subtraction are used in the measurement function) and they are typically ratio scale measurements.
Variation
To start the discussion of variation, consider an exercise from Shewhart’s book, Economic Control of Quality of Manufactured Product (Shewhart 1980). This exercise begins by having a person write the letter “a” on a piece of paper. The person is then asked to write another “a” identical to the first one; then another just like it, then another, until the person has 10 a’s on the paper. When all of the letters are compared, are they all perfectly identical or are there variations between them? The instructions were to make all of the a’s identical, but as hard as the person may have tried, that person can not make them all identical. But, why can a person not write ten identical a’s? Most people could probably think of multiple reasons why there are variations between the letters. Examples might include:
People accept the fact that there are many reasons for the variation, and that it is impractical, and most likely impossible, to try to remove them all. If two people were asked to do this same exercise, there would probably be even more variation between one person’s a’s and another person’s a’s., because different people doing the same task adds another cause of variation.
Quality practitioners talk about two distinct sources for variation. The first of these sources is called common causes, also known as random causes or chance causes of variation. Common causes of variation are the normal or usual causes of variation that are inherent in a consistent, repeatable process. A process is said to be in statistical control when it only exhibits common causes of variation. As long as the process does not change, these common causes of variation will not change. Therefore, based on historic data or on a sample set of data from the current process, practitioners can quantify common causes of variation in order to predict the amount of variation that will occur in the future.
Common cause variation comes from normal, expected fluctuations in various factors, typically classified as influences from:
The only way to change or reduce the common causes of variation is to change the process itself (for example, through process improvement). It should be noted that there is always some level of variation in every process, and that not all common causes of variation can be or should be eliminated. There are engineering trade-offs to be made. For example, eliminating common causes of variation in one part of the process may cause problems in another part of the process, or it may not be economically feasible to eliminate a source of common cause variation.
The second source of variation is special causes, also called assignable causes, which are outside the normal “noise” of the process and result in abnormal or unexpected variation. Special causes of variation include:
If special causes of variation exist in a process, that process is not performing at its best and is considered out of control. In this case, the process is not sufficiently stable enough to use historic data or samples from the current process to predict the amount of variation that will occur in the future. That means that statistical quality control cannot be applied to processes that are out of control. Special causes of variation can usually be detected, and actions to eliminate these causes can typically be economically justified.
Statistics and Statistical Process Control
Statistics is defined as the science of the collection, organization, analysis, and interpretation of data. Statistical quality control refers to the use of measurement and statistical methods in the monitoring and maintaining of the quality of products and services. Faced with a large amount of data, it may seem daunting to try to turn that data into information that can be used to create knowledge. However, there are some basic descriptive statistics that can be used to characterize a set of data items in order to provide needed information about a data set.
The location of the data set refers to the typical value or central tendency that is exhibited by most data sets. The location refers to the way in which data items tend to cluster around one or more values in that data set. Figure 18.5 shows examples of three data sets, each with the same variance and shape, but with three different locations.
Three statistics are typically used to represent the location of a data set:
Figure 18.5 Data sets with different locations—examples.
In most data sets, the individual data items tend to cluster around the location and then spread or scatter out from there toward the extreme or extremes of the data set. The extent of this scattering is the variance, also called the spread or dispersion, of the data set. The variance is the amount by which smaller values in the data set differ from larger values. Figure 18.6 shows an example of three data sets, each with the same location and shape, but with three different variances.
The simplest measure of variance is the range of the data set. The range is the difference between the maximum value and the minimum value in the data set. The more variance there is in the data set, the larger the value of the range will be. For example, if the data items were 3, 5, 10, 15, 19, 21, 25, the range would be equal to 25 − 3 = 22.
The most important measure of variance from the perspective of statistical quality control is standard deviation. Standard deviation is a measure of how much variation there is from the mean of the data set. A low value for standard deviation indicates that the data values are tightly clustered around the mean. The larger the standard deviation, the more spread out the data values are from the mean. The Greek letter sigma (σ) is used to represent the standard deviation of the entire population, and the letter “s” is used to represent the standard deviation for a sample from that population.
Figure 18.6 Data sets with different variances—examples.
In statistics, the distribution describes the shape of a set of data. More specifically, the probability distribution describes the range of possible values for the data in the set and the probability that any given random value selected from that population is within a measurable subset of that range. Figure 18.7 illustrates a normal distribution, also called a Gaussian distribution or bell curve. A normal distribution is a continuous distribution where there is a concentration of observations around the mean, and the shape is a symmetrical bell-like shape. In a normal distribution, the three location statistics of mean, median, and mode are all equal. Figure 18.7 also illustrates the percentage of data items under a normal distribution curve as plus and minus various standard deviations. For example, 68.26 percent of all data items in a normal distribution fall within ±1 standard deviation under the curve. At ±3 standard deviations, 99.73 percent of all data items in a normal distribution fall under the curve. Calculations for the standard deviation are dependent on the distribution of the data set. The normal distribution is the most well-known and used probability distribution. Data sets for many quality characteristics can be estimated and/ or described using a normal distribution. As discussed in Chapter 19 , normal distributions are also used as the basis for creating statistical process control charts.
A normal distribution is just one type of distribution or shape that a data set can have. There are also other types of distributions, grouped into two major classifications:
Figure 18.7 Normal distribution curve with standard deviations.
Figure 18.8 Common cause and special cause impacts on statistics—examples.
When only common cause variation exists in a process, the location, distribution, and variance statistics of the data collected from that process will be stable and predictable over time, as illustrated on the left side of Figure 18.8 . This means that statistics for a data set collected from the process today will be the same as for the data set collected next week, and the week after that, and the week after that and so on.
Whenever special causes of variation are introduced into the process, the location, distribution, and variance statistics of data sets collected from that process will vary over time, and will no longer be predictable, as illustrated on the right side of Figure 18.8 .
2. SOFTWARE PRODUCT METRICS
Choose appropriate metrics to assess various software attributes (e.g., size, complexity, the amount of test coverage needed, requirements volatility, and overall system performance). (Apply)
BODY OF KNOWLEDGE V.A.2
Metric Customers
With all of the possible software entities and attributes, it is easy to see that a huge number of possible metrics could be implemented. So how does an organization or team decide which metrics to use? The first step is to identify the customer. The customer of the metrics is the person or team who will be making decisions or taking action based on the metrics. The customer is the person who needs the information supplied by the metrics.
If a metric does not have a customer—someone who will make a decision based on that metric (even if the decision is “everything is fine—no action is necessary”)—then stop producing that metric. Remember that collecting data and generating metrics is expensive, and if the metrics are not being used, it is a waste of people’s time and the organization’s money.
There are many different types of customers for a metrics program. This adds complexity because each customer may have different information requirements. It should be remembered that metrics do not solve problems—people solve problems. Metrics can only provide information so that those people can make informed decisions based on facts rather than “gut feelings.” Customers of metrics may include:
Selecting Metrics
Basili and his colleagues defined a goal/question/metric paradigm, which provides an excellent mechanism for defining a goal-based measurement program (Grady 1992). The goal/question/metric paradigm is illustrated in Figure 18.9 . The first step to implementing the goal/question/metric paradigm is to select one or more measurable goals for the customer of the metrics:
Figure 18.9 Goal/question/metric paradigm.
The second step is to determine the questions that need to be answered in order to determine whether each goal is being met or if progress is being made in the right direction. For example, if the goal is to maintain a high level of customer satisfaction, questions might include:
Finally, metrics are selected that provide the information needed to answer each question. When selecting metrics, be practical, realistic, and pragmatic. Metrics customers are turned off by metrics that they see as too theoretical. They need information that they can interpret and utilize easily. Avoid the “ivory tower” perspective that is completely removed from the existing software engineering environment (currently available data, processes being used, and tools). Customers will also be turned off by metrics that require a great deal of work to collect new or additional data. Start with what is possible within the current process. Once a few successes are achieved, the metrics customers will be open to more radical ideas—and may even come up with a few metrics ideas of their own.
Again, with all of the possible software entities and attributes, it is easy to see that a huge number of possible metrics could be implemented. This chapter touches on only a few of those metrics as examples of the types of metrics used in the software industry. The recommended method is to use the goal/question/metric paradigm, or some other mechanism, to select appropriate metrics that meet the information needs of an organization and its teams, managers, and engineers. It should also be noted that these information needs will change over time. The metrics that are needed during requirements activities will be different than the metrics needed during testing, or once the software is being used in operations. Some metrics are collected and reported on a periodic basis (daily, weekly, monthly), others are needed on an event-driven basis (when an activity or phase is started or stopped), and others are used only once (during a specific study or investigation). The Software Capability Maturity Module Integration (CMMI) for Development, Service and Acquisition (SEI 2010; SEI 2010a; SEI 2010b) includes a Measurement and Analysis process area which provides a road map for developing and sustaining a measurement program.
As discussed previously, metrics must define the entity and the attributes of that entity that are being measured. Software product entities are the outputs of software development, operations, and maintenance processes. These include all of the artifacts, deliverables, and documents that are produced. Examples of software product entities include:
Examples of attributes associated with software product entities include size, complexity, number of defects, test coverage, volatility, reliability, availability, performance, usability and maintainability.
Size—Lines of Code
Lines of code (LOC) counts are one of the most used and most often misused of all the software metrics. Some estimation methods are based on KLOC (thousands of lines of code). The LOC metric may also be used in other derived metrics to normalize measures so that releases, projects, or products of different sizes can be compared (for example, defect density or productivity). The problems, variations, and anomalies of using lines of code are well documented (Jones 1986). Some of these include:
No industry-accepted standards exist for counting LOC. Therefore, it is critical that specific criteria for counting LOC be adopted for the organization. Considerations include:
The Software Engineering Institute (SEI) has created a technical report specifically to present guidelines for defining, recording, and reporting software size in terms of physical and logical source statements (CMU/SEI 1992). This report includes check sheets for documenting criteria selected for inclusion or exclusion in LOC counting for both physical and logical LOC.
Size—Function Points
Function points is a size metric that is intended to measure the size of the software without considering the language it is written in. The function point counting procedure has five steps:
Step 1: Decide what to count (total number of function points, or just new or changed function points).
Step 2: Define the boundaries of the software being measured.
Step 3: Count the raw function points, which are determined by adding together the weighted counts for each of the following five function types, as illustrated in Figure 18.10 :
Within each function type, the counts are weighted based on complexity and other contribution factors. Raw function point counts have clearly defined measurement methods, as established by the International Function Point Users Group (IFPUG).
Step 4: Assign a degree of influence to each of the 14 defined by IFPUG. These factors are used to measure the complexity of the software. Each factor is measured on a scale of zero to five (five being the highest). These adjustment factors include:
  1. Data communications
  2. Distributed data or processing
  3. Performance objectives
  4. Heavily used configuration
  5. Transaction rate
  6. Online data entry
  7. End user efficiency
  8. Online update
  9. Complex processing
  10. Reusability
  11. Conversion and installation ease
  12. Operational ease
  13. Multiple site use
  14. Facilitate change
Figure 18.10 Function points.
Step 5: Adjust the raw function point count. To do this, the sum of all 14 degrees of influence to the value adjustment factors is multiplied by 0.01 and added to 0.65. This results in a value from 65 percent to 135 percent. That percentage is then multiplied by the raw function point count to calculate the adjusted function point count. For example, if the sum of the degrees of influence is 42 and the raw function point count is 450, then the adjusted function point count is calculated as:
As with lines of code, function points are used as the input into many project estimation tools, and may also be used in derived metrics to normalize those metrics, so that releases, projects, or products of different sizes can be compared (for example, defect density or productivity).
Other Size Metrics
There are many other metrics that may be used to measure the size of different software products:
Complexity—Cyclomatic Complexity
McCabe’s cyclomatic complexity is a measure of the number of linearly independent paths through a module or detailed design element. Cyclomatic complexity can therefore be used in structural testing to determine the minimum number of path tests that must be executed for complete coverage. Empirical data indicates that source code modules with a cyclomatic complexity of 10 or greater are more defect prone and harder to maintain.
Cyclomatic complexity is calculated from a control flow graph by subtracting the number of nodes from the number of edges, and adding two times the number of unconnected parts of the graph (edges—nodes + 2p). In well-structured code, with one entry point and one exit point, there is only a single part to the graph, so p = 1. Cyclomatic complexity is a measure that is rarely calculated manually. This is a measurement where static analysis tools are extremely useful and efficient for calculate the cyclomatic complexity of the code. This explanation is included here so people understand what the tool is doing when it measures cyclomatic complexity
As illustrated in Figure 18.11 :
Another way of measuring the cyclomatic complexity for well-structured software (no two edges cross each other in the control flow graph, and there is one part to the graph) is to count the number of regions in the graph. For example:
Figure 18.11 Cyclomatic complexity—examples.
Just to illustrate an example of a software component where the number of parts is greater than one, assume an old piece of poorly structured legacy code has two entry points and two exit points. For example, assume control flow graphs C and D in Figure 18.11 are both part of the same source code module. That module would have 25 edges (12 from C and 13 from D), 19 nodes (9 from C and 10 from D) and 2 parts (part C and part D that are not connected). The cyclomatic complexity would be 25 − 19 + (2*2) = 10. This makes sense because both control flow graphs C and D each had a cyclomatic complexity of 5, so together they should have a cyclomatic complexity of 10.
Complexity—Structural Complexity
While cyclomatic complexity is looking at the internal complexity of an individual design element or source code module, structural complexity is looking at the complexity of the interactions between the modules in a calling structure (or in the case of object-oriented development, between the classes in an inheritance tree). Figure 18.12 illustrates four structural complexity metrics. The depth and width metrics focus on the complexity of the entire structure. The fan-in and fan-out metrics focus on the individual elements within that structure.
The depth metric is the count of the number of levels of control in the overall structure or of an individual branch in the structure. For example, the branch in Figure 18.12 that starts with the element labeled “Main” and goes to the element labeled “A” has a depth of three. Each of the four branches that starts with the element labeled “Main” and go to the element labeled “D” has a depth of five. The depth for the entire structure is measured by taking the maximum depth of the individual branches. For example, the structure in Figure 18.12 has a depth of five.
The width metric is the count of the span of control in the overall software system, or of an individual level within the structure. For example, the level that includes the element labeled “Main” in Figure 18.12 has a width of one. The level that includes the elements labeled “A” and “B” has a width of five. The level that includes the element labeled “C” has a width of seven. The width for the entire structure is equal to the maximum width of the individual levels. For example, the structure in Figure 18.12 has a width of seven.
Depth and width metrics can help provide information for making decisions about the integration and integration testing of the product and the amount of effort required.
Fan-out is a measure of the number of modules that are directly called by another module (or that inherit from a class). For example, the fan-out of module Main in Figure 18.12 is three. The fan-out of module A is zero, the fan-out of module B is four, and the fan-out of module C is one.
Figure 18.12 Structural complexity—examples.
Fan-in is the count of the number of modules that directly call a module (or that the class inherits from). For example, in Figure 18.12 , the fan-in of module A is one, the fan-in of module D is four, and the fan-in of module Main is zero.
Fan-in and fan-out metrics provide information for making decisions about the integration and integration testing of the product. These metrics can also be useful in evaluating the impact of a change and the amount of regression testing needed after the implementation of a change
Quality—Defect Density
Defect density is a measure of the total known defects divided by the size of the software entity being measured (Number of known defects/Size). The number of known defects is the count of total defects identified against a particular software entity during a particular time period. Examples include:
For defect density, size is used as a normalizer to allow comparisons between different software entities (for example, source code modules, releases, products) of different sizes. To demonstrate the calculation of defect density, Table 18.1 illustrates the size of the three major subsystems that make up the ABC software system and the number of prerelease and post-release defects discovered in each subsystem.
Defect density is used to compare the relative number of defects in various software components. This helps identify candidates for additional inspection or testing, or for possible reengineering or replacement. Identifying defect-prone components allows the concentration of limited resources into areas with the highest potential return on the investment. Typically this is done using a Pareto diagram as illustrated in Figure 19.16 .
Another use for defect density is to compare subsequent releases of a product to track the impact of defect reduction and quality improvement activities as illustrated in Figure 18.13 . Normalizing defect arrival rates by size allows releases of varying size to be compared. Differences between products or product lines can also be compared in this manner.
Table 18.1 Defect density inputs—example.
Figure 18.13 Post-release defect density—example.
Quality—Arrival Rates
Arrival rates graph the trends over time of problems newly opened against a product. Note that this metric looks at problems, not defects. Testers, customers, technical support personnel, or other originators report problems because they think there is something wrong with the software. The software developers must then debug the software to determine if there is actually a defect. In some cases, the problem report may be closed after this analysis as “operator error,” “works as designed,” “can not duplicate,” or some other non-defect-related disposition.
Prior to release, the objective of evaluating the arrival rate trends is to determine if the product is stable or moving toward stability. As illustrated in Figure 18.14 , when testing is first gearing up, arrival rates may be low. During the middle of testing, arrival rates are typically higher with some level of variation. However, as shown for product A, the goal is for the arrival rates to trend downward toward zero or stabilize at very low levels (the time between failures should be far apart) prior to the completion of testing and release of the software. By stacking the arrival rates by severity, additional information can be analyzed. For example, it is one thing to have a few minor problems still being found near the end of testing, but if critical problems are still being found, it might be appropriate to continue testing. Product B in Figure 18.14 does not exhibit this stabilization and therefore indicates that continued testing is appropriate. However, arrival rate trends by themselves are not sufficient to signal the end of testing. This metric could be made to “look good” simply by slowing the level of effort being expended on testing. Therefore, arrival rates should be evaluated in conjunction with other metrics, including effort rates, when evaluating test sufficiency.
Post-release arrival rate trends can also be evaluated. In this case the goal is to determine the effectiveness of the defect detection and removal processes, and process improvement initiatives. As illustrated in Figure 18.13 , post-release defect arrival rates can be normalized by release size in order to compare them over subsequent releases. Note that this metric must wait until after the problem report has been debugged, and counts only defects.
Figure 18.14 Problem report arrival rate—examples.
Figure 18.15 Cumulative problem reports by status—examples.
Quality—Problem Report Backlog
Arrival rates only track the number of problems being identified in the software. Those problems must also be debugged and defects corrected to increase the software product’s quality before that software is ready for release. Therefore, tracking the problem report backlog over time provides additional information. The cumulative problem reports by status metric illustrated in Figure 18.15 combines arrival rate information with problem report backlog information. In this example, four problem report statuses are used:
The objectives for this metric, as testing nears completion, is for the arrival rate trend (the shape of the cumulative curve) to flatten—indicating that the software has stabilized— and for all of the known problems to have a status of closed.
Quality—Data Quality
The ISO/IEC Systems and Software Engineering—Systems and Software Quality Requirements and Evaluation (SQuaRE)—Measurement of Data Quality standard (ISO/IEC 2015) lists the following quality measures for data quality and provides associated measures:
Amount of Test Coverage Needed
There are a number of metrics that can help predict the amount of test coverage needed. For example, for unit-level white-box tests, metrics for measuring the needed amount of test coverage might include:
For integration testing, metrics for measuring the needed amount of test coverage might include:
For functional (black box) testing, metrics for measuring the needed amount of test coverage might include the number of requirements and the historic number of test cases per requirement. By multiplying these two metrics (Number of requirements × Historic number of test cases per requirement), the number of functional test cases for the project can be estimated. A forward-traceability metric can be used to track test case coverage during test design. For example, dividing the number of functional requirements that trace forward to test cases by the total number of functional requirements provides an estimate of the completeness of functional test case coverage. A graph such as the one depicted in Figure 18.16 can then be used to track the completeness of test coverage against the number of planned test cases during test execution.
Figure 18.16 Completeness of test coverage—examples.
Requirements Volatility
Requirements volatility, also called requirements churn or scope creep, is a measure of the amount of change in the requirements once they are baselined. Jones (2008) reports that in the United States, requirements volatility averages range from:
It is not a question of “if the requirements will change” during a project, but “how much will the requirements change?” Since requirements change is inevitable, it must be managed, and to appropriately manage requirements volatility, it must be measured. Figure 18.17 is an example of a graph that tracks requirements volatility as the number of changes to requirements over time. The line on this graph reports the current requirements size (number of requirements or weighted requirements). The data table includes details about the number of requirements added, deleted, and modified. This detail is necessary to understand the true requirements volatility. For example, if five requirements are modified, two new requirements are added, and two other requirements are deleted, the number of requirements remains unchanged even though a significant amount of change has occurred.
Figure 18.17 Requirements volatility: change to requirements size—example.
Another example of a requirements-volatility metric is illustrated in Figure 18.18 . Instead of tracking the number of changes, this graph looks at the percentage of baselined requirements that have changed over time. A good project manager understands that the requirements will change and takes this into consideration when planning project schedules, budgets, and resource needs. The risk is not that requirements will change— that is a given. The risk is that more requirements change will occur than the project manager estimated when planning the project. In this example, the project manager estimated 10 percent requirements volatility, and then tracked the actual volatility to that threshold. In this example, the number of changed requirements is the cumulative count of all requirements added, deleted, or modified after the initial requirements are baselined and before the time of data extraction. If the same requirement is changed more than once, the number of changed requirements is incremented once for each change. These metrics act as triggers for contingency plans based on this risk.
Other metrics used to measure requirements volatility include:
Reliability
Reliability is a quality attribute describing the extent to which the software can perform its functions without failure for a specified period of time under specified conditions. The actual reliability of a software product is typically measured in terms of the number of defects in a specific time interval (for example, the number of failures per month), or the time between failures (mean time to failure). Software reliability models are utilized to predict the future reliability or the latent defect count of the software. There are two types of reliability models: static and dynamic.
Static reliability models use other project or software product attributes (for example, size, complexity, programmer capability) to predict the defect rate or number of defects). Typically, information from previous products and projects is used in these models, and the current project or product is viewed as an additional data point in the same population.
Figure 18.18 Requirements volatility: percentage requirements change—examples.
Dynamic reliability models are based on collecting multiple data points from the current product or project. Some dynamic models look at defects gathered over the entire life cycle, while other models concentrate on defects found during the formal testing phases at the end of the life cycle. Examples of dynamic reliability models include:
Appropriately implementing a software reliability model requires an understanding of the assumptions underlying that model. For instance, the J-M model’s five assumptions are: (Kan 2003)
Various other models make different assumptions than the J-M model. When selecting a model, consideration must be given to the likelihood that the model’s assumptions will be met by the project’s software environment.
Availability
Availability is a quality attribute that describes the extent to which the software or a service is available for use when needed. Availability is closely related to reliability because unreliable software that fails frequently is unavailable for use because of those failures. Availability is also dependent on the ability to restore the software product to a working state. Therefore, availability is also closely related to maintainability in cases where a software defect caused the failure and that defect must be corrected before the operations can continue.
In the example of an availability metric illustrated in Figure 18.19 , post-release availability is calculated each month by subtracting from one the sum of all of the minutes of outage that month, divided by the sum of all the minutes of operations that month, and multiplying by 100 percent. Tracking availability helps relate software failures and reliability issues to their impact on the user community.
When defining reliability and availability metric for software, care should be taken not to use formulas intended for calculating hardware reliability and availability to evaluate software. This is because:
Figure 18.19 Availability—example.
System Performance
System performance metrics are used to evaluate a number of execution characteristics of the software. These can be measured during testing to evaluate the future performance of the product, or during actual execution. Examples of performance metrics include:
Maintainability
Maintainability is a quality attribute describing the ease with which a software product or one of its components can be modified after it has been released. Several metrics can be used to measure maintainability. Examples include:
Usability
Usability is a quality attribute describing the amount of effort that the users must expend to learn and use the software. Usability is measured in terms of five main factors:
There are typically engineering trade-offs between these various factors, and between these factors and other attributes of the software product. For example, the speed of data entry might be a trade-off with efficiency, that is, the faster the user enters data, the more mistakes they make. The speed of accessing the software or its data might also require an engineering trade-off against the security of the software or the integrity of the data. The goal is to optimize as many of these usability factors as possible, balanced against optimizing other product attributes based on the priorities of the various software product attributes.
3. SOFTWARE PROCESS METRICS
Measure the effectiveness and efficiency of software processes (e.g., functional verification tests (FVT), cost, yield, customer impact, defect detection, defect containment, total defect containment effectiveness (TDCE), defect removal efficiency (DRE), process capability). (Apply)
BODY OF KNOWLEDGE V.A.3
Software process entities can be major development, maintenance, or acquisition activities, such as the entire software development process from requirements through delivery to operations, or small individual activities such as the inspection of a single document. Software process entities can also be time intervals, which may not necessarily correspond to specific activities. Examples of time intervals include the month of January or the first six months of operations after delivery.
Examples of attributes associated with process entities include cycle time, cost, the number of incidents that occurred during that process (for example, the number of defects found, the number of pages inspected, the number of tasks completed), controllability, productivity, efficiency, effectiveness, stability, and capability.
Cost
The cost of a process is typically measured as either the amount of money spent, or the amount of effort expended, to implement an occurrence of that process, for example:
Collecting measurements on average cost of a process, and/or cost distributions over multiple implementations of that process, can help identify areas of inefficiencies and low productivity that provide opportunities for improvement to that process. After those improvements are implemented, the same cost metrics can be used to measure the impacts of the improvements on process costs in order to evaluate the success of the improvement initiatives.
Cost metrics can be used to estimate the costs of a project and project activities as part of project planning, and to track actual costs against budgeted estimates as part of project tracking. The collection and evaluation of process cost metrics can help an organization understand where, when, and how they spend money and effort on their projects. This understanding can help improve the cost estimation models for predicting costs on future projects.
First-Pass Yield
First-pass yield evaluates the effectiveness of an organization’s defect prevention techniques by looking at the percentage of products that do not require rework after the completion of a process. First-pass yield is calculated as a percentage of the number of items not requiring rework because of defects after the completion of the process, divided by the total number of items produced by that process. For example, if the coding process resulted in the creation and baselining of 200 source code modules, and 35 of those modules were later reworked because defects were found in those 35 modules, then the first-pass yield for the coding process is ([200 − 35]/200) x 100% = 82.5%. As another example, if 345 requirements are documented and baselined as part of the requirements development process, and 69 of those requirements were modified after baselining because of defects, then the first-pass yield of requirements development is ([345 − 69]/345) x 100% ≈ 80%.
Cycle Time
Cycle time is a measurement of the amount of calendar time it takes to perform a process from start to completion. Knowing the cycle times for the processes in an organization’s software development life cycle allows better estimates to be done for schedules and required resources. It also enables the organization to monitor the impacts of process improvement activities on the cycle time for those processes. Cycle time can be measured as either static or dynamic cycle time.
Static cycle time looks at the average actual time it takes to perform the process. Cycle time helps answer questions such as “how long, on average, does it take to code a source code module, to correct a software defect, or to execute a test case?” For example, if four source code modules were programmed this week and they took five days, ten days, seven days, and eight days, respectively, to program, the static cycle time = (5 + 10 + 7 + 8)/4 = 7.5 days per module.
Dynamic cycle time is calculated by dividing the number of items in progress (items that have only partially completed the process) by one-half of the number of new starts plus new completions during the period. For example, if 52 source code modules were started this month, 68 source code modules were completed this month, and 16 source code modules were in progress at the end of the month, the dynamic cycle time = (16/[(52 + 68)/2]) x 23 days (the number of working days this month to convert months to days) ≈ 6.1 days per module.
Customer Impact—Customer Satisfaction
Customer satisfaction is an essential element to staying in business in this modern world of global competition. An organization must satisfy and even delight its customers and other stakeholders with the value of its software products and services to gain their loyalty and repeat business. So how satisfied are an organization’s customers and other stakeholders? The best ways to find out is to ask them by using customer satisfaction surveys. Several metrics can result from the data collected during these surveys. These metrics can provide management with the information they need to determine their customers’ level of satisfaction with their software products, and with the services and processes associated with those products. Software engineers and other members of the technical staff can use these metrics to identify opportunities for ongoing process improvements and to monitor the impact of those improvements.
Figure 18.20 illustrates an example of a metrics report that summarizes the customer satisfaction survey results and indicates the current customer satisfaction level. For each quality attribute polled on the survey, the average satisfaction and importance values are plotted as a numbered bubble on an x–y graph. It should be remembered that to make calculation of average satisfaction level valid, a ratio scale measure should be used (for example, a range of zero to five, with five being very satisfied). If an ordinal scale metric is used, the median should be used as the measure of central tendency. The darker shaded area on this graph indicates the long-term goal of having an average satisfaction score of better than four for all quality attributes. The lighter shaded area indicates a shorter-term goal of having an average satisfaction score better than three. From this summary report it is possible to quickly identify “initial software reliability” (bubble 2) and “documentation” (bubble 7) as primary opportunities to improve customer satisfaction. By polling importance as well as satisfaction level in the survey, the person analyzing this metric can see that even though documentation has a poorer satisfaction level, initial software reliability is much more important to the customers and therefore should probably be given a higher priority.
Figure 18.21 illustrates an example of a metrics report that shows the distribution of satisfaction scores for three questions. Graphs where the scores are tightly clustered around the mean (question A) indicate a high level of agreement among the customers on their satisfaction level. Distributions that are widely spread (question B), or particularly bimodal distributions (question C), are candidates for further detailed analysis.
Figure 18.20 Customer satisfaction summary report—example.
Figure 18.21 Customer satisfaction detailed report—example.
Figure 18.22 Customer satisfaction trending report—example.
Another way to summarize the results of a satisfaction survey is to look at trends over time. Figure 18.22 illustrates an example of a metrics report that trends the initial software reliability based on quarterly surveys conducted over a period of 18 months. Again, the dark and light shaded areas on this graph indicate the long- and short-term satisfaction level goals. One note of caution is that to trend the results over time the survey must remain unchanged in the area being trended. Any rewording of the questions on the survey can have major impacts on the survey results. Therefore, historic responses from before wording changes should not be used in future trends.
The primary purpose of trend analysis is to determine if the improvements made to the products, services, or processes had an impact on the satisfaction level of the customers. It should be remembered, however, that satisfaction is a lagging indicator because it provides information only about what has already occurred. Customers have long memories; the dismal initial quality of a software version, from three releases ago, may still impact their perception of the product even if the last two versions have been superior.
Customer satisfaction can be considered both a product and a process metric. It measures the stakeholder’s satisfaction with the current product and services as well as the supporting processes like installation and customer support. It can also be used as a process measure to measure the impacts of process improvements to those processes.
Customer satisfaction is a subjective measure. It is a measure of perception, not reality, although when it comes to a happy customer, perception is more important than reality. One phenomenon that often occurs is that as the quality of software improves, the expectations of the customers also increase. The customers continue to demand bigger, better, faster software. This can result in a flat trend even though quality is continuously improving. Worse still, it can cause a declining graph if improvements to quality are not keeping up with the increases in the customer’s expectations. Even though this impact can be discouraging, it is valuable information that the organization needs to know in the very competitive world of software.
Customer Impact—Responsiveness to Reported Problems
When a customer has a problem with software, the developer must respond quickly to resolve and close the reported problem. Service level agreements may define problem report response time goals based on the severity of the incident (for example, critical problems within 24 hours, major problems within 30 days, and minor problems within 120 days). The graph in Figure 18.23 illustrates an example of a metric to track actual performance against these service level agreements. This graph trends the percentage of problem reports closed within the service level agreement time frames each month. This metric is a lagging indicator, a view of the past. Using lagging indicator metrics is like painting the front windshield of a car black and monitoring the quality of the driving by counting the dead bodies that can be seen in the rearview mirror. If an organization must wait until the problem report is closed before tracking this response-time metric, they can not take proactive action to control their responsiveness.
The graph in Figure 18.24 shows all non-closed, major problem reports distributed by their age, the number of days since they were opened. Analysis of this graph can quickly identify problem areas, including problem reports that are past their service level agreement goal and reports approaching their goal. This information allows a proactive approach to controlling responsiveness to customer-reported problems. This metric is a leading indicator because it helps proactively identify issues. Similar graphs can also be created for minor problems (since critical problems must be corrected within 24 hours it is probably not cost-effective to track them with this type of metric).
Figure 18.23 Responsiveness to customer problems—example.
Figure 18.24 Defect backlog aging—example.
Efficiency of Defect Detection—Escapes
One way to measure the effectiveness of a process is to measure the number of escapes from that process. An escape is any issue, defect, or nonconformance that exists during the process but is not discovered by the process. To explain the concept of escapes, let’s first talk about escapes from the various defect detection processes. An analogy for these escapes is to think of each defect detection process as a screen door that is catching bugs (defects). Escapes are the bugs that make it through the screen door and get farther into the house. A person can not determine how effective the screen door is by looking at the bugs it caught. The person must go inside and count the number of bugs that got past the screen door.
In software there is no way of counting how many defects actually escaped a defect detection process. The number and origin of defects detected by subsequent defect detection processes must be examined to approximate the number of actual escapes. For example, as illustrated in Figure 18.25a , at the end of the first defect detection process (requirement defect detection—typically a requirement peer review) there are no known escapes.
However, as illustrated in Figure 18.25b , after analyzing the defects found by the second design defect detection process, there are not only design-type defects (seven dark gray bugs) but also requirement-type defects (three light gray bugs), so three requirement escapes have been identified.
As illustrated in Figure 18.25c , after analyzing the defects found by the third coding defect detection process, there are not only coding-type defects (seven black bugs) but also requirement-type defects (one light gray bug) and design-type defects (three dark gray bugs). So four requirement escapes and three design escapes have now been identified.
As illustrated in Figure 18.25d , everything found in the testing defect detection processes are escapes because no new defects are introduced through the testing process (new defects introduced during testing are the result of requirement, design, or code rework efforts). Analysis of defects found in testing shows requirement-type defects (two light gray bugs), design-type defects (five dark gray bugs) and coding-type defects (three black bugs). There are now six requirement escapes, eight design escapes, and three coding escapes.
Finally, as illustrated in Figure 18.25e , everything found in operations is also an escape. Analysis of defects found in operations shows requirement-type defects (one light gray bug), design-type defects (three dark gray bugs), and coding-type defects (four black bugs). There are now a total of seven requirement escapes, eleven design escapes, and seven coding escapes. Of course these counts will continue to change if additional defects are identified in operations.
However, escapes are not just limited to defect detection processes. Escapes can be examined to evaluate the effectiveness of other processes as well. For example:
Figure 18.25a Measuring escapes—requirements example.
Figure 18.25b Measuring escapes—design example.
Figure 18.25c Measuring escapes—coding example.
Figure 18.25d Measuring escapes—testing example.
Figure 18.25e Measuring escapes—operations example.
Defect Containment and Total Defect Containment Effectiveness (TDCE)
Defect containment in general looks at the effectiveness of defect detection processes to keep defects from escaping into later phases or into operations. Phase containment effectiveness is a defect containment metric used to measure the effectiveness of defect detection processes in identifying defects in the same phase as they were introduced. Many studies have been done that demonstrate that defects that are not detected until later in the software development life cycle are much more costly to correct. By understanding which processes are allowing phase escapes, organizations can better target their defect detection improvement efforts. Phase containment is calculated by:
Figure 18.26 illustrates an example of calculating phase containment. For the requirements phase, 15 requirement-type defects were found and fixed during that phase, and 10 requirement-type defects were found in later phases for a total of 25 requirement-type defects. The requirements phase containment is 15/25 = 60%. For the design phase, 29 design-type defects were found and fixed during that phase, and 12 design-type defects were found in later phases for a total of 41 design-type defects. The design phase containment is 29/41 = 71%. To continue this example, for the code phase, 86 code-type defects were found and fixed during that phase, and 26 code-type defects were found in later phases for a total of 112 code-type defects. The code phase containment is 86/112 = 77%. Since the requirements phase containment percentage is the lowest, the requirements phase would be considered the best target for process improvement.
Figure 18.26 Defect containment effectiveness—example.
While this example shows the phase containment metrics being calculated after the software has been in operations for some period of time, in real use the phase containment metrics should be calculated after each phase. At the end of the requirements phase, the requirement phase containment is 15/15 = 100%. At the end of the design phase, requirement phase containment is 15/20 = 75%. These ongoing calculations can be compared with average values baselined from other projects to determine if corrective action is necessary at any time. For example, if instead of five requirement-type defects being found in the design phase, assume that 30 requirement-type defects were found. In this case, the requirements phase containment would be calculated as 25/55 ≈ 45%. This much lower value might indicate that corrective action in terms of additional requirements defect detection activities should be performed before proceeding into the code phase.
The total defect containment effectiveness (TDCE) metric shows the effectiveness of defect detection processes in identifying defects before the product is released into operation. TDCE is calculated as:
For the example in Figure 18.26 , 24 requirement-type defects, 39 design-type defects, and 105 code-type defects, for a total of (24 + 39 + 105) = 168 defects, were found prior to release. Total defects found (including those found after release) is (25 + 41 + 112) = 178. TDCE for this example is 168/178 ≈ 94%. The TDCE for this project could then be compared with the TDCE for previous projects to determine if it is at an appropriate level. If process improvements have been implemented, this comparison can be used to determine if the improvements had a positive impact on the TDCE metric value.
Defect Removal Efficiency
Defect removal efficiency (DRE), also called defect detection efficiency or defect detection effectiveness, is a measure of the effectiveness of a detect detection/removal process. DRE is a measure of the percentage of all defects in the software that were found and removed when a detection/rework process was executed. Unlike phase containment, this metric can be calculated for each defect detection process (instead of by phase). For example, if both code reviews and unit testing were done in the coding phase, each process would have its own DRE percentage. DRE also includes all defects that could have been found by that process, not just the ones introduced during that phase. DRE is calculated by:
Figure 18.27a illustrates an example of the calculation of DRE. Note that the DRE for the requirements review process is 15/25 = 60%. Since this is the first defect detection/ removal process and the only types of defects available to find are requirement-type defects, then the phase containment and DRE numbers are the same. However, in the design review processes, not only can design-type defects be found, but the defects that escaped from the requirements review can also be found. As illustrated in Figure 18.27a , there were five requirement-type defects and 29 design-type defects found during the design review. However, an additional five requirement-type defects and 12 design-type defects escaped the design review process and were found later. The DRE for the design review is (5 + 29)/(5 + 29 + 5 + 12) ≈ 67%.
As illustrated in Figure 18.27b , the DRE for the code review is (1 + 3 + 54)/(1 + 3 + 54 + 4 + 9 + 58) ≈ 45%. To complete this example, the DRE for:
Note that of the peer reviews, the coding peer review had the lowest DRE in this example at 45 percent. Of the testing activities, system testing had the lowest DRE. Therefore, these two activities would be candidates for process improvement.
As with phase containment effectiveness, while these examples show the defect removal efficiency metrics being calculated after the software has been in operations for some period of time, in real use these metrics should be calculated after each major defect detection/removal process and compared with baselined values to identify potential issues that need corrective action.
Figure 18.27a Defect containment effectiveness—design review example.
Figure 18.27b Defect containment effectiveness—code review example.
Process Capability
When measuring process capability, the software industry offers guidance in the form of the CMMI Continuous Representation (SEI 2010; SEI 2010a; 2010b) and the ISO/ IEC 15504 set of standards. Based on these documents, each process area is assigned a capability level from zero to five, based on whether the implementation of that process within the organization meets defined criteria for good industry practice.
Many of the metrics already discussed can also be used to understand the process capabilities of each individual process, help an organization recognize the amount of variation in its processes, and act as predictors of future process performance. Examples of these metrics include first-pass yield, phase containment, defect detection efficiency, cycle time, productivity, cost, and the product measures of the products being produced by the process. For example, the higher the defect density of requirements specification, the lower the capability of the requirements development processes.
Project and Risk Metrics
In Chapter 16 , several software project metrics were discussed for tracking projects. In Chapter 17 , several software risk metrics were discussed for analyzing, planning for and tracking risks.
Agile Metrics
One of the primary practices of eXtreme programming (XP) is the informative workspace, which uses the workspace to communicate important, active information. For example, story cards (sorted by “done,” “to be done this week,” “to be done this release”) or big, visible charts for tracking progress over time of important issues might be posted on the wall. Beck (2005) says “An interested observer should be able to walk into the team space and get a general idea of how the project is going in 15 seconds. He should be able to get more information about real or potential problems by looking more closely.”
Another agile methodology, called crystal, has a similar strategy called information radiators. Information radiators are forms of displayed project documentation used or placed in areas where people can easily see them. Examples of information radiators include posting: (Cockburn 2005)
Information radiators show status such as the:
Agile also uses metrics called burn charts. Burn charts are intended to be posted in a visible location, as part of the informative workspace or information radiator. Burn charts should show how the project’s estimations (predictions) compare to its actual accomplishments. Figure 16.9 shows a burn-up chart example. A burn-up chart starts with zero and tracks progress by graphing the number of items completed up to the ceiling (goal). A burn-down chart starts with the total number of items that need to be done and tracks progress by graphing the number of items left to do, down to zero items left to do.
4. DATA INTEGRITY
Describe the importance of data integrity from planning through collection and analysis, and apply various techniques to ensure data quality, accuracy, completeness, and timelines. (Apply)
BODY OF KNOWLEDGE V.A.4
The old saying “garbage in, garbage out” applies to software metrics. A software quality engineer (SQE) should be able to analyze the integrity of the data used as inputs into creating information products. An SQE should be able to analyze the metrics design and implementation processes and the data collection and analysis processes to evaluate the quality, accuracy, completeness, and timeliness of the data, metrics, and information products.
If the right data items are not collected, then the objectives of the measurement program can not be accomplished. Data analysis is pointless without good data. Therefore, establishing a good data collection plan is the cornerstone of any successful metrics program. The data collection plans may be part of standardized metrics definitions at the organizational level, or included/tailored in project plans or sub-plans (for example, measurement plans, communication plans, quality assurance plans). A data collection plan should include who is responsible for collection and/or validation of the data, how the data should be collected (for example, measurement units used, and conditions under which the measurement should be collected), when the data is collected, and where and how the data is stored. Data collection plans may also include information about archiving or retiring data.
Who Should Collect the Data?
Deciding who should collect the data is an important part of making sure that good data is collected. In most cases the best choice is the owner of the data. The data owner is the person or tool with direct access to the source of the data and in many cases is actually responsible for generating the data. Table 18.2 includes a list of example data owners and the types of data that they own. For example, when a user calls in to the help desk, the help desk staff has direct access to the problem identification data that is being reported by the user, and collects that data into a customer call report database. On the other hand, in some cases the users themselves have access to a problem-reporting database. In that case the users are the owners of the problem-identification data and collect that data. Whenever possible, tools should be used to collect the data and relieve people of the burden of data collection.
The benefits of having the data owner collect the data include:
Table 18.2 Data ownership—examples
Data owner
Examples of data owned
Management
  • Schedule
  • Budget
Software developers
  • Time spent per task
  • Inspection data, including defects found
  • Root cause of defects
Testers
  • Test cases planned/executed/passed
  • Problem identification data
  • Test coverage
Configuration management
  • Defects corrected per build
  • Modules changed per build
Users
  • Problem-identification data
  • Operational hours
Tools
  • Line of code
  • Cyclomatic complexity
Once the people who need to gather the data are identified, they must agree to do the work. They must be convinced of the importance and usefulness of collecting the data. Management has to support the program by giving the data owners the time and resources required to perform data collection activities. To quote Watts Humphrey (1989), “The actual work of collecting data is tedious and must generally be done by software professionals who are busy with other tasks. Unless they and their immediate managers are convinced that the data is important, they either will not gather it or will not be very careful when they do.”
So what can the metrics providers do to help make certain that the data owners collect good data? First, the metrics providers can design metrics that are as objective and unambiguous as possible. A data item is objective if it is collected the same way each time. Subjective data can also be valuable. For example, customer satisfaction metrics are typically subjective. However, the goal is to make the data as objective as possible. A data item is unambiguous if two different people collecting the same measure for the same item will collect the same data. This requires standardized definitions and well-defined measurement methods.
Second, the metrics provider can design the metrics and establish data collection mechanisms that are as unobtrusive and convenient as possible. Data collection must be an integral part of the software development process and not some outside step that detracts from the “real work.” Data collection must be simple enough not to disrupt the working patterns of the individual collecting the data any more than absolutely necessary. In some cases this means automating all or part of the data collection. For example, having a pulldown pick list is much more convenient than making the data owner type in “critical,” “major,” or “minor,” and it can also contribute to more accurate data by eliminating misspellings or abbreviations. As another example, do not make the data owner type in the current date. The computer knows the date, so have the computer default that data item.
Third, the data owners must be trained in how to collect the data so that they understand what to do and when to do it. For simple collection mechanisms, training can be short (≤ one hour). Hands-on, interactive training, where the group works with actual data collection examples, often provides the best results. Without this training, hours of support staff time can be wasted answering the same questions over and over again. An additional benefit of training is promoting a common understanding about when and how to collect the data. This reduces the risk of invalid and inconsistent data being collected.
Fourth, the metrics providers must feed the metrics information back to the data owners so that they can see that the data items are used and not just dropped into a black hole. Better yet, the metrics providers can use the data to create metrics that are directly useful to the data owners. There must also be people assigned to support the data collection effort so that the data collectors can get their questions answered, and issues related to data, and data collection problems, are handled in a timely manner.
How Should the Data Be Collected?
The answer to the question how should the data be collected, is automate, automate, automate, and automate:
There is widespread agreement that as much of the data-gathering process as possible should be automated. At a minimum, standardized forms should be used for data collection, but at some point the data from these forms must be entered into a metrics database if it is to have any long-term usefulness. Information that stays on forms can quickly get buried in file drawers, never to see the light of day again. In order for data to be useful and used, easy access to the data is required. The people who need the data have to be able to get to them easily. The easier the data items are to access, the easier it is to generate timely metrics reports.
Dumping raw data and hand-tallying or calculating measurements is another way to introduce human error into the measured values. Even if the data are recorded in a simple spreadsheet, automatic sorting, extracting, and calculations are available and should be used. They also increase the speed of producing the metrics and therefore can help increase timeliness.
Automating metrics reporting and delivery eliminates hours spent standing in front of copy machines. It also increases usability because the metrics are readily available on the computer. Remember, metrics are expensive. Automation can reduce the expense while making the metrics available in a timely manner.
Quality Data and Measurement Error
In order to have high-quality data, the measurements taken need to be as free from error as possible. A measurement error occurs when the measured value or data item collected (the assigned number or symbol) differs from the actual value (that would be mapped to the attribute of the entity in a perfect world). For example, a person might be measured as 6 foot 2 inches tall, while in reality that person might be 6 foot 2.1385632 inches tall. The 0.1385632 inches is measurement error. According to Arthur (1985), measurement error can happen for a number of reasons, including:
Data Accuracy
What if data has measurement error and is inaccurate? Can inaccurate data be accepted and used anyway? There are two reasons why the answer may be yes. First, the data may be accurate enough to meet the required objectives. Go back to the time sheet example. When reconstructing the data, the engineer may overstate one project by a few hours one week and understate it the next. The result is still a fairly good approximation of total time spent over the life of the project. When estimating time requirements for future projects, the time card data may be accurate enough for its intended use.
The second reason for reporting inaccurate data is to make it accurate. For example, a metrics analyst created metrics reports to show the trend over time for non-closed problems. The project managers then complained that the reports were inaccurate. They said that many of the problems were actually closed, but the database had not been updated. However, upper management utilized these reports to continue the focus on reducing the backlog of uncorrected problems. This provided the leverage needed to update the data and increase its accuracy.
If metrics providers wait for 100 percent data accuracy, they may never produce any metrics. Remember—good enough is good enough. On the other hand, metrics providers and users need to be aware of data inaccuracies, and consider those inaccuracies when determining how reliable and valid the metrics are. This awareness can help determine the level of confidence that can be placed in the measurement results.
Data Completeness
What if the data are incomplete? Again there are similar circumstances where the metrics provider can still use the data. In the time sheet example, the data might be considered incomplete because they do not include overtime. In this case, the metrics provider can change the metrics algorithm for estimating the next project to take this into consideration. If they know engineers are working about 10 hours a week unpaid overtime, they can use a 1.25 multiplication factor in the model to estimate total engineering effort for the next project.
The reporting of incomplete data can be used to make the data more complete. When one company first started reporting code review metrics, the initial reports indicated that only a small percent of the code was being reviewed. Actually, a much higher percent was reviewed, but data were not being recorded in the database. By reporting incomplete data, emphasis was placed on the need to record accurate and complete data so that the measurement reflected the actual accomplishments. This is especially true when projects recording data use that data to demonstrate both cost and time savings to upper management.
However, as with accuracy, metrics providers and users need to be aware of data incompleteness as a consideration when determining how reliable and valid the metrics are, so they can determine the level of confidence that can be placed in the measurement results.
Data Timeliness
Data collection and reporting must be timely. For example, if a supplier is trying to make a decision on whether or not to ship a software product, having defect status data from the previous week is of little value. Two time frames need to be considered when talking about timeliness and data collection:
  1. Data collection time: Is the data being collected in a timely manner? The longer the time period between the time an event happens and the time when the data about that event are collected, the less integrity the data probably has. For example, if an engineer records the time spent on each task (effort) as that engineer is working on those tasks, the effort data will be more accurate than they would be if the engineer waited until the end of the week to complete the data on a time sheet.
  2. Data availability time: Is the collected data being made available in a timely manner? If there is a large gap of time between when the data are collected and when they are available in the metrics database, the integrity of the reported metrics can be impacted. For example, even if inspection data are collected on forms during the actual inspection, those forms may sit for some period of time before the data are entered into the inspection database.
There are also two time frames to consider when talking about timeliness and data reporting:
  1. Data extraction time: Looks at the timeliness of extracting the data from the metrics database. This timing may be an issue in information warehouse environments where data are extracted from operational systems on a periodic basis and made available for metrics reporting. For example, data extracted the previous midnight might be considered untimely for an immediate ship/no ship decision, while being perfectly timely for monthly summary reports.
  2. Data reporting time: Looks at the timeliness of report generation and distribution of the extracted data. This timing may especially be an issue if the metrics are hand-calculated and distributed in hard copy.
The timing of data synchronization is also important. If the data for a metrics report are pulled while data are being input or refreshed, it could result in part of the data extraction being old and out of sync with another part of the extraction that reflects newer data. Data synchronization may also be a problem if data collection, availability, or extraction times vary for different data sets used in the same metrics report. If these data time frames are not synchronized for the measurements being analyzed, that analysis will be severely limited or even completely defective/erroneous.
Measurements Affect People
Measurements affect people, and people affect measures. The simple act of measuring affects the behavior of the individuals who are performing the processes and producing the products and services being measured. When something is being measured, it is automatically assumed to have importance. People want to look good, so therefore they want the measures to look good. They will modify their behavior to focus on the areas being measured. For example, think about the fact that most people study harder if there is going to be a test than they might just because they were taking a class or had a desire to learn something new.
This is known as the Hawthorne effect. The Hawthorne effect was first noticed in the Hawthorne Works plant where production processes were being studied to determine the impact of various conditions (for example, lighting levels and rest breaks) on productivity. However, each change in these conditions resulted in overall increases in productivity, including the return to the original conditions. Measurements like this were done across multiple aspects of the workers’ behavior. It was concluded that productivity increases, and other positive improvements, were not a consequence of actual changes in working conditions, but happened because of the simple act of measurement. Measurement gave attention (demonstrated interest by management) to values being measured and therefore caused the workers to endeavor to make those measurements improve.
Whether a metric is ultimately useful to an organization or not depends on the attitudes of the people involved. Therefore, the organization must consider these human factors when selecting metrics and implementing their metrics program.
People Affect Metrics and Measuring
People also affect measures through their behaviors by:
When implementing a metric, the people doing that implementation must always decide what behaviors they want to encourage. Then take a long look at what other, negative behaviors might result from the use or misuse of the metric. For example, a team implemented a metric to monitor an initiative aimed at reducing the number of unfixed problem reports each development team had in their backlog. The desired behavior was for managers to dedicate resources to this initiative and for engineers to actively work to correct the problems and thus remove them from the backlog. However, on the negative side, some people tried to beat the metrics by:
To quote an unknown source, “Do not underestimate the intelligence of your engineers. For any one metric you can come up with, they will find at least two ways to beat it.” While the goal is to appropriately close out problems and improve product quality by decreasing the product defect backlog, the organization also does not want to have to rediscover “known” problems later in operations because they have been forgotten or inappropriately closed.
Metric and Measurement Do’s and Don’ts
There are ways to minimize the negative impacts of implementing software metrics. The following is a list of do’s and don’ts that can help increase the probability of implementing a successful metrics program. To minimize negative impacts:
Other Human Factors
A famous quote usually attributed to Mark Twain talks about the three kinds of lies— “lies, damn lies, and statistics.” Marketers use statistical tricks all the time to help sell their products. When an organization puts its metrics program together, it needs to be aware of these issues and make conscious decisions on how to display the metrics for maximum usefulness and minimum “deviousness.” If engineers and managers catch the metrics producer playing a marketing-type trick to make the metrics look good, they will stop believing the metrics.