5
Number-Crunching Tools
If you are a numbers person, this should be the golden age, as tasks that would have taken you months to complete a few decades ago can be done in a few seconds with the help of technology. The ease of access to numbers and tools has also made everyone a number cruncher, with mixed results; as we noted in the last chapter, numbers can be misused, manipulated, or just misread. In this chapter I break the number-crunching process into three parts, starting with the collection of data, moving on to the analysis of that data, and ending with presenting the data to others. At each stage, I look at practices that you can use to minimize bias and error when working with numbers or to detect bias and error when presented with numbers and models.
From Data to Information: The Sequence
Is this the data age or is it the information age? I am not sure, and the reason is that data and information are used as interchangeable words when they represent very different concepts. The data is what we start with, the raw numbers, and defined as such, we are in the data age, when it is possible to collect and store massive quantities of these numbers. Data has to be processed and analyzed for it to become information, and it is here that we face a conundrum. The proliferation of data has also meant that not only do we have far more data to process but that data can offer contradictory signals, making it more difficult to convert it into information. Thus, it is data overload that we face, not information overload.
There are three steps in the data-to-information process, and this chapter will be built around these steps, since at each step there is both promise and peril.
  1.  Data collection: The first step is collecting the data. In some cases, this can be as simple as accessing a computerized database. In others, it will require running experiments or surveys.
  2.  Data analysis: Once the data has been collected, not only does it have to be summarized and described, but you have to look for relationships in the data that you can use in your decision making. It is at this stage that statistical analysis comes into play.
  3.  Presentation: Having analyzed the data, you have to present it, not only so others can see and use the information you have gleaned from the data, but so you yourself have a sense of what that information is.
At each stage in the process it is easy to get lost in the details as you get drawn into data-collection debates, statistical arguments, and discussions about whether you should use bar graphs or pie charts. You have to remember that your endgame is to use the information to make better decisions and that anything that helps you do that is good and anything that leads you away from it is a distraction.
Collecting Data
The first step in the process is collecting data, a time-consuming and manual process for much of mankind’s history. Broadly speaking, that data can either come from records maintained by an organizing entity (government, security exchange, regulator, private business), from surveys, or from experiments. As more and more of our transactions are done on or with computers, the data is generally recorded online, making the job of creating and maintaining databases simpler.
The Choices in Data Collection
In using data, a fundamental question is how much data is enough. At the risk of oversimplifying this choice, you often will have to decide between a smaller sample of meticulously collected and curated data and a larger sample with noisy and potentially erroneous data. In making that decision, you should be guided by the law of large numbers, one of the building blocks of statistics. Put simply, the law of large numbers states that as sample sizes get larger, your computed statistics from that sample get more precise. If that sounds unreasonable, the intuition is that as sample sizes get larger, the mistakes you may have on individual data points get averaged out.
Assuming that you are getting a sampling of the process you want to understand, you then have to make decisions on what will constitute the sample. In the context of financial data, for instance, here are some choices you will face:
  1.  Public company versus private company data: Publicly traded companies in most of the world face information disclosure requirements; they have to make their financial statements available to the public. As a consequence, it is far easier to access data on public companies than their privately owned counterparts.
  2.  Accounting versus market data: With public companies, not only do you have access to financial statement data but you also can obtain data from financial markets on price movements and transactions data (bid-ask spread, trading volume).
  3.  Domestic versus global data: Many researchers, and especially those in the United States, tend to stay focused on U.S. data, partly because they trust and understand it more and partly because it is easier to access, for most of them. As both companies and investors globalize, that domestic focus may no longer be appropriate, especially if the decisions you are making have global consequences.
  4.  Quantitative versus qualitative data: Databases tend to be skewed heavily toward quantitative data, partly because that is the bulk of the data collected and partly because it is so much easier to store and retrieve than qualitative data. Consequently, it is easy to obtain data on the number of directors at each publicly traded company but much more difficult to get data on how much dissension there is at board meetings at companies. One of the outgrowths of the surge in social media sites is the development of more sophisticated techniques for reading, analyzing, and storing qualitative data.
Your choices in what types of data you collect can affect the results you obtain, because your choices can create bias in your samples, often implicitly.
Data-Collection Biases
For those who still hold onto the belief that data is objective, a closer look at the data-collection process is often all that is needed to dispel that belief. In particular, when sampling, there are at least two biases that represent clear and present danger if your objective is to be unbiased and great opportunity if you have an agenda that you want to advance.
SELECTION BIAS
As we are all taught in our introductory statistics classes, it is perfectly reasonable to sample from a larger population and draw conclusions about that population, but only if that sample is random. That may sound like a simple task, but it can be very difficult to accomplish in the context of business and investing.
•  In some cases, the sampling bias you introduce can be explicit when you pick and choose the observations in your sample to deliver the result you want. Thus, a researcher who starts off with the objective of showing that companies generally take good investments may decide to use only companies in the S&P 500 in his sample. Since these are the largest market capitalization companies in the United States and they reached that status because of their success in the past, it should not be surprising that they have a history of taking good investments, but that result cannot be generalized to the rest of the market.
•  In other cases, the bias can be implicit and embedded in what you may believe are innocuous choices that you had to make in what data you would collect. For instance, restricting your sample to just publicly traded companies may be a choice that is thrust upon you, because the databases you use contain only those companies. However, the results you get from this data may not be generalizable to all businesses, since privately owned businesses tend to be smaller and more localized than public companies.
As a general rule, I find it useful when I sample data to also take a look at the data I exclude from my sample, just to be aware of biases.
SURVIVOR BIAS
The other challenge in sampling is survivor bias, that is, the bias introduced by ignoring the portions of your universe that have been removed from your data for one reason or another. As a simple example of survivor bias, consider research done by Stephen Brown, my colleague at New York University, on hedge fund returns. While many studies looking at hedge fund returns over time had concluded that they earned “excess” returns (over and above expectations), Professor Brown argued that the mistake many analysts were making was that they were starting with the hedge funds in existence today and working backward to see what returns those funds earned over time. By doing so, the analysts were missing the harsh reality of the hedge fund business, which is that the worst-performing hedge funds go out of business and not counting the returns from those funds pushes up the computed return for the sample. His research concluded that survivor bias adds about 2–3 percent to the average hedge fund return. In general, survivor bias will be a bigger issue with groups that have high failure rates, thus posing a more significant problem for investors looking at tech start-ups than for those looking at established consumer product companies.
NOISE AND ERROR
In this age of computer data, when less of the data is entered by hand, we have learned to trust the data more, perhaps too much. Even in the most carefully maintained databases, there will be data input errors, some of which are large enough to alter the results of your study. Consequently, it behooves researchers to do at least a first pass at the data to catch the big errors.
The other problem is missing data, either because the data is not available or because it did not make it into the database. One solution is to eliminate observations with missing data, but not only will this reduce your sample size, it may introduce bias if missing data is more common in some subsets of the population than in others. It is a problem that I face increasingly, as I have chosen to move away from U.S.-centric to global data. To provide an example, I consider lease commitments to be debt and convert them when looking at how much a company owes; U.S. companies are required to reveal these commitments for disclosure purposes, but in many emerging markets, especially in Asia, there is no such disclosure requirement. I have two choices. One is to go back to a conventional debt definition, which does not include leases, but I will then be settling for a much poorer measure of financial leverage than I could be using for the half of my global sample that does report leases. The other is to eliminate all firms that don’t reveal lease commitments from any sampling for financial leverage and not only lose half of my sample but create significant bias. I settle for an intermediate choice; I use lease commitments as debt for U.S. firms and I make an approximation of future commitments, based on the current year’s lease expense, for non-U.S. firms.
Data Analysis
I enjoyed my statistics classes in college, but I did find them abstract, with few real-world examples that I could relate to. That is a pity, because if only I had known then how critical statistics is to making sense of data, I would have paid more attention.
Tools for Data Analysis
When faced with a large data set, you want to start by summarizing the data into summary statistics before you embark on more complex analysis. The first two statistics you start with are the mean and the standard deviation, with the mean representing the simple average of all of the data points and the standard deviation capturing how much variability there is around the average. If the numbers are not distributed evenly around the average, the mean may not be the most representative number for the sample, and thus you may estimate the median (the 50th percentile of the numbers in your sample) or the mode (the number that occurs most frequently in your sample). There are other summary statistics designed to capture the spread of numbers in your sample, with the skewness measuring the symmetry in your sample numbers and the kurtosis the frequency of numbers that are very different from your mean.
For those who prefer a more visual description of the data, you will often see the numbers graphed out in a distribution. If your data is discrete, that is, it can take only a finite number of values, you can count the number of times each value occurs and create a frequency table, which you can then graph as a frequency distribution. For example, in the frequency table and distribution in figure 5.1, I report on the bond ratings (S&P rating, a discrete measure that takes on alphabetical values) for U.S. companies at the start of 2016.
image
Figure 5.1
S&P bond ratings for U.S. companies, January 2016.
Source: S&P Capital IQ (for raw data).
If your data is continuous, that is, it can take any value between a minimum and maximum value, you can classify your numbers into smaller groupings, count the number in each group, and graph the results in a histogram. If the histogram that you have is close to a standardized probability distribution (normal, lognormal, exponential), you can then draw on the properties of these standardized distributions to make statistical judgments about your data. To illustrate, I have graphed out the distribution of price–earnings (PE) ratios for all companies in the United States for which a PE ratio could be computed at the end of 2015 (figure 5.2).
image
Figure 5.2
PE ratio for U.S. companies, January 2016. Source: Damodaran Online (http://www.damodaran.com).
Finally, there are statistical measures and tools designed to measure how two or more variables move with each other. The simplest of these is the correlation coefficient, a number that is bounded between +1 (when two variables move in perfect harmony, in the same direction) and −1 (when two variables move in perfect harmony, in opposite directions). A close variant is the covariance, which also measures the comovement of two variables but is not bounded by −1 and +1. The easiest way to visualize the relationship between two variables is with a scatter plot, where the values of one variable are plotted against the other. In figure 5.3, for instance, I plot the PE ratios for U.S. companies against expected growth rates in earnings (as estimated by analysts) to see whether there is any truth to the conventional wisdom that higher-growth companies have higher PE ratios.
image
Figure 5.3
Trailing PE versus expected growth in earnings per share (EPS) for the next five years for U.S. companies, January 2016.
The good news for conventional wisdom is that it is true, in the aggregate, since the correlation between PE and growth is positive, but the bad news is that it is not that strong, since the correlation is only 20 percent. If the objective is to use one variable to predict another, the tool that fits well is a regression, with which you find a line that best fits the two variables. Graphically, a simple regression is most easily visualized in the scatter plot, and I report the results of a regression of PE against expected growth rates in (figure 5.3). The numbers in brackets in the regression are t statistics, with t statistics above 2 indicating statistical significance. Based on the regression, every 1 percent increase in expected growth translates into an increase of 0.441 in the PE ratio, and you can use the regression to predict the PE ratio for a firm with an expected growth rate of 10 percent:
Predicted PE = 19.86 + 44.10 (0.10) = 23.27
Note that this predicted PE comes with a wide range, reflecting the low predictive power of the regression (captured in the R-squared of 21%). The biggest advantage of a regression is that it can be extended to multiple variables, with a single dependent variable (the one you are trying to explain) linked to many independent variables. Thus, if you wanted to examine how the PE ratios of companies are related to the risk, growth, and profitability of those companies, you could run a multiple regression of PE (dependent variable) against proxies for growth, risk, and profitability (independent variables).
Biases in Analysis
The fact that we have statistical tools at our disposal that will do all of what we described in the last section and more is a mixed blessing, since it has opened the door to what can be best classified as “garbage in, garbage out” analysis. Looking at the state of data analysis in business and finance, here are a few of my observations:
  1.  We put too much trust in the average: With all of the data and analytical tools at our disposal, you would not expect this, but a substantial proportion of business and investment decisions are still based on the average. I see investors and analysts contending that a stock is cheap because it trades at a PE that is lower than the sector average or that a company has too much debt because its debt ratio is higher than the average for the market. The average is not only a poor central measure on which to focus in distributions that are not symmetric, but it strikes me as a waste to not use the rest of the data. While an analyst in the 1960s could have countered with the argument that using all of the data was time-consuming and unwieldy, what conceivable excuse can be offered in today’s data environment?
  2.  Normality is not the norm: One of the shameful legacies of statistics classes is that the only distribution most of us remember is the normal distribution. It is an extremely elegant and convenient distribution, since it can not only be fully characterized by just two summary statistics, the mean and the standard deviation, but it lends itself to probability statements such as “that has only a 1 percent chance of happening since it is 3 standard deviations away from the mean.” Unfortunately, most real-world phenomena are not normally distributed, and that is especially true for data we look at in business and finance. In spite of that, analysts and researchers continue to use the normal distribution as their basis for making predictions and building models and are constantly surprised by outcomes that fall outside their ranges.1
  3.  The outlier problem: The problem with outliers is that they make your findings weaker. Not surprisingly, the way researchers respond to outliers is by ridding themselves of the source of the trouble. Removing outliers, though, is a dangerous game; it opens the door to bias, since outliers that don’t fit your priors are quick to be removed, but outliers that do fit your priors are maintained. In fact, if you view your job in business and investing as dealing with crises, you can argue that it is the outliers you should be paying the most attention to, not the data that neatly fits your hypothesis.
Data Presentation
If you are collecting and analyzing data for your own decision making, you may be ready to make your best judgments after your data analysis. However, if you are crunching numbers for a decision maker or team or have to explain your decision to others, you will need to find ways to present the data to an audience that is neither as conversant with nor as interested in the data as you are.
Presentation Choices
The first way you can present the data is in tables, and there are two types of tables. The first is reference tables, which contain large amounts of data and allow people to look up specific data on individual segments. Thus, the tax rate data that I have, by sector, on my website is an example of a reference table. The second is demonstration tables, which are summary tables, whose objective is to show differences (or the lack thereof) between subgroups of the data.
The second way you can show data is in charts, and while there are a multitude of different charts, the three most commonly used are listed below:
  1.  Line charts: Line charts work best for showing trend lines in data across time and for comparing different series. In figure 5.4, I look at the equity risk premium for U.S. stocks and the U.S. Treasury bond rate for each year from 1960 to 2015. It allows me to not only describe how equity risk premiums have risen and fallen over different time periods but also how they have moved with risk-free rates.
  2.  Column and bar charts: Column and bar charts are most suited for comparing statistics across a few subgroups. For illustration, you can compare the PE ratios for companies in five different markets or five different sectors to see whether one or more of them is an outlier.
  3.  Pie charts: A pie chart is designed to illustrate the breakdown of a whole into component parts. Thus, I can use a pie chart to illustrate the parts of the world where a company gets its revenues or the businesses that a multibusiness company is in.
image
Figure 5.4
Equity risk premiums and Treasury bond rate, 1961–2015.
Source: Damodaran Online (http://pages.stern.nyu.edu/~adamodar).
I am a fan of Edward Tufte, a visionary when it comes to presenting data, and I agree with him that we need to go beyond the restrictive and dull bounds of spreadsheet programs to create pictures that better convey the story in the data. In fact, presenting data more creatively is a discipline in itself and has spawned research, new visualization tools (infographics), and new businesses to further the use of these tools.
Presentation Biases and Sins
All through this chapter we have noted how bias creeps into the process, either implicitly or explicitly. At the data-collection stage, it shows up in biased samples designed to deliver the results you want, and in the data-analysis stage, in how you deal with outliers. Not surprisingly, it finds its way into the data-presentation stage as well, in small but still significant ways, ranging from changing the scaling of axes to making changes look bigger than they are to the use of infographics that are meant more to mislead than to inform.
If there is one message that you should heed at the data-presentation stage, it is that less is more and that your objective is to not drown decision makers in three-dimensional graphs with dubious content but to build up to better decisions. So do not use a table when mentioning two numbers in the text will do; do not insert a graph when a table will suffice; do not give a graph a third dimension when two dimensions are all you need. I have been guilty of violating all of these rules at some time in my life and I perhaps will do so later in this book; if I do, I hope that you will call me out on my transgressions.
CASE STUDY 5.1: THE PHARMACEUTICAL BUSINESS—R&D AND PROFITABILITY, NOVEMBER 2015
To understand the drug business, I started my analysis in 1991, toward the beginning of a surge in spending on health care in the United States. The pharmaceutical companies at the time were cash machines, built on a platform of substantial upfront investments in research and development (R&D). The drugs generated by R&D that made it through the Food and Drug Administration (FDA) approval process and into commercial production were used to cover the aggregated cost of R&D and to generate significant excess profits. The key to this process was the pricing power enjoyed by the drug companies, the result of a well-defended patent process, significant growth in health-care spending, splintered health insurance companies, and lack of accountability for costs at every level (from patients to hospitals to the government). In this model, not surprisingly, investors rewarded pharmaceutical companies based on the amounts they spent on R&D (secure in their belief that the costs could be passed on to customers) and the fullness and balance of their product pipelines.
So how has the story changed over the last decade? The growth rate in health-care costs seems to have slowed down and the pricing power of drug companies has waned for many reasons, with changes in health-care laws being only one of many drivers. First, we have seen more consolidation within the health insurance business, potentially increasing its bargaining power with the pharmaceutical companies on drug prices. Second, the government has used the buying clout of Medicaid to bargain for better prices on drugs, and while Medicare still works through insurance companies, it can put pressure on drug companies to negotiate lower costs. Third, the pharmacies that represent the distribution networks for many drugs have also been corporatized and consolidated and are gaining a voice in the pricing process. The net effect of all of these changes is that R&D has much more uncertain payoffs and has to be evaluated like any other large capital investment: it is good only when it creates value for a business.
To test the hypotheses that pharmaceutical companies have lost pricing power between 1991 and 2014 and that R&D no longer has the revenue punch it used to have, I started by looking at the average profit margins at pharmaceutical companies in figure 5.5, using different measures of profit (net income, operating income, earnings before interest, taxes, depreciation, and R&D) for each year:
image
Figure 5.5
Pharmaceutical companies: profit margins.
The evidence is only weakly supportive of the hypothesis that pricing power has waned over the period, since margins, while down slightly, have not changed much over the time period.
I followed up by looking at whether the payoff to R&D in terms of revenue growth has slowed over time by looking at R&D spending as a percent of sales each year and revenue growth in the same year from 1991 to 2014 in table 5.1.
Table 5.1
Payoff to R&D in Revenue Growth
Year R&D/sales Revenue growth rate Growth to R&D ratio
1991 10.17% 49.30% 4.85
1992 10.64%   6.40% 0.60
1993 10.97%   3.58% 0.33
1994 10.30% 15.85% 1.54
1995 10.37% 17.32% 1.67
1996 10.44% 11.38% 1.09
1997 10.61% 13.20% 1.24
1998 11.15% 19.92% 1.79
1999 11.08% 15.66% 1.41
2000 11.41%   8.15% 0.71
2001 13.74% −8.17% −0.59  
2002 13.95%   4.80% 0.34
2003 14.72% 16.26% 1.10
2004 14.79%   8.17% 0.55
2005 15.40%   1.49% 0.10
2006 16.08%   2.86% 0.18
2007 16.21%   8.57% 0.53
2008 15.94%   6.21% 0.39
2009 15.58% −4.87% −0.31  
2010 15.17% 19.82% 1.31
2011 14.30%   3.77% 0.26
2012 14.48% −2.99% −0.21  
2013 14.28%   2.34% 0.16
2014 14.36%   1.67% 0.12
1991–1995 10.49% 18.49% 1.80
1996–2000 10.94% 13.66% 1.25
2001–2005 14.52%   4.51% 0.30
2006–2010 15.80%   6.52% 0.42
2011–2014 14.36%   1.20% 0.08
I know there is a substantial lag between R&D spending and revenue growth, but as a simplistic measure of the contemporaneous payoff to R&D, I computed a growth to R&D to sales growth:
Growth to R&D ratio = revenue growth rate/R&D spending as  percent of sales
Notwithstanding its limitations, this ratio illustrates the declining payoff to R&D spending at pharmaceutical firms, dropping close to zero in the 2011–2014 period.
What can we learn from this analysis? First, pharmaceutical companies remain profitable, notwithstanding the significant changes in the health-care business in the United States. Second, pharmaceutical companies have not cut back on internal R&D spending as much as some stories suggest they have. Third, the R&D table suggests that pharmaceutical companies should be spending less money on R&D, not more, as the growth payoff to R&D becomes lower and lower. Finally, it provides at least a partial explanation for why some pharmaceutical companies have embarked on the acquisition path, focusing on buying younger, smaller companies for the products in their research pipelines.
CASE STUDY 5.2: EXXONMOBIL’S OIL PRICE EXPOSURE, MARCH 2009
In chapter 13, I will describe a valuation of ExxonMobil in March 2009, in which the primary problem I faced was that oil prices had dropped significantly (to $45 a barrel) in the six months leading up to the valuation, but much of the financial data (including revenue and earnings) that ExxonMobil was reporting reflected a prior year, when oil prices had averaged almost $80 a barrel. While the obvious insight is that the trailing twelve-month earnings are too high, given the drop in oil prices, I still faced the challenge of trying to adjust the company’s earnings to the lower oil price.
To see how sensitive ExxonMobil’s earnings were to oil prices, I collected historical data on its operating income and the average oil price each year from 1985 to 2008. The numbers are in figure 5.6.
image
Figure 5.6
ExxonMobil operating income versus average oil price.
I also ran a regression of ExxonMobil’s operating income on the average oil price each year. Not only is the company’s operating income determined almost entirely by oil price levels (with an R-squared exceeding 90%), but you can use the regression to get an adjusted operating income for it at the prevailing oil price of $45 a barrel.
ExxonMobil oil-price-adjusted operating income = −$6,394.9 million + $911.32 million (45) = $34,615 million
While this operating income of $34.5 billion was substantially lower than the reported income over the prior twelve months, this is the operating income that you will see me use in my Exxon valuation.
CASE STUDY 5.3: A PICTURE OF VALUE DESTRUCTION—PETROBRAS
I am not particularly creative when it comes to converting data analyses into presentations that tell a story. I do remain proud of an analysis I did in May 2015 of how Petrobras, the Brazilian oil company, put itself in a cycle of value destruction that reduced its market capitalization by close to $100 billion (see figure 5.7).
image
Figure 5.7
A roadmap to destroying value: Petrobras, 2015.
I undoubtedly violated many data-visualization rules and tried to pack too much into one picture, but I was trying to convey not only the collection of actions that Petrobras took that led to the value destruction but how these actions logically led one to the other. Thus, the massive investments in new reserves without regard to profits had to be financed with new debt issues because the company wanted to continue paying high dividends. The net effect was a value-destruction cycle that occurred over and over and destroyed value at an exponential pace.
Conclusion
In this chapter I looked at the three steps in using data, starting with data collection, moving on to analysis of that data, and finishing with how best to present that data. At each step you have to fight the urge to mold the data to fit your preconceptions. If you are willing to keep an open mind and learn from the data, it will augment your storytelling skills and lead to better investment and business decisions.