Bubble Value at Risk

This chapter provides the statistical concepts essential for the understanding of risk management. There are many good textbooks on the topic, see Carol Alexander (2008). Here, we have chosen to adopt a selective approach. Our goal is to provide adequate math background to understand the rest of the book. It is fair to say that if you do not find it here, it is not needed later. As mentioned in the preface, this book tells a story. In fact, the math here is part of the plot. Therefore, we will include philosophy or principles of statistical thinking and other pertinent topics that will contribute to the development of the story. And we will not sidetrack the reader with unneeded theorems and lemmas.

2.1 FREQUENTIST STATISTICS

A random variable or stochastic variable (often just called variable) is a variable that has an uncertain value in the future. Contrast this to a deterministic variable in physics; for example, the future position of a planet can be determined (calculated) to an exact value using Newton’s laws. But in financial markets, the price of a stock tomorrow is unknown and can only be estimated using statistics.

Let X be a random variable. The observation of X (data point) obtained by the act of sampling is denoted with a lower case letter xi as a convention, where the subscript i = 1,2, . . . , is a running index representing the number of observations. In general, X can be anything—price sequences, returns, heights of a group of people, a sample of dice tosses, income samples of a population, and so on. In finance, variables are usually price (levels) or returns (changes in levels). We shall discuss the various types of returns later and their subtle differences. Unless mentioned otherwise, we shall talk about returns as daily percentage change in prices. In VaR, the data set we will be working with is primarily distributions of sample returns and distributions of profit and loss (PL).

Figure 2.1 is a plot of the frequency distribution (or histogram) of S&P 500 index returns using 500 days data (Jul 2007 to Jun 2009). One can think of this as a probability distribution of events—each day’s return being a single event. So as we obtain more and more data (trials), we get closer to the correct estimate of the “true” distribution.

We posit that this distribution contains all available information about risks of a particular market and we can use this distribution for forecasting. In so doing, we have implicitly assumed that the past is an accurate guide to future risks, at least for the next immediate time step. This is a necessary (though arguable) assumption; otherwise without an intelligent structure, forecasting would be no different from fortune telling.

In risk management, we want to estimate four properties of the return distribution—the so-called first four moments—mean, variance, skewness, and kurtosis. To be sure, higher moments exist mathematically, but they are not intuitive and hence of lesser interest.

The mean of a random variable X is also called the expectation or expected value, written μ = E(X). The mean or average of a sample x1, . . . , xn is just the sum of all the data divided by the number of observations n. It is denoted by

The Excel function is AVERAGE(.). It measures the center location of a sample. A word on statistical notation—generally, when we consider the actual parameter in question μ (a theoretical idea), we want to measure this parameter using an estimator

(a formula). The outcome of this measurement is called an estimate, also denoted

(a value). Note the use of the ^ symbol henceforth.

The variance or second moment of a sample is defined as the average of the squared distances to the mean:

The Excel function is VAR(.). It represents the dispersion from the mean. The square-root of variance is called the standard deviation or sigma σ. In risk management, risk is usually defined as uncertainty in returns, and is measured in terms of sigma. The Excel function is STDEV(.).

The skewness or third moment (divided by

) measures the degree of asymmetry about the mean of the sample distribution. A positive (negative) skew means the distribution slants to the right (left). The Excel function is SKEW(.).

The kurtosis or fourth moment (divided by

) measures the “peakness” of the sample distribution and is given by:

Since the total area under the probability distribution must sum up to a total probability of 1, a very peakish distribution will naturally have fatter tails. Such a behavior is called leptokurtic. Its Excel function is KURT(.). A normal distribution has a kurtosis of 3. For convenience, Excel shifts the KURT(.) function such that a normal distribution gives an excess kurtosis of 0. We will follow this convention and simply call it kurtosis for brevity.

Back to Figure 2.1, the S&P distribution is overlaid with a normal distribution (of the same variance) for comparison. Notice the sharp central peak above the normal line, and the more frequent than normal observations in the left and right tails. The sample period (Jul 2007 to Jun 2009) corresponds to the credit crisis—as expected the distribution is fat tailed. Interestingly, the distribution is not symmetric—it is positively skewed! (We shall see why in Section 7.1.)

2.2 JUST ASSUMPTIONS

A market price series is seldom stationary—trends and periodic components make the time series nonstationary. However, if we take the percentage change or take the first difference, this price change can be shown to be often stationary. This process is called detrending (of differencing) a time series and is a common practice.

Figure 2.2 illustrates a dummy price series and its corresponding return series. We divide the 200-day period into two 100-day periods, and compute the first two moments. For the price series, the mean moved from 4,693 (first half) to 5,109 (second half). Likewise, the standard deviation changed from 50 to 212. Clearly the price series is nonstationary. The return series, on the other hand, is stationary—its mean and standard deviation remained roughly unchanged at 0% and 0.5% respectively in both periods. Visually a stationary time series always looks like white noise.

An i.i.d. process will be stationary for finite distributions.3 The benefit of the stationarity assumption is we can then invoke the Law of Large Numbers to estimate properties such as mean and variance in a tractable way.

Law of Large Numbers

Expected values can be estimated by sampling. Let X be a random variable and suppose we want to estimate the expected value of some function g(X), where the expected value is μg ≡ E(g(X)). We sample for n observations xi of X where i = 1, . . . , n. The Law of Large Numbers states that if the sample is i.i.d. then:

For example, in equations (2.3) to (2.5) used to estimate the moments, as we take larger and larger samples, precision improves, our estimate converges to the (“true”) expected value. We say our estimate is consistent. On the other hand, if the sample is not i.i.d., one cannot guarantee the estimate will always converge (it may or may not); the forecast is said to be inconsistent. Needless to say, a modeler would strive to derive a consistent theory as this will mean that its conclusion can (like any good scientific result) be reproduced by other investigators.

Let’s look at some examples. We have a coin flip (head +1, tail −1), a return series of a standard normal N(0,1) process and a return series of an autoregressive process called AR(1). The first two are known i.i.d. processes; the AR(1) is not i.i.d. (it depends on the previous random variable) and is generated using:

where k0 and k1 are constants and εt, t=1, 2, . . . is a sequence of i.i.d. random variable with zero mean and a finite variance, also known as white noise. Under certain conditions (i.e., |k1| > 1), the AR(1) process becomes nonstationary.

Figure 2.3 illustrates the estimation of the expected value of the three processes using 1,000 simulations. For AR(1), we set k0 = 0, k1 = 0.99. The coin flip and normal process both converge to zero (the expected value) as n increases, but the AR(1) does not. See Spreadsheet 2.1.

Figure 2.4 plots the return series for both the normal process and the AR(1) process. Notice there is some nonstationary pattern in the AR(1) plot, whereas the normal process shows characteristic white noise.

To ensure stationarity, risk modelers usually detrend the price data and model changes instead. In contrast, technical analysis (TA) has always modeled prices. This is a well-accepted technique for speculative trading since the 1960s after the ticker tape machine became obsolete. Dealing with non-i.i.d. data does not make TA any less effective. It does, however, mean that the method is less consistent in a statistical sense. Thus, TA has always been regarded as unscientific by academia.

In fact, it would seem that the popular and persistent use of TA (such as in program trading) by the global trading community has made its effectiveness self-fulfilling and the market returns more persistent (and less i.i.d.). Market momentum is a known fact. Ignoring it by detrending and assuming returns are i.i.d. does not make risk measurement more scientific. There is no compelling reason why risk management cannot borrow some of the modelling techniques of TA such as that pertaining to momentum and cycles. From an epistemology perspective, such debate can be seen as a tacit choice between intuitive knowledge (heuristics) and mathematical correctness.

The Quest for Invariance

From an academic standpoint, the first step in time series modelling is to find a certain aspect of the data set that has a repeatable pattern or is “invariant” across time in an i.i.d. way. This is mathematically necessary because if the variable under study is not repeatable, one cannot really say that a statistical result derived from a past sample is reflective of future occurrences—the notion of forecasting a number at a future horizon breaks down.

A simple graphical way to test for invariance is to plot a variable Xt with itself at a lagged time step (Xt−1). If the variable is i.i.d. this scatter plot will resemble a circular cloud. Figure 2.5 shows the scatter plots for the normal and AR(1) processes seen previously.

Clearly the AR(1) process is not i.i.d.; its cloud is not circular. For example, if the return of a particular stock, Xt, follows an AR(1) process, it is incorrect to calculate its moments and VaR using Xt. Instead, one should estimate the constant parameters in equation (2.7) from data and back out εt, then compute the moments and VaR from the εt component. The εt is the random (or stochastic) driver of risk and it being i.i.d. allows us to project the randomness to the desired forecast horizon. This is the invariant we are after. The astute reader should refer to Meucci (2011).

Needless to say, a price series is never i.i.d. because of the presence of trends (even if only small ones) in the series, hence, the need to transform the price series into returns.

PDF and CDF

In practice, moments are computed using discrete data from an observed frequency distribution (like the histogram in Figure 2.1). However, it is intuitive and often necessary to think in terms of an abstract continuous distribution. In the continuous world, frequency distribution is replaced by probability density function (PDF). If f(x) is a PDF of a random variable X, then the probability that X is between some numbers a and b is the area under the graph f(x) between a and b. In other words,

where f(.) is understood to be a function of x here. The probability density function f(x) has the following intuitive properties:

It is the probability of observing the variable having values at or below x, written F(x) = Pr[X ≤ x]. As shown in Figure 2.6, F(x) is just the area under the graph f(x) at a particular value x.

Normal Distribution

The most important continuous distribution is the normal distribution also known as Gaussian distribution. Among many bell-shaped distributions, this famous one describes amazingly well the physical characteristics of natural phenomena such as the biological growth of plants and animals, the so-called Brownian motion of gas, the outcome of casino games, and so on. It seems logical to assume that a distribution that describes science so accurately should also be applicable in the human sphere of trading.

The normal distribution can be described fully by just two parameters—its mean μ and variance σ2. Its PDF is given by:

The normal distribution is written in shorthand as X∼N(μ, σ2). The standard normal, defined as a normal distribution with mean zero and variance 1, is a convenient simplification for modeling purposes. We denote a random variable ε as following a standard normal distribution by ε∼N(0,1).

Figure 2.7 plots the standard normal distribution. Note that it is symmetric about the mean; it has skewness 0 and kurtosis 3 (or 0 excess kurtosis in Excel convention).

How do we interpret the idea of one standard deviation σ for N (μ, σ2)? It means 68% of its observations (area under f(x)) lies in the interval [−σ, +σ] and 95% of the observations lies within [−2σ, +2σ]. Under normal conditions, a stock’s daily return will fluctuate between −2σ and 2σ roughly 95% of the time, or roughly 237 days out of 250 trading days per year, as a rule of thumb.

Central Limit Theorem

The central limit theorem (CLT) is a very fundamental result. It states that the mean of a sample of i.i.d. distributed random variables (regardless of distribution shape) will converge to the normal distribution as the number of samples becomes very large. Hence, the normal distribution is the limiting distribution for the mean. See Spreadsheet 2.2 for an illustration.

The CLT is a very useful result—it means that if you have a very large portfolio of assets, regardless of how each asset’s returns is distributed (some may be fat-tailed, others may be skewed, yet another may be uniform) the average portfolio return behaves like a normal distribution. The only catch is that all the assets must be independent of each other (even though at times they are not).

We summarize the advantages of the normal distribution that make it widely used in financial models:

1. By CLT, it is the limiting distribution for large systems.

2. You can describe all its characteristics with just two parameters, μ and σ.

3. A linear combination (addition) of normally distributed random variables is also normally distributed. This makes standard deviation σ subadditive which is a nice feature for risk measurement (see Section 2.8).

4. It is widely applicable in science and hence seems natural. Financial markets are shown to be Gaussian under most (although not all) situations.

2.3 QUANTILES, VaR, AND TAILS

To obtain the quantile from the CDF (lower panel of Figure 2.6) follow the horizontal line from the vertical axis. When you reach the CDF, the value at the horizontal axis is the quantile. In other words, the quantile is just the inverse function of the CDF, or F−1(x). The Excel function is PERCENTILE({x},q).

VaR is a statistical measure of risk based on loss quantiles. It measures market risk on a total portfolio basis taking into account diversification and leverage. VaR of confidence level c% is defined 4 as:

There are three possible interpretations of VaR. Consider a 95% VaR of $1 million; it could mean:

1. 95% chance of losing up to $1 million in the next day (maximum loss).

2. 5% chance that the loss is at least $1 million in the next one day (minimum loss).

3. 5% chance that the loss is greater than $1 million in the next one day (loss beyond).

Interestingly, all three mean the same thing mathematically but reflect the mental bias we have. See Figure 2.8. In light of the 2008 crisis and the criticism of VaR, it seems the more conservative interpretation (3) is appropriate. Regardless, VaR is certainly not the expected loss.

Since key decision-makers in banks are often not experts from the risk management department, knowledge gaps may exist. It was argued that the ignorance of what VaR truly represents encouraged a false sense of security that things were under control during the run-up and subsequent crash in the 2008 credit crisis. See Section 13.1.

The target horizon of concern for banks is usually the next one day. That means we need to work with daily returns. Regulators normally stipulate a 10-day horizon for the purpose of capital computation, so the daily VaR is scaled up to 10 days. For limits monitoring and regulatory capital, VaR is often reported in dollar amounts by portfolios. For risk comparison between individual markets, talking in units of sigma (σ) or percentage loss is often more intuitive.

As a convention, the VaR confidence level c is defined on the left tail of the distribution. Hence, 97.5%−VaR or c = 0.975 means the quantile of interest is q = (1 − 0.975) = 0.025.

Banks can choose the observation period (or window length) and confidence level c for their VaR. This is a balancing act. The window must be short enough to make VaR sensitive to market changes, yet long enough to be statistically meaningful. Likewise, if the confidence level is too high, there are too few observations in the left tail to give statistically meaningful inferences.

Figure 2.9 is an example of the return distribution of the S&P 500 taken over a 500-day window (Jan 05 to Dec 06). We overlay with a normal distribution line to show that the stock market behavior was reasonably Gaussian during noncrisis periods. From this data, we can calculate VaR in a few ways using Excel:

1. Assume the normal distribution. Then the 0.025 quantile is roughly two times σ. You can check this on Excel: NORMSINV(0.025) = −1.96. Thus, VaR is −1.96 · STDEV(.) = 1.2% loss.

2. Take the 0.025 quantile. VaR is PERCENTILE(.,0.025) = −1.2% loss.

3. There are 0.025 · 500 = 12.5 observations at the tail to the left of the VaR cutoff. So we can approximate VaR by the 13th largest loss computed using Excel function SMALL(.,13) = −1.3% loss.

Note the 1.2% loss is the daily VaR. VaR, like volatility, is often quoted in annualized terms. Assuming a normal distribution and 250 business days for a typical trading year, the annual VaR is given by 1.2% * √250 = 19%. (Section 6.4 covers such time scaling.)

From experience, a window length of 250 days and a 97.5% VaR can be a reasonable choice. Unless stated otherwise, we will use this choice throughout this book.

Note that VaR does not really model the behavior of the left tail of the distribution. It is a point estimate, a convenient single-number proxy to represent our knowledge of the probability distribution for the purpose of decision making. Because it is just a quantile cutoff, VaR is oblivious to observations to its left. These exceedences (nq =12 of them in the above example) can be distributed in any number of ways without having an impact on the VaR number. As we shall learn in Chapter 5, there are commendable efforts to model the left tail, but exceedences have so far remained elusive. This is the dangerous domain of extremistan.

2.4 CORRELATION AND AUTOCORRELATION

Given two sets of observed samples {xi} and {yi} where i = 1, . . . , n, we can estimate their covariance using:

where n is the number of observations,

and

are the sample means of X and Y respectively. The Excel function is COVAR(.).

Then correlation (sometimes called Pearson’s or linear correlation) is just covariance standardized so that it is unitless and falls in the interval [−1, 1].

A correlation of +1 means two assets move perfectly in tandem, whereas a correlation of −1 means that they move perfectly inversely relative to each other. A correlation of 0 means the two return variables are independent. Correlation in the data (i.e., association) does not imply causation, although causation will imply correlation. Correlation measures the sign (or direction) of asset movements but not the magnitudes. The Excel function is CORREL(.).

Linear correlation is a fundamental concept in Markowitz portfolio theory (see Section 2.8) and is widely used in portfolio risk management. But correlation is a minefield for the unaware. A very good paper by Embrechts, McNeil, and Straumann (1999) documented the pitfalls of correlation. Here we will go over the main points by looking at the bivariate case; the technical reader is recommended to read that paper.

For correlation to be problem-free, not only must X and Y be normally distributed, their joint distributions must also be normally distributed. More generally, the joint distribution must be an elliptical distribution, of which the multivariate normal is a special case. In our bivariate case, this means the contour line of its 2D plot traces out an ellipse.5 See Figure 2.10. Both diagrams—one actual, one simulated—are generated using N(0,1) with ρ = 0.7.

As long as the joint distribution is not elliptical, linear correlation will be a bad measure of dependency. It helps to remember that correlation only measures clustering around a straight line. Figure 2.11 shows two obviously nonelliptical distributions even though they show the same correlation (ρ = 0.7) as Figure 2.10. Clearly correlation, as a scalar (single number) representation of dependency, tells us very little about the shape (and hence joint risk) of the joint distribution. Correlation is also very sensitive to extreme values or outliers. If we remove the few outliers (bottom right) in the right panel of Figure 2.11, the correlation increases drastically from 0.7 to 0.95. Furthermore, the estimation of correlation will be biased—there is no way to draw a best straight line through Figure 2.11 in an unbiased way (i.e., with points scattered evenly to the left and right of the line).

Figure 2.12, left panel shows the actual joint distribution of Lehman 5-year CDS spread and Morgan Stanley 5-year CDS spread. During normal times (the shaded zone) the distribution is reasonably elliptical, but during times of stress the outliers show otherwise.

Let’s summarize key weaknesses of linear correlation when used on nonelliptical distributions:

1. Correlation is a scalar measure. It tells us nothing about the shape (or structure) of the joint distribution.

2. Our interpretation of correlation on a scale from −1 to 1 becomes inconsistent. Perfect dependency does not necessarily show ρ = 1, and perfect inverse dependency does not necessarily show ρ = −1.

3. Correlation of zero no longer implies that the two return variables are independent.

4. Correlation changes under transformation of risks. This means that if X and Y are correlated by ρ, it does not follow that functions g(X) and g(Y) are correlated by ρ. In practice, g(.) could be a pricing formula.

5. Correlation is unsuitable for fat tail events because the variances can appear infinite. By equation (2.16), correlation is undefined. This is the extremistan zone.

An alternative measure of dependency, rank correlation, can solve problems 2, 4, and 5. Unfortunately, rank correlations cannot be applied to the Markowitz portfolio framework. They are still useful, though, as stand-alone correlation measures.

We introduce two rank correlations: Kendall’s tau and Spearman’s rho. Suppose we have n pairs of observations {x1, y1}, . . . , {xn, yn} of the random variables X and Y. The ith pair and jth pair are said to be concordant if (xi − xj) (yi − yj) > 0, and discordant if (xi − xj) (yi − yj) < 0. The sample estimate of Kendall’s tau is computed by comparing all possible pairs of observations (where i ≠ j), and there are 0.5n(n−1) pairs.

The sample estimate of the Spearman’s rho is calculated by applying the Pearson’s (linear) correlation on the ranked paired data:

where rank(x) is an n-vector containing the ranks 6 of {xi} and similarly for rank(y). This can be written in Excel function as: CORREL(RANK(x, x),RANK(y, y)).

Figure 2.12, right panel shows a bivariate distribution that has perfect correlation since all pairs are concordant. Kendall’s tau is 1.0, Spearman’s rho is 1.0, but linear correlation gives 0.9. This illustrates weakness (2). One can easily illustrate weakness (4) as well as shown in Spreadsheet 2.3.

A mathematically perfect but complicated measure of association is by using the copula function (not covered here). A copula is a function that links the marginal distributions (or stand-alone distribution) to form a joint distribution. If we can fully specify the functional form of the N-dimensional joint distribution of N assets in a portfolio, and specify the marginal distributions of each asset, then we can use the copula function to tell us useful information about risk of this system. Clearly, this is an extremely difficult task because in practice there is seldom enough data to specify the distributions precisely, and the copula has to be assumed. Copulas are extensively used in modeling of credit risk and the pricing of tranched credit derivatives such as credit default obligations (CDOs). These models simplify the task of representing the relationships among large numbers of credit risk factors to a handful of latent variables. A copula is then used to link the marginal distributions of the latent variables to the joint distribution. Unfortunately, the choice of the copula function has a significant impact on the final tail risk measurement. This exposes users to considerable model risks as discussed in the article by Frey and colleagues (2001).

In the absence of such perfect knowledge as copulas, a risk manager has to rely on linear and rank correlations. He has to be mindful of the limitations of these imperfect tools.

Autocorrelation

Autocorrelation is a useful extension of the correlation idea. For a time series variable Yt the autocorrelation of lag k is defined as:

In other words, it is just the correlation of a time series with itself but with a lag of k, hence the synonymous term serial correlation.

This autocorrelation can be estimated using equation (2.15) but on samples (y1, . . . , yt) and (y1−k, . . . , yt−k) instead. The plot of autocorrelation for various lags is called the autocorrelation function (ACF) or correlogram. Figure 2.13 is an ACF plot of an N(0,1) process and an AR(1) process described by equation (2.7) with k0 = 0, k1 = 0.7. We will discuss AR(p) processes next, but for now notice that the ACF plot shows significant autocorrelation for various k lags, which tapers off as k increases. This compares to the Gaussian process, which has no serial correlation—its ACF fluctuates near zero for all k. The ACF is a practical way to detect serial correlation for a time series. See Spreadsheet 2.4.

where p is a positive integer, k’s are constant coefficients and εt is an i.i.d. random sequence. Clearly AR(p) models are non-i.i.d. (they are past dependent).

It is instructive to use the AR(1), equation (2.7), to clarify some important concepts learned previously:

1. The AR(1) process is non-i.i.d. although “driven” by an i.i.d. random variable εt.

2. An i.i.d. process will be stationary. That does not imply that a non-i.i.d. process (like the AR(1)) is necessarily nonstationary. Indeed, it is stationary given certain conditions (|k1| < 1).

3. It is always serially correlated as shown by its ACF.

In short, i.i.d., stationarity and serial correlation are related yet distinct ideas. Only under certain conditions, does one lead to another.

2.5 REGRESSION MODELS AND RESIDUAL ERRORS

where Yt and Xt are the two random variables representing 5-year and 10-year swap rates respectively. εt is a residual error term that captures all other (unknown) effects not explained by the chosen variables. α and β are constant parameters that need to be estimated from data. The method of estimation is called ordinary least squares (OLS). If εt is a white noise (i.i.d.) series, then the OLS method produces consistent estimates. People often mistakenly model the level price rather than the change in prices. If one models the price, the residual error εt is often found to be serially correlated.7 This means the OLS method will produce inconsistent (biased) estimates. To illustrate, we will do both regressions—one where data samples {yi}, {xi} are levels, the other where they represent changes.

Start by drawing a scatter plot as in Figure 2.15—the more the data concentrate along a straight line, the stronger the regression relationship. Conceptually, OLS works by estimating the parameters α and β that will minimize the residual sum squares (RSS) given by:

where xi, yi are the ith observation. Intuitively, OLS is a fitting method that draws the best straight line that represents all data points by minimizing the residual distances (squared and summed) between the estimated line

and all the observed points yi. Linear regression can easily be performed by Excel. From the toolbar, select: Tools → Data Analysis → Regression.

The estimated parameters are shown in Table 2.1. The β that represents the slope of the line is close to 1 for both cases. The α is the intercept of the line with the vertical axis. The R-square that ranges from 0 to 1 measures the goodness of fit. R-square of 0.9 means 90% of the variability in Y is explained linearly by X. The Excel functions are SLOPE(.), INTERCEPT(.), and RSQ(.) respectively.

	Using Price	Using Price Change
Beta	1.079	0.987
Alpha	−0.858	−0.001
R-square	0.966	0.866

Figure 2.16 shows that if we model using level prices, the residual series does not behave like white noise, unlike the second case where we model using price changes. We can use an ACF plot on the residuals to prove that the first has serial autocorrelation and the second is stationary.

1. A noticeable correlation in the level price chart does not imply an authentic relationship. Most market prices have trends. In the presence of trends, you will likely get high correlation purely by chance (spurious correlation). Thus, correlation should be calculated on returns.

2. Always check the residuals for serial correlation. If we model nonstationary level prices, then the OLS method may give inconsistent estimates due to serial correlation in residuals. (The only exception is when the price series are cointegrated. See Section 2.10.)

3. OLS regression can easily be extended to model multiple variables or to model time-lagged variables.

4. Even when a good fit is obtained (high R-square), the result may not be useful for forecasting tomorrow’s returns. Our simple regression measures contemporaneous relationships, which can be useful for hedging. But for forecasting you will need a lead-lag relationship. Since markets are mostly efficient, lead-lag relationships are hard to find. In our example, the trader can use the 5-year swap to hedge the 10-year swap, but he cannot use the 10-year swap movement as a signal to trade the 5-year swap.

2.6 SIGNIFICANCE TESTS

where

i is the estimate of y at the ith observation and

is the sample mean of x. Then the t-ratio is defined by:

This measures how far away the slope is from showing no linear dependence (β = 0) in units of standard error. Clearly, the larger the t-ratio the more significant is the linear relationship. The t-ratios can be calculated by the Excel regression tool.

Hypothesis Testing

To determine how good the t-ratio is, we need to state the null hypothesis (H0) and an alternative hypothesis (H1).

H0: The slope of the regression line is equal to zero.

H1: The slope of the regression line is not equal to zero.

If the regression relationship is significant, the slope will not equal zero; that is, we hope to reject H0. We assume the t-ratio is distributed like the student-t distribution centered about H0. The student-t is fatter than the normal distribution. But for large samples, the student-t approaches the standard normal N(0,1) distribution, and we assume this case. We need to see if the estimated t-ratio is larger than some critical value (CV) defined on a chosen confidence level p of the distribution. In Excel, the CV is given by NORMSINV(p).

For example, suppose we choose 95% confidence level 8 for a two-tail test, then at the left side tail, the CV = NORMSINV(0.025) = −1.96. Suppose the t–ratio of our regression works out to be −7.0, that is, falls in the critical region (see Figure 2.17), then we can reject the null hypothesis H0, and our regression is significant. On the other hand, if the t-ratio is −1.3, then we cannot reject H0, and we have to think of another model.

Note that statisticians will never say accept the null hypothesis. A null hypothesis is never proven by such methods, as the absence of evidence against the null hypothesis does not confirm it. As an example, consider a series of five coin flips. If the outcome turns out to be five heads, an observer will be tempted to form the opinion (the null hypothesis) that the coin is an unfair two-headed coin. Statistically, we can say we cannot reject the null hypothesis. We cannot accept the null hypothesis, since a series of five heads can occur from random chance alone. So likewise if we model financial time series using a chosen fat tail distribution (say the Laplace distribution), we cannot say we accept that distribution as the true distribution, even if the statistical test is highly significant. Perhaps there may be other distributions that fit the data better.

Stationarity Tests

In statistics literature, there are many significance tests of stationarity, also called unit root tests. Here, we will only discuss the augmented Dickey-Fuller test, or ADF(q) test developed by Dickey and Fuller (1981). This test is based on the regression equation:

where ΔX is the first difference of X, q is the number of lags, εt is the residual, and α, β, δs are coefficients. The test statistic is the t-ratio for the coefficient β. The null hypothesis is for β = 0 versus a one-sided alternative hypothesis β < 0. If the null hypothesis is rejected, then X is stationary. The critical values depend on q and are shown in Table 2.2 for a sample size between 500 and 600.

Number of Lags, q	Significance Level
1	−3.43	−2.86
2	−3.90	−3.34
3	−4.30	−3.74
4	−4.65	−4.10

Equation (2.25) can easily be implemented in Excel for up to 16 lags if needed. We shall see an example of ADF test in Section 2.10.

2.7 MEASURING VOLATILITY

where ri is the observed percentage return (pi − pi−1)/pi−1 or log return ln(pi/pi−1). Note that equation (2.26) is just the variance equation (2.3) but with zero mean. Therefore, volatility

(normally written without the subscript n) is simply the standard deviation of returns. In Excel function,

is given by STDEV(.).

But in financial markets, volatility is known to change with time, often rapidly. The first time-varying model, the Autoregressive Conditional Heteroscedasticity (ARCH) model was pioneered by Engle (1982).

In some sense, equation (2.27) is an extension of equation (2.26) which incorporates mean reversion about a constant long-run volatility θ where α is the weight assigned to this long-run volatility term. Unfortunately, these first two models have an undesirable “plateau” effect as a result of giving equal weights 1/n to each observation. This effect kicks in whenever a large return observation drops off the rolling window as the window moves forward in time, as shown in Figure 2.19.

To overcome this problem, different weights should be assigned to each observation; logically, more recent observations (being more representative of the present) should be weighted more. Such models are called conditional variance since volatility is now conditional on information at past times. The time ordering of the returns matters in the calculation. In contrast, for the standard deviation model, the return distribution is assumed to be static. Hence, its variance is constant or unconditional, and equal weights can be used for every observation. A common scheme is the so-called exponentially weighted moving average (EWMA) method promoted by J.P. Morgan’s RiskMetrics (1994).

The decay factor λ (an input parameter) must be larger than 0 and less than 1. Equation (2.28) can be simplified to an iterative formula which can be easily implemented in Excel:

Figure 2.18 shows the EWMA weights for past observations. The larger the λ, the slower the weights decay. Gradually falling weights solve the plateau problem because any (large) return that is exiting the rolling window will have a very tiny weight assigned to it. For a one-day forecast horizon, RiskMetrics proposed a decay factor of 0.94.

Bollerslev (1986) proposed a useful extension of ARCH called Generalized Autoregressive Conditional Heteroscedasticity (GARCH). There is a whole class of models under GARCH, and one simple example of a GARCH model is:

Compared to equation (2.29) it looks like an extension of the EWMA model to include mean reversion about a constant long-term mean volatility θ, which needs to be estimated separately. In fact, the EWMA is the simplest example of a GARCH model with only one parameter. When the weight α = 0, GARCH becomes the EWMA model.

Figure 2.19 graphs all four models for the same simulated time series. We set α = 0.4, λ = 0.99 and θ = 0.02. We simulated a 10% (large) return on a particular day, halfway in the time series. We can see this caused the artificial plateau effect for the standard deviation and ARCH models. There is no plateau effect for the EWMA and GARCH models. This implementation is in Spreadsheet 2.5.

The main advantage of GARCH is it can account for volatility clustering observed in financial markets. This phenomenon refers to the observation, as noted by Mandelbrot (1963), that “large changes tend to be followed by large changes, of either sign, and small changes tend to be followed by small changes.” A quantitative manifestation of this is that absolute returns show a positive and slowly decaying ACF even though returns themselves are not autocorrelated. Figure 2.20 shows volatility clustering observed in the S&P 500 index during the credit crisis. The upper panel shows that the price return bunches together during late 2008. This is captured by GARCH and EWMA, which showed a peak in volatility that tapers off thereafter (lower panel). The standard deviation and ARCH models registered the rise in risk but not the clustering.

Another good feature is that ARCH and GARCH can produce fatter tails than the normal distribution. Unfortunately, they cannot explain the leverage effect (asymmetry) observed in the market, where volatility tends to be higher during a sell-off compared to that during a rally. The exponential GARCH model (EGARCH) by Nelson (1991) allows for this asymmetric effect between positive and negative asset returns.

Note that what we have calculated so far is the volatility for today (the nth day). In risk management, we are actually interested in forecasting the next day’s volatility. It can be shown that for standard deviation and the EWMA model, the expected future variance is the same as today’s. Taking expectations of equation (2.29):

This is not strictly true for the ARCH and GARCH models because of the presence of mean-reversion toward the long-term volatility. Depending on whether the current volatility is above or below the long-term volatility, tomorrow’s volatility will fall or rise slightly. However, for a one-day forecast the effect is so tiny that equation (2.31) can still be applied.

The key lesson from this section is that a different choice of method and parameters will produce materially different risk measurements (see Figure 2.19). This is unnerving. What then is the “true” volatility (risk)? Does it really exist? Some believe the implied volatility backed out from option markets can represent real volatility but this runs into its own set of problems. Firstly, not all assets have tradable options, which naturally limit risk management coverage. Secondly, there is the volatility smile where one asset has different volatilities depending on the strike of its options (so which one?). Mathematically, this reflects the fact that asset returns are not normally distributed, which gives rise to the third problem—such implied volatility cannot be used in the Markowitz portfolio framework (explained in the next section).

The current best practice in banks is to use standard deviation or EWMA volatility because of its practical simplicity. These measures are well understood and commonly used in various trades and industries.

2.8 MARKOWITZ PORTFOLIO THEORY

where the weights ωi are the asset allocations such that Σωi = 1. The portfolio variance is given by:

where the correlation ρij = 1 when i = j. If there are n assets then i, j = 1, . . . , n.

The incorporation of correlation into equation (2.33) leads to the idea of diversification. An investor can reduce portfolio risk simply by holding combinations of assets that are not perfectly positively correlated, or better, negatively correlated. To see the effects of diversification, consider a portfolio with just two assets a, b. Equation (2.33) then becomes:

The risk (volatility) of the portfolio is always less than the sum of the risk of its components. Merging portfolios should not increase risks; conversely splitting portfolios should not decrease risk. This desirable property called subadditivity is generally true for standard deviation regardless of distribution.

For the purpose of mathematical manipulation, it is convenient to write the list of correlation pairs in the form of a matrix, the correlation matrix. Then basic matrix algebra can be applied to solve equations efficiently. Excel has a simple tool that generates the correlation matrix (Tools → Data analysis → Correlation) and some functions to do matrix algebra.

The problem of finding the optimal asset allocation is called the Markowitz problem. The optimization problem can be stated in matrix form:

where w is the column vector of weights ωi, μ is the vector of investor’s expected returns, μi, rT the targeted portfolio return, σ the covariance matrix derived using equation (2.15). We need to find the weights w that minimize the portfolio variance subject to constraints on expected returns and portfolio target return. Spreadsheet 2.6 is an example of portfolio optimization with four assets performed using Excel Solver.

The key weakness of the classical Markowitz problem is that the expected returns and target are all assumed inputs (guesswork)—they are not random variables that are amenable to statistical estimation. The only concrete input is the covariance matrix, which can be statistically estimated from time series. It is found that the optimization result (weights) is unstable and very sensitive to return assumptions. This could lead to slippage losses when the portfolio is rebalanced too frequently. The Markowitz theory has evolved over the years to handle weaknesses in the original version, but that is outside the scope of this book.

Imagine a universe of stocks available for investment and we have to select a portfolio of stocks by choosing different weights for each stock. Each combination of assets will have its own risk-return characteristics and can be represented as a point in the risk/return space (see Figure 2.21). It can be shown that if we use equation (2.36) to determine our choice of assets, these optimal portfolios will lie on a hyperbola. The upper half of this hyperbola is preferable to the rest of the curve and is called the efficient frontier. Along this frontier, risky portfolios have the lowest risk for a given level of return. The heart of portfolio management is to rebalance the weights dynamically so that a portfolio is always located on the efficient frontier.

It is also convenient to express the risk of an individual asset relative to its market benchmark index. This is called the beta approach, commonly used for equities. It is reasonable to assume a linear relationship given by the regression:

where Ri is the return variable for asset i and RM is the return of the market index at time t, αi is a constant drift for asset i. εi(t) is an i.i.d. residual random variable specific to the i th asset; this idiosyncratic risk is assumed to be uncorrelated to the index. The beta βi, also called the hedge ratio, is the sensitivity of the asset to a unit change in the market index. Thus, in the beta model, an asset return is broken down into a systematic component and an idiosyncratic component. The parameters can be estimated using the Excel regression tool.

2.9 MAXIMUM LIKELIHOOD METHOD

Since the log function is a monotonic function (i.e., perpetually increasing), maximizing L(θ) is equivalent to maximizing ln(L(θ)). In fact, it is easier to maximize the log-likelihood function:

because it is easier to deal with a summation series than a multiplicative series.

As an example, we will use MLE to estimate the decay factor λ of the EWMA volatility model (see Section 2.7). We assume the PDF fθ(x) is normal as given by equation (2.12). Then, the likelihood function is given by:

where the variance νi = νi(λ) is calculated using the EWMA model. Taking logs, we can simplify to the following log-likelihood function (ignoring the constant term and multiplicative factor):

The conditional variance νi(λ) = σi2 on day i is estimated iteratively using equation (2.29). After that, the objective L*(λ) can be maximized using Excel Solver to solve for parameter λ. Spreadsheet 2.7 illustrates how this can be done for Dow Jones index data. The optimal decay factor works out to be

= 0.94 as was estimated and proposed by RiskMetrics. The weakness of the MLE method is that it assumes the functional form of the PDF is known. This is seldom true in financial markets.

2.10 COINTEGRATION

In the 1980s many economists simply used regression on level prices, which ran into the problem of serial correlation of residuals. This means the OLS estimation is not consistent and the correlation relationship may be spurious. Engle and Granger (1987) formalized the cointegration approach and spelled out the necessary conditions for OLS regression to be applied to level prices (nonstationary data).

A time series is said to be integrated of order p written I(p) if we can obtain a stationary series by differencing p times (but no less). Most financial market time series are found to be I(1). A stationary time series is a special case of I(0). Suppose X1, X2, . . . , Xn are integrated prices (or log prices), then these variables are cointegrated if a linear combination of them produces a stationary time series. This can be formalized using the Engle-Granger regression:

The Engle-Granger test is for the residuals ε(t) to be stationary or I(0). If affirmative, then X1, X2, . . . , Xn are said to be cointegrated. Only in this situation, OLS regression can be applied to these random variables.

The coefficients a1, . . . , an can be estimated using the OLS method. The problem with the Engle-Granger regression is that the coefficients obviously depend on which variable you choose as the dependent variable (i.e., as X1). Hence, the estimates are not unique—there can be n−1 sets of coefficients. This is not a problem for n = 2, which is mostly the case for financial applications.

A more advanced cointegration test that does not have such weakness is the Johansen method (1988). Once a cointegration relationship has been established, various models can be used to determine the degree of deviation from equilibrium and the strength of subsequent corrections. For further reading, see Alexander (2008).

There are many examples of securities in the market that show highly visible correlation when plotted, but no cointegration when tested. Figure 2.23 shows the chart of Dow Jones index vs. Nasdaq index—both I(1) time series. The correlation of their log returns over the period Jan 1990 to Jun 2009 is 0.86. Using the Engle-Granger regression test, we can show that the two indices are not cointegrated. The worked-out example in Spreadsheet 2.8 also shows how the augmented Dickey-Fuller stationarity test can be performed in Excel.

On the other hand, empirical research has found that different points on the same yield curve are often cointegrated. Pair trading, for example, thrives on searching for cointegrated pairs of stocks, typically from the same industry sector. Cointegrated systems are almost always tied together by common underlying dynamics.

2.11 MONTE CARLO METHOD

In general, there are two classes of time series processes that are fundamentally different—stochastic trend processes and deterministic trend processes. A stochastic trend process is one where the trend itself is also random. In other words (Xt − Xt−1) shows a stationary randomness. One example is the so-called random walk with drift process given by:

where the constant μ is the drift, εt is i.i.d. Clearly this is an I(1) process as the first difference will produce μ + εt, which is an I(0) stationary process. We call such a process stochastic trend. Another example is the geometric Brownian motion (GBM) process:9

where the constant μ is the annualized drift, σ the annualized volatility, εt ∼ N(0,1) and ΔXt = Xt − Xt−1. The time step is in units of years, so for one day step, Δt = 1/250. Hence, in GBM, we simulate the percentage changes in price; we can then construct the price series itself iteratively.

The GBM process is commonly used for derivatives pricing and VaR calculation because it describes the motion of stock prices quite well. In particular, negative prices are not allowed because the resulting prices are lognormally distributed. A random variable Xt is said to follow a lognormal distribution if its log return, ln(Xt/Xt−1) is normally distributed. Figure 2.24 shows simulated GBM paths using σ = 10%, μ = 5%. The larger μ is, the larger the upward drift, the larger σ is, the larger the dispersion of paths.

The other process is the deterministic trend process, where the trend is fixed, only the fluctuations are random. It is an “I(0)+trend” process, and one simple form is:

where α is a constant, μ is the drift, and εt is i.i.d. (α + εt) is I(0) and μt is the trend.

Figure 2.25 shows a comparison of the three processes generated using −30% drift and 50% volatility. Although not obvious from the chart, the I(1) and I(0) + trend processes are fundamentally different, and this will determine the correct method used to detrend them (see next section). It is instructive to experiment with the processes in Excel (see Spreadsheet 2.9).

As we shall see in Monte Carlo simulation VaR (Chapter 4), we will need to simulate returns for many assets that are correlated. The dependence structure of the portfolio of assets is contained in the correlation matrix. To generate random numbers that are correlated, the correlation matrix σ needs to be decomposed into the Cholesky matrix L. L is a lower triangular matrix (i.e., entries above the diagonal are all zero) with positive diagonal entries, such that LLT = σ where LT is the transpose of L.

We coded this function as cholesky(.) in Excel, that is, L = cholesky(σ). The code is explained in the reference book by Wilmott (2007).10 Let’s say we have n assets; first, we compute L (n-by-n matrix). Secondly, an n-vector of random numbers {ε1(t), . . . , εn(t)} is sampled from an i.i.d. N(0, 1) distribution. This column vector is denoted ε(t). Finally, the n-column-vector of correlated random numbers r(t) can be computed by simple matrix multiplication: r(t) = Lε(t). For example, to generate n correlated time series of T days, perform the above calculations for t = 1, 2, . . . , T. In Excel function notation, this is written as r = MMULT(L, ε). If the vectors are rows instead, the Excel function is written in a transposed form: r = TRANSPOSE(MMULT(L,TRANSPOSE(ε))). Correlated random numbers are generated in Spreadsheet 2.10.

2.12 THE CLASSICAL DECOMPOSITION

where lt is the long-term trend, st the cycle (or seasonality), and zt the noise (or irregular) component. Of the three components, zt is assumed to be stationary and i.i.d. There is an abundance of research in this area of seasonal decomposition; for a reference book, read Ghysels and Osborn (2001).

1. Plot the time series xt to visually check for trend and seasonality.

2. Detrend data into a stationary time series zt.

3. Chose a model (for example, AR(p), GARCH, etc.) and try to fit zt.

4. Run diagnostic tests (such as residual analysis). If the model is unacceptable, go back to step 3.

5. Generate the forecasted distribution (say for a one-day horizon).

6. Invert the transformation in step 2 to bring the forecasted distribution back to the price level.

There are many ways to detrend the data in step 2. If the times series is a stochastic trend process then the correct approach is to take differences. There is considerable evidence that suggests most financial market prices are I(1) stochastic trend processes.

On the other hand, if the data is a deterministic trend—“I(0)+trend” process, then the correct approach is to take the deviation from a fitted trend line (the decomposition approach). Note that if we do this to an I(1) process, the resulting deviation from trend line may not be stationary. Nevertheless, this is a common practice among forecasters, and it remains an open debate. Beveridge and Nelson (1981) found that if the trend is defined as the long-run forecast, then decomposition of an I(1) process can lead to a stationary deviation.

As a typical example, the Berlin procedure, developed by the Federal Statistical Office of Germany, models the long-term trend as a polynomial and the cycle as a Fourier series. It assumes that the stochastic trend follows a highly auto-correlated process such that the realizations will be smooth enough to be approximated using a low-order polynomial (of order p):

The cycle component is assumed to be weakly stationary and changes slowly with time, so that it can be approximated using a finite Fourier series:

We are dealing with 250 daily observations per year, so λ1 = 2π/250 and λi =iλ1 with i = 1, 2, . . . , k are the harmonic waves for λ1. We can use least square minimization 11 to estimate the coefficients αi, βi, and γi. Spreadsheet 2.11 shows the decomposition of S&P 500 index daily closing prices using a cubic polynomial (p = 3) and Fourier series (k = 12). The result is shown in Figure 2.28.

When statisticians speak of forecast, they mean estimating the most statistically likely distribution at a future horizon (typically the next day). While a risk manager will be interested in the four moments of such a distribution and its quantile (the VaR), such forecast of uncertainty (or risk) is of lesser importance to traders who are more concerned about the trend, cycle, and direction of the market. There is a fundamental difference between forecasting distribution and forecasting direction. Statistics is good for distributional forecasts but poor for directional forecasts. The problem is that estimation methods are mostly designed to handle stationary data. During the detrending process, valuable information on trend (or anything other than noise) is lost.

Many analysts and researchers still perform analysis on direction, trends, and cycles using technical analysis and component models, but things become murky once we deal with nonstationary data. We no longer have the law of large numbers on our side, precision is lost, and we can no longer state our results with a high confidence level. Furthermore, the model is not identifiable—for example there are many ways to decompose a time series into its three components.

Table 2.3 compares the two schools of forecasting. In Chapter 13 we will propose a new interpretation of decomposition that will incorporate directional elements into VaR forecasts.

Type	Distributional Forecast	Directional Forecast
Forecast	Distributions, moments, risks, quantiles	Trend, cycles, direction
Nature of time series	Stationary	Nonstationary
Results	High precision subject to model error	Nonunique solutions
Users	Risk managers, statisticians	Traders, economist, analyst, policy makers

2.13 QUANTILE REGRESSION MODEL

Given a sample, the OLS solution

will provide an estimate Ê (Y|X) shown as the thin (middle) line in Figure 2.29. This best fit line is drawn such that the sum of residual squares above and below the line offset. Likewise, to estimate median, we draw a line such that the sum of absolute residuals above and below the line offset. We can extend this idea to estimate the q-quantile by imposing different weights to the residuals above and below the line. Suppose q = 0.05. We will draw the line such that the 0.05 weighted absolute residuals above the line offset the 0.95 weighted absolute residuals below the line. These are shown in Figure 2.29 for q = 0.05, 0.95.

Mathematically, the quantile regression line (for a chosen q) can be found by minimizing the weighted equation (2.50) and estimating α and β

The newly estimated coefficients

depend on the choice of q (a different q leads to a line with a different slope β and intercept α). Note that they are not parameter estimates for the linear regression (2.21) but rather for the (linear) quantile regression model:

where F−1(.) represents the quantile function; recall a quantile is just the inverse CDF (see Section 2.3). So the conditional q-quantile F−1(q|X) is derived by taking the inverse of F(Y|X); that is, you can get equation (2.51) simply by taking the quantile function of equation (2.21). The last term is nonzero because the i.i.d. error εt does not have zero quantile (even though it has zero mean).

Now, since VaR is nothing but the quantile, then (2.51) basically gives the VaR of Y conditional on another variable X, which could be anything such as GDP, market indices, and so on, or even a function of variables. This conditional VaR is a powerful result and will be exploited in Section 5.3 and Section 12.3. However, QRM is not model free in the sense that we still need to assume (2.51) is linear and also specify what X is.

Spreadsheet 2.12 is a worked-out example of QRM estimation using the Excel Solver. Here, we assume Y are returns of a portfolio, X is the change in some risk sentiment index.12 Hence, the estimated

is the c% confidence level VaR conditional on X (where q = 1−c by convention). In other words, we assume there is some relationship between our portfolio’s VaR and variable X (perhaps we have used X as a market timing tool for the portfolio) and would like to study the tail loss behavior. We then use QRM to estimate the quantile loss (or VaR) at different q for different changes in X.

The result is shown in Figure 2.30. Negative return denotes loss. Notice the higher the VaR confidence level, the larger the loss obviously for the same line (i.e., same change in index). But the plot revealed an interesting structure in the tail loss—losses tend to be even higher when the risk sentiment index shows large positive changes (large X). It shows that the index is doing what it is supposed to do—predicting risky sentiment.

2.14 SPREADSHEET EXERCISES

2.1 The Law of Large Numbers (LLN) is useful because it states that as more measurements N are taken, statistics will converge to “true” value, provided the random variable is i.i.d. Aim: illustrate LLN on three processes including a non-i.i.d. AR(1) process as N increases to 1000. Action: check the convergence of mean of AR(1) by modifying equation (2.7) for cases: |k1| < 1, |k1| > 1 and k1 = 0.999.

2.2 The Central Limit Theorem (CLT) is useful because it states that as more samples are taken from various i.i.d. distributions, the means of those samples will be normally distributed. Aim: illustrates CLT by sampling from a uniform distribution 500 times. The distribution of the means of those samples is plotted. Action: Extend the number of samples to 5000 to show that the histogram approaches a normal distribution

2.3 Linear correlation is known to have weaknesses when the relationship between variables is nonlinear. Rank correlation provides an alternative. Aim: illustrates the computation of rank correlations. Action: Modify the spreadsheet to show that rank correlation is invariant under nonlinear transformation of risk. Tip: first generate sample returns x, y using NORMSINV(RAND()) * 0.1, then do the transformation (for example to x3, y3) but keep the signs (directions) of x, y unchanged.

2.4 An autocorrelation function (ACF) plot is useful in visually identifying serial correlation. Aim: illustrates the computation of ACF plots of N(0,1) and AR(1) processes. Action: Experiment with the AR(1) process by modifying equation (2.7) for cases: |k1| < 1, |k1| > 1, and k1 = 0.999.

2.5 Standard deviation, EWMA, ARCH, GARCH are four different ways to model the volatility (hence risks) of a financial variable. Aim: illustrates the implementation of the four volatility models. Note the behavior of the four graphs as the parameter α goes to 0, GARCH becomes the EWMA model. As α goes to 1.0, the volatility of ARCH and GARCH converges to the long-term (constant) volatility θ.

2.6 Markowitz modern portfolio theory was a breakthrough in modern understanding of risk/reward of investments and pioneered the use of variance as a measure of risks. Aim: illustrates the classical Markowitz portfolio optimization for a portfolio of four assets. Action: Set a different target return for the portfolio and rerun the Excel Solver. Modify the expected return inputs for each asset and check how the optimal weights respond. Tip: If the target return is unreasonably high, the Solver cannot find an optimum solution.

2.7 Maximum likelihood method (MLE) is a popular statistical method to estimate the parameters of models given a sample of data. Aim: illustrates the use of MLE to estimate the decay parameter λ of the EWMA volatility model.

2.8 Cointegration measures the tendency for two time series to drift towards a common long-term equilibrium path. It is a required behavior for pair trading. A pair of correlated variables is not necessarily cointegrated. Aim: illustrates how to check for cointegration using the Engle-Granger regression method. Note: The two I(1) time series are logs of Dow Jones index and S&P 500 index. The residuals of the regression are tested for I(0) stationarity using the augmented Dickey-Fuller test.

2.9 Deterministic trend and stochastic trend processes are fundamentally different. The time series of the former wanders around a fixed trend, in the latter the trend itself wanders around. Aim: illustrates the two processes using three examples: random walk, geometric Brownian motion, and a deterministic trend process. Note: First, the GBM process can never go below zero because it is a lognormal process. Second, the deterministic trend process tends to have a rather stable trend, which is just the fixed line μ + βt. On the other hand, the trend for the other two processes wanders around.

2.10 In Monte Carlo simulation it is crucial to be able to generate correlated variables because most financial assets that are modeled in the banking industry move in correlated fashion. Aim: illustrates the generation of correlated random variables using Cholesky decomposition. Note: The function cholesky(.) is written in Visual Basic (VB) code. Using a given correlation matrix, the example generated 50 sets of correlated random numbers for four assets.

2.11 The classical decomposition breaks down a time series into three theoretical components—the trend, the cycle, and the noise—for analysis. Aim: Illustrate the classical decomposition using cubic polynomial for the trend and Fourier series for the cycle. Note: The polynomial fit is done by OLS regression. The resulting deviation from trend is then fitted to a Fourier series. The Excel solver is used to estimate parameters βi and γi in equation (2.48) by minimizing the sum of squared deviations of observation points from the fitted line.

2.12 Quantile regression model (QRM) is a statistical method to estimate parameters of a model given a sample of data. It is increasingly popular in VaR modeling because researchers are considering models that are conditional on external variables. Aim: illustrates QRM estimation using Excel Solver. Action: solve for the parameters using different choices of quantiles q. Then use

s and

s to calculate the conditional quantile as per equation (2.51) for various X. Plot the results to produce the Figure 2.30.

NOTES

1. Plight of the Fortune Tellers: Why We Need to Manage Financial Risk Differently by Riccardo Rebonato (2007) challenges the frequentist paradigm and advocates the Bayesian alternative. The book introduces (without using equations) potential use of subjective probability and decision theory in risk management.

2. To be precise, the mean does not depend on t, and the autocovariance COV(Xt, Xt + k) is constant for any time lag k.

4. Unless mentioned otherwise, in this book we will retain the negative sign for the VaR number to denote loss (or negative P&L), to follow the convention used in most banks.

5. For N-variate joint distribution, imagine an object of N-dimensions that when projected onto any two dimensional plane casts an elliptical shadow.

7. Possible exceptions are if the price itself is highly stationary such as the VIX index and implied volatilities observed in option markets.

8. At higher levels, it will be easier to accept the null hypothesis, but there is a risk of accepting a wrong null hypothesis (Type I error). At lower levels, it will be easier to reject the null hypothesis (i.e., our tolerance is more stringent), but there is risk of rejecting a true null hypothesis (Type II error). A 95% confidence level is commonly accepted as a choice that balances the two types of error. Note: Do not confuse this with the VaR’s confidence level, which is just one minus the quantile.

9. The Brownian motion is originally used in physics to describe the random motion of gas particles.

10. The code for Cholesky decomposition is widely available from the Internet and should not distract us from the main discussion here.

11. The actual estimation in the Berlin procedure is more sophisticated than what is described here. For more information, the avid reader can refer to the procedural manual and free application software downloadable from its website www.destatis.de/.

12. Risk sentiment (or risk appetite) indices are created by institutions for market timing (for trading purpose) and are typically made of some average function of VIX, FX option implied volatilities, bond-swap spreads, Treasury yields, and so on. The indicator is supposed to gauge fear or risk perception. Although this has not gone into mainstream risk management, we see it slowly being applied in risks research.

Chapter 2

Essential Mathematics