This chapter is included for the sake of readers who are not familiar with econometrics and the modeling techniques that are commonly used in finance and financial economics. Needless to say, the following treatment is not meant as a substitute of one of the several excellent books on the subject. Our limited aims are:
Here we only deal with modeling; issues related to estimation are deferred to Chapter 4, which has similar objectives and limitations. In modeling, one typically works with parameters, expected values, variances, etc. In estimation, one typically works with their sample counterparts, like sample mean and sample variance. However, since it is often useful to illustrate ideas by simple but concrete examples, we will occasionally use sample counterparts of probabilistic concepts. In doing so, we will rely on readers’ intuition and, possibly, some background in elementary inferential statistics, as well as the availability of easy to use R functions; more advanced concepts and issues are dealt with in the next chapter.
Within this framework, the choice of topics is admittedly open to criticism. We cover some rather basic issues related to probability distributions, together with some more advanced ones, such as copulas and stochastic differential equations. However, the coverage of some standard tools in econometrics, such as multiple linear regression models, will leave much to be desired. Our choice is deliberate and related to what one needs in order to appreciate Monte Carlo techniques.
In Section 3.1 we use elementary examples to motivate our study and to further illustrate the differences between static, discrete-time, and continuous-time models. Even these deceptively simple examples raise a number of issues that may be addressed by rather sophisticated tools. Section 3.2 deals with rather basic topics, as we outline a few essential (univariate) probability distributions that are commonly used in finance. Probably, many readers are quite familiar with the content of this section, which may be skipped and referred back when needed. In order to make the section as useful as possible, we take advantage of it by introducing some R functions to deal with probability distributions, including random sample generation; the technicalities involved in the generation of pseudorandom samples from these distributions is deferred to Chapter 5. We move on to deal with multivariate distributions and dependence in Sections 3.3 and 3.4. In the former section we outline multivariate distributions, including a brief review of standard concepts related to covariance and Pearson’s correlation. We also cover multivariate extensions of normal and t distributions, but the main aim of the section is to point out the limitations of these standard approaches, paving the way for more advanced ideas revolving around copula theory, which is the subject of Section 3.4. This section is a bit more advanced and also covers alternative correlation measures, such as Spearman’s rho and Kendall’s tau; we also outline such concepts as tail dependence and independence, which are relevant for risk management. Section 3.5 is a bit of an interlude, where we cover linear regression models within a probabilistic framework. Usually, linear regression is associated with the statistical estimation of linear models;1 here we insist on the probabilistic view of regression as an orthogonal projection, laying down the foundations for variance reduction by control variates2 and model complexity reduction by factor models. Regression models are related and contrasted with time series models, which are described in Section 3.6, within the framework of discrete-time models. We should note that a great deal of work has been carried out on representing, estimating, and simulating time series models in R, which also includes a specific time series object. Continuous-time models, based on stochastic differential equations, are introduced in Section 3.7. These models are most useful in financial engineering applications, but they require some background in stochastic calculus and stochastic integration. Here we introduce the basic concepts, most notably Itô’s lemma, without paying too much attention to mathematical rigor. In Section 3.8 we outline two approaches that can be used to streamline a model, namely, principal component analysis and factor models. These multivariate analysis tools may be useful from both a model estimation and a simulation perspective, since they reduce data and computational requirements. Finally, in Section 3.9 we rely on Itô’s lemma to develop the powerful tools of risk-neutral option pricing. This section may be obviously skipped by the familiar reader, but also by readers who are not interested in option pricing.
In Chapter 1 we considered the difference between Monte Carlo sampling and simulation. In the former case, we are essentially approximating an integral, which is related to an expectation; in the latter case, we have to simulate dynamics over time. We have also seen that, in principle, any dynamic simulation can be considered as the estimation of the integral of a possibly quite complicated function. Hence, the line between sampling and simulation is not so sharp at all, from a Monte Carlo viewpoint. From a modeling viewpoint, however, the issues involved may be quite different. In this section we outline three related financial modeling problems:
The best known static portfolio optimization model is certainly the mean–variance model of Markowitz. In the basic version of this model, time is disregarded and issues related to rebalancing are neglected. To build the model, we need to characterize the joint distribution of asset returns over the investment time horizon, which involves subtle issues, particularly from a risk management perspective.
More realistic models, possibly involving asset and liabilities, transaction costs, etc., require the introduction of time. We may accomplish the task within a discrete- or a continuous-time framework. The former approach typically requires the machinery of time series analysis, whereas the latter one relies on stochastic calculus concepts. When time enters the stage, we may face the task of representing intertemporal dependence. Actually, if we believe the efficient market hypothesis (EMH), we may avoid the task. EMH can be stated in different, more or less stringent forms. The bottom line is that it is impossible to beat the market by stock picking or buy/sell timing. From our point of view, the essential implication is that there is no intertemporal dependence in returns over different time periods. On the contrary, if we have a different view or we are modeling something else and the EMH is not relevant, we are faced with the task of modeling some form of dependence over time.
In econometrics, to categorize the issues a modeler faces when dealing with dependence, the following model classification is used:
One of the main tasks of this chapter is to illustrate a few of these models, pointing out the related issues.
To build a single-period portfolio optimization problem, we need the following ingredients:
In principle, the microeconomics of uncertainty offers a way to represent investor’s preferences, in the form of a utility function u(·) mapping a monetary amount into a real number. Utility theory posits that we should maximize the expected utility of wealth WT, a random variable, at the end of the holding period T. As we did in Example 1.1, we denote the initial asset price by Pi0 and the amount of shares of asset i held in the portfolio by hi; for the sake of simplicity, here we assume that the number of shares is large enough to be approximated by a continuous variable and avoid integrality restrictions. If we neglect transaction costs and tax issues, the problem can be written as
(3.1)
where PiT is the asset price at the end of the holding period. This models looks easy enough, but there are two serious difficulties:
Computing the expectation is a technical problem, which may be tackled by the methods described in the rest of this book. However, defining a utility function is a much more serious issue. In fact, despite their popularity in most economic literature, utility functions are not easily applicable to real-life financial optimization. On the one hand, even a sophisticated investor would find it hard to assess her utility function; on the other hand, which utility function should a professional fund manager use, when reporting to her many clients? Should she consider her own attitude toward risk or that of her wealthiest client? A common approach is to resort to a risk measure, related to uncertainty in future wealth. We will have more to say about risk measures later, but since they should capture uncertainty, a seemingly natural measure is standard deviation. Furthermore, it may also be convenient to get rid of wealth W0 and deal with a sort of adimensional model in terms of portfolio weights, i.e., fraction of total wealth allocated to each asset, and asset returns.3 To be more specific, if the assets we are considering are stock shares, we should also consider any dividend that is paid within the planning horizon. If asset i pays a dividend Di during the holding period, its (rate of) return can be written as
Note that the return can be negative, but it cannot be less than −1, since stock shares are limited liability assets and the worst price that we may observe in the future is PiT = 0. To be precise, this definition does not account for the time value of money; when the dividend is paid exactly does make a difference if the time period is long enough and we consider the possibility of reinvesting the dividend. For the sake of simplicity, let us neglect this issue.
Now, what about the joint distribution of returns? If we assume that the investor only cares about expected value and standard deviation of portfolio return, we may streamline our task considerably. Let:
If we denote the wealth fraction allocated to asset i by wi, so that Σi wi = 1, we immediately find that the portfolio return Rp is
with expected value
where w is the column vector collecting portfolio weights, which is transposed by the operator. The variance of portfolio return is given by
In the classical Markowitz portfolio optimization model, a mean–variance efficient portfolio is found by minimizing risk subject to a constraint on expected portfolio return:
where αT is a minimal target for expected return and e = [1, 1,…, 1]; non-negativity on portfolio weights may be added if short sales are forbidden. Note that we minimize variance, rather than standard deviation, for the sake of computational convenience. Doing so, the solution of the optimization problem does not change, but we have to solve a (convex) quadratic programming problem, for which extremely efficient algorithms are widely available.
In fact this new model looks remarkably simple, and there is no need for Monte Carlo methods for its solution. Furthermore, there is no need to specify a full-fledged joint return distribution, as first- and second-order moments are all we need. However, severe difficulties have just been swept under the rug. To begin with, how can we estimate the problem data? The covariance matrix consists of many entries, and there is no hope to estimate them all in a reliable way. To see why, assume that we have to deal with n = 1000 assets. Then, the covariance matrix consists of one million entries. Actually, given symmetry, “only”
entries need be specified. This is a formidable task, as we would need a very long history of returns to obtain reliable estimates. Unfortunately, even if we collect many years of data, assuming that they are available for all of the assets that we are considering for inclusion in our portfolio, few of them are actually relevant since market conditions have changed over the years. This is why we need to resort to some data reduction strategies, as described in Section 3.8.2.
Even if we assume that we have perfect knowledge of the involved parameters, tough issues are to be considered:
We see that even specifying a single-period, static model is not trivial at all. No approach based on Monte Carlo sampling will yield useful results, if the underlying uncertainty model is not credible.
In many applications a static and single-period model is inappropriate, as we need to represent a sequence of decisions to be made in the face of possibly complex dynamics of stochastic risk factors. Here we illustrate a discrete-time framework for a stylized model extending the consumption–saving problem of Section 1.4.3. The task is to choose the fraction of current wealth that is consumed, and how the remaining amount saved is allocated between a risky and a risk-free asset.6 For strategic decision problems like this one, an appropriate time bucket could be one year; in other cases much higher frequencies are needed. Whatever time period we choose, in any discrete-time model it is essential to specify the sequence of events very clearly, as this defines the information that is actually available when making a decision. We consider time instants and time buckets (or periods):
The rules of the game in our problem are:
Note that we are considering two sources of uncertainty here:
The first source of uncertainty can be represented using the concepts from the previous section, if we do not want to consider any predictability in asset returns. However, it is likely that the labor income Lt is not quite independent from Lt−1; therefore, we must choose some way to represent both predictable and unpredictable (shock) components, as we show below.
Given a decision αt, the rate of return of the investment portfolio over the next time bucket is
where Rt+1 − rf is typically referred to as excess return. Therefore, wealth at time instant t + 1 is
(3.2)
We emphasize again that the risky asset return Rt+1 is not known when making the asset allocation decision, whereas the risk-free rate is known. Actually the risk-free return should also be modeled as a stochastic process, as the level of interest rates does change over time because of changing economic conditions and the corresponding monetary policy of central banks. Hence, we should denote this rate as rft. The main difference with respect to the risky return Rt+1 is that, at the beginning of a time bucket, we do know the risk-free rate for that time bucket; however, we do not know the interest rates for the subsequent time buckets. So, rft is a predictable stochastic process. For the sake of simplicity, let us assume a constant interest rate rf. We might also wish to model correlations between the two returns. After all, we know that when central banks change the level of interest rates, stock exchanges are indeed affected.
To model labor income, we really must be more specific and decompose Lt in components. A model proposed, e.g., in [4, Chapter 7], is
where
The difference between the two random components is that the transitory random term should have an impact in one year, and then it should leave no effect, whereas permanent shocks have to be cumulated, or integrated, if you prefer, over time. This can be represented as a random walk:
where, for instance, is a normal random variable with variance ση2 and expected value zero. By a similar token, one could assume
. Note that the expected value of a shock is zero, since any predictable component should be included in the deterministic component. In simple models, ηt and
t are assumed independent over time.
Now, we should see a problem with an additive decomposition like the one in Eq. (3.3). If shocks are normally distributed, in principle, income could get negative. Furthermore, using the same volatilities (standard deviations) for all levels of income is questionable. One way to overcome these issues is to resort to a multiplicative decomposition. We might just reinterpret equation Eq. (3.3) in terms of log-income, defined as lt ≡ log Lt.7 In other words, we posit
(3.4)
Since we take an exponential of a normal, a negative shock will be transformed into a multiplicative shock less than 1, i.e., a reduction in income. We will also see, in Section 3.2.3, that when we take the exponential of a normal random variable, we obtain a lognormal random variable.
Finally, if we associate a utility function u(·) with consumption, the overall objective could be
where β (0, 1) is a subjective discount factor. Thus, we face a dynamic, stochastic optimization problem, which cannot be solved analytically in general; Monte Carlo and related methods play a key role in the solution of these models, as well as in the simulation of decisions over time to analyze the resulting policy. In Section 3.6 we will analyze a few more discrete-time models that can be used to represent dependency over time, changes in volatility, etc.
Time discretization may look like an approximation of an intrinsically continuous time. This need not be the case, actually: If decisions are made periodically, then a discretized time bucket corresponding to decision periods is all we need. In other cases, higher frequencies must be accounted for. We may then take smaller and smaller time periods in order to approximate continuous time. Needless to say, this increases the computational burden considerably and a trade-off must be assessed between modeling needs and computational tractability. A possibly surprising fact is that if we take the process to the limit and consider continuous-time models, we may actually ease many issues and come up with tractable models. Furthermore, when we consider trading on exchanges and the actual flow of information, we have to accept the fact that a more faithful representation does require continuous time. This can help in building a theoretically sound framework, even if numerical methods, most notably Monte Carlo methods, are to be employed when models cannot be solved analytically. Indeed, continuous-time models rely on differential equations, which may look tougher than discrete-time difference equations, especially since we have to cope with stochastic differential equations. Stochastic models are in fact needed to represent uncertainty in stock prices, interest rates, foreign exchange rates, etc. Nevertheless, it seems wise to introduce the concept in the simplest deterministic case of some practical relevance, the investment in a safe bank account. We follow here the same path that we have followed in Section 1.3.2, providing some more motivation and details.
Consider a risk-free bank account. If we invest an initial amount B(0) at time t = 0, the annual interest rate is rf, and interest is earned once per year, our wealth after one year will be
However, if semiannual interest rf/2 is paid twice a year and is immediately reinvested, we get
This new amount, if the rate rf is the same, is a bit larger than the value in Eq. (3.5), since we earn interest on interest. In such a case, we speak of semiannual compounding. More generally, if interest is paid m times per year, we have
and it is easy to see that if m goes to infinity, we end up with the limit
Such a continuously compounded interest does not really exist in financial markets, even though daily compounding is essentially the same as continuous-time compounding. Nevertheless, it simplifies considerably several financial calculations and it paves the way for continuous-time models.
It is useful to describe all of the above using a dynamic model. So, let us assume that interest is continuously earned and instantaneously reinvested. Over a short time interval (t, t + δt), we have the following change in the bank account:
If we let δt → 0, we obtain the differential equation
Many readers may be more familiar with the form
There are a couple of reasons why we might prefer the differential form of Eq. (3.7). First, a slight manipulation shows a way to solve it. If we rewrite the equation as
and we recall the form of the derivative of a composite function, we may recognize that a logarithm is involved:
Now we may integrate the differentials over the time interval [0, T]:
which can be easily rewritten as
which is consistent with Eq. (3.6).
There is another reason why the form (3.7) is preferred to (3.8): It can be extended to cope with stochastic factors. Suppose that we are investing in a risky asset, rather than a risk-free bank account. We need a model for the asset price, accounting for both drift (related to expected return) and volatility (related to random variability in the price). The drift component is essentially the same as in Eq. (3.7), with rf replaced by some coefficient, say, μ. However, we need to add a suitable source of randomness. A common building block is the Wiener process, denoted by W(t). This is a continuous-time stochastic process, which essentially means that for every value of the continuous parameter t, we have a random variable W(t): In general we may define a stochastic process as a time-indexed collection of random variables X(t) or Xt, where time can be continuous or discrete. To include this noise term in the equation, we should first consider increments in the stochastic process over a small time interval (t, t + δt),
The Wiener process is a Gaussian process, in the sense that δW(t) ~ N(0, δt).8 Please note that the variance of δW(t) is δt, and thus its standard deviation is . Then we take the limit for δt → 0, which yields the infinitesimal increment dW(t). Adding this term, multiplied by a volatility factor σ, to Eq. (3.7) yields the following stochastic differential equation:
Later, we will see that this defines an important process called geometric Brownian motion. Actually, we have skipped over some challenging issues, since “taking the limit” for δt → 0 is quite delicate when random variables are involved, as one must refer to some stochastic convergence concept. Furthermore, the Wiener process is a very “irregular” and noisy process, with sample paths that are not differentiable. We will deal with sample path generation later, but you may have a peek at Fig. 3.39 to see what we mean. Hence, taking the usual derivative dW(t)/dt is not possible, which justifies the use of this form. In Section 3.7 we will outline the fundamental tools of stochastic calculus, and we will learn that Eq. (3.10) is actually a shorthand for an integral equation:
FIGURE 3.39 A sample path of the standard Wiener process.
(3.11)
This looks superficially like the integral in Eq. (3.9), but it relies on the concept of Itô stochastic integral, to be discussed in Section 3.7. The differential form, however, is more intuitive and paves the way for the development of several interesting processes.
For instance, let us consider a model for stochastic interest rates. Can we model an interest rate r(t) by geometric Brownian motion? The answer is no, and to see why, let us wonder about the solution of Eq. (3.10) when σ = 0. When there is no volatility, the equation boils down to Eq. (3.7), whose solution is an increasing exponential. If there is a volatility component, intuition suggests that the solution is an exponentially increasing trend, with some superimposed noise. Financial common sense suggests that interest rates do not grow without bound, and tend to oscillate around a long-term average level . To model such a mean reversion, we may use the following equation:
(3.12)
which is an example of the Ornstein–Uhlenbeck process, to be discussed later in Section 3.7.5. The intuition is that the drift can be positive or negative, pulling back r(t) toward its long-term average level. It is worth mentioning that, in practice, continuous-time models are the rule in financial engineering.
Most readers are quite familiar with the distributions that we outline in the following and can safely skip the section, possibly referring back when needed. The treatment is limited and just meant to be a refresher, but in order to make the section as useful as possible to the widest readership, we also take the review as an excuse to introduce R functions to deal with probability distributions; we will also use functions to produce some simple plots, and R scripts to illustrate both R programming and some specific properties of distributions. As a general rule, four functions are associated with each standard distribution in R. Given a distribution distr, R provides us with:
Note how the prefixes p, d, q, and r characterize each function. The p is somewhat confusing, as “probability mass” could be associated with discrete distributions, whereas in R the p only refers to cumulative probabilities; the choice of the prefix d reminds a PDF and could possibly reflect some bias toward continuous distributions.
We may illustrate these four functions with reference to a continuous uniform variable on the interval [a, b], which features a constant density function on its support:
A commonly used notation to state that a random variable X has this distribution is X ~ U[a, b], or X ~ U(a, b). We also recall that
The following R snapshots illustrate the related R functions:
> a=5 > b=15 > dunif(8,min=a,max=b) [1] 0.1 > punif(8,min=a,max=b) [1] 0.3 > dunif(3,min=a,max=b) [1] 0 > punif(3,min=a, max=b) [1] 0 > dunif(18,min=a,max=b) [1] 0 > punif(18,min=a,max=b) [1] 1
Note how qunif and punif are inverse of each other:
> punif(9,min=a,max=b) [1] 0.4 > qunif(0.4,min=a, max=b) [1] 9
The following snapshot will probably give different results on your computer, as random samples depend on the state of the (pseudo)random number generator:
> runif(5,min=a, max=b) [1] 8.434857 5.190168 10.679197 5.220115 10.036124
This is why, throughout this book, we will use the command set.seed(55555) to set the state of the generator to a specific state, thus allowing replication of results. We defer to Chapter 5 the technicalities involved in the generation of pseudorandom numbers and variates. By the way, while the uniform distribution does not look too interesting in itself, it is fundamental in Monte Carlo simulation as all distributions are sampled by suitable transformations of U(0, 1) variables.
In the following sections we illustrate a few common univariate distributions, both discrete and continuous, without any claim of exhaustiveness. Some of the distributions that we describe here have a multivariate counterpart, to be discussed in Section 3.3.
The binomial distribution is a discrete distribution built on top of a Bernoulli random variable. The mechanism behind a Bernoulli variable is carrying out a random experiment, which can result in either a success, with probability p, or a failure, with probability 1 − p. If we assign to the random variable X the value 1 in case of a success, and 0 in case of a failure, we obtain the PMF:
For a Bernoulli variable, denoted by X ~ Ber(p), we can easily calculate the expected value,
and the variance,
The formula for variance makes intuitive sense: The variance is zero for p = 1 and p = 0 (there is no uncertainty on the outcome of the experiment), and it is maximized for p = , where we have the largest uncertainty about the outcome.
When needed, we may use any other value, rather than 0 and 1. A financial motivation for a Bernoulli variable, quite relevant in option pricing, is modeling random asset prices S(t) in simplified form, rather than using stochastic differential equations. According to this simplified model, given the current asset price S(0) = S0, the price after a time step δt, denoted by S(δt), can be either Su = uS0 or Sd = dS0, with probabilities pu and pd, respectively. The subscripts may be interpreted as “up” and “down,” if we choose multiplicative shocks d < u. Now, imagine that the process S(t) is observed up to time T = n · δt and that up and down jumps are independent over time. Since there are n steps, the largest value that S(T) can attain is S0un, and the smallest one is S0dn. Intermediate values are of the form S0umd(n-m), where m is the number of up jumps occurred, and n − m is the number of down jumps. Note that the exact sequence of up and down jumps is not relevant to determine the terminal price, as multiplicative shocks commute: S0ud = S0du. A simple model like this is referred to as binomial model and can be depicted as the binomial lattice in Fig. 3.1. Such a model is quite convenient for simple, low-dimensional options,9 as recombination keeps complexity low, whereas the size of a generic tree explodes exponentially.10 Now, we might wonder what is the probability that S(T) = S0umd(n-m). Indeed, this is an example of a binomial variable: To build a binomial variable, we carry out a given number n of independent Bernoulli experiments and count the number m of successes. Hence, a binomial variable is just the sum of n independent Bernoulli variables. This distribution has finite support {0, 1, …, n} and depends on the two parameters p and n; thus, it may be denoted by X ~ Bin(n, p). By using properties of sums of independent random variables, it is easy to show that
FIGURE 3.1 A binomial lattice model for a stock price.
The PMF of a binomial variable is not completely trivial:
where we use the binomial coefficient11
To see why this coefficient is needed, consider that there are many ways in which we may have r successes and n − r failures; which specific experiments succeed or fail is irrelevant. In fact, we have n! permutations of the n individual experiments, (n − r)! permutations of failures, and r! permutations of successes. Since the specific order of each permutation is irrelevant, we divide n! by (n − r)! and r!.
Figure 3.2 shows the PMF for n = 30, and p = 0.2, p = 0.4. The plot has been obtained by the following commands:
FIGURE 3.2 Probability mass function of two binomial random variables.
> par(mfrow=c(2,1)) > plot(dbinom(0:30,size=30,p=0.2),type=’h’,lwd=5) > plot(dbinom(0:30,size=30,p=0.4),type=’h’,lwd=5)
The first command sets mfrow in order to draw plots according to a layout consisting of two rows and one column, thus lining the plots vertically. The plot function uses the parameter lwd to set the line width. Also check the following calculations:
> dbinom(10,size=30,p=0.4) [1] 0.1151854 > factorial(30)/factorial(20)/factorial(10)*0.4^10*0.6^20 [1] 0.1151854
The second way of computing binomial probabilities is a straightforward application of Eq. (3.13), but it is not quite smart, as by taking the factorial of integer numbers there is a danger of overflow, i.e., of producing a number so large that it cannot be represented on a computer. When in need, it is better to leave the task to an appropriately coded function:
> choose (30,10)*0.4^10*0.6^20 [1] 0.1151854
Another discrete distribution that can be built on top of the Bernoulli is the geometric. This is defined by repeating identical and independent Bernoulli trials with success probability p, until the first success is obtained and counting the number of experiments.12 The PMF of the geometric distribution is
The expected value E[X] can be easily found by conditioning on the result of the first experiment, rather than by applying the definition. If the first experiment is a success, which occurs with probability p, the conditional expected value of the number of experiments is just E[X | success] = 1; otherwise, since there is no memory of past outcomes, it is E[X] failure] = 1 + E[X], i.e., we count the first failed experiment and then we are back to square one. Thus:
(3.14)
R offers functions for the geometric distribution, too, but there is a little peculiarity:
> set.seed(55555) > rgeom(10,prob=0.5) [1] 1 2 3 1 0 2 0 0 0 0
We notice that the random variable can take the value 0, because in R the convention for the geometric distribution is to count failures, rather than experiments.
The exponential and the Poisson distributions, despite their different nature, are tightly intertwined: The former is continuous and may be used to model the time elapsing between two events of some kind, whereas the latter is discrete and represents the corresponding number of events occurred within a given time interval. Examples of relevant events in finance are jumps in stock prices or changes in the credit rating of a bond issuer.
An exponential random variable owes its name to the functional form of its PDF:
where λ > 0 is a given parameter. Note that the support of the exponential distribution is [0, +∞), which makes sense since it is often used to model the length of time intervals. The corresponding CDF is
It is fairly easy to show that
It is worth noting that the expected value of an exponential random variable is quite different from the mode, which is zero, and that this distribution is right-skewed. If we interpret X as the time elapsing between consecutive events, 1/λ is the average time between them, and λ is a rate, i.e., the average number of events occurring per unit time. The notation X ~ exp(λ) is typically used to denote an exponential random variable.
A weird but remarkable property of exponential random variables is lack of memory. From the cumulative distribution function (3.15) we see that
Intuitively, this means that, if we are at time zero, the probability that we have to wait more than t for the next event to occur is a decreasing function of t, which makes sense. What is not so intuitive is that if a time period of length t has already elapsed from the last event, this has no influence on the time we still have to wait for the next one. This is clearly false, e.g., for a uniform random variable U(a, b); if a time b − has elapsed from the last event, where
is small, we should get ready for the next one. On the contrary, the exponential distribution is memoryless, which can be stated formally as follows:
This characteristic is important when making modeling choices: If lack of memory is not compatible with the phenomenon we are representing, we should not use an exponential distribution.
R offers the exp family of functions, which we may use to illustrate lack of memory empirically using Monte Carlo sampling as an R programming exercise. The function in Fig. 3.3 generates a sample of numRepl exponential random variables with rate lambda. Then, the probabilities P{X > s}, P{X > s}, and P{X > t + s | X > t} are estimated empirically. Note how we restrict the sample conditionally on the event {X > t}, when extracting the subvector times_t. Here is the result, based on one million replications:
FIGURE 3.3 Checking the memoryless property of exponential distribution empirically.
> set.seed(55555) > CheckMemory(1, 1, 0.5, 1000000) P(T>t) = 0.367936 P(T>s) = 0.606819 P(T>t+s|T>t) = 0.6064832
We see that, despite a small discrepancy due to sampling errors, the estimate of P{X > t + s | X > t} is fairly close to P{X > s}. One could wonder whether one million replications are enough. The following exact evaluation is not too different from the above estimates, suggesting that the selected number of replications should be indeed large enough:
> 1-pexp(1,rate=1) [1] 0.3678794 > 1-pexp(0.5,rate=1) [1] 0.6065307
However, this could just be sheer luck, and in real life we cannot afford the luxury of a comparison with exact numbers (otherwise, we would not be using Monte Carlo methods in the first place). We will address similar issues in Chapter 7, when dealing with simulation output analysis.
The exponential distribution is the building block of other random variables. By summing n independent exponentials with common rate λ we obtain the Erlang distribution, denoted by Erl(n, λ). Erlang random variables may be used to model time elapsed as a sequence of exponential stages, in order to overcome the memoryless property of a single exponential. The PDF of a few Erlang variables, with a different number of stages, can be plotted using the R script in Fig. 3.4, which produces the plot of Fig. 3.5. From a mathematical point of view we immediately notice that by summing exponential variables we get a rather different distribution. In general, the sum of random variables has a distribution that is quite different from the single terms in the sum. The normal distribution is the most notable exception, as the sum of normals is still normal. We also notice that, in order to plot the PDF of an Erlang variable in R, we have to resort to the dgamma function. As we shall see in Section 3.2.5, the gamma distribution generalizes other distributions, including the Erlang.
FIGURE 3.4 Plotting the PDF of different Erlang distributions.
FIGURE 3.5 Erlang PDFs.
We have already mentioned in Section 1.3.3 a peculiar stochastic process, the Poisson process, to motivate models based on discrete-event systems. Here we further investigate its properties and link it to a discrete distribution sharing the same name. The process is obtained by counting the number of events occurred in the time interval [0, t], denoted by N(t). Hence, this is a counting process, and Fig. 3.6 illustrates a sample path. Clearly, the sample path is discontinuous, but piecewise constant, as there is a jump corresponding to each event. Let Xk, k = 1, 2, 3, 4, …, be the time elapsed between jump k − 1 and jump k; by convention, X1 is the time at which the first jump occurs, after starting the system at time t = 0. We obtain a Poisson process if we assume that variables Xk are mutually independent and all exponentially distributed with parameter λ. You might even stumble on exoteric jargon like a càdlàg function when dealing with certain stochastic processes in finance. This is just a French acronym for “continue à droite, limitée à gauche,” since the sample path is continuous from the right, and is limited (or bounded, i.e., it does not go to infinity) from the left.
FIGURE 3.6 Sample path of the Poisson process.
If we count the number of events in the [0, 1] time interval, we obtain a discrete random variable, the Poisson random variable. The Poisson distribution is characterized by the same parameter λ as the exponential, and it may take values on the set {0, 1, 2, 3, …}. Its PMF is
Despite the apparent complexity of this function, it is straightforward to check that it meets the fundamental requirement of a PMF:
By a similar token, we may also find its expected value and variance:
Note that λ is, in fact, the event rate, which characterizes this distribution, denoted by X ~ Pois(λ). If we want to count events occurring in the more general interval [0, t], we have just to consider X ~ Pois(λt). The R family for the Poisson distribution is pois. Figure 3.7 shows the plot of the PMF of a Poisson random variable for λ = 5, obtained by the following command:
FIGURE 3.7 Probability mass function of a Poisson random variable.
> plot(dpois(0:30,lambda=5,p=0.2),type=’h’,lwd=5)
We may state the connection between the Poisson random variable and the Poisson process in more general terms as follows. If we consider a time interval [t1, t2], with t1 < t2, then the number of events occurred in this interval, i.e., N(t2) − N(t1), has Poisson distribution with parameter λ(t2 − t1). Furthermore, if we consider another time interval [t3, t4], where t3 < t4, which is disjoint from the previous one, i.e., (t2 < t3), then the random variables N(t2) − N(t1) and N(t4) − N(t3) are independent. We say that the Poisson process has stationary and independent increments.
The Poisson process model can be generalized to better fit reality, when needed. If we introduce a time-varying rate λ(t), we get the so-called inhomogeneous Poisson process. Furthermore, when modeling price jumps, we would like to introduce a random variable modeling the height of the jump. The cumulative sum of jump heights in the time interval [0, t] is another stochastic process, which is known as compound Poisson process. We will see how sample paths for such processes can be generated in Chapter 6.
It is safe to say that the normal or Gaussian distribution, with its bell shape, is the best-known and most used (and misused) distribution. A normal random variable X ~ N(μ, σ2) is characterized by the following PDF:
which clearly suggests symmetry with respect to the point x = μ. The parameters μ and σ2 have a clear interpretation,13 as a few calculations show that
R offers the norm family of functions. As an illustration, in Fig. 3.8 we show the PDF for two normal distributions with μ = 0 and σ = 1, 2, respectively; the plot is obtained as follows:
FIGURE 3.8 PDF of normal random variables with μ = 0 and σ = 1, σ = 2.
> x <- seq(−6.5, 6.5, by=0.01) > plot (x, dnorm(x, mean=0, sd=1), type=“l”, lty=“dashed”) > lines(x, dnorm(x,mean=0,sd=2),type=“l”,lty=“dotdash”)
From the plot, we also verify that, for a normal distribution, mode, median, and expected value are the same.
In the applications, a very special role is played by the standard normal distribution, characterized by parameters μ = 0 and σ = 1. The reason of its importance is that, if we are able to work with a standard normal in terms of quantiles and distribution function, then we are able to work with a generic normal variable. In fact the CDF is the integral of the bell-shaped function above, for which no analytical formula is known. Although the general CDF for a normal variable is not known analytically, efficient numerical approximations are available for the standard normal case Z ~ N(0, 1):
This function is calculated by the R function pnorm. We are also able to invert the function by qnorm, which yields the quantile zq defined by
for a probability level q (0, 1). Armed with these functions, we may easily compute the CDF and quantiles for a generic normal. If X ~ N(μ, σ2) and we need the probability
we have just to standardize:
since Z = (X − μ)/σ is standard normal. If we need a quantile xq, such that P{X ≤ xq} = q, we find the corresponding quantile zq and go the other way around by destandardizing:
In R, we can avoid all of this if we specify the parameters mean and sd.
The following examples show how to use pnorm and qnorm:
> pnorm(10,mean=5,sd=3) [1] 0.9522096 > pnorm((10-5)/3) [1] 0.9522096 > qnorm(0.95,mean=5,sd=3) [1] 9.934561 > z=qnorm(0.95);z [1] 1.644854 > 5+3*z [1] 9.934561
We may also easily check the following well-known properties:
> pnorm(1) − pnorm(−1) [1] 0.6826895 > pnorm(2) − pnorm(−2) [1] 0.9544997 > pnorm(3) − pnorm(−3) [1] 0.9973002
The last property shows that the normal distribution has rather thin tails, since most realizations are within three standard deviations of the mean, which makes its use sometimes questionable. Indeed, the normal distribution is a sort of benchmark in terms of fat vs. thin tails, a fact that is measured by its kurtosis:
Sometimes, kurtosis is denned as κ −3, which should actually be called excess kurtosis. Distributions with positive excess kurtosis allow for more frequent extreme events than the normal. Another important feature of the normal distribution is symmetry, which implies a zero skewness:
The assumption that returns on an asset are normally distributed implies their symmetry, which is contradicted by empirical facts, especially when financial derivatives are involved. Hence, a normal model is sometimes too stylized to be credible. Nevertheless, the normal distribution has also a few quite interesting and convenient properties:
It is said that the normal distribution is a stable distribution, since a linear combination of normals is still normal. The process can also go the other way around, since stable random variables are infinitely divisible, i.e., can be expressed as sums of similar variables. We just mention that there is a stable distribution that includes the normal as a specific case.15
In Section 3.3.4 we will also see some properties of the multivariate generalization of the normal distribution. The normal distribution is also the basis of other distributions obtained by transforming a normal variable or by combining independent standard normals.
A random variable X is said to have a lognormal distribution if log X is normal. In other words, if Y ~ N(μ, σ2), then X = eY is lognormal, denoted by X ~ LogN(μ, σ2). It can be shown that the PDF density of the lognormal can be written as
(3.16)
It is important to notice that in this case the two parameters μ and σ2 cannot be interpreted as in the normal case. In fact, using the moment generating function of the normal, it is easy to show that
Note how, due to convexity of the exponential function and Jensen’s inequality, if X ~ Log N(μ, σ2), then
Another noteworthy fact is that
This suggests using a lognormal variable with unit expected value to model random errors in a multiplicative model, whereas a normal variable with zero expected value would model random errors in an additive model.
Furthermore, since by summing normal variables we still get a normal variable, a similar property holds when we multiply lognormal random variables. Formally, if we consider n independent lognormals Xi ~ LogN(μi, σi2), then
This is useful when modeling financial returns, which should cumulate multiplicatively rather than additively. The R family for this distribution is lnorm. For instance, we may check Eq. (3.17) empirically:
> set.seed(55555) > sigma <- 2 > X=rlnorm(1000000, meanlog=-(sigma^2)/2, sd=sigma) > mean(X) [1] 0.9928838
Let Zi, i = 1, …, n, be standard and independent normal variables. The random variable X defined as
is certainly not normal, as it cannot take negative values. This variable is called chi-square with n degrees of freedom and is often denoted by χn2. Recalling that the expected value of a squared standard normal is
we immediately see that
It can also be shown that
Figure 3.9 shows the PDF for a chi-square variable with 4 and 8 degrees of freedom, obtained as follows:
FIGURE 3.9 PDF of two chi-square random variables with 4 and 8 degrees of freedom, respectively; the variable with 4 degrees of freedom is less uncertain and has a higher mode.
> x <- seq(0,20,by=0.01) > plot (x,dchisq(x,df=4),type=“l”,lty=“dashed”) > lines (x,dchisq(x,df=8),type=“l”,lty=“dotdash”)
The chi-square distribution is relevant in inferential statistics, and there is a well-known nonparametric test of goodness of fit owing its name to it. It is worth noting that a chi-square distribution may be regarded as a specific case of a gamma random variable, discussed later. In finance, we also meet the noncentral chi-square distribution, which is obtained by the following construction. Let us consider
The corresponding variable is called noncentral chi-square with n degrees of freedom and noncentrality parameter . It is possible to generalize this distribution to a noninteger ν, resulting in a χ′ν2 (λ) variable. The following formulas for the relevant moments can be shown:
(3.18)
The R function family chisq can also be used for noncentral case. For instance:
> set.seed(55555) > rchisq(3, df=4.5, ncp=3.6) [1] 17.775371 8.719674 15.453155
from which we see that the degrees of freedom df and the noncentrality parameter ncp may be noninteger. This distribution is relevant in the simulation of square-root diffusions.16
If Z and χn2 are a standard normal and a chi-square with n degrees of freedom, respectively, and they are also mutually independent, the random variable
has a Student’s t distribution with n degrees of freedom, denoted by tn. This distribution is widely used in inferential statistics, for computing confidence intervals and testing hypotheses. A notable feature is that it features fatter tails than the normal. Indeed, it can be shown that
We observe that, for large n, variance goes to 1, but it is not defined for n ≤ 2. Since this variance is larger than the corresponding variance of the standard normal, the t distribution is more dispersed. In fact, the excess kurtosis is
for n > 4. We may check these fatter tails by plotting its density in Fig. 3.10:
FIGURE 3.10 PDF of Student’s t distribution, with n=1 (dash-dotted line) and n=5 (dashed line), compared with a standard normal (continuous line).
> x <- seq(−4.5,4.5,by=0.01) > plot(x,dnorm(x),type=“l”, lty=“solid”) > lines(x,dt(x,df=1),type=“l”, lty=“dotdash”) > lines(x,dt(x,df=5),type=“l”,lty=“dashed”)
The plot shows the densities of t1 and t5 random variables, along with a standard normal. We observe that the PDF of the t distribution is bell-shaped much like a standard normal; the main differences lie in its heavier tail and in a lower mode. For increasing n, the tails of tn get thinner and the density tends to a standard normal. In fact, for large n the two distributions are virtually identical. The t distribution can be generalized to the multivariate case, as shown in Section 3.3.4.
Another relevant distribution can be derived from the normal and plays a fundamental role in analysis of variance and in testing regression models. If we combine two independent chi-square variables with n and m degrees of freedom, respectively, as follows:
we obtain the F distribution with n and m degrees of freedom. R offers functions like df (x, df1, df2); there is a noncentral version of the F distribution as well.
The beta distribution is continuous and has support on the unit interval [0, 1]. It depends on two parameters α1, α2 ≥ 0, defining its shape through the PDF
The denominator of the above ratio is just needed in order to normalize the PDF. A straightforward approach is defining the beta function:
Alternatively, we may use the following gamma function:
(3.20)
Using integration by parts, it is easy to see that the gamma function satisfies the following recursion:
This shows that the gamma function is an extension of the factorial function outside the domain of integer numbers. In fact, using Eq. (3.21) for an integer argument and unfolding the recursion, we find
The gamma function appears in the definition of different PDFs, including, needless to say, the gamma distribution to be discussed next. It also turns out that
The PDFs displayed in Fig. 3.11 are obtained by the following R snapshot; this shows that the uniform and certain triangular distributions are special cases of the beta distribution.
FIGURE 3.11 PDFs of several beta distributions.
> x <- seq(0.001,0.999,by=0.001) > par(mfrow = c(3,3)) > plot (x,dbeta (x,2,1),type=′1′) > plot (x,dbeta (x,1,1),type=′1′) > plot (x, dbeta (x, 1, 2), type=′1′) > plot (x,dbeta (x,1,0.8),type=′1′) > plot (x,dbeta (x,0.8,0.8),type=′1′) > plot (x,dbeta (x,0.8,1),type=′1′) > plot (x, dbeta (x, 8, 3), type=′1′) > plot (x, dbeta (x, 3, 3), type=′1′) > plot (x, dbeta (x, 3, 8), type=′1′)
Possible uses of a beta random variable are related to random percentages, like recovery rates after a default. If needed, the support of the beta distribution can be easily mapped to the interval (a, b). It is also interesting to compare the PDF of the beta distribution in Eq. (3.19) with the PMF of a binomial distribution in Eq. (3.13). Apart from the normalization factors, they look quite similar, at least if we choose integer parameters α1 and α2 in the PDF of a beta variable. In fact, these distributions are quite intertwined within the framework of Bayesian statistics, when we consider uncertainty in the parameter p of a Bernoulli random variable; the beta distribution can be used to represent such an uncertainty, both in the prior and the posterior.17
The gamma distribution is characterized by the PDF
(3.22)
which depends on parameters α and λ; we use notation X ~ Gamma(α, λ). A notable feature of the gamma distribution is that it includes other distributions as special cases:
The fundamental moments of a gamma variable are
This suggests that λ is a scale parameter, whereas α is a shape parameter. This is in fact the case, in the sense that if X ~ Gamma(α, 1), then X/λ ~ Gamma(α, λ). Actually, some authors prefer to use λ as the scale parameter, whereas others prefer using β = 1/λ. R offers both possibilities, as there are two alternatives in the gamma function family:
> dgamma(2, shape=3, rate=2) [1] 0.2930502 > dgamma(2, shape=3, scale=1/2) [1] 0.2930502
The following R snapshot produces the plot in Fig. 3.12, which illustrates the effect of the shape parameter for a normalized scale parameter:
FIGURE 3.12 PDFs of some gamma distributions.
> x <- seq(0,7,by=0.01) > plot(x,dgamma(x,shape=.5,rate=1),type=′1′, ylim=c(0,1.2),ylab=”) > lines(x,dgamma(x,shape=1,rate=1)) > lines(x,dgamma(x,shape=2,rate=1)) > lines(x,dgamma(x,shape=3,rate=1))
Sometimes, no textbook distribution will fit the available data. In such a case, one could consider resorting to an empirical distribution. In the discrete case, we have just to specify a set of values and the associated probabilities. In the continuous case, we must come up with a sensible way to devise a PDF or a CDF based on a discrete set of observations.
Piecewise linear interpolation is a straightforward way of building a CDF. Given n observations Xi, i = 1, …, n, of the random variable X, we first sort them and find the order statistics X(1) ≤ X(2) ≤ ··· ≤ X(n). Now we build a piecewise linear CDF as follows:
(3.23)
In Fig. 3.13 we illustrate an example for n = 6. After defining the empirical CDF, it is easy to sample from it using, e.g., the inverse transform method of Section 5.3.
FIGURE 3.13 Building an empirical CDF by piecewise linear interpolation.
Kernel-based estimation is an alternative, possibly more sophisticated way to build an empirical distribution. Unlike the piecewise-linear interpolation approach, it aims at building a PDF, and it can be interpreted as a way to “smooth” a histogram. Roughly speaking, the idea is to center a suitable elementary density around each observation, to add all of the densities, and to scale the result according to the number of observations in such a way as to obtain a legitimate overall PDF.
R offers a density function to this aim, receiving a vector of observations and producing an object of class density. The function behavior can be controlled by setting several parameters, among which we emphasize:
To illustrate the role of bandwidth, let us generate a sample of observations by using a mixture of two normals:
> set.seed(55555) > data <- c(rnorm(40,10,10), rnorm(30,75,20)) > hist(data,nclass=15) > windows() > par(mfrow=c(1,2)) > kd1 <- density(data, bw=2) > plot(kdl,main=“”) > kd2 <- density(data, bw=10) > plot(kd2, main=“”) > class(kd1) [1] “density”
The corresponding histogram is displayed in Fig. 3.14, and suggests a bimodal distribution. The kernel kd1 is indeed of class density, and it can easily be plotted, resulting in the two PDFs displayed in Fig. 3.15. We clearly see that a small bandwidth does not smooth data enough and produces a confusing PDF; a larger bandwidth yields a definitely more reasonable result.
FIGURE 3.14 Histogram of data generated by a mixture of normals.
FIGURE 3.15 Kernel densities for different bandwidths.
An extreme form of empirical distribution, in a sense, is the bootstrap approach sometimes employed in finance. The idea is to take a time series of prices or returns, and to generate future scenarios by sampling past outcomes. There are refinements of this idea, but it should be clear that if we take past observations and we just mix them randomly to come up with sample paths, we are implicitly assuming that longitudinal dependencies and correlations can be ruled out. Hence, we are relying on the efficient market hypothesis. On the other hand, a notable advantage of the bootstrap approach is that, when dealing with multidimensional time series, we automatically account for cross-sectional dependencies. In the next section, indeed, we move on to consider multivariate distributions, and we shall see that modeling such dependencies is far from trivial, and this explains why bootstrap is often advocated.
Finally, two general remarks are in order about empirical distributions:
In the previous sections we have considered univariate distributions, which provide us with the nuts and bolts for modeling uncertainty. However, the need for modeling dependence is pervasive in finance and economics, and this leads us to consider multivariate distributions. We are interested in two main dimensions of dependence:
Clearly, the decomposition of dependence in two orthogonal dimensions is an oversimplification. In some cases, we really need to consider both dimensions together, which leads to the analysis of panel data. In other cases, the decomposition may be a sensible modeling simplification. For instance, we might want to check the dependence of two bond defaults that do not really occur at the same time instant, but from a modeling point of view they could be considered as happening at the same time (unless we wish to model causal links between them).
In this section we start by considering the structure of multivariate distributions, emphasizing their complexity and the fact that a characterization of marginal distributions does not tell the whole story, unless we assume independence. Then, we deal with standard concepts like covariance and Pearson’s correlation. Showing their potential limitations paves the way for Section 3.4, where we illustrate more advanced ideas revolving around the copula concept. Finally, we recall a couple of multivariate distributions, namely, the normal and Student’s t, which generalize their univariate counterparts.
We know that univariate distributions may be described by the PMF in the discrete case or the PDF in the continuous case. Actually, the CDF is a more general concept, since it is valid in a general setting. So, the starting point to to get the picture about the dependence of two random variables is their joint CDF.
DEFINITION 3.1 (Joint cumulative distribution function) The joint CDF of random variables X and Y is defined as
The streamlined notation P(X < x,Y ≤ y) is typically used to denote a joint event.
The joint CDF is a function of two variables, x and y, and it fully characterizes the joint distribution of the random variables X and Y. It is easy to see how the definition can be generalized to n random variables.
The joint CDF is the most general tool to characterize a multivariate distribution. Other concepts may be defined, depending on the character of the involved random variables. In the case of two discrete variables, where X may take values xi, i = 1, 2, 3, …, and Y may take values yj, j = 1, 2, 3, …, we define the joint PMF as
When dealing with continuous random variables, we may define a joint PDF fX,Y(x, y) such that
(3.24)
In general, we cannot be sure that such a function exists. To be precise, we should say that the two random variables are jointly continuous when their joint PDF exists. Given the joint PDF, we may find the joint CDF by integration:
(3.25)
Also the joint PMF and PDF can be defined for n random variables. Needless to say, specifying a function of n variables requires a lot of effort unless one imposes some structure:
We can do without the above machinery in one easy case, i.e., when the random variables are independent. In such a case, the events {X ≤ x} and {Y ≤ y} are independent, and the CDF factors into the product of two terms:
In general, whatever refers to a single variable, within the context of a multivariate distribution, is called marginal. In case of independence, the joint CDF is just the product of marginal CDFs. The same applies to joint PMFs and PDFs. A first question is: Given the joint CDF FX,Y(x, y), how can we find the marginal CDFs FX(x) and FY(y), pertaining to each individual variable, if we do not assume independence? In principle, the task for the CDF is fairly easy. We obtain the marginal CDFs for the two random variables as follows:
By the same token, FY(y) = FX,Y (+∞, y). In the discrete case, to obtain the marginal PMFs from the joint PMF, we just use the total probability theorem and the fact that events {Y = yj} are disjoint:
If we want to obtain marginal PDFs from the joint PDF, we may work in much the same way as in the discrete case:
If we introduce the marginal density
we have
The marginal PDF for Y is obtained in the same way, integrating with respect to x.
We see that, given a joint distribution, we may find the marginals. It is tempting to think that, given the two marginals, we may recover the joint distribution in some way. This is not true, as the marginal distributions do not say anything about the link among random variables.
Consider the discrete-time stochastic process
where ~ N(0, 1). We immediately see that each Xt is normal with expected value 0 and variance t2. This marginal distribution is the same for the process
where t is again standard normal, but we have one such variable for each time instant. Indeed, Eq. (3.26) describes a rather degenerate process, since uncertainty is linked to the realization of a single random variable. The following R snapshot produces a set of sample paths for the two processes, illustrated in Figs. 3.16 and 3.17:
FIGURE 3.16 A few sample paths of the stochastic process of Eq. (3.26).
FIGURE 3.17 A single sample path of the stochastic process of Eq. (3.27).
set.seed(55555) len <- 50 times <- 0:len eps <- rnorm(10) lims <- c(len*min(eps),len*max(eps)) plot(x=times,y=eps[1]*times,type=′1′, ylim=lims) points(x=times,y=eps[1]*times) for (k in 2:10) lines(x=times,y=eps[k]*times) points(x=times,y=eps[k]*times) windows() eps <- rnorm(300) path <- (0:300)*c(0,eps) plot (x=0:300,y=path,type=′1′)
Note the use of the variable lims, to set the limits of successive plots, and the function windows (), to open a new window for plotting without overwriting the previous one.
It is easy to see the two processes have little in common, despite the fact that their marginal PDFs are the same for every time instant.
The example above deals with dependence over time for a monodimensional stochastic process, i.e., in the longitudinal sense. Specifying marginals for each time instant falls short of capturing the structure of a stochastic process. We will make similar considerations later, when dealing with the Wiener process. The next example illustrates the case of cross-sectional dependence between two random variables: We may have quite different joint distributions, sharing the same marginals.
Consider the following joint PDFs (in a moment we will check that they are legitimate densities):
They look quite different, but it is not too difficult to see that they yield the same marginals. The first case is easy:
By symmetry, we immediately see that fY(y) = 1 as well. Hence, the two marginals are two uniform distributions on the unit interval [0, 1]. Now let us tackle the second case. When integrating with respect to y, we should just treat x as a constant:
As before, it is easy to see by symmetry that fY(y) = 1 as well. Again, the two marginals are two uniform distributions, but the links between the two random variables are quite different.
Before closing the example, it is easy to see that fX,Y(x,y) is a legitimate density as it is never negative and
Actually, this is just the area of a unit square. Checking the legitimacy of gX,Y (x, y) is a bit more difficult. The easy part is checking that the integral over the unit square is 1. Given the marginals above, we may write
We should also check that the function is never negative; this is left as an exercise.
The two examples point out that there is a gap between two marginals and a joint distribution. The missing link is exactly what characterizes the dependence between the two random variables. This missing link is the subject of a whole branch of probability theory, which is called copula theory; a copula is a function capturing the essential nature of the dependence between random variables, separating it from the marginal distributions. Copulas are described in Section 3.4. In the next section we recall a more classical, simpler, but partial characterization of the links between random variables, based on covariance and correlation.
If two random variables are not independent, we need a way to measure their degree of dependence. This is implicit in the joint distribution, which may be difficult if not impossible to manage and estimate, in general. The simplest measure in this vein is covariance.
DEFINITION 3.2 (Covariance) The covariance between random variables X and Y is defined as
Usually, the covariance between X and Y is denoted by σXY.
Covariance is the expected value of the product of two deviations from the mean, and its sign depends on the signs of the two factors. We have positive covariance when the events {X > E[X]} and {Y > E[Y]} tend to occur together, as well as the events {X < E[X]} and {Y < E[Y]}, because the signs of the two factors in the product tend to be the same. If the signs tend to be different, we have a negative covariance. What covariance tries to capture is whether the two variables tend to be on the same side of their respective means, both above or both below, in which case covariance is positive, or on opposite sides, in which case covariance is negative.
We recall a few properties of covariance, which are easy to check:
Equation (3.28) yields a convenient way of computing covariance. Equation (3.29) shows that covariance is, as the name suggests, a generalization of variance; this is also reflected in the notation σXX = σX2. Equation (3.30) points out an important issue: Covariance is a measure of association, but it has nothing to do with cause–effect relationships. Equation (3.31) shows that covariance, just like variance, is not affected by a deterministic shift. Putting Eqs. (3.32) and (3.33) together, we may prove the following theorem.
THEOREM 3.3 (Variance of a sum of random variables) Given a collection of random variables Xi, i = 1, …, n, we obtain
Covariance is a generalization of variance. Hence, it is not surprising that it shares a relevant shortcoming: Its value depends on how we measure the underlying quantities. To obtain a measure that does not depend on units of measurement, we may consider the covariance between standardized versions of X and Y:
where μX, σY, ΣX, and σY are expected values and standard deviations of X and Y, respectively, and we have used properties of covariance. This motivates the definition of the correlation coefficient
DEFINITION 3.4 (Correlation coefficient) The correlation coefficient between random variables X and Y is defined as
To be precise, the above definition corresponds to Pearson’s correlation; later we will see that there are alternative definitions of similar measures. The coefficient of correlation is adimensional, and it can be easily interpreted, as it can take only values in the interval [−1, 1]. While we usually do not prove theorems in this book, we make an exception here as the proof is instructive.
THEOREM 3.5 The correlation coefficient ρXY may only take values in the range [−1, 1]. If ρXY = ±1. then X and Y are related by Y = a + bX, where the sign of b is the sign of the correlation coefficient.
PROOF Consider the following linear combination of X and Y:
We know that variance cannot be negative; hence
This inequality immediately yields ρX,Y ≥ −1. By the same token, consider a slightly different linear combination:
We also know that if Var(Z) = 0, then Z must be a constant. In the first case, variance will be zero if ρ = −1. Then, we may write
for some constant α. Rearranging the equality, we have
This can be rewritten as Y = a + bX, and since standard deviations are non-negative, we see that the slope is negative. Considering the second linear combination yields a similar relationship for the case ρ = 1, where the slope b is positive.
Given the theorem above, it is fairly easy to interpret a specific value of correlation:
A visual illustration of correlation is given in Fig. 3.18. Each scatterplot shows a sample of 100 joint observations of two random variables for different values of correlation.18 The effect of correlation is quite evident if we think to “draw a line” going through each cloud of point; the slope of the line corresponds to the sign of the correlation. In the limit case of ρ = ±1, the observations would exactly lie on a line.
FIGURE 3.18 Samples of jointly normal variables for different values of the correlation coefficient ρ.
We easily see that if two variables are independent, then their covariance is zero. In fact, independence implies E[XY] = E[X] · E[Y]; this can be shown, e.g., by taking advantage of the factorization of the joint PDF into the product of the two marginals. Then, using the property (3.28), we see that independence implies that covariance is zero. However, the converse is not true in general, as we may see from the following counterexample.
Let us consider a uniform random variable on the interval [−1, 1]; its expected value is zero, and on its support the density function is constant and given by fX(x) = . Now, define random variable Y as
Clearly, there is a very strong interdependence between X and Y because, given the realization of X, Y is perfectly predictable. However, their covariance is zero! We know from Eq. (3.28) that
but the second term is clearly zero, as E[X] = 0. Furthermore,
because of the symmetry of the integrand function, which is an odd function, in the sense that f(−x) = −f(x). Hence, the first term is zero as well, and the two variables have zero covariance and are uncorrelated. One intuitive way to explain the weird finding of this example is the following. First note that points with coordinates (X, Y) lie on the upper half of the unit circumference X2 + Y2 = 1. But if Y < E[Y], we may have either X > E[X] or X < E[X]. In other words, if Y is above its mean, there is no way to tell whether X will be above or below its mean. This is illustrated in Fig. 3.19. A similar consideration applies when Y > E[Y].
FIGURE 3.19 A counterexample about covariance.
The example shows that covariance is not really a perfect measure of dependence, as it may be zero in cases in which there is a very strong dependence. In fact, covariance is rather a measure of concordance between a pair of random variables, which is related to the signs of the deviations with respect to the mean. Furthermore, covariance measures a linear association between random variables. A strong nonlinear link, as the one in Fig. 3.19, may not be detected at all, or only partially.
In real life, covariances and correlations must be estimated on the basis of empirical data. We will consider estimation issues in the next chapter, but R offers two basic and easy-to-use functions to estimate covariance and correlation. Not quite surprisingly, they are called cov and cor. To illustrate them, let us generate two independent samples from the standard normal distribution, and combine the observations in order to induce correlation. It is easy to show, and is left as an exercise, that if variables Z1, Z2 ~ N(0, 1) are independent, then the variables X1 and X2 defined as
are still standard normal but have correlation ρ:
> set.seed(55555) > rho <- 0.8 > Z1 <- rnorm(100000) > Z2 <- rnorm(100000) > X1 <- Z1 > X2 <- rho*Z1 + sqrt (1-rho^2) *Z2 > cor(Z1,Z2) [1] −0.00249354 > cor(X1,X2) [1] 0.8002211 > cov(X1,X2) [1] 0.7978528
Note that covariance and correlation, since the two variables are standard, should be the same, but they are not because of sampling errors. In the above snapshot, we have invoked cov and cor by passing two vectors as input arguments. By doing so, we obtain covariance and correlation between the corresponding variables. If we are interested in covariance and correlation matrices, we should collect the input data in a matrix:
> X <- as.matrix(cbind(2*X1,-3*X2)) > cov(X) [,1] [,2] [1,] 3.996719 −4.787117 [2,] −4.787117 8.954152 > cor(X) [,1] [,2] [1,] 1.0000000 −0.8002211 [2,] −0.8002211 1.0000000 > cov2cor(cov(X)) [,1] [,2] [1,] 1.0000000 −0.8002211 [2,] −0.8002211 1.0000000
Note the use of the function cbind to collect two vectors side by side as columns, and as.matrix to transform the result to a matrix. The function cov2cor can be used to transform the former into the latter. Once again, we suggest to compare the sample estimates with the theoretical values.
If you try getting online help on cor, you will notice that there is an additional parameter, method = c (“pearson”, “kendall”, “spearman”). By default, we obtain Pearson’s correlations, but there are alternative definitions of correlation, which we discuss later in Section 3.4. To understand the rationale between them, let us consider the following snapshot:
> set.seed(55555) > X=runif (100000) > cor(X,X) [1] 1 > cor(X,X^2) [1] 0.9683232 > cor(X,X^4) [1] 0.8660849
As expected, the correlation between X and itself is 1, but if we raise X to increasing powers, we decrease correlation. This is due to the nonlinearity in X2 and X4, which is lost in a linear measure of correlation. The larger the power, the less linear the resulting relationship, which is reflected in a diminishing value of Pearson’s correlation. However, we may well argue that, at a deeper level, there is no substantial difference in the dependence between X and itself and between X and X4. We will see that Kendall’s and Spearman’s correlations avoid this issue.19
In this section we illustrate two commonly used multivariate distributions, namely, the multivariate counterparts of normal and t distributions. They are similar in the sense that:
The essential difference is that, just like the univariate case, the t distribution has fatter tails, which are related to the number of degrees of freedom.
An n-dimensional multivariate normal variable , has PDF
Strictly speaking, this applies to a nonsingular distribution, i.e., when Σ is positive definite. If the covariance matrix is only positive semidefinite, then there is a linear combination of variables, with coefficients w, such that
From a statistical point of view this means that there is some redundancy in the variables, since we may find a linear combination that is constant; from a mathematical point of view this implies that the covariance matrix cannot be inverted and we cannot write the PDF as in Eq. (3.35).
We immediately see that, when μ, = 0 and Σ = I, the joint density boils down to
i.e., the product of n independent standard densities. We may obtain a generic multivariate normal from the standard Z ~ N(0, I), and the other way around, by the transformations
where A is a matrix such that Σ = AA. These formulas are straightforward generalizations of typical standardization (also called whitening in this context) and destandardization. Possible choices of A are the lower triangular Cholesky factor L of Σ, or its square root Σ−1/2; unlike the Cholesky factor, the square root is symmetric and is defined by Σ = Σ−1/2Σ−1/2.
Let us consider two standard normal variables with correlation coefficient ρ. The covariance matrix and the corresponding lower triangular Cholesky factor are
The reader is urged to check that indeed Σ = LL, and that using this matrix in Eq. (3.36) generates correlated normals as in Eq. (3.34).
The following snapshot shows two R functions, chol and sqrtm from package expm to carry out the above factorizations:
> Sigma <- matrix(c(10,2, 2, 8), nrow=2, ncol=2) > A <- chol (Sigma) > t(A) %*% A [,1] [,2] [1,] 10 2 [2,] 2 8 > library(expm) > sqrtSigma <- sqrtm(Sigma) > sqrtSigma [,1] [,2] [1,] 3.1443790 0.3359774 [2,] 0.3359774 2.8084015 > sqrtSigma %*% sqrtSigma [,1] [,2] [1,] 10 2 [2,] 2 8
Note that chol actually returns the upper triangular factor L. As we shall see in Chapter 5, these transformations are also used to generate samples of correlated multivariate normals from a set of independent standard variates.
We may generalize the well-known property that a linear combination of (jointly) normals is normal as follows.
PROPERTY 3.6 Let X1, X2, …, Xr be mi-dimensional, independent normal variables, Xi ~ N(μi, Σi), i = 1, …, r. Consider a vector and matrices
. Then
Later, we will appreciate the usefulness of the following theorem.20
THEOREM 3.7 (Conditional distribution for multivariate normals) Let us consider an n-dimensional multivariate normal variable X ~ N(μ, Σ) and the partition X = [X1, X2], where X1 and X2, have dimensions n1 and n2, respectively, with n1 + n2 = n. Let us define the corresponding partitions
where vectors and matrices have the appropriate dimensions. Then, the conditional distribution of X1, given that X2 = x, is , where
Consider a bivariate normal distribution consisting of two standard normals Z1 and Z2 with correlation ρ. Then
and by Theorem 3.7 the distribution of Z1 conditional on Z2 = z is normal with
The last theorem is useful for path generation by the Brownian bridge (Section 6.2.3) and variance reduction by stratification (Section 8.5).
The multivariate t distribution is characterized by the PDF
With respect to the multivariate normal, there is an additional parameter, ν, corresponding to the degrees of freedom, and the notation tν(μ, Σ) is used. The essential moments are
We notice that when ν → ∞, the covariance matrix tends to Σ, just like in the univariate case. More generally, the PDF of the multidimensional t tends to the PDF of a multivariate normal when ν → ∞. The matrix Σ, in this case, is referred to as a scale parameter, whereas ν is a shape parameter. The most relevant difference between the multivariate normal and t distributions is seen in the tails, which are influenced by ν. If we set aside the complicated normalization factors involving the gamma function and look inside the big square parentheses in Eq. (3.37), we notice that the level curves of the PDF are essentially determined by the same term as in the normal case:
If we fix the value of this quantity and draw the corresponding level curve, we obtain an ellipse, which explains why we speak of elliptical distributions. Both distributions are symmetrical with respect to the expected value.
The mvtnorm package offers functions to calculate the PDF and to generate random samples from multivariate normal and t distributions:
The corresponding r functions for random variate generation are rmvnorm and rmvt; there are also q and p functions, but they are of course more problematic than the corresponding univariate functions. The script in Fig. 3.20 shows how to use R functions to plot the logarithm of the density of these two distributions. The resulting contour plots are displayed in Fig. 3.21. We should not be surprised by the negative values, since we are displaying log-densities to make the plots more readable. We immediately observe the elliptic character of both densities and the effect of positive correlation. The most striking difference, as we have remarked, is in the fatter tails of the t distribution.
FIGURE 3.20 Script to illustrate the mvtnorm package.
FIGURE 3.21 Contour plots of the log-density of multivariate normal and t distributions.
The idea behind copulas is to factor an n-dimensional multivariate distribution into two components:
The idea is to express the n-dimensional CDF as
Indeed, since C(···) must map a set of marginal probabilities into another probability, it must be a function mapping [0, 1]n into [0, 1]. However, there are additional requirements that we detail below, referring to a two-dimensional copula for the sake of simplicity.
Consider a two-dimensional copula C(u1, u2). A reasonable requirement is that, for any u1, u2 [0, 1],
This amounts to saying that if u1 = P{X1 ≤ x1} = 0, then FX(x1, x2) = 0 for whatever value of x2. In fact, below the support of the distribution of X1, the CDF must be zero; or, if the support is unbounded below, the limit for x1 → −∞ must be 0. A similar consideration applies to the second variable x2, as well as to any variable in a multidimensional case. This leads to the following general definition.
DEFINITION 3.8 (Grounded function) Consider a function H(···) of n variables, with domain S1 × ··· × Sn, where each set Sk has a smallest element ak. We say that H is grounded if H(y1, …, yn) = 0, whenever at least one yk = ak.
Thus, we have to require that a copula is a grounded function. In the case of a copula, Sk = [0, 1] and ak = 0. After looking below the support, let us move to the opposite side and have a look above it. There, a CDF must take the value 1; or, if the support is unbounded above, the limit for xk → +∞ must be 1. More generally, the copula should have sensible margins.
DEFINITION 3.9 (Margins) Consider a function H(···) of n variables with domain S1 × ··· × Sn, where each set Sk has a largest element bk. The one-dimensional margin of H corresponding to the kth variable is the function
It is possible to define higher-dimensional margins, but we will simply refer to one-dimensional margins as margins.
On the basis of this definition, we see that a copula C(···) should have margins Ck(·) such that
As a last requirement, consider a two-dimensional CDF and the region illustrated in Fig. 3.22. The darkest rectangle in the figure is the set [a1, b1] × [a2, b2]. Let us denote this set by S; given the CDF FX(x1, x2), it is easy to see that
FIGURE 3.22 Illustrating a property of multidimensional CDFs.
Since this is a probability, the expression on the right must be positive. This leads to the following definition.
DEFINITION 3.10 (2-increasing function) Consider a function H(·, ·) of two variables, and let S = [a1, b1] × [a2, b2] be a rectangle with all vertices in the domain of H(·, ·). We define the H-volume of S as
We say that the function is 2-increasing if VH(S) ≥ 0 for any such rectangle.
The definition may be extended to n-dimensional functions, but we omit the technicalities involved.21 It should be noted that the 2-increasing property does not imply and is not implied by the fact that the function is nondecreasing in each of its arguments. This is well illustrated by the following counterexamples.22
Consider the function
defined on I2 = [0, 1]2. It is immediate to check that the function is nondecreasing in x1 and x2. However, it is not 2-increasing, as the H-volume of I2 is
On the contrary, let us consider
defined on I2 = [0, 1]2. Clearly, the function is decreasing in x1 for x2 < 0.5, and it is decreasing in x2 for x1 < 0.5. Nevertheless, it is 2-increasing. To see this, consider the rectangle S = [a1, b1] × [a2, b2], where b1 ≥ a1 and b2 ≥ a2, and compute its G-volume:
All of the above considerations are summarized in the following definition.
DEFINITION 3.11 (Copula) An n-dimensional copula is a function C(···), defined on the domain In = [0, 1]n, such that:
We have stated requirements such that, given an n-dimensional distribution with marginal CDFs Fk, the function
is a sensible CDF. One might wonder whether we can also go the other way around: Given a joint CDF, is it possible to decompose it in the form of Eq. (3.38)? The answer is positive, thanks to the following fundamental theorem.
THEOREM 3.12 (Sklar’s theorem) Let FX be an n-dimensional CDF with continuous marginal CDFs F1(·), …, Fn(·). Then, there exists a unique copula C(···) such that
Note that the above theorem statement asserts uniqueness for continuous distributions. In fact, Sklar’s theorem also states the existence of a copula for discrete distributions, but there are technical issues concerning uniqueness that we want to avoid here. Since we are just dealing with continuous distributions, we assume invertibility of marginal CDFs and rewrite Eq. (3.38) as
(3.39)
When looking at this equation, it is important to realize that if we evaluate a marginal CDF Fk(x) for x = Xk, where Xk is a random variable characterized by that CDF Fk, then Fk(Xk) is a uniform random variable. To see this, note that, assuming invertibility of the margin and taking u [0, 1],
However, this is just the CDF of a uniform distribution. This fact plays a fundamental role in random variate generation.23 Indeed, having defined the general concept of a copula, the following tasks must be typically carried out:
We leave the second and the third task to Chapters 4 and 5, respectively. Here we describe a few common copula functions as examples.
The product copula is defined as
(3.41)
It is no surprise that, as it turns out, the continuous random variables in the vector (X1, …, Xn) are independent if and only if their copula is the product copula.
The normal (or Gaussian) copula is obtained by using the CDF of standard normals in Eq. (3.40). Let Φ be the CDF of a standard normal and ΦRn be the joint CDF for an n-dimensional standard normal distribution with correlation matrix R. Then, we may define the normal copula function
(3.42)
Note that this copula cannot be expressed analytically; nevertheless, it is possible to sample from the normal copula.
The t copula is built much like the normal copula, on the basis of the CDF of a t distribution with ν degrees of freedom:
(3.43)
where tν is the CDF of a t distribution with ν degrees of freedom and is the CDF of a multivariate t distribution with ν degrees of freedom and correlation matrix R.
We have already illustrated a few pitfalls of Pearson’s correlation:
This has lead to the proposal of different correlation coefficients.
DEFINITION 3.13 (Kendall’s tau) Consider independent and identically distributed random vectors X = [X1, X2] and Y = [Y1, Y2]
. Kendall’s tau is defined as
In plain words, Kendall’s tau measures the probability of concordance minus the probability of discordance. Note that in covariance, and as a consequence in defining Pearson’s correlation, we take an expected value, which is influenced by the numerical values attained by the random variables. Here we are using a probability in a way that captures some more essential features. This is essentially why tau is not sensitive to monotonic transformations.
DEFINITION 3.14 (Spearman’s rho) Consider a joint CDF Fx(x1, x2), with marginals F1(·) and F2 (·). Spearman’s rho is defined as the Pearson’s correlation between the transformed random variables F1(X1) and F2(X2):
(3.44)
To get an intuitive grasp of the definition, let us recall that F1 (X1) and F2(X2) are uniformly distributed. Hence, in the definition of ρS we pick the two random variables and map them into uniform variables through their CDF, which is a monotonic transformation into the unit interval [0, 1]. Hence, we are actually correlating the ranks of the realizations of the two variables. Indeed, Spearman’s rho is also known as rank correlation. We should mention that sample counterparts have been defined for both Kendall’s tau and Spearman’s rho, and that there is a deep connection between them and copulas, as well as alternative definitions.24
The script in Fig. 3.23 illustrates the fact that Spearman’s rho and Kendall’s tau are not influenced by certain transformations of the underlying random variables. For our illustration purposes, we are using their sample counterparts here. We generate a sample of standard normal variables X1 and X2, with correlation 0.8. Then we take the exponential of the first sample, F1 = eX1, and we apply the cor function to estimate different correlations. By the way, in R there is no need to specify the pearson method, as it is the default; hence cor (X1,x2,method=’pearson’) and cor (X1, X2) are equivalent. Running the script, we obtain
FIGURE 3.23 Illustrating invariance properties of Spearman’s rho and Kendall’s kappa.
> set.seed(55555) > tryCorrelation() Correlations Pearson: (X1,X2)= 0.8091203 (F1,X2)= 0.6192021 Spearman: (X1,X2)= 0.7952385 (F1,X2)= 0.7952385 Kendall: (X1,X2)= 0.6000901 (F1,X2)= 0.6000901
We observe that Pearson’s correlation for the transformed variable is smaller: This is due to the nonlinear nature of the exponential transformation, which in this case weakens the linear relationship between the two random variables. The other two correlation measures are not affected.
A relevant feature of multivariate distributions, especially for risk management, is tail dependence. The idea is focusing on the joint behavior of random variables when extreme events occur. For the sake of simplicity, we deal here with the upper tail dependence for two random variables X1 and X2; the case of lower tail dependence is similar.
DEFINITION 3.15 (Upper tail dependence) Let us consider two jointly continuous random variables X1 and X2 with marginal CDFs F1(·) and F2(·). The coefficient of upper tail dependence is defined as
provided that the limit exists. If λU = 0, then the two random variables are said to be asymptotically independent in the upper tail; otherwise, they are said to be asymptotically dependent in the upper tail.
Since u is reaching its upper limit in the definition, the coefficient λU captures the joint behavior of X1 and X2 in the limit of the upper-right (North–East) quadrant. It turns out that this property is related to the copula, which means that it is invariant for strictly increasing transformations of the random variables.
Let us recall the following formula from elementary probability:
(3.46)
where is the complement of event E. It is easy to see that the probability in Eq. (3.45) can be rewritten as
(3.47)
Therefore, using the copula C(·, ·) between X1 and X2 and assuming that the limit exists, we may rewrite the above probability as
(3.48)
An interesting feature of the normal copula is that, when the correlation coefficient ρ is strictly less than 1, it can be shown that λU = 0, hence, we have tail independence.25 It is often argued that this fact points out another limitation of the multivariate normal distribution for risk management.
Linear regression models are traditionally introduced within an inferential statistics framework. We will see how R can be used to accomplish classical estimation by ordinary least squares in Section 4.4. However, a probabilistic view is also possible, where we consider the problem of approximating a random variable Y by an affine transformation a + bX of another random variable X. This is quite useful in order to better appreciate:
The key point is how to define a measure of distance between Y and a + bX. One possibility is to consider the variance of their difference. If this variance is close to zero, we may argue that the approximation tracks its target pretty well. If we do so and minimize tracking error variance, we formulate the following distance:
which is minimized with respect to b by setting
We immediately see one issue with the above choice: The coefficient a plays no role. A reasonable choice is to select a such that the approximation is unbiased:
The reader familiar with linear regression in inferential statistics will find the above formulas quite familiar; in fact, least-squares estimators of the coefficients are just sample counterparts of the expressions in Eqs. (3.49) and (3.50). We may also set the two coefficients together, in a more convincing way, by minimizing the mean square deviation, rather than variance:
Using the above calculation and a well-known property of variance, the objective function may be rewritten as follows:
Now we enforce the optimality conditions, taking partial derivatives with respect to a and b, respectively. The first condition,
yields the result in Eq. (3.50), whereas the second one is
which yields the coefficient in Eq. (3.49).
Within this framework, it is also easy to see that we obtain a kind of “orthogonal projection,” in the sense that the approximation a+bX and the residual Y − (a + bX) are uncorrected:
Thus, the optimal approximation uses the maximum information about Y that can be included in bX; the residual variability is orthogonal.
The previous section was mainly aimed at cross-sectional dependence, i.e., the relationship between different random variables observed at the same time. In this section we move on to longitudinal dependence and consider discrete-time models based on time series. Time series models are an important topic within econometrics, and here we can only afford to scratch the surface. In doing so, we have the following objectives:
In the next chapter we will see some more R functions to estimate time series models. Nevertheless, the tone in this section is a bit different from the rest of the chapter. The theory behind time series is far from trivial, and we will rely on simple simulations to get an intuitive feeling for them. Furthermore, some concepts we use here, such as the analysis of autocorrelation, are a bit halfway between probability and statistics. We include them here as they are part of the toolbox for model building. In fact, applying time series requires the following steps:
We cover here the first task, and estimation methods are deferred to the next chapter. After a few preliminary considerations, we will move on to consider two fundamental classes of time series models:
Then, in Section 3.6.3 we will see how these modeling approaches can be integrated in the more general class of ARIMA (autoregressive integrated moving average) processes, also known as Box–Jenkins models. The next step, in Section 3.6.4, is to generalize a bit, moving from scalar time series to multidimensional vector autoregressive models. Finally, in Section 3.6.5 we will present some more specific model classes, like ARCH and GARCH, to represent stochastic volatility. These models are quite relevant for financial applications. As we shall see, time series modeling offers many degrees of freedom, maybe too many; when observing a time series, it may be difficult to figure out which type of model is best-suited to capture the essence of the underlying process. This is why model identification plays a pivotal role.
In its basic form, time series theory deals with weakly stationary processes, i.e., time series Yt with the following properties, related to first- and second-order moments:
The second condition deserves some elaboration.
DEFINITION 3.16 (Autocovariance and autocorrelation) Given a weakly stationary stochastic process Yt, the function
is called autocovariance of the process with time lag k. The function
is called the autocorrelation function (ACF).
The definition of autocorrelation relies on the fact that variance is constant as well:
In practice, autocorrelation may be estimated by the sample autocorrelation function (SACF), given a sample path Yt, t = 1, …, T:
where is the sample mean of Yt. The expression in Eq. (3.51) may not look quite convincing, since the numerator and the denominator are sums involving a different number of terms. In particular, the number of terms in the numerator is decreasing in the time lag k. Thus, the estimator looks biased and, for a large value of k, Rk will vanish. However, this is what one expects in real life. Furthermore, although we could account for the true number of terms involved in the numerator, for large k the sum involves very few terms and is not reliable. Indeed, the form of sample autocorrelation in Eq. (3.51) is what is commonly used in statistical software packages, even though alternatives have been proposed.26 If T is large enough, under the null hypothesis that the true autocorrelations ρk are zero, for k ≥ 1, the statistic
is approximately normal standard. Since z0.99 = 1.96 ≈ 2, a commonly used approximate rule states that if
then the sample autocorrelation at lag k is statistically significant. For instance, if T = 100, autocorrelations outside the interval [−0.2, 0.2] are significant. We should keep in mind that this is an approximate result, holding for a large number T of observations. We may plot the sample autocorrelation function at different lags, obtaining an autocorrelogram that can be most useful in pointing out hidden patterns in data.
Many time series in economics display seasonality. Consider the time series displayed on the left in Fig. 3.24. A cursory look at the plot does not suggest much structure in the data, but we may try to get a clue by plotting the ACF on the right. However, the ACF does not look quite helpful either. We notice decreasing autocorrelations, many of which are significant; the plot displays two horizontal bands outside which autocorrelation is statistically significant. Notice that
FIGURE 3.24 A seasonal time series and its autocorrelogram.
and the positioning of the two horizontal bands is consistent with Eq. (3.52). In fact, the time series is seasonal and has been generated by simulating the model
which features an additive seasonal component St superimposed on a linear trend a + bt, plus a noise term t. The following R code shows how such a model can be simulated, and how the acf function can be used to plot the SACF:
set.seed(55555) t <- 1:40 factors <- c(15, 30, −10, −35) Y <- 100+10*t+factorS[t%%4+1]+rnorm(length(t),0,20) par(mfrow=c(1,2)) plot(t,Y,type=′1′) points(t,Y,pch=15) acf(Y)
The seasonal factors are collected in the vector (15, 30, −10, −35); note that the average factor, for additive models, is zero, and that the indexing t%%4+1 repeats the four factors in a cycle over the simulation horizon; we recall that the operator %% corresponds to the mathematical operator mod and yields the remainder of the integer division. The noise terms t are normally distributed, which need not be the case in general. The problem actually is in the trend. We should detrend the series in order to see more structure, which can be accomplished by linear regression:
mod <- lm(Y~t) dY <- Y-mod$fitted plot(t,dY,type=′1′) points(t,dY,pch=15) acf(dY) par(mfrow=c(1,1))
The first statement above uses lm to estimate a linear regression model, relating variable Y with time t; the model is assigned to the variable mod, which is a rich data structure including fitted values, among other things. By subtracting the fitted trendline, we detrend the time series. The new ACF, displayed in Fig. 3.25 shows a quite different pattern: The autocorrelogram is indeed a useful tool to spot seasonality and other hidden patterns in data.
FIGURE 3.25 Detrended seasonal time series and its autocorrelogram.
In the above example, we have used an important building block in time series modeling, called white noise, which is a sequence of i.i.d. (independent and identically distributed) random variables. If they are normal, we have a Gaussian white noise. In other models, noise may be “colored,” i.e., there may be some autocorrelation between the driving random terms t. As we have mentioned, the first step in modeling a time series is the identification of the model structure. In the following we outline a few basic ideas that are used to this aim, pointing out the role of autocorrelation in the analysis.
A finite-order moving-average process of order q, denoted by MA(q), can be expressed as
where random variables t are white noise, with E[
t] = 0 and Var(
t) = σ2. These variables play the role of random shocks and drive the process. It is fairly easy to see that the process is weakly stationary. In fact, a first observation is that expected value and variance are constant:
The calculation of autocovariance is a bit more involved, but we may take advantage of the uncorrelation of white noise:
As a consequence, the autocorrelation function is
Thus, the autocorrelation function depends only on the lag k. We also notice that the autocorrelation function cuts off for lags larger than the order of the process. This makes sense, since the process Yt is a moving average of the driving white noise t. Hence, by checking whether the sample autocorrelation function cuts off after a time lag, we may figure out whether a time series can be modeled as a moving average, as well as its order q. Of course, the sample autocorrelation will not be exactly zero for k > q; nevertheless, by using the autocorrelogram and its significance bands, we may get some clue.
Let us consider a simple MA(1) process
where t is a sequence of uncorrected standard normal variables (Gaussian white noise). In Fig. 3.26 we show a sample path and the corresponding sample autocorrelogram obtained by the following R code (note how vectorization avoids a for loop):
FIGURE 3.26 Sample path and corresponding sample ACF for the moving-average process Yt = 40 + t + 0.8
t−1.
set.seed(55555) t <- 1:100 epsVet <- rnorm(101) Y <- 40+epsVet[2:101]+0.8*epsVet[1:100] par(mfrow=c(1,2)) plot(t,Y,type=’l’) points (t,Y,pch=15) acf(Y)
Incidentally, in this section we are writing our own code to simulate ARMA models for illustrative purposes, but R offers a ready-to-use function arima.sim, which will be used later in Section 4.5. The sample autocorrelation looks significant at time lag 1, which is expected, given the nature of the process. Note that, by applying Eq. (3.53), we find that the autocorrelation function, for a MA(1) process Yt = μ + t − θ1
t−1, is
(3.54)
(3.55)
Figure 3.27 shows the sample path and autocorrelogram of a slightly different MA(1) process (the R code is essentially the same as above):
FIGURE 3.27 Sample path and corresponding ACF for the moving-average process Yt = 40 + t − 0.8
t−1.
The change in sign in θ1 has a noticeable effect on the sample path; an upswing tends to be followed by a downswing, and vice versa. The autocorrelogram shows a cutoff after time lag 1, and a negative autocorrelation.
If we increase the order of the process, we should expect more significant autocorrelations. In Fig. 3.28, we repeat the exercise for the MA(2) process
FIGURE 3.28 Sample path and corresponding ACF for the moving-average process Yt = 40 + t + 0.9
t−1 + 0.5
t−2.
using the following R code:
set.seed(55555) t <- 1:100 epsVet <- rnorm(102) Y<-40+epsVet [3:102]+0.8*epsVet [2:101]+0.6*epsVet [1:100] par(mfrow=c(1,2)) plot (t,Y,type=’l’) points(t,Y,pch=15) acf(Y)
We observe that, in this case, the autocorrelation function cuts off after time lag k = 2.
We should mention that sample autocorrelograms are a statistical tool. It may well be the case that, for the moving-average processes in the examples, we get different pictures when different sample paths are randomly generated. This is a useful experiment to carry out with the help of statistical software.
In finite-order moving-average processes, only a finite number of past realizations of white noise influence the value of Yt. This may be a limitation for those processes in which all of the previous realizations have an effect, even though this possibly fades in time. In principle, we could consider an infinite-order moving-average process, but having to do with an infinite sequence of θq coefficients does not sound quite practical. Luckily, under some technical conditions, such a process may be rewritten in a compact form involving time-lagged realizations of the Yt itself. This leads us to the definition of an autoregressive process of a given order. The simplest such process is the autoregressive process of order 1, AR(1):
One could wonder under which conditions this process is stationary, since here we cannot use the same arguments as in the moving-average case. A heuristic argument to find the expected value μ = E[Yt] is based on taking expectations and dropping the time subscript in Eq. (3.56):
The argument is not quite correct, as it leads to a sensible result if the process is indeed stationary, which is the case if |ϕ| < 1. Otherwise, intuition suggests that the process will grow without bounds. The reasoning can be made precise by using the infinite-term representation of Yt, which is beyond the scope of this book. Using the correct line of reasoning, we may also prove that
In particular, we have
and we may also observe that, for a stationary AR(1) process,
We notice that autocorrelation is decreasing, but it fades away with no sharp cutoff.
In Figs. 3.29 and 3.30, we show a sample path and the corresponding sample autocorrelogram for the two AR(1) processes
FIGURE 3.29 Sample path and corresponding ACF for the autoregressive process Yt = 8 + 0.8Yt−1 + t.
FIGURE 3.30 Sample path and corresponding ACF for the autoregressive process Yt = 8 − 0.8Yt−1 + t.
respectively. The R code cannot avoid a for loop in this case:
set.seed(55555) t <- 1:100 epsVet <- rnorm(101) Y <- 40+epsVet[2:101]+0.8*epsVet[1:100] par(mfrow=c(1,2)) plot (t,Y,type=’l’) points(t,Y,pch=15) acf(Y)
Notice that the change in sign in the ϕ coefficient has a significant effect on the sample path, as well as on autocorrelations. In the first case, autocorrelation goes to zero along a relatively smooth path. This need not be the case, as we are working with sample autocorrelations. Nevertheless, at least for significant values, we observe a monotonic behavior consistent with Eq. (3.57). The sample path of the second process features evident up- and downswings; we also notice an oscillatory pattern in the autocorrelation.
The autocorrelation behavior of AR processes does not feature the cutoff properties that help us determine the order of an MA process. A tool that has been developed for the identification of AR processes is the partial autocorrelation function (PACF). The rationale behind PACF is to measure the degree of association between Yt and Yt−k, removing the effects of intermediate lags, i.e., Yt−1, …, Yt−k+1. We cannot dwell too much on PACF, but we may at least get a better intuitive feeling as follows. Consider three random variables X, Y, and Z, and imagine regressing X and Y on Z:
Note that we are considering a probabilistic regression, not a sample-based regression. From Section 3.5 we know that
Furthermore, we have regression errors
which may be regarded as the random variables X and Y, after the effect of Z is removed. The correlation ρ(X, Y) may be large because of the common factor Z (which plays the role of a “lurking” variable). If we want to get rid of it, we may consider the partial correlation ρ(X*, Y*). On the basis of this intuition, we might consider estimating the partial autocorrelation between Yt and Yt−k by the following linear regression:
The inclusion of intermediate lagged variables Yt−1, …, Yt−k+1 allows to capture their effect explicitly by the regression coefficients b1, …, bk−1. Then, we could use bk as an estimate of partial autocorrelation. Actually, this need not be the sounder approach, but software packages provide us with ready-to-use functions to estimate the PACF by its sample counterpart. In Fig. 3.31 we show the SPACF for the two AR(1) processes of Example 3.12. The plot was obtained by the pacf function. The example also shows how to use expression to build a mathematical annotation for the plot, including subscripts, etc.:27
FIGURE 3.31 Sample partial autocorrelation function for the autoregressive processes of Example 3.12.
expr1 = expression(Y[t] == 8 + 0.8 %*% Y[t-1] + epsilon[t]) pacf(Y, main = expr1)
We see that the SPACF cuts off after lag 1, even though, due to statistical sampling errors, it seems that there is a significant value at larger lags in the first case. This suggests how to use this function to identify the order of an AR model.
Autoregressive and moving-average processes may be merged into ARMA (autoregressive moving-average) processes like
The model above is referred to as ARMA(p, q) process, for self-explanatory reasons. Conditions ensuring stationarity have been developed for ARMA processes, as well as identification and estimation procedures. Clearly, the ARMA modeling framework provides us with plenty of opportunities to fit historical data with a model that can be used to create scenarios by Monte Carlo sampling. However, it applies only to stationary data. It is not too difficult to find real-life examples of data processes that are nonstationary. Just think of stock market indices; most investors really wish that the process is not stationary.
A quite common building block in many financial models is the random walk. An example of random walk is
where ηt is a sequence of independent and standard normal random variables. This is actually an AR process, but from Section 3.6.2 we know that it is nonstationary, as ϕ = 1. Three sample paths of this process are shown in Fig. 3.32. They look quite different, and a subjective comparison of the sample paths would not suggest that they are realizations of the same stochastic process. However, the ACFs show a common pattern: Autocorrelation fades out very slowly. Indeed, this is a common feature of nonstationary processes. The PACFs display a very strong partial autocorrelation at lag 1, which cuts off immediately in all of the three cases.
FIGURE 3.32 Three sample paths and the corresponding ACF and PACF for the random walk Yt = Yt−1 + 0.05ηt.
Since the theory of stationary MA and AR processes is well developed, it would be nice to find a way to apply it to nonstationary processes as well. A commonly used trick to remove nonstationarity in a time series is differencing, by which we consider the time series
Applying differencing in R is obtained by the diff function, which is applied in the following R snapshot to the random walk of Eq. (3.59):
set.seed(5) epsVet <- c(0, rnorm(100)) # first value is dummy Y <- c (4,numeric(100)) for (t in 2:101) Y[t] <- Y[t-1] + 0.05*epsVet[t] dY <- diff(Y) par(mfrow=c(2,2)) plot(1:100, Y[−1], type=’l’) points (1:100,Y[−1],pch=15) acf(Y) plot(1:100,dY,type=’l’) points(1:100,dY,pch=15) acf(dY)
In Fig. 3.33 we show the sample path and the differenced series, along with their ACFs. The shape of the ACF is not surprising, since the differenced process is just white noise.
FIGURE 3.33 The effect of differencing on the simple random walk of Eq. (3.59).
A time series with trend
where t is white noise, is clearly nonstationary, as it features a deterministic trend. However, nonstationarity is a subtler concept than that. A little digression is in order to clarify the nature of nonstationarity in a random walk like
The sample paths that we have generated in Example 3.13 show that this random walk does not feature a deterministic trend. The recursive unfolding of Eq. (3.62) yields
Therefore, we find that
Hence, we must have a different kind of nonstationarity in the random walk of Eq. (3.62) than in the process described by Eq. (3.61). To investigate the matter, let us consider the expected value of the increment Y’t = Yt − Yt−1, conditional on Yt−1:
Therefore, given the last observation Yt−1, we cannot predict whether the time series will move up or down. This shows a connection between a stationary random walk and the efficient market hypothesis. Now, let us consider a stationary AR(1) process
where ϕ (−1, 1). The increment in this case is
Since (ϕ − 1) < 0, we have
This suggests that a stationary AR(1) process is mean reverting, in the sense that the process tends to return to its expected value; the nonstationary random walk does not enjoy this property.
If we introduce the backshift operator B, defined by
we may express the first difference in Eq. (3.60) as
Sometimes, differencing must be repeated in order to obtain a stationary time series. We obtain second-order differencing by repeated application of (first-order) differencing:
This suggests that we may formally apply the algebra of polynomials to the backshift operator, in order to find differences of arbitrary order. By introducing the operator polynomials
we can rewrite the ARMA model of Eq. (3.58) in the compact form:
(3.63)
We may also extend the class of stationary ARMA models, in order to allow for nonstationarity. Doing so, we find the more general class of ARIMA (autoregressive integrated moving average) processes, also known as Box–Jenkins models. An ARIMA(p, d, q) process can be represented as follows:28
(3.64)
where Φ(B) and Θ(B) are polynomials of order p and q, respectively, and d is a differencing order such that the process Yt is stationary, whereas, if we take differences of order (d − 1), the process is still nonstationary. The name “integrated” stems from the fact that we obtain the nonstationary process by integrating a stationary one, i.e., by undoing differencing. In most practical applications the order d is 0 or 1. Box–Jenkins models can be extended to cope with seasonality, but a full account of this class of models is beyond the scope of this book; we refer the reader to the references provided at the end of this chapter.
If you want to generalize the scalar time series models that we have seen so far, in principle, there is no limit to what one can conceive. Simultaneous equations can be built, resulting in quite rich structures. In practice, an overly complicated model need not be the smartest option. On the one hand, there may be issues with estimation and the assessment of properties such as stationarity. On the other hand, the predictive power of a seemingly rich model may be disappointing due to overfitting issues. This is why vector autoregressive models, VAR for short,29 have been proposed. Let Yt
k be a vector of random variables; a VAR model can be represented in reduced form as
(3.65)
where ϕ(B) is a polynomial in the backshift operator B and c, t
k. More explicitely, a VAR(p) model is
(3.66)
where . Note that the error term
t need not be a vector or independent variables.
Let us consider the bidimensional model
where the error terms are standard normals with correlation ρ = 0.5. In matrix form, we may write the above equations as
where
and
The following R script produces the time series displayed in Fig. 3.34.
FIGURE 3.34 Bidimensional time series corresponding to a VAR(1) model.
require(MASS) tPhil <- t(matrix(c(0.7, 0.2, 0.2, 0.7), 2, 2)) c.vec <- c(−0.7, 1.3) mu.vec <- numeric(2) cov.mat <- matrix(c(1, 0.5, 0.5, 1), 2, 2) set.seed(55555) T <- 100 Ypath <- matrix(0, T, 2) Ypath[1,] <- c(1,1) # initial condition epsPath <- mvrnorm(n=T, mu.vec, cov.mat) for (t in 2:T) Ypath[t,] <- c.vec+Ypath[t-1,]%*%tPhil+epsPath[t,] plot.ts(Ypath)
Note that, in order to collect the time series in a matrix with rows corresponding to observations, we have to transpose Φ. We also need the MASS package to sample from a multivariate normal.
When observing financial time series of returns, the following stylized facts may be noticed:
The first feature is referred to as heteroskedasticity, in contrast to the homoskedasticity assumption, whereby volatility is constant. To model volatility, it is convenient to build a time series of returns. However, returns over consecutive periods cannot be added, as they should rather be multiplied. This is why a series of log-returns ut is usually represented:
where St is the price of the relevant financial asset. We want to capture the conditional variance of the time series, which can be expressed as:
It is often the case that E[ut] ≈ 0, if the time bucket is small enough.30 Hence, the expected return term is typically dropped. To build a model, we might relate volatility to the last shock:
A model like this is called ARCH(1), where the acronym stands for autoregressive conditionally heteroskedastic, and 1 means that only ut with lag 1 plays a role in Eq. (3.67). This is just a part of an overall model describing ut. We may build an ARCH(q) model as
(3.68)
An ARCH model is, in a sense, an MA-style model for volatility. We may also consider an AR-style model like
which is known as GARCH(1, 1), i.e., a generalized ARCH model involving two variables with lag 1.
A GARCH model should be coupled with another equation describing the evolution of a time series of interest. Let us consider a time series of return generated as an expected return plus a random shock ut:
where
Note that σt is the standard deviation, i.e., the volatility of the return time series, and it follows Eq. (3.69). One may use a fatter-tailed distribution to generate the shocks. The code in Fig. 3.35 generates the two plots displayed in Fig. 3.36.31
FIGURE 3.35 Code to generate a sample path of a GARCH(1, 1) model.
FIGURE 3.36 Sample path of a GARCH(1, 1) model.
Many financial engineering models rely on continuous-time stochastic processes. One reason is that, in fact, time is continuous in the real world, but another one is that this modeling framework allows for the development of a rich theory. The theory is translated into practical models that allow for an analytical solution in some lucky cases, and even if we are not that lucky, we may still resort to numerical methods, including Monte Carlo. In this section we first illustrate a gentle transition from discrete- to continuous-time models, which leads to the introduction of a fundamental building block in continuous-time stochastic modeling, the Wiener process. Continuous-time models are typically expressed as stochastic differential equations, which not only generalize deterministic differential equations but also feature thorny technical issues that require the introduction of a new concept: The stochastic integral in the sense of Itô. This, in turn, leads to an essential result, the celebrated Ithon’s lemma, which is in some sense an extension of the chain rule for the derivative of composite functions in familiar calculus. In fact, it turns out that the familiar calculus rules do not work in this context and must be suitably modified. This is essential in tackling stochastic differential equations analytically, when possible, and numerically via sample path generation.32 Then, we introduce the family of Itô stochastic differential equations, which includes geometric Brownian motion, and some generalizations. In the following, we will mostly consider the price of a stock share as the relevant variable, but we can also model interest rates, exchange rates, macroeconomic variables, etc.
It is a good idea to start with a discrete-time model and then derive a continuous-time model heuristically. Consider a time interval [0, T], and imagine that we discretize it by a sequence of time periods of length δt, such that T = n · δt; discrete-time instants are indexed by t = 0, 1, 2, …, n. Let St be, for instance, the price of a stock share at time t. One possible and reasonable model for the evolution of this price over time is the multiplicative form:
where ut is a non-negative random variable and the initial price S0 is known. If we consider continuous random variables ut, the model is continuous-state. In simplest model the variables ut are i.i.d., which may be justified by the efficient market hypothesis, if we believe in it (or by the lack of an alternative model we trust enough). The multiplicative model ensures that prices will stay non-negative, which is an obvious requirement for stock prices. If we used an additive model, such as St+1 = ut + St, we should admit negative values for the random variables ut, in order to represent price drops; therefore, we would not have the guarantee that St ≥ 0. With the multiplicative form, a price drops when ut < 1, but it remains positive. Furthermore, the actual impact of a price change depends on the present stock price; a $1 increase is a different matter if the current price is $100 rather than $5. This is easily accounted for by the multiplicative form, which essentially deals with percent changes in price.
In order to select a sensible probability distribution for the random variables ut, let us use a typical trick of the trade and consider the natural logarithm of the stock price:
The random variable zt is the increment in the logarithm of price, and a common assumption is that it is normally distributed, which implies that ut is lognormal. Starting from the initial price S0 and unfolding (3.70) recursively, we find
which implies that
Since the sum of normal random variables is still a normal variable, log St is normally distributed, which in turn implies that stock prices are lognormally distributed, according to this model. To put it in another way, the product of lognormals is still lognormal. Using the notation
we see that
(3.71)
(3.72)
where we take advantage of the intertemporal independence of the variables zt when calculating variance. The important point to see here is that the expected value and the variance of the increment in the logarithm of the stock price scale linearly with time; this implies that standard deviation scales with the square root of time, whereas expected value scales linearly. This property is known as the square-root rule and implies that on a very short time period volatility, related to a standard deviation, dominates drift, related to an expected value.33
Our aim is to build a continuous-time model, so let us consider first what we typically do in a deterministic case to accomplish the task. The familiar approach is to take the limit of a difference equation, for δt → 0, and build a differential equation. Informally, if there were no uncertainty, we would recast what we have seen in discrete time as
Taking the limit as δt → 0, we find a relationship between differentials:
Integrating both differentials over the interval [0, t] yields
In the deterministic case, as we remarked earlier in this book, it is customary to write the differential equation as
or, equivalently, as
where we have used standard calculus, in particular the chain rule of derivatives, to rewrite the differential
To make the model stochastic (and more realistic), we have to include a noise term, which entails a few important changes. First, we should write the equation in the form
where dW(t) can be considered as the increment of a stochastic process over the interval [t, t + dt]. Equation (3.75) is an example of a rather tricky object, called a stochastic differential equation. It is reasonable to guess that the solution of a stochastic differential equation is a stochastic process, rather than a deterministic function of time. This topic is quite difficult to deal with rigorously, as it requires some background in measure theory and stochastic calculus, but we may grasp the essentials by relying on intuition and heuristic arguments.
The first thing we need is to investigate which type of continuous-time stochastic process W(t) we can use as a building block. In the next section we introduce such a process, called the Wiener process, which plays more or less the same role as the process zt above. It turns out that this process is not differentiable, whatever this may mean for a stochastic process; informally, we will see that its sample paths are quite jagged indeed. Hence, we cannot write the stochastic differential equation as
Actually, a stochastic differential equation must be interpreted as a shorthand for an integral equation much like (3.73), involving the increments of a stochastic process. This calls for the definition of a stochastic integral and the related stochastic calculus. A consequence of the definition of the stochastic integral is that working with differentials as in Eq. (3.74) is not possible. We need a way to generalize the chain rule for differentials from the deterministic to the stochastic case. This leads to a fundamental tool of stochastic calculus called Itô’s lemma.
In the discrete-time model, we have assumed normally distributed increments in logarithmic prices, and we have also seen that the expected value of the increment of the logarithm of price scales linearly with time, whereas standard deviation scales with the square root of time.
In discrete time, we could consider the following process as a building block:
where t is a sequence of independent standard normal variables. We see that, for k > j,
which implies that
In order to move to continuous time, we may define the standard Wiener process as a continuous-time stochastic process characterized by the following properties:
Note how the second condition makes sure that standard deviation scales with the square root of time, like in the discrete-time model. To see the importance of the independent increments assumption, let us consider the sample paths of a process defined as , with
~ N(0, 1), which are shown in Fig. 3.37. This is a “degenerate” stochastic process,34 since knowledge of just one point on a sample path yields the complete knowledge of the whole sample path, which makes the process quite predictable. However, if we only look at the marginal distributions of Q(t), this process just looks like the Wiener process, since
FIGURE 3.37 Sample paths of a “degenerate” stochastic process.
It is lack of independence that makes the difference, as we may also check empirically. We may recast Eq. (3.76) as
(3.77)
where the notation t+δt, rather than
t, may be used to emphasize the fact that this standard normal shock conveys new information. This discretization immediately allows us to write a piece of R code to generate sample paths of the standard Wiener process, which is illustrated in Fig. 3.38 and produces the plot in Fig. 3.39, which is quite different from Fig. 3.37, even though the marginals are the same.
FIGURE 3.38 Simple R function and script to sample the standard Wiener process.
From Fig. 3.39, we observe that sample paths of the Wiener process look continuous, but not differentiable. This may be stated precisely, but not very easily. Introducing continuity and differentiability rigorously calls for specifying a precise concept of stochastic convergence, as we should say that the Wiener process is nowhere differentiable with probability 1. To get an intuitive feeling for this fact, let us consider the increment ratio
Given the defining properties of the Wiener process, it is easy to see that
If we take the limit for δt → 0, this variance goes to infinity. Strictly speaking, this is no proof of nondifferentiability of W(t), but it does suggest that there is some trouble in using an object like dW(t)/dt; indeed, you will never see a notation like this. We only use the differential dW(t) of the Wiener process. Informally, we may think of dW(t) as a random variable with distribution N(0, dt). Actually, we should think of this differential as an increment, which may be integrated as follows:
This looks reasonable, doesn’t it? We may even go further and use W(t) as the building block of stochastic differential equations. For instance, given real numbers a and b, we may imagine a stochastic process X(t) satisfying the equation
(3.78)
This equation defines a generalized Wiener process and may be solved by straightforward integration over the time interval [0, t]:
which may be rewritten as
Hence, for the generalized Wiener process we find
However, if we consider something more complicated, like
things are not that intuitive. A process satisfying an equation like (3.79) is called an Itô process. We could argue that the solution should be something like
(3.80)
Here the first integral looks like a standard Riemann integral of a function over time, but what about the second one? We need to assign a precise meaning to it, and this leads to the definition of a stochastic integral.
In a stochastic differential equation defining a process X(t), where a Wiener process W(t) is the driving factor, we may assume that the value X(t) depends only on the history of W(t) over the time interval from 0 to t. Technically speaking, we say that process X(t) is adapted to process W(t). Now let us consider a stochastic integral like
How can we assign a meaning to this expression? To begin with, it is reasonable to guess that a stochastic integral is a random variable. If we integrate a deterministic function of time we get a number; so, it is natural to guess that, by integrating a stochastic process over time, we should get a random variable. Furthermore, the stochastic integral above looks related to the sample paths of process W(t), and an approximation could be obtained by partitioning the integration interval into small subintervals defined by points 0 = t0, t1, t2, …, tn = T and considering the sum
If you recall how the Riemann integral was introduced in your first calculus course, as the limit of sums over partitions, the above definition does not look too different. Yet, there is a subtle and crucial point. It is very important to notice how we have chosen the time instants in the expression above: X(tk) is taken at the left extreme of the interval (tk, tk+1). This is actually one possible choice: Why not take the value of X in the midpoint? This choice has an important effect: X(tk) is a random variable which is independent from the increment W(tk+1) − W(tk) by which it is multiplied. This choice is what makes the stochastic integral in the sense of Itô a bit peculiar, and it may be motivated as follows.
Consider a set of n assets, whose prices are modeled by stochastic processes Si(t),i = 1, …, n, which are described by stochastic differential equations like (3.79), and assume that we have a portfolio strategy represented by functions hi(t). These functions represent the number of stock shares of each asset i that we hold at time t. But which functions make sense? An obvious requirement is that functions hi(·) should not be anticipative: hi(t) may depend on all the history so far, over the interval [0, t], but clairvoyance should be ruled out. Furthermore, we should think of hi(t) as the number of shares we hold over a time interval of the form [t, t + dt).
Now, assume that we have an initial wealth that we invest in the portfolio, whose initial value, depending on the portfolio strategy represented by the functions hi(·), is
where we have grouped hi(t) and Si(t) in column vectors. What about the dynamics of the portfolio value? If the portfolio is self-financing, i.e., we can trade assets but we do not invest (nor withdraw) any more cash after t = 0, it can be shown that the portfolio value will satisfy the equation
This looks fairly intuitive and convincing, but some careful analysis is needed to prove it (see, e.g., [1, Chapter 6]). Then, we may reasonably guess that the wealth at time t = T will be
However, it is fundamental to interpret the stochastic integral as the limit of an approximation like (3.82), i.e.,
The number of stock shares we hold at time tk does not depend on future prices S(tk+1). First we allocate wealth at time tk, and then we observe return over the time interval (tk, tk+1). This makes financial sense and is why Itô stochastic integrals are defined the way they are.
Now, if we take approximation (3.82) and consider finer and finer partitions of the interval [0, t], letting n → ∞, what do we obtain? The answer is technically involved. We must select a precise concept of stochastic convergence and check that everything makes sense. Using mean square convergence, it can be shown that the definition makes indeed sense and provides a rigorous definition of the stochastic integral in the sense of Itô.
The definition of stochastic integral has some important consequences. To begin with, what is the expected value of the integral in Eq. (3.81)? This is a good starting question, because the stochastic integral is a random variable, and the first and foremost feature of a random variable is its expected value. We may get a clue by considering the approximation (3.82):
where we have used the independence between X(tk) and the increments of the Wiener process over the time interval (tk,tk+1), along with the fact that the expected value of these increments is zero. This shows that the integral of Eq. (3.81) is a random variable with expected value zero, but can we say something more? The definition of stochastic integral does not yield a precise way to compute it practically. We may try, however, to consider a specific case to get some intuition. The following example illustrates one nasty consequence of the way in which we have defined the stochastic integral.
Say that we want to “compute” the stochastic integral
Analogy with ordinary calculus would suggest using the chain rule of differentiation of composite functions to obtain a differential which can be integrated directly. Specifically, we might guess that
This in turn would suggest that
Unfortunately, this cannot be the correct answer, as it contradicts our previous findings. We have just seen that the expected value of an integral of this kind is zero, but:
We see that the two expected values do not match at all, and there must be something wrong somewhere.
Example 3.17 shows that the chain rule does not work in Itô stochastic calculus. To proceed further, we need to find the right rule, and the answer is Itô’s lemma. As usual in this section, we cannot follow rigorous arguments, but we may give an informal, yet quite instructive argument (following [12, Chapter 13]). The argument is instructive as it provides us with valuable intuition explaining what went wrong in Example 3.17. Let us recall that an Itô process X(t) satisfies a stochastic differential equation such as
(3.83)
which is in some sense the continuous limit of
where (t) ~ N(0, 1). What we actually need is a way to derive a stochastic differential equation for a function F(X, t) of X(t), as this plays the role of the derivative of a composite function, which is what the chain rule is used for.
Let us take a little step back in the realm of deterministic calculus, and consider what is the rationale behind Taylor’s expansion of a function G(x, y) of two variables. The key ingredient we need for our reasoning is the formula for the differential of such a function:
which indeed may be obtained from Taylor’s expansion:
when δx, δy → 0. Now we may apply this Taylor expansion to F(X, t), limiting it to the leading terms. In doing so it is important to notice that the term √δt in Eq. (3.84) needs careful treatment when squared. In fact, we have something like
which implies that the term in (δX)2 cannot be neglected in the approximation. Since is a standard normal variable, we have E[
2] = 1 and E[
2 δt] = δt. A delicate point is the following: It can be shown that, as δt tends to zero, the term
2 δt can be treated as nonstochastic, and it is equal to its expected value. An informal (far from rigorous) justification relies on the variance of this term:
which, for δt → 0, can be neglected with respect to first-order terms. Here we have used the fact E[4] = 3, which can be checked by using moment generating functions35 or, cheating a bit, by recalling that the kurtosis of any normal, including a standard one, is 3. A useful way to remember this point is the formal rule
Hence, when δt tends to zero, in the Taylor expansion we have
Neglecting higher-order terms and taking the limit as both δX and δt tend to zero, we end up with
which, substituting for dX, becomes the celebrated Itô’s lemma:
(3.86)
Although this proof is far from rigorous, we see that all the trouble is due to the term of order , which is introduced by the increment of the Wiener process.
It is instructive to see that if we set b(X, t) = 0 in the Itô differential equation, i.e., if we eliminate randomness, then we step back to familiar ground. In such a case, the stochastic differential equation is actually deterministic and can be rewritten as
where we use the notation x rather than X to point out the deterministic nature of function x(t). Then, Itô’s lemma boils down to
which can be rearranged to yield the following chain rule for a function F(x, t):
In order to grasp Itô’s lemma, let us consider Example 3.17 once more.
In order to compute the stochastic integral of W2(t), we may apply Itô’s lemma to the case X(t) = W(t), by setting a(X,t) ≡ 0, b(X, t) ≡ 1, and F(X, t) = X2(t). Hence, we have
(3.90)
(3.91)
It is important to point out that in Eq. (3.89) the partial derivative with respect to time is zero; it is true that F(X(t),t) depends on time through X(t), but here we have no direct dependence on t, thus the partial derivative with respect to time vanishes. To put it another way, the dependence of X(t) with respect to time does not play any role when taking the partial derivative with respect to t, because X(t) is held fixed. You may also have a look back at Eq. (3.88) to see the relationship between the partial and a total derivative of F(X(t),t) with respect to t.
The application of Itô’s lemma yields
It is essential to notice that dt is exactly the term which we would not expect by applying the usual chain rule. But this is the term that allows us to get the correct expected value of W2(T). Indeed, by straightforward integration of Eq. (3.92) now we find
Taking the expected values yields
which is coherent with our findings in Example 3.17.
Let us summarize our findings: With respect to Eq. (3.87), in Itô’s lemma we have an extra term in dW, which is expected given the form of the differential equation defining the stochastic process X(t), and an unexpected term
In deterministic calculus, second-order derivatives occur in second-order terms linked to (δt)2, which can be neglected; but here we have a term of order which must be taken into account even when it is squared.
Itô’s lemma may be used to find the solution of stochastic differential equations, at least in relatively simple cases. A very important example is geometric Brownian motion. Geometric Brownian motion is defined by a specific choice of the functions in Itô stochastic differential equation:
where μ and σ are constant parameters referred to as drift and volatility, respectively. Intuition would suggest to rewrite the equation as
and then to consider the differential of d log S, which would be dS/S in deterministic calculus, to find the corresponding integral. However, we have seen that some extra care is needed when dealing with stochastic differential equations. Nevertheless, the above intuition deserves further investigation, so let us apply Itô’s lemma and find the stochastic differential equation for F(S, t) − log S(t). As a first step we compute the partial derivatives:
Then, putting all of them together we find
Now we see that our guess above was not that bad, as this equation may be easily integrated and yields
Recalling that W(t) has a normal distribution and can be written as W(t) = , where
~ N(0, 1), we conclude that the logarithm of price is normally distributed:
We can rewrite the solution in terms of S(t):
or
On the one hand, this shows that prices, according to the geometric Brownian motion model, are lognormally distributed. On the other hand, now we have a way to generate sample paths, based on a discretization of a prescribed time horizon in time periods of length δt, like we did with the standard Wiener process. A simple R function to do so is given in Fig. 3.40.36 This function, when given the drift and volatility selected in the script, yields the plot in Fig. 3.41.
FIGURE 3.40 Simple R function to sample geometric Brownian motion.
FIGURE 3.41 A sample path of geometric Brownian motion.
By using properties of the lognormal distribution37 and letting ν = μ −σ2/2, we also find
(3.94)
(3.95)
from which we see that the drift parameter μ is linked to the continuously compounded return. The volatility parameter σ is related to standard deviation of the increment of logarithm of price. The roles of drift and volatility can also be grasped intuitively by considering the following approximation of the differential equation:
where δS/S is the return of the asset over small time interval δt. According to this approximation, we see that return can be approximated by a normal variable with expected value μδt and standard deviation σ√δt. Actually, this normal distribution is only a local approximation of the “true” (according to the model) lognormal distribution.
Geometric Brownian motion is not the only type of stochastic process relevant in finance, and the Wiener process is not the only relevant building block. One of the main features of these processes is the continuity of sample paths. However, discontinuities do occur sometimes, such as jumps in prices. In this case, different building blocks are used, such as the Poisson process, which counts the events occurring at a certain rate. We should also note that continuous sample paths do not make sense for certain state variables such as credit ratings. Another point is that the lognormal distribution, that we get from geometric Brownian motion, is a consequence of the normality associated with the Wiener process. Distributions with fatter tails are typically observed, questioning the validity of the simple models based on Gaussian diffusions.
Indeed, several alternative processes have been proposed and are commonly used in financial engineering. Here we just mention some possible generalizations, leaving further details to Chapter 6 on path generation and Chapter 11 on pricing derivatives.
Correlated Wiener processes. The need to model the dynamics of a set of asset prices arises when we deal with a portfolio, or when we want to price a rainbow option depending on multiple underlying assets. The simplest model that we may adopt is a direct generalization of geometric Brownian motion. According to this approach, the prices Si(t) of assets i = 1, …, n satisfy the equation
where the Wiener processes Wi(t) are not necessarily independent. They are characterized by a set of instantaneous correlation coefficients ρij, whose meaning can be grasped by an extension of the formal rule of Eq. (3.85):
From a Monte Carlo point of view, sample path generation of these processes require the generation of increments
where and
j are standard normals with correlation ρij.
Ornstein–Uhlenbeck processes. In geometric Brownian motion, we have an exponential trend on which a diffusion process is superimposed.38 This exponential growth makes no sense for several financial variables, which rather feature mean reversion. Then, we may resort to Ornstein–Uhlenbeck processes like
They still are Gaussian diffusions, but we observe that the drift can change sign, pulling the process back to a long-run average , with speed γ. The application of this model to a short-term interest rate r(t) yields the Vasicek model:
Square-root diffusions. One drawback of Ornstein–Uhlenbeck processes is that they do not prevent negative values, which make no sense for stock prices and interest rates. A possible adjustment is the following:
This is an example of a square-root diffusion, which in the context of short-term interest rates is known as the Cox–Ingersoll–Ross model. For a suitable choice of parameters, it can be shown that a square-root diffusion stays non-negative. This changes the nature of the process, which is not Gaussian anymore, as we will see in Chapter 6. Similar considerations hold when modeling a stochastic and time-varying volatility σ(t). Geometric Brownian motion assumes constant volatility, whereas in practice we may observe time periods in which volatility is higher than usual. In a discrete-time setting, ARCH and GARCH models may be used. In continuous-time, one possible model for stochastic volatility is the Heston model, which consists of a pair of stochastic differential equations:
where V(t) = σ2(t), is a long-term value, and different assumptions can be made on the correlation of the two driving Wiener processes.
Jump–diffusions. In order to account for jumps, we may devise processes with both a diffusion and a jump component, such as
where Y(t) is a compound Poisson process. From a formal point of view, we require the definition of a stochastic integral of a stochastic process Y(t) with respect to a Poisson process N(t):
where N(t) is the number of jumps occurred up to time t in the Poisson process, and ti, i = 1, …, N(t), are the time instants at which jumps occur. We note that the Wiener and the Poisson process, which are the basic building blocks of jump–diffusions, are quite different, but they do have an important common feature: stationary and independent increments. In fact, the class of Lévy processes generalizes both of them by emphasizing this feature.