Chapter Three: Stochastic Modeling in Finance and Economics

This chapter is included for the sake of readers who are not familiar with econometrics and the modeling techniques that are commonly used in finance and financial economics. Needless to say, the following treatment is not meant as a substitute of one of the several excellent books on the subject. Our limited aims are:

Here we only deal with modeling; issues related to estimation are deferred to Chapter 4, which has similar objectives and limitations. In modeling, one typically works with parameters, expected values, variances, etc. In estimation, one typically works with their sample counterparts, like sample mean and sample variance. However, since it is often useful to illustrate ideas by simple but concrete examples, we will occasionally use sample counterparts of probabilistic concepts. In doing so, we will rely on readers’ intuition and, possibly, some background in elementary inferential statistics, as well as the availability of easy to use R functions; more advanced concepts and issues are dealt with in the next chapter.

Within this framework, the choice of topics is admittedly open to criticism. We cover some rather basic issues related to probability distributions, together with some more advanced ones, such as copulas and stochastic differential equations. However, the coverage of some standard tools in econometrics, such as multiple linear regression models, will leave much to be desired. Our choice is deliberate and related to what one needs in order to appreciate Monte Carlo techniques.

In Section 3.1 we use elementary examples to motivate our study and to further illustrate the differences between static, discrete-time, and continuous-time models. Even these deceptively simple examples raise a number of issues that may be addressed by rather sophisticated tools. Section 3.2 deals with rather basic topics, as we outline a few essential (univariate) probability distributions that are commonly used in finance. Probably, many readers are quite familiar with the content of this section, which may be skipped and referred back when needed. In order to make the section as useful as possible, we take advantage of it by introducing some R functions to deal with probability distributions, including random sample generation; the technicalities involved in the generation of pseudorandom samples from these distributions is deferred to Chapter 5. We move on to deal with multivariate distributions and dependence in Sections 3.3 and 3.4. In the former section we outline multivariate distributions, including a brief review of standard concepts related to covariance and Pearson’s correlation. We also cover multivariate extensions of normal and t distributions, but the main aim of the section is to point out the limitations of these standard approaches, paving the way for more advanced ideas revolving around copula theory, which is the subject of Section 3.4. This section is a bit more advanced and also covers alternative correlation measures, such as Spearman’s rho and Kendall’s tau; we also outline such concepts as tail dependence and independence, which are relevant for risk management. Section 3.5 is a bit of an interlude, where we cover linear regression models within a probabilistic framework. Usually, linear regression is associated with the statistical estimation of linear models;¹ here we insist on the probabilistic view of regression as an orthogonal projection, laying down the foundations for variance reduction by control variates² and model complexity reduction by factor models. Regression models are related and contrasted with time series models, which are described in Section 3.6, within the framework of discrete-time models. We should note that a great deal of work has been carried out on representing, estimating, and simulating time series models in R, which also includes a specific time series object. Continuous-time models, based on stochastic differential equations, are introduced in Section 3.7. These models are most useful in financial engineering applications, but they require some background in stochastic calculus and stochastic integration. Here we introduce the basic concepts, most notably Itô’s lemma, without paying too much attention to mathematical rigor. In Section 3.8 we outline two approaches that can be used to streamline a model, namely, principal component analysis and factor models. These multivariate analysis tools may be useful from both a model estimation and a simulation perspective, since they reduce data and computational requirements. Finally, in Section 3.9 we rely on Itô’s lemma to develop the powerful tools of risk-neutral option pricing. This section may be obviously skipped by the familiar reader, but also by readers who are not interested in option pricing.

3.1 Introductory examples

In Chapter 1 we considered the difference between Monte Carlo sampling and simulation. In the former case, we are essentially approximating an integral, which is related to an expectation; in the latter case, we have to simulate dynamics over time. We have also seen that, in principle, any dynamic simulation can be considered as the estimation of the integral of a possibly quite complicated function. Hence, the line between sampling and simulation is not so sharp at all, from a Monte Carlo viewpoint. From a modeling viewpoint, however, the issues involved may be quite different. In this section we outline three related financial modeling problems:

The best known static portfolio optimization model is certainly the mean–variance model of Markowitz. In the basic version of this model, time is disregarded and issues related to rebalancing are neglected. To build the model, we need to characterize the joint distribution of asset returns over the investment time horizon, which involves subtle issues, particularly from a risk management perspective.

More realistic models, possibly involving asset and liabilities, transaction costs, etc., require the introduction of time. We may accomplish the task within a discrete- or a continuous-time framework. The former approach typically requires the machinery of time series analysis, whereas the latter one relies on stochastic calculus concepts. When time enters the stage, we may face the task of representing intertemporal dependence. Actually, if we believe the efficient market hypothesis (EMH), we may avoid the task. EMH can be stated in different, more or less stringent forms. The bottom line is that it is impossible to beat the market by stock picking or buy/sell timing. From our point of view, the essential implication is that there is no intertemporal dependence in returns over different time periods. On the contrary, if we have a different view or we are modeling something else and the EMH is not relevant, we are faced with the task of modeling some form of dependence over time.

In econometrics, to categorize the issues a modeler faces when dealing with dependence, the following model classification is used:

One of the main tasks of this chapter is to illustrate a few of these models, pointing out the related issues.

3.1.1 SINGLE-PERIOD PORTFOLIO OPTIMIZATION AND MODELING RETURNS

To build a single-period portfolio optimization problem, we need the following ingredients:

In principle, the microeconomics of uncertainty offers a way to represent investor’s preferences, in the form of a utility function u(·) mapping a monetary amount into a real number. Utility theory posits that we should maximize the expected utility of wealth W_T, a random variable, at the end of the holding period T. As we did in Example 1.1, we denote the initial asset price by P_i⁰ and the amount of shares of asset i held in the portfolio by h_i; for the sake of simplicity, here we assume that the number of shares is large enough to be approximated by a continuous variable and avoid integrality restrictions. If we neglect transaction costs and tax issues, the problem can be written as

where P_i^T is the asset price at the end of the holding period. This models looks easy enough, but there are two serious difficulties:

Computing the expectation is a technical problem, which may be tackled by the methods described in the rest of this book. However, defining a utility function is a much more serious issue. In fact, despite their popularity in most economic literature, utility functions are not easily applicable to real-life financial optimization. On the one hand, even a sophisticated investor would find it hard to assess her utility function; on the other hand, which utility function should a professional fund manager use, when reporting to her many clients? Should she consider her own attitude toward risk or that of her wealthiest client? A common approach is to resort to a risk measure, related to uncertainty in future wealth. We will have more to say about risk measures later, but since they should capture uncertainty, a seemingly natural measure is standard deviation. Furthermore, it may also be convenient to get rid of wealth W₀ and deal with a sort of adimensional model in terms of portfolio weights, i.e., fraction of total wealth allocated to each asset, and asset returns.³ To be more specific, if the assets we are considering are stock shares, we should also consider any dividend that is paid within the planning horizon. If asset i pays a dividend D_i during the holding period, its (rate of) return can be written as

Note that the return can be negative, but it cannot be less than −1, since stock shares are limited liability assets and the worst price that we may observe in the future is P_i^T = 0. To be precise, this definition does not account for the time value of money; when the dividend is paid exactly does make a difference if the time period is long enough and we consider the possibility of reinvesting the dividend. For the sake of simplicity, let us neglect this issue.

Now, what about the joint distribution of returns? If we assume that the investor only cares about expected value and standard deviation of portfolio return, we may streamline our task considerably. Let:

If we denote the wealth fraction allocated to asset i by w_i, so that Σ_i w_i = 1, we immediately find that the portfolio return R_p is

where w is the column vector collecting portfolio weights, which is transposed by the operator. The variance of portfolio return is given by

In the classical Markowitz portfolio optimization model, a mean–variance efficient portfolio is found by minimizing risk subject to a constraint on expected portfolio return:

where α_T is a minimal target for expected return and e = [1, 1,…, 1]; non-negativity on portfolio weights may be added if short sales are forbidden. Note that we minimize variance, rather than standard deviation, for the sake of computational convenience. Doing so, the solution of the optimization problem does not change, but we have to solve a (convex) quadratic programming problem, for which extremely efficient algorithms are widely available.

In fact this new model looks remarkably simple, and there is no need for Monte Carlo methods for its solution. Furthermore, there is no need to specify a full-fledged joint return distribution, as first- and second-order moments are all we need. However, severe difficulties have just been swept under the rug. To begin with, how can we estimate the problem data? The covariance matrix consists of many entries, and there is no hope to estimate them all in a reliable way. To see why, assume that we have to deal with n = 1000 assets. Then, the covariance matrix consists of one million entries. Actually, given symmetry, “only”

entries need be specified. This is a formidable task, as we would need a very long history of returns to obtain reliable estimates. Unfortunately, even if we collect many years of data, assuming that they are available for all of the assets that we are considering for inclusion in our portfolio, few of them are actually relevant since market conditions have changed over the years. This is why we need to resort to some data reduction strategies, as described in Section 3.8.2.

Even if we assume that we have perfect knowledge of the involved parameters, tough issues are to be considered:

Measuring risk by variance or standard deviation is appropriate for symmetric distributions like the multivariate normal. However, returns are not symmetric, especially when financial derivatives are involved.

We may need alternative risk measures, such as value-at-risk or conditional value-at-risk, which call for a fuller picture of the joint distribution; see Chapter 13. The behavior on the bad tails of the distribution, where we lose money, is relevant, but capturing this is not easy.

Covariances and correlations may fail to capture true dependency. We stress this point in Section 3.4, where we motivate the introduction of copulas. In particular, can we believe that correlations stay the same in good and bad times, when extreme events occur?

Last but not least, standard statistical methods are by their very nature backward looking. However, in finance we should look forward.⁵ For instance, assume that yesterday was very lucky for a certain stock share that earned a stellar 150% daily return, which may have a sensible impact on return statistics. Is this a good reason to invest all of our wealth in that stock?

We see that even specifying a single-period, static model is not trivial at all. No approach based on Monte Carlo sampling will yield useful results, if the underlying uncertainty model is not credible.

3.1.2 CONSUMPTION-SAVING WITH UNCERTAIN LABOR INCOME

In many applications a static and single-period model is inappropriate, as we need to represent a sequence of decisions to be made in the face of possibly complex dynamics of stochastic risk factors. Here we illustrate a discrete-time framework for a stylized model extending the consumption–saving problem of Section 1.4.3. The task is to choose the fraction of current wealth that is consumed, and how the remaining amount saved is allocated between a risky and a risk-free asset.⁶ For strategic decision problems like this one, an appropriate time bucket could be one year; in other cases much higher frequencies are needed. Whatever time period we choose, in any discrete-time model it is essential to specify the sequence of events very clearly, as this defines the information that is actually available when making a decision. We consider time instants and time buckets (or periods):

At time instant t = 0, 1, 2, …, T, i.e., at the beginning of the corresponding time bucket, the agent owns a current wealth denoted by W_t, resulting from the previous saving and investment decisions.

Labor income L_t is collected; actually, this is income earned during the previous time bucket t − 1.

The total cash on hand W_t + L_t is split between saving S_t and consumption C_t during the next time bucket.

The saved amount is allocated between a risk-free asset, with deterministic rate of return r_f, and a risky asset, with random rate of return R_t+1; note how the subscript t + 1 emphasizes the fact that this return will be known only at the next time instant t + 1, or, if you prefer, at the end of the next time bucket, whereas the allocation decision must be made now, at the beginning of the time bucket; let us denote by the fraction of saving that is allocated to the risky asset by α_t

[0, 1].

The first source of uncertainty can be represented using the concepts from the previous section, if we do not want to consider any predictability in asset returns. However, it is likely that the labor income L_t is not quite independent from L_t−1; therefore, we must choose some way to represent both predictable and unpredictable (shock) components, as we show below.

Given a decision α_t, the rate of return of the investment portfolio over the next time bucket is

where R_t+1 − r_f is typically referred to as excess return. Therefore, wealth at time instant t + 1 is

We emphasize again that the risky asset return R_t+1 is not known when making the asset allocation decision, whereas the risk-free rate is known. Actually the risk-free return should also be modeled as a stochastic process, as the level of interest rates does change over time because of changing economic conditions and the corresponding monetary policy of central banks. Hence, we should denote this rate as r_ft. The main difference with respect to the risky return R_t+1 is that, at the beginning of a time bucket, we do know the risk-free rate for that time bucket; however, we do not know the interest rates for the subsequent time buckets. So, r_ft is a predictable stochastic process. For the sake of simplicity, let us assume a constant interest rate r_f. We might also wish to model correlations between the two returns. After all, we know that when central banks change the level of interest rates, stock exchanges are indeed affected.

To model labor income, we really must be more specific and decompose L_t in components. A model proposed, e.g., in [4, Chapter 7], is

The difference between the two random components is that the transitory random term should have an impact in one year, and then it should leave no effect, whereas permanent shocks have to be cumulated, or integrated, if you prefer, over time. This can be represented as a random walk:

where, for instance,

is a normal random variable with variance σ_η² and expected value zero. By a similar token, one could assume

. Note that the expected value of a shock is zero, since any predictable component should be included in the deterministic component. In simple models, η_t and

_t are assumed independent over time.

Now, we should see a problem with an additive decomposition like the one in Eq. (3.3). If shocks are normally distributed, in principle, income could get negative. Furthermore, using the same volatilities (standard deviations) for all levels of income is questionable. One way to overcome these issues is to resort to a multiplicative decomposition. We might just reinterpret equation Eq. (3.3) in terms of log-income, defined as l_t ≡ log L_t.⁷ In other words, we posit

Since we take an exponential of a normal, a negative shock will be transformed into a multiplicative shock less than 1, i.e., a reduction in income. We will also see, in Section 3.2.3, that when we take the exponential of a normal random variable, we obtain a lognormal random variable.

Finally, if we associate a utility function u(·) with consumption, the overall objective could be

where β

(0, 1) is a subjective discount factor. Thus, we face a dynamic, stochastic optimization problem, which cannot be solved analytically in general; Monte Carlo and related methods play a key role in the solution of these models, as well as in the simulation of decisions over time to analyze the resulting policy. In Section 3.6 we will analyze a few more discrete-time models that can be used to represent dependency over time, changes in volatility, etc.

3.1.3 CONTINUOUS-TIME MODELS FOR ASSET PRICES AND INTEREST RATES

Time discretization may look like an approximation of an intrinsically continuous time. This need not be the case, actually: If decisions are made periodically, then a discretized time bucket corresponding to decision periods is all we need. In other cases, higher frequencies must be accounted for. We may then take smaller and smaller time periods in order to approximate continuous time. Needless to say, this increases the computational burden considerably and a trade-off must be assessed between modeling needs and computational tractability. A possibly surprising fact is that if we take the process to the limit and consider continuous-time models, we may actually ease many issues and come up with tractable models. Furthermore, when we consider trading on exchanges and the actual flow of information, we have to accept the fact that a more faithful representation does require continuous time. This can help in building a theoretically sound framework, even if numerical methods, most notably Monte Carlo methods, are to be employed when models cannot be solved analytically. Indeed, continuous-time models rely on differential equations, which may look tougher than discrete-time difference equations, especially since we have to cope with stochastic differential equations. Stochastic models are in fact needed to represent uncertainty in stock prices, interest rates, foreign exchange rates, etc. Nevertheless, it seems wise to introduce the concept in the simplest deterministic case of some practical relevance, the investment in a safe bank account. We follow here the same path that we have followed in Section 1.3.2, providing some more motivation and details.

Consider a risk-free bank account. If we invest an initial amount B(0) at time t = 0, the annual interest rate is r_f, and interest is earned once per year, our wealth after one year will be

However, if semiannual interest r_f/2 is paid twice a year and is immediately reinvested, we get

This new amount, if the rate r_f is the same, is a bit larger than the value in Eq. (3.5), since we earn interest on interest. In such a case, we speak of semiannual compounding. More generally, if interest is paid m times per year, we have

Such a continuously compounded interest does not really exist in financial markets, even though daily compounding is essentially the same as continuous-time compounding. Nevertheless, it simplifies considerably several financial calculations and it paves the way for continuous-time models.

It is useful to describe all of the above using a dynamic model. So, let us assume that interest is continuously earned and instantaneously reinvested. Over a short time interval (t, t + δt), we have the following change in the bank account:

There are a couple of reasons why we might prefer the differential form of Eq. (3.7). First, a slight manipulation shows a way to solve it. If we rewrite the equation as

and we recall the form of the derivative of a composite function, we may recognize that a logarithm is involved:

There is another reason why the form (3.7) is preferred to (3.8): It can be extended to cope with stochastic factors. Suppose that we are investing in a risky asset, rather than a risk-free bank account. We need a model for the asset price, accounting for both drift (related to expected return) and volatility (related to random variability in the price). The drift component is essentially the same as in Eq. (3.7), with r_f replaced by some coefficient, say, μ. However, we need to add a suitable source of randomness. A common building block is the Wiener process, denoted by W(t). This is a continuous-time stochastic process, which essentially means that for every value of the continuous parameter t, we have a random variable W(t): In general we may define a stochastic process as a time-indexed collection of random variables X(t) or X_t, where time can be continuous or discrete. To include this noise term in the equation, we should first consider increments in the stochastic process over a small time interval (t, t + δt),

The Wiener process is a Gaussian process, in the sense that δW(t) ~ N(0, δt).⁸ Please note that the variance of δW(t) is δt, and thus its standard deviation is

. Then we take the limit for δt → 0, which yields the infinitesimal increment dW(t). Adding this term, multiplied by a volatility factor σ, to Eq. (3.7) yields the following stochastic differential equation:

Later, we will see that this defines an important process called geometric Brownian motion. Actually, we have skipped over some challenging issues, since “taking the limit” for δt → 0 is quite delicate when random variables are involved, as one must refer to some stochastic convergence concept. Furthermore, the Wiener process is a very “irregular” and noisy process, with sample paths that are not differentiable. We will deal with sample path generation later, but you may have a peek at Fig. 3.39 to see what we mean. Hence, taking the usual derivative dW(t)/dt is not possible, which justifies the use of this form. In Section 3.7 we will outline the fundamental tools of stochastic calculus, and we will learn that Eq. (3.10) is actually a shorthand for an integral equation:

This looks superficially like the integral in Eq. (3.9), but it relies on the concept of Itô stochastic integral, to be discussed in Section 3.7. The differential form, however, is more intuitive and paves the way for the development of several interesting processes.

For instance, let us consider a model for stochastic interest rates. Can we model an interest rate r(t) by geometric Brownian motion? The answer is no, and to see why, let us wonder about the solution of Eq. (3.10) when σ = 0. When there is no volatility, the equation boils down to Eq. (3.7), whose solution is an increasing exponential. If there is a volatility component, intuition suggests that the solution is an exponentially increasing trend, with some superimposed noise. Financial common sense suggests that interest rates do not grow without bound, and tend to oscillate around a long-term average level

. To model such a mean reversion, we may use the following equation:

which is an example of the Ornstein–Uhlenbeck process, to be discussed later in Section 3.7.5. The intuition is that the drift can be positive or negative, pulling back r(t) toward its long-term average level. It is worth mentioning that, in practice, continuous-time models are the rule in financial engineering.

3.2 Some common probability distributions

Most readers are quite familiar with the distributions that we outline in the following and can safely skip the section, possibly referring back when needed. The treatment is limited and just meant to be a refresher, but in order to make the section as useful as possible to the widest readership, we also take the review as an excuse to introduce R functions to deal with probability distributions; we will also use functions to produce some simple plots, and R scripts to illustrate both R programming and some specific properties of distributions. As a general rule, four functions are associated with each standard distribution in R. Given a distribution distr, R provides us with:

Note how the prefixes p, d, q, and r characterize each function. The p is somewhat confusing, as “probability mass” could be associated with discrete distributions, whereas in R the p only refers to cumulative probabilities; the choice of the prefix d reminds a PDF and could possibly reflect some bias toward continuous distributions.

We may illustrate these four functions with reference to a continuous uniform variable on the interval [a, b], which features a constant density function on its support:

A commonly used notation to state that a random variable X has this distribution is X ~ U[a, b], or X ~ U(a, b). We also recall that

The following snapshot will probably give different results on your computer, as random samples depend on the state of the (pseudo)random number generator:

This is why, throughout this book, we will use the command set.seed(55555) to set the state of the generator to a specific state, thus allowing replication of results. We defer to Chapter 5 the technicalities involved in the generation of pseudorandom numbers and variates. By the way, while the uniform distribution does not look too interesting in itself, it is fundamental in Monte Carlo simulation as all distributions are sampled by suitable transformations of U(0, 1) variables.

In the following sections we illustrate a few common univariate distributions, both discrete and continuous, without any claim of exhaustiveness. Some of the distributions that we describe here have a multivariate counterpart, to be discussed in Section 3.3.

3.2.1 BERNOULLI, BINOMIAL, AND GEOMETRIC VARIABLES

The binomial distribution is a discrete distribution built on top of a Bernoulli random variable. The mechanism behind a Bernoulli variable is carrying out a random experiment, which can result in either a success, with probability p, or a failure, with probability 1 − p. If we assign to the random variable X the value 1 in case of a success, and 0 in case of a failure, we obtain the PMF:

For a Bernoulli variable, denoted by X ~ Ber(p), we can easily calculate the expected value,

The formula for variance makes intuitive sense: The variance is zero for p = 1 and p = 0 (there is no uncertainty on the outcome of the experiment), and it is maximized for p =

, where we have the largest uncertainty about the outcome.

When needed, we may use any other value, rather than 0 and 1. A financial motivation for a Bernoulli variable, quite relevant in option pricing, is modeling random asset prices S(t) in simplified form, rather than using stochastic differential equations. According to this simplified model, given the current asset price S(0) = S₀, the price after a time step δt, denoted by S(δt), can be either S_u = uS₀ or S_d = dS₀, with probabilities p_u and p_d, respectively. The subscripts may be interpreted as “up” and “down,” if we choose multiplicative shocks d < u. Now, imagine that the process S(t) is observed up to time T = n · δt and that up and down jumps are independent over time. Since there are n steps, the largest value that S(T) can attain is S₀uⁿ, and the smallest one is S₀dⁿ. Intermediate values are of the form S₀u^md^(n-m), where m is the number of up jumps occurred, and n − m is the number of down jumps. Note that the exact sequence of up and down jumps is not relevant to determine the terminal price, as multiplicative shocks commute: S₀ud = S₀du. A simple model like this is referred to as binomial model and can be depicted as the binomial lattice in Fig. 3.1. Such a model is quite convenient for simple, low-dimensional options,⁹ as recombination keeps complexity low, whereas the size of a generic tree explodes exponentially.¹⁰ Now, we might wonder what is the probability that S(T) = S₀u^md^(n-m). Indeed, this is an example of a binomial variable: To build a binomial variable, we carry out a given number n of independent Bernoulli experiments and count the number m of successes. Hence, a binomial variable is just the sum of n independent Bernoulli variables. This distribution has finite support {0, 1, …, n} and depends on the two parameters p and n; thus, it may be denoted by X ~ Bin(n, p). By using properties of sums of independent random variables, it is easy to show that

To see why this coefficient is needed, consider that there are many ways in which we may have r successes and n − r failures; which specific experiments succeed or fail is irrelevant. In fact, we have n! permutations of the n individual experiments, (n − r)! permutations of failures, and r! permutations of successes. Since the specific order of each permutation is irrelevant, we divide n! by (n − r)! and r!.

Figure 3.2 shows the PMF for n = 30, and p = 0.2, p = 0.4. The plot has been obtained by the following commands:

The first command sets mfrow in order to draw plots according to a layout consisting of two rows and one column, thus lining the plots vertically. The plot function uses the parameter lwd to set the line width. Also check the following calculations:

The second way of computing binomial probabilities is a straightforward application of Eq. (3.13), but it is not quite smart, as by taking the factorial of integer numbers there is a danger of overflow, i.e., of producing a number so large that it cannot be represented on a computer. When in need, it is better to leave the task to an appropriately coded function:

Another discrete distribution that can be built on top of the Bernoulli is the geometric. This is defined by repeating identical and independent Bernoulli trials with success probability p, until the first success is obtained and counting the number of experiments.¹² The PMF of the geometric distribution is

The expected value E[X] can be easily found by conditioning on the result of the first experiment, rather than by applying the definition. If the first experiment is a success, which occurs with probability p, the conditional expected value of the number of experiments is just E[X | success] = 1; otherwise, since there is no memory of past outcomes, it is E[X] failure] = 1 + E[X], i.e., we count the first failed experiment and then we are back to square one. Thus:

R offers functions for the geometric distribution, too, but there is a little peculiarity:

We notice that the random variable can take the value 0, because in R the convention for the geometric distribution is to count failures, rather than experiments.

3.2.2 EXPONENTIAL AND POISSON DISTRIBUTIONS

The exponential and the Poisson distributions, despite their different nature, are tightly intertwined: The former is continuous and may be used to model the time elapsing between two events of some kind, whereas the latter is discrete and represents the corresponding number of events occurred within a given time interval. Examples of relevant events in finance are jumps in stock prices or changes in the credit rating of a bond issuer.

3.2.2.1 Exponential random variables

where λ > 0 is a given parameter. Note that the support of the exponential distribution is [0, +∞), which makes sense since it is often used to model the length of time intervals. The corresponding CDF is

It is worth noting that the expected value of an exponential random variable is quite different from the mode, which is zero, and that this distribution is right-skewed. If we interpret X as the time elapsing between consecutive events, 1/λ is the average time between them, and λ is a rate, i.e., the average number of events occurring per unit time. The notation X ~ exp(λ) is typically used to denote an exponential random variable.

A weird but remarkable property of exponential random variables is lack of memory. From the cumulative distribution function (3.15) we see that

Intuitively, this means that, if we are at time zero, the probability that we have to wait more than t for the next event to occur is a decreasing function of t, which makes sense. What is not so intuitive is that if a time period of length t has already elapsed from the last event, this has no influence on the time we still have to wait for the next one. This is clearly false, e.g., for a uniform random variable U(a, b); if a time b −

has elapsed from the last event, where

is small, we should get ready for the next one. On the contrary, the exponential distribution is memoryless, which can be stated formally as follows:

This characteristic is important when making modeling choices: If lack of memory is not compatible with the phenomenon we are representing, we should not use an exponential distribution.

Example 3.1 Checking the memoryless property

R offers the exp family of functions, which we may use to illustrate lack of memory empirically using Monte Carlo sampling as an R programming exercise. The function in Fig. 3.3 generates a sample of numRepl exponential random variables with rate lambda. Then, the probabilities P{X > s}, P{X > s}, and P{X > t + s | X > t} are estimated empirically. Note how we restrict the sample conditionally on the event {X > t}, when extracting the subvector times_t. Here is the result, based on one million replications:

FIGURE 3.3 Checking the memoryless property of exponential distribution empirically.

> set.seed(55555)
> CheckMemory(1, 1, 0.5, 1000000)
P(T>t) = 0.367936
P(T>s) = 0.606819
P(T>t+s|T>t) = 0.6064832

We see that, despite a small discrepancy due to sampling errors, the estimate of P{X > t + s | X > t} is fairly close to P{X > s}. One could wonder whether one million replications are enough. The following exact evaluation is not too different from the above estimates, suggesting that the selected number of replications should be indeed large enough:

> 1-pexp(1,rate=1)
[1] 0.3678794
> 1-pexp(0.5,rate=1)
[1] 0.6065307

However, this could just be sheer luck, and in real life we cannot afford the luxury of a comparison with exact numbers (otherwise, we would not be using Monte Carlo methods in the first place). We will address similar issues in Chapter 7, when dealing with simulation output analysis.

3.2.2.2 Erlang random variables

The exponential distribution is the building block of other random variables. By summing n independent exponentials with common rate λ we obtain the Erlang distribution, denoted by Erl(n, λ). Erlang random variables may be used to model time elapsed as a sequence of exponential stages, in order to overcome the memoryless property of a single exponential. The PDF of a few Erlang variables, with a different number of stages, can be plotted using the R script in Fig. 3.4, which produces the plot of Fig. 3.5. From a mathematical point of view we immediately notice that by summing exponential variables we get a rather different distribution. In general, the sum of random variables has a distribution that is quite different from the single terms in the sum. The normal distribution is the most notable exception, as the sum of normals is still normal. We also notice that, in order to plot the PDF of an Erlang variable in R, we have to resort to the dgamma function. As we shall see in Section 3.2.5, the gamma distribution generalizes other distributions, including the Erlang.

3.2.2.3 Poisson random variables and the Poisson stochastic process

We have already mentioned in Section 1.3.3 a peculiar stochastic process, the Poisson process, to motivate models based on discrete-event systems. Here we further investigate its properties and link it to a discrete distribution sharing the same name. The process is obtained by counting the number of events occurred in the time interval [0, t], denoted by N(t). Hence, this is a counting process, and Fig. 3.6 illustrates a sample path. Clearly, the sample path is discontinuous, but piecewise constant, as there is a jump corresponding to each event. Let X_k, k = 1, 2, 3, 4, …, be the time elapsed between jump k − 1 and jump k; by convention, X₁ is the time at which the first jump occurs, after starting the system at time t = 0. We obtain a Poisson process if we assume that variables X_k are mutually independent and all exponentially distributed with parameter λ. You might even stumble on exoteric jargon like a càdlàg function when dealing with certain stochastic processes in finance. This is just a French acronym for “continue à droite, limitée à gauche,” since the sample path is continuous from the right, and is limited (or bounded, i.e., it does not go to infinity) from the left.

If we count the number of events in the [0, 1] time interval, we obtain a discrete random variable, the Poisson random variable. The Poisson distribution is characterized by the same parameter λ as the exponential, and it may take values on the set {0, 1, 2, 3, …}. Its PMF is

Despite the apparent complexity of this function, it is straightforward to check that it meets the fundamental requirement of a PMF:

Note that λ is, in fact, the event rate, which characterizes this distribution, denoted by X ~ Pois(λ). If we want to count events occurring in the more general interval [0, t], we have just to consider X ~ Pois(λt). The R family for the Poisson distribution is pois. Figure 3.7 shows the plot of the PMF of a Poisson random variable for λ = 5, obtained by the following command:

We may state the connection between the Poisson random variable and the Poisson process in more general terms as follows. If we consider a time interval [t₁, t₂], with t₁ < t₂, then the number of events occurred in this interval, i.e., N(t₂) − N(t₁), has Poisson distribution with parameter λ(t₂ − t₁). Furthermore, if we consider another time interval [t₃, t₄], where t₃ < t₄, which is disjoint from the previous one, i.e., (t₂ < t₃), then the random variables N(t₂) − N(t₁) and N(t₄) − N(t₃) are independent. We say that the Poisson process has stationary and independent increments.

The Poisson process model can be generalized to better fit reality, when needed. If we introduce a time-varying rate λ(t), we get the so-called inhomogeneous Poisson process. Furthermore, when modeling price jumps, we would like to introduce a random variable modeling the height of the jump. The cumulative sum of jump heights in the time interval [0, t] is another stochastic process, which is known as compound Poisson process. We will see how sample paths for such processes can be generated in Chapter 6.

3.2.3 NORMAL AND RELATED DISTRIBUTIONS

It is safe to say that the normal or Gaussian distribution, with its bell shape, is the best-known and most used (and misused) distribution. A normal random variable X ~ N(μ, σ²) is characterized by the following PDF:

which clearly suggests symmetry with respect to the point x = μ. The parameters μ and σ² have a clear interpretation,¹³ as a few calculations show that

R offers the norm family of functions. As an illustration, in Fig. 3.8 we show the PDF for two normal distributions with μ = 0 and σ = 1, 2, respectively; the plot is obtained as follows:

From the plot, we also verify that, for a normal distribution, mode, median, and expected value are the same.

In the applications, a very special role is played by the standard normal distribution, characterized by parameters μ = 0 and σ = 1. The reason of its importance is that, if we are able to work with a standard normal in terms of quantiles and distribution function, then we are able to work with a generic normal variable. In fact the CDF is the integral of the bell-shaped function above, for which no analytical formula is known. Although the general CDF for a normal variable is not known analytically, efficient numerical approximations are available for the standard normal case Z ~ N(0, 1):

This function is calculated by the R function pnorm. We are also able to invert the function by qnorm, which yields the quantile z_q defined by

for a probability level q

(0, 1). Armed with these functions, we may easily compute the CDF and quantiles for a generic normal. If X ~ N(μ, σ²) and we need the probability

since Z = (X − μ)/σ is standard normal. If we need a quantile x_q, such that P{X ≤ x_q} = q, we find the corresponding quantile z_q and go the other way around by destandardizing:

Example 3.2 Some R functions for the normal distribution

The following examples show how to use pnorm and qnorm:

> pnorm(10,mean=5,sd=3)
[1] 0.9522096
> pnorm((10-5)/3)
[1] 0.9522096
> qnorm(0.95,mean=5,sd=3)
[1] 9.934561
> z=qnorm(0.95);z
[1] 1.644854
> 5+3*z
[1] 9.934561

We may also easily check the following well-known properties:

1. P{μ − σ ≤ X ≤ μ + σ} ≈ 68.27%

2. P{μ − 2σ ≤ X ≤ μ + 2σ} ≈ 95.45%

3. P{μ − 3σ ≤ X ≤ μ + 3σ} ≈ 99.73%

> pnorm(1) − pnorm(−1)
[1] 0.6826895
> pnorm(2) − pnorm(−2)
[1] 0.9544997
> pnorm(3) − pnorm(−3)
[1] 0.9973002

The last property shows that the normal distribution has rather thin tails, since most realizations are within three standard deviations of the mean, which makes its use sometimes questionable. Indeed, the normal distribution is a sort of benchmark in terms of fat vs. thin tails, a fact that is measured by its kurtosis:

Sometimes, kurtosis is denned as κ −3, which should actually be called excess kurtosis. Distributions with positive excess kurtosis allow for more frequent extreme events than the normal. Another important feature of the normal distribution is symmetry, which implies a zero skewness:

The assumption that returns on an asset are normally distributed implies their symmetry, which is contradicted by empirical facts, especially when financial derivatives are involved. Hence, a normal model is sometimes too stylized to be credible. Nevertheless, the normal distribution has also a few quite interesting and convenient properties:

It is a limit distribution according to the central limit theorem: If we sum a large number of independent and identically distributed random variables, the result has a normal distribution in the limit. This makes the normal distribution useful in inferential statistics, for suitably large samples.

If we add normal random variables, we get another normal random variable. Formally, if we consider n independent normals X_i ~ N(μ_i, σ_i²), i = 1, …, n, and take an affine combination with coefficients a and b_i, then

equation

The proof of this property requires more advanced concepts related to moment generating functions.¹⁴

It is said that the normal distribution is a stable distribution, since a linear combination of normals is still normal. The process can also go the other way around, since stable random variables are infinitely divisible, i.e., can be expressed as sums of similar variables. We just mention that there is a stable distribution that includes the normal as a specific case.¹⁵

In Section 3.3.4 we will also see some properties of the multivariate generalization of the normal distribution. The normal distribution is also the basis of other distributions obtained by transforming a normal variable or by combining independent standard normals.

3.2.3.1 Lognormal distribution

A random variable X is said to have a lognormal distribution if log X is normal. In other words, if Y ~ N(μ, σ²), then X = e^Y is lognormal, denoted by X ~ LogN(μ, σ²). It can be shown that the PDF density of the lognormal can be written as

It is important to notice that in this case the two parameters μ and σ² cannot be interpreted as in the normal case. In fact, using the moment generating function of the normal, it is easy to show that

Note how, due to convexity of the exponential function and Jensen’s inequality, if X ~ Log N(μ, σ²), then

This suggests using a lognormal variable with unit expected value to model random errors in a multiplicative model, whereas a normal variable with zero expected value would model random errors in an additive model.

Furthermore, since by summing normal variables we still get a normal variable, a similar property holds when we multiply lognormal random variables. Formally, if we consider n independent lognormals X_i ~ LogN(μ_i, σ_i²), then

This is useful when modeling financial returns, which should cumulate multiplicatively rather than additively. The R family for this distribution is lnorm. For instance, we may check Eq. (3.17) empirically:

3.2.3.2 Chi-square distribution

Let Z_i, i = 1, …, n, be standard and independent normal variables. The random variable X defined as

is certainly not normal, as it cannot take negative values. This variable is called chi-square with n degrees of freedom and is often denoted by χ_n². Recalling that the expected value of a squared standard normal is

Figure 3.9 shows the PDF for a chi-square variable with 4 and 8 degrees of freedom, obtained as follows:

The chi-square distribution is relevant in inferential statistics, and there is a well-known nonparametric test of goodness of fit owing its name to it. It is worth noting that a chi-square distribution may be regarded as a specific case of a gamma random variable, discussed later. In finance, we also meet the noncentral chi-square distribution, which is obtained by the following construction. Let us consider

The corresponding variable is called noncentral chi-square with n degrees of freedom and noncentrality parameter

. It is possible to generalize this distribution to a noninteger ν, resulting in a χ′_ν² (λ) variable. The following formulas for the relevant moments can be shown:

from which we see that the degrees of freedom df and the noncentrality parameter ncp may be noninteger. This distribution is relevant in the simulation of square-root diffusions.¹⁶

3.2.3.3 Student’s t distribution

If Z and χ_n² are a standard normal and a chi-square with n degrees of freedom, respectively, and they are also mutually independent, the random variable

has a Student’s t distribution with n degrees of freedom, denoted by t_n. This distribution is widely used in inferential statistics, for computing confidence intervals and testing hypotheses. A notable feature is that it features fatter tails than the normal. Indeed, it can be shown that

We observe that, for large n, variance goes to 1, but it is not defined for n ≤ 2. Since this variance is larger than the corresponding variance of the standard normal, the t distribution is more dispersed. In fact, the excess kurtosis is

for n > 4. We may check these fatter tails by plotting its density in Fig. 3.10:

The plot shows the densities of t₁ and t₅ random variables, along with a standard normal. We observe that the PDF of the t distribution is bell-shaped much like a standard normal; the main differences lie in its heavier tail and in a lower mode. For increasing n, the tails of t_n get thinner and the density tends to a standard normal. In fact, for large n the two distributions are virtually identical. The t distribution can be generalized to the multivariate case, as shown in Section 3.3.4.

3.2.3.4 F distribution

Another relevant distribution can be derived from the normal and plays a fundamental role in analysis of variance and in testing regression models. If we combine two independent chi-square variables with n and m degrees of freedom, respectively, as follows:

we obtain the F distribution with n and m degrees of freedom. R offers functions like df (x, df1, df2); there is a noncentral version of the F distribution as well.

3.2.4 BETA DISTRIBUTION

The beta distribution is continuous and has support on the unit interval [0, 1]. It depends on two parameters α₁, α₂ ≥ 0, defining its shape through the PDF

The denominator of the above ratio is just needed in order to normalize the PDF. A straightforward approach is defining the beta function:

Using integration by parts, it is easy to see that the gamma function satisfies the following recursion:

This shows that the gamma function is an extension of the factorial function outside the domain of integer numbers. In fact, using Eq. (3.21) for an integer argument and unfolding the recursion, we find

The gamma function appears in the definition of different PDFs, including, needless to say, the gamma distribution to be discussed next. It also turns out that

The PDFs displayed in Fig. 3.11 are obtained by the following R snapshot; this shows that the uniform and certain triangular distributions are special cases of the beta distribution.

Possible uses of a beta random variable are related to random percentages, like recovery rates after a default. If needed, the support of the beta distribution can be easily mapped to the interval (a, b). It is also interesting to compare the PDF of the beta distribution in Eq. (3.19) with the PMF of a binomial distribution in Eq. (3.13). Apart from the normalization factors, they look quite similar, at least if we choose integer parameters α₁ and α₂ in the PDF of a beta variable. In fact, these distributions are quite intertwined within the framework of Bayesian statistics, when we consider uncertainty in the parameter p of a Bernoulli random variable; the beta distribution can be used to represent such an uncertainty, both in the prior and the posterior.¹⁷

3.2.5 GAMMA DISTRIBUTION

which depends on parameters α and λ; we use notation X ~ Gamma(α, λ). A notable feature of the gamma distribution is that it includes other distributions as special cases:

This suggests that λ is a scale parameter, whereas α is a shape parameter. This is in fact the case, in the sense that if X ~ Gamma(α, 1), then X/λ ~ Gamma(α, λ). Actually, some authors prefer to use λ as the scale parameter, whereas others prefer using β = 1/λ. R offers both possibilities, as there are two alternatives in the gamma function family:

The following R snapshot produces the plot in Fig. 3.12, which illustrates the effect of the shape parameter for a normalized scale parameter:

3.2.6 EMPIRICAL DISTRIBUTIONS

Sometimes, no textbook distribution will fit the available data. In such a case, one could consider resorting to an empirical distribution. In the discrete case, we have just to specify a set of values and the associated probabilities. In the continuous case, we must come up with a sensible way to devise a PDF or a CDF based on a discrete set of observations.

Piecewise linear interpolation is a straightforward way of building a CDF. Given n observations X_i, i = 1, …, n, of the random variable X, we first sort them and find the order statistics X₍₁₎ ≤ X₍₂₎ ≤ ··· ≤ X_(n). Now we build a piecewise linear CDF

as follows:

In Fig. 3.13 we illustrate an example for n = 6. After defining the empirical CDF, it is easy to sample from it using, e.g., the inverse transform method of Section 5.3.

Kernel-based estimation is an alternative, possibly more sophisticated way to build an empirical distribution. Unlike the piecewise-linear interpolation approach, it aims at building a PDF, and it can be interpreted as a way to “smooth” a histogram. Roughly speaking, the idea is to center a suitable elementary density around each observation, to add all of the densities, and to scale the result according to the number of observations in such a way as to obtain a legitimate overall PDF.

R offers a density function to this aim, receiving a vector of observations and producing an object of class density. The function behavior can be controlled by setting several parameters, among which we emphasize:

To illustrate the role of bandwidth, let us generate a sample of observations by using a mixture of two normals:

The corresponding histogram is displayed in Fig. 3.14, and suggests a bimodal distribution. The kernel kd1 is indeed of class density, and it can easily be plotted, resulting in the two PDFs displayed in Fig. 3.15. We clearly see that a small bandwidth does not smooth data enough and produces a confusing PDF; a larger bandwidth yields a definitely more reasonable result.

An extreme form of empirical distribution, in a sense, is the bootstrap approach sometimes employed in finance. The idea is to take a time series of prices or returns, and to generate future scenarios by sampling past outcomes. There are refinements of this idea, but it should be clear that if we take past observations and we just mix them randomly to come up with sample paths, we are implicitly assuming that longitudinal dependencies and correlations can be ruled out. Hence, we are relying on the efficient market hypothesis. On the other hand, a notable advantage of the bootstrap approach is that, when dealing with multidimensional time series, we automatically account for cross-sectional dependencies. In the next section, indeed, we move on to consider multivariate distributions, and we shall see that modeling such dependencies is far from trivial, and this explains why bootstrap is often advocated.

1. Unless suitable countermeasures are adopted, they will never generate anything beyond the range of what was observed in the sample that we use to generate the distribution. This is quite clear from Fig. 3.13 and may rule out extreme scenarios, as the resulting distribution has finite support. One possibility is to append suitable tails in order to correct for this range effect. The PDF produced by kernel estimation approach, in fact, goes a bit beyond what is observed, depending on the selected bandwidth. More specific approaches to select tails have been proposed in the literature.

2. The idea of just fitting the data as they are in order to capture all of their nuances, without the oversimplification of textbook distributions, is quite appealing. However, we should always keep in mind the danger of overfitting. For instance, if we observe bimodal data, we should wonder if there is some fundamental reason behind that pattern, which should be represented in our uncertainty model, or if it is just the effect of sampling variability.

3.3 Multivariate distributions: Covariance and correlation

In the previous sections we have considered univariate distributions, which provide us with the nuts and bolts for modeling uncertainty. However, the need for modeling dependence is pervasive in finance and economics, and this leads us to consider multivariate distributions. We are interested in two main dimensions of dependence:

Dependence between two variables observed at the same time instant, i.e., cross-sectional dependence. As an example, we are interested in the relationship between return of different assets over the same time horizon. This kind of dependence may be represented explicitly by a possibly complicated multivariate distribution or built implicitly, e.g., by factor models described in Section 3.8.2.

Dependence between two observations of the same variable at different time instants, i.e., longitudinal dependence. For instance, we want to investigate the relationship between returns of the same asset in two consecutive days. Again, this kind of dependence may be represented explicitly by a possibly complicated multivariate distribution or built implicitly, e.g., by time series models described in Section 3.6.

Clearly, the decomposition of dependence in two orthogonal dimensions is an oversimplification. In some cases, we really need to consider both dimensions together, which leads to the analysis of panel data. In other cases, the decomposition may be a sensible modeling simplification. For instance, we might want to check the dependence of two bond defaults that do not really occur at the same time instant, but from a modeling point of view they could be considered as happening at the same time (unless we wish to model causal links between them).

In this section we start by considering the structure of multivariate distributions, emphasizing their complexity and the fact that a characterization of marginal distributions does not tell the whole story, unless we assume independence. Then, we deal with standard concepts like covariance and Pearson’s correlation. Showing their potential limitations paves the way for Section 3.4, where we illustrate more advanced ideas revolving around the copula concept. Finally, we recall a couple of multivariate distributions, namely, the normal and Student’s t, which generalize their univariate counterparts.

3.3.1 MULTIVARIATE DISTRIBUTIONS

We know that univariate distributions may be described by the PMF in the discrete case or the PDF in the continuous case. Actually, the CDF is a more general concept, since it is valid in a general setting. So, the starting point to to get the picture about the dependence of two random variables is their joint CDF.

DEFINITION 3.1 (Joint cumulative distribution function) The joint CDF of random variables X and Y is defined as

The streamlined notation P(X < x,Y ≤ y) is typically used to denote a joint event.

The joint CDF is a function of two variables, x and y, and it fully characterizes the joint distribution of the random variables X and Y. It is easy to see how the definition can be generalized to n random variables.

The joint CDF is the most general tool to characterize a multivariate distribution. Other concepts may be defined, depending on the character of the involved random variables. In the case of two discrete variables, where X may take values x_i, i = 1, 2, 3, …, and Y may take values y_j, j = 1, 2, 3, …, we define the joint PMF as

When dealing with continuous random variables, we may define a joint PDF f_X,Y(x, y) such that

In general, we cannot be sure that such a function exists. To be precise, we should say that the two random variables are jointly continuous when their joint PDF exists. Given the joint PDF, we may find the joint CDF by integration:

Also the joint PMF and PDF can be defined for n random variables. Needless to say, specifying a function of n variables requires a lot of effort unless one imposes some structure:

We can do without the above machinery in one easy case, i.e., when the random variables are independent. In such a case, the events {X ≤ x} and {Y ≤ y} are independent, and the CDF factors into the product of two terms:

In general, whatever refers to a single variable, within the context of a multivariate distribution, is called marginal. In case of independence, the joint CDF is just the product of marginal CDFs. The same applies to joint PMFs and PDFs. A first question is: Given the joint CDF F_X,Y(x, y), how can we find the marginal CDFs F_X(x) and F_Y(y), pertaining to each individual variable, if we do not assume independence? In principle, the task for the CDF is fairly easy. We obtain the marginal CDFs for the two random variables as follows:

By the same token, F_Y(y) = F_X,Y (+∞, y). In the discrete case, to obtain the marginal PMFs from the joint PMF, we just use the total probability theorem and the fact that events {Y = y_j} are disjoint:

If we want to obtain marginal PDFs from the joint PDF, we may work in much the same way as in the discrete case:

The marginal PDF for Y is obtained in the same way, integrating with respect to x.

We see that, given a joint distribution, we may find the marginals. It is tempting to think that, given the two marginals, we may recover the joint distribution in some way. This is not true, as the marginal distributions do not say anything about the link among random variables.

Example 3.3 Marginals are not enough: Part 1

Consider the discrete-time stochastic process

(3.26) equation

where ~ N(0, 1). We immediately see that each X_t is normal with expected value 0 and variance t². This marginal distribution is the same for the process

(3.27) equation

where _t is again standard normal, but we have one such variable for each time instant. Indeed, Eq. (3.26) describes a rather degenerate process, since uncertainty is linked to the realization of a single random variable. The following R snapshot produces a set of sample paths for the two processes, illustrated in Figs. 3.16 and 3.17:

FIGURE 3.16 A few sample paths of the stochastic process of Eq. (3.26).

FIGURE 3.17 A single sample path of the stochastic process of Eq. (3.27).

set.seed(55555)
len <- 50
times <- 0:len
eps <- rnorm(10)
lims <- c(len*min(eps),len*max(eps))
plot(x=times,y=eps[1]*times,type=′1′, ylim=lims)
points(x=times,y=eps[1]*times)
for (k in 2:10)
    lines(x=times,y=eps[k]*times)
    points(x=times,y=eps[k]*times)
windows()
eps <- rnorm(300)
path <- (0:300)*c(0,eps)
plot (x=0:300,y=path,type=′1′)

Note the use of the variable lims, to set the limits of successive plots, and the function windows (), to open a new window for plotting without overwriting the previous one.

It is easy to see the two processes have little in common, despite the fact that their marginal PDFs are the same for every time instant.

The example above deals with dependence over time for a monodimensional stochastic process, i.e., in the longitudinal sense. Specifying marginals for each time instant falls short of capturing the structure of a stochastic process. We will make similar considerations later, when dealing with the Wiener process. The next example illustrates the case of cross-sectional dependence between two random variables: We may have quite different joint distributions, sharing the same marginals.

Example 3.4 Marginals are not enough: Part 2

Consider the following joint PDFs (in a moment we will check that they are legitimate densities):

equation

They look quite different, but it is not too difficult to see that they yield the same marginals. The first case is easy:

equation

By symmetry, we immediately see that f_Y(y) = 1 as well. Hence, the two marginals are two uniform distributions on the unit interval [0, 1]. Now let us tackle the second case. When integrating with respect to y, we should just treat x as a constant:

equation

As before, it is easy to see by symmetry that f_Y(y) = 1 as well. Again, the two marginals are two uniform distributions, but the links between the two random variables are quite different.

Before closing the example, it is easy to see that f_X,Y(x,y) is a legitimate density as it is never negative and

equation

Actually, this is just the area of a unit square. Checking the legitimacy of g_X,Y (x, y) is a bit more difficult. The easy part is checking that the integral over the unit square is 1. Given the marginals above, we may write

equation

We should also check that the function is never negative; this is left as an exercise.

The two examples point out that there is a gap between two marginals and a joint distribution. The missing link is exactly what characterizes the dependence between the two random variables. This missing link is the subject of a whole branch of probability theory, which is called copula theory; a copula is a function capturing the essential nature of the dependence between random variables, separating it from the marginal distributions. Copulas are described in Section 3.4. In the next section we recall a more classical, simpler, but partial characterization of the links between random variables, based on covariance and correlation.

3.3.2 COVARIANCE AND PEARSON’S CORRELATION

If two random variables are not independent, we need a way to measure their degree of dependence. This is implicit in the joint distribution, which may be difficult if not impossible to manage and estimate, in general. The simplest measure in this vein is covariance.

DEFINITION 3.2 (Covariance) The covariance between random variables X and Y is defined as

Covariance is the expected value of the product of two deviations from the mean, and its sign depends on the signs of the two factors. We have positive covariance when the events {X > E[X]} and {Y > E[Y]} tend to occur together, as well as the events {X < E[X]} and {Y < E[Y]}, because the signs of the two factors in the product tend to be the same. If the signs tend to be different, we have a negative covariance. What covariance tries to capture is whether the two variables tend to be on the same side of their respective means, both above or both below, in which case covariance is positive, or on opposite sides, in which case covariance is negative.

Equation (3.28) yields a convenient way of computing covariance. Equation (3.29) shows that covariance is, as the name suggests, a generalization of variance; this is also reflected in the notation σ_XX = σ_X². Equation (3.30) points out an important issue: Covariance is a measure of association, but it has nothing to do with cause–effect relationships. Equation (3.31) shows that covariance, just like variance, is not affected by a deterministic shift. Putting Eqs. (3.32) and (3.33) together, we may prove the following theorem.

THEOREM 3.3 (Variance of a sum of random variables) Given a collection of random variables X_i, i = 1, …, n, we obtain

Covariance is a generalization of variance. Hence, it is not surprising that it shares a relevant shortcoming: Its value depends on how we measure the underlying quantities. To obtain a measure that does not depend on units of measurement, we may consider the covariance between standardized versions of X and Y:

where μ_X, σ_Y, Σ_X, and σ_Y are expected values and standard deviations of X and Y, respectively, and we have used properties of covariance. This motivates the definition of the correlation coefficient

DEFINITION 3.4 (Correlation coefficient) The correlation coefficient between random variables X and Y is defined as

To be precise, the above definition corresponds to Pearson’s correlation; later we will see that there are alternative definitions of similar measures. The coefficient of correlation is adimensional, and it can be easily interpreted, as it can take only values in the interval [−1, 1]. While we usually do not prove theorems in this book, we make an exception here as the proof is instructive.

THEOREM 3.5 The correlation coefficient ρ_XY may only take values in the range [−1, 1]. If ρ_XY = ±1. then X and Y are related by Y = a + bX, where the sign of b is the sign of the correlation coefficient.

This inequality immediately yields ρ_X,Y ≥ −1. By the same token, consider a slightly different linear combination:

We also know that if Var(Z) = 0, then Z must be a constant. In the first case, variance will be zero if ρ = −1. Then, we may write

This can be rewritten as Y = a + bX, and since standard deviations are non-negative, we see that the slope is negative. Considering the second linear combination yields a similar relationship for the case ρ = 1, where the slope b is positive.

Given the theorem above, it is fairly easy to interpret a specific value of correlation:

A visual illustration of correlation is given in Fig. 3.18. Each scatterplot shows a sample of 100 joint observations of two random variables for different values of correlation.¹⁸ The effect of correlation is quite evident if we think to “draw a line” going through each cloud of point; the slope of the line corresponds to the sign of the correlation. In the limit case of ρ = ±1, the observations would exactly lie on a line.

We easily see that if two variables are independent, then their covariance is zero. In fact, independence implies E[XY] = E[X] · E[Y]; this can be shown, e.g., by taking advantage of the factorization of the joint PDF into the product of the two marginals. Then, using the property (3.28), we see that independence implies that covariance is zero. However, the converse is not true in general, as we may see from the following counterexample.

Example 3.5 Two dependent but uncorrected random variables

Let us consider a uniform random variable on the interval [−1, 1]; its expected value is zero, and on its support the density function is constant and given by f_X(x) = . Now, define random variable Y as

equation

Clearly, there is a very strong interdependence between X and Y because, given the realization of X, Y is perfectly predictable. However, their covariance is zero! We know from Eq. (3.28) that

equation

but the second term is clearly zero, as E[X] = 0. Furthermore,

equation

because of the symmetry of the integrand function, which is an odd function, in the sense that f(−x) = −f(x). Hence, the first term is zero as well, and the two variables have zero covariance and are uncorrelated. One intuitive way to explain the weird finding of this example is the following. First note that points with coordinates (X, Y) lie on the upper half of the unit circumference X² + Y² = 1. But if Y < E[Y], we may have either X > E[X] or X < E[X]. In other words, if Y is above its mean, there is no way to tell whether X will be above or below its mean. This is illustrated in Fig. 3.19. A similar consideration applies when Y > E[Y].

The example shows that covariance is not really a perfect measure of dependence, as it may be zero in cases in which there is a very strong dependence. In fact, covariance is rather a measure of concordance between a pair of random variables, which is related to the signs of the deviations with respect to the mean. Furthermore, covariance measures a linear association between random variables. A strong nonlinear link, as the one in Fig. 3.19, may not be detected at all, or only partially.

3.3.3 R FUNCTIONS FOR COVARIANCE AND CORRELATION

In real life, covariances and correlations must be estimated on the basis of empirical data. We will consider estimation issues in the next chapter, but R offers two basic and easy-to-use functions to estimate covariance and correlation. Not quite surprisingly, they are called cov and cor. To illustrate them, let us generate two independent samples from the standard normal distribution, and combine the observations in order to induce correlation. It is easy to show, and is left as an exercise, that if variables Z₁, Z₂ ~ N(0, 1) are independent, then the variables X₁ and X₂ defined as

Note that covariance and correlation, since the two variables are standard, should be the same, but they are not because of sampling errors. In the above snapshot, we have invoked cov and cor by passing two vectors as input arguments. By doing so, we obtain covariance and correlation between the corresponding variables. If we are interested in covariance and correlation matrices, we should collect the input data in a matrix:

Note the use of the function cbind to collect two vectors side by side as columns, and as.matrix to transform the result to a matrix. The function cov2cor can be used to transform the former into the latter. Once again, we suggest to compare the sample estimates with the theoretical values.

If you try getting online help on cor, you will notice that there is an additional parameter, method = c (“pearson”, “kendall”, “spearman”). By default, we obtain Pearson’s correlations, but there are alternative definitions of correlation, which we discuss later in Section 3.4. To understand the rationale between them, let us consider the following snapshot:

As expected, the correlation between X and itself is 1, but if we raise X to increasing powers, we decrease correlation. This is due to the nonlinearity in X² and X⁴, which is lost in a linear measure of correlation. The larger the power, the less linear the resulting relationship, which is reflected in a diminishing value of Pearson’s correlation. However, we may well argue that, at a deeper level, there is no substantial difference in the dependence between X and itself and between X and X⁴. We will see that Kendall’s and Spearman’s correlations avoid this issue.¹⁹

3.3.4 SOME TYPICAL MULTIVARIATE DISTRIBUTIONS

In this section we illustrate two commonly used multivariate distributions, namely, the multivariate counterparts of normal and t distributions. They are similar in the sense that:

The essential difference is that, just like the univariate case, the t distribution has fatter tails, which are related to the number of degrees of freedom.

3.3.4.1 Multivariate normal

Strictly speaking, this applies to a nonsingular distribution, i.e., when Σ is positive definite. If the covariance matrix is only positive semidefinite, then there is a linear combination of variables, with coefficients w, such that

From a statistical point of view this means that there is some redundancy in the variables, since we may find a linear combination that is constant; from a mathematical point of view this implies that the covariance matrix cannot be inverted and we cannot write the PDF as in Eq. (3.35).

We immediately see that, when μ, = 0 and Σ = I, the joint density boils down to

i.e., the product of n independent standard densities. We may obtain a generic multivariate normal from the standard Z ~ N(0, I), and the other way around, by the transformations

where A is a matrix such that Σ = AA. These formulas are straightforward generalizations of typical standardization (also called whitening in this context) and destandardization. Possible choices of A are the lower triangular Cholesky factor L of Σ, or its square root Σ^−1/2; unlike the Cholesky factor, the square root is symmetric and is defined by Σ = Σ^−1/2Σ^−1/2.

Example 3.6 A Cholesky factor

Let us consider two standard normal variables with correlation coefficient ρ. The covariance matrix and the corresponding lower triangular Cholesky factor are

equation

The reader is urged to check that indeed Σ = LL, and that using this matrix in Eq. (3.36) generates correlated normals as in Eq. (3.34).

The following snapshot shows two R functions, chol and sqrtm from package expm to carry out the above factorizations:

Note that chol actually returns the upper triangular factor L. As we shall see in Chapter 5, these transformations are also used to generate samples of correlated multivariate normals from a set of independent standard variates.

We may generalize the well-known property that a linear combination of (jointly) normals is normal as follows.

PROPERTY 3.6 Let X₁, X₂, …, X_r be m_i-dimensional, independent normal variables, X_i ~ N(μ_i, Σ_i), i = 1, …, r. Consider a vector

and matrices

. Then

THEOREM 3.7 (Conditional distribution for multivariate normals) Let us consider an n-dimensional multivariate normal variable X ~ N(μ, Σ) and the partition X = [X₁, X₂], where X₁ and X₂, have dimensions n₁ and n₂, respectively, with n₁ + n₂ = n. Let us define the corresponding partitions

where vectors and matrices have the appropriate dimensions. Then, the conditional distribution of X₁, given that X₂ = x, is

, where

Example 3.7 Conditioning in a bivariate normal

Consider a bivariate normal distribution consisting of two standard normals Z₁ and Z₂ with correlation ρ. Then

equation

and by Theorem 3.7 the distribution of Z₁ conditional on Z₂ = z is normal with

equation

The last theorem is useful for path generation by the Brownian bridge (Section 6.2.3) and variance reduction by stratification (Section 8.5).

3.3.4.2 Multivariate Student’s t distribution

With respect to the multivariate normal, there is an additional parameter, ν, corresponding to the degrees of freedom, and the notation t_ν(μ, Σ) is used. The essential moments are

We notice that when ν → ∞, the covariance matrix tends to Σ, just like in the univariate case. More generally, the PDF of the multidimensional t tends to the PDF of a multivariate normal when ν → ∞. The matrix Σ, in this case, is referred to as a scale parameter, whereas ν is a shape parameter. The most relevant difference between the multivariate normal and t distributions is seen in the tails, which are influenced by ν. If we set aside the complicated normalization factors involving the gamma function and look inside the big square parentheses in Eq. (3.37), we notice that the level curves of the PDF are essentially determined by the same term as in the normal case:

If we fix the value of this quantity and draw the corresponding level curve, we obtain an ellipse, which explains why we speak of elliptical distributions. Both distributions are symmetrical with respect to the expected value.

3.3.4.3 The mvtnorm package

The mvtnorm package offers functions to calculate the PDF and to generate random samples from multivariate normal and t distributions:

The corresponding r functions for random variate generation are rmvnorm and rmvt; there are also q and p functions, but they are of course more problematic than the corresponding univariate functions. The script in Fig. 3.20 shows how to use R functions to plot the logarithm of the density of these two distributions. The resulting contour plots are displayed in Fig. 3.21. We should not be surprised by the negative values, since we are displaying log-densities to make the plots more readable. We immediately observe the elliptic character of both densities and the effect of positive correlation. The most striking difference, as we have remarked, is in the fatter tails of the t distribution.

3.4 Modeling dependence with copulas

The idea behind copulas is to factor an n-dimensional multivariate distribution into two components:

Indeed, since C(···) must map a set of marginal probabilities into another probability, it must be a function mapping [0, 1]ⁿ into [0, 1]. However, there are additional requirements that we detail below, referring to a two-dimensional copula for the sake of simplicity.

Consider a two-dimensional copula C(u₁, u₂). A reasonable requirement is that, for any u₁, u₂

[0, 1],

This amounts to saying that if u₁ = P{X₁ ≤ x₁} = 0, then F_X(x₁, x₂) = 0 for whatever value of x₂. In fact, below the support of the distribution of X₁, the CDF must be zero; or, if the support is unbounded below, the limit for x₁ → −∞ must be 0. A similar consideration applies to the second variable x₂, as well as to any variable in a multidimensional case. This leads to the following general definition.

DEFINITION 3.8 (Grounded function) Consider a function H(···) of n variables, with domain S₁ × ··· × S_n, where each set S_k has a smallest element a_k. We say that H is grounded if H(y₁, …, y_n) = 0, whenever at least one y_k = a_k.

Thus, we have to require that a copula is a grounded function. In the case of a copula, S_k = [0, 1] and a_k = 0. After looking below the support, let us move to the opposite side and have a look above it. There, a CDF must take the value 1; or, if the support is unbounded above, the limit for x_k → +∞ must be 1. More generally, the copula should have sensible margins.

DEFINITION 3.9 (Margins) Consider a function H(···) of n variables with domain S₁ × ··· × S_n, where each set S_k has a largest element b_k. The one-dimensional margin of H corresponding to the kth variable is the function

It is possible to define higher-dimensional margins, but we will simply refer to one-dimensional margins as margins.

On the basis of this definition, we see that a copula C(···) should have margins C_k(·) such that

As a last requirement, consider a two-dimensional CDF and the region illustrated in Fig. 3.22. The darkest rectangle in the figure is the set [a₁, b₁] × [a₂, b₂]. Let us denote this set by S; given the CDF F_X(x₁, x₂), it is easy to see that

Since this is a probability, the expression on the right must be positive. This leads to the following definition.

DEFINITION 3.10 (2-increasing function) Consider a function H(·, ·) of two variables, and let S = [a₁, b₁] × [a₂, b₂] be a rectangle with all vertices in the domain of H(·, ·). We define the H-volume of S as

We say that the function is 2-increasing if V_H(S) ≥ 0 for any such rectangle.

The definition may be extended to n-dimensional functions, but we omit the technicalities involved.²¹ It should be noted that the 2-increasing property does not imply and is not implied by the fact that the function is nondecreasing in each of its arguments. This is well illustrated by the following counterexamples.²²

Example 3.8 Nondecreasing vs. 2-increasing functions

Consider the function

equation

defined on I² = [0, 1]². It is immediate to check that the function is nondecreasing in x₁ and x₂. However, it is not 2-increasing, as the H-volume of I² is

equation

On the contrary, let us consider

equation

defined on I² = [0, 1]². Clearly, the function is decreasing in x₁ for x₂ < 0.5, and it is decreasing in x₂ for x₁ < 0.5. Nevertheless, it is 2-increasing. To see this, consider the rectangle S = [a₁, b₁] × [a₂, b₂], where b₁ ≥ a₁ and b₂ ≥ a₂, and compute its G-volume:

equation

DEFINITION 3.11 (Copula) An n-dimensional copula is a function C(···), defined on the domain Iⁿ = [0, 1]ⁿ, such that:

We have stated requirements such that, given an n-dimensional distribution with marginal CDFs F_k, the function

is a sensible CDF. One might wonder whether we can also go the other way around: Given a joint CDF, is it possible to decompose it in the form of Eq. (3.38)? The answer is positive, thanks to the following fundamental theorem.

THEOREM 3.12 (Sklar’s theorem) Let F_X be an n-dimensional CDF with continuous marginal CDFs F₁(·), …, F_n(·). Then, there exists a unique copula C(···) such that

Note that the above theorem statement asserts uniqueness for continuous distributions. In fact, Sklar’s theorem also states the existence of a copula for discrete distributions, but there are technical issues concerning uniqueness that we want to avoid here. Since we are just dealing with continuous distributions, we assume invertibility of marginal CDFs and rewrite Eq. (3.38) as

When looking at this equation, it is important to realize that if we evaluate a marginal CDF F_k(x) for x = X_k, where X_k is a random variable characterized by that CDF F_k, then F_k(X_k) is a uniform random variable. To see this, note that, assuming invertibility of the margin and taking u

[0, 1],

However, this is just the CDF of a uniform distribution. This fact plays a fundamental role in random variate generation.²³ Indeed, having defined the general concept of a copula, the following tasks must be typically carried out:

We leave the second and the third task to Chapters 4 and 5, respectively. Here we describe a few common copula functions as examples.

The product copula

It is no surprise that, as it turns out, the continuous random variables in the vector (X₁, …, X_n) are independent if and only if their copula is the product copula.

The normal copula

The normal (or Gaussian) copula is obtained by using the CDF of standard normals in Eq. (3.40). Let Φ be the CDF of a standard normal and Φ_Rⁿ be the joint CDF for an n-dimensional standard normal distribution with correlation matrix R. Then, we may define the normal copula function

Note that this copula cannot be expressed analytically; nevertheless, it is possible to sample from the normal copula.

The t copula

The t copula is built much like the normal copula, on the basis of the CDF of a t distribution with ν degrees of freedom:

where t_ν is the CDF of a t distribution with ν degrees of freedom and

is the CDF of a multivariate t distribution with ν degrees of freedom and correlation matrix R.

3.4.1 KENDALL’S TAU AND SPEARMAN’S RHO

DEFINITION 3.13 (Kendall’s tau) Consider independent and identically distributed random vectors X = [X₁, X₂] and Y = [Y₁, Y₂]. Kendall’s tau is defined as

In plain words, Kendall’s tau measures the probability of concordance minus the probability of discordance. Note that in covariance, and as a consequence in defining Pearson’s correlation, we take an expected value, which is influenced by the numerical values attained by the random variables. Here we are using a probability in a way that captures some more essential features. This is essentially why tau is not sensitive to monotonic transformations.

DEFINITION 3.14 (Spearman’s rho) Consider a joint CDF F_x(x₁, x₂), with marginals F₁(·) and F₂ (·). Spearman’s rho is defined as the Pearson’s correlation between the transformed random variables F₁(X₁) and F₂(X₂):

To get an intuitive grasp of the definition, let us recall that F₁ (X₁) and F₂(X₂) are uniformly distributed. Hence, in the definition of ρ_S we pick the two random variables and map them into uniform variables through their CDF, which is a monotonic transformation into the unit interval [0, 1]. Hence, we are actually correlating the ranks of the realizations of the two variables. Indeed, Spearman’s rho is also known as rank correlation. We should mention that sample counterparts have been defined for both Kendall’s tau and Spearman’s rho, and that there is a deep connection between them and copulas, as well as alternative definitions.²⁴

Example 3.9 Kendall’s tau and Spearman’s rho

The script in Fig. 3.23 illustrates the fact that Spearman’s rho and Kendall’s tau are not influenced by certain transformations of the underlying random variables. For our illustration purposes, we are using their sample counterparts here. We generate a sample of standard normal variables X₁ and X₂, with correlation 0.8. Then we take the exponential of the first sample, F₁ = e^X₁, and we apply the cor function to estimate different correlations. By the way, in R there is no need to specify the pearson method, as it is the default; hence cor (X1,x2,method=’pearson’) and cor (X1, X2) are equivalent. Running the script, we obtain

FIGURE 3.23 Illustrating invariance properties of Spearman’s rho and Kendall’s kappa.

> set.seed(55555)
> tryCorrelation()
Correlations
    Pearson: (X1,X2)= 0.8091203 (F1,X2)= 0.6192021
    Spearman: (X1,X2)= 0.7952385 (F1,X2)= 0.7952385
    Kendall: (X1,X2)= 0.6000901 (F1,X2)= 0.6000901

We observe that Pearson’s correlation for the transformed variable is smaller: This is due to the nonlinear nature of the exponential transformation, which in this case weakens the linear relationship between the two random variables. The other two correlation measures are not affected.

3.4.2 TAIL DEPENDENCE

A relevant feature of multivariate distributions, especially for risk management, is tail dependence. The idea is focusing on the joint behavior of random variables when extreme events occur. For the sake of simplicity, we deal here with the upper tail dependence for two random variables X₁ and X₂; the case of lower tail dependence is similar.

DEFINITION 3.15 (Upper tail dependence) Let us consider two jointly continuous random variables X₁ and X₂ with marginal CDFs F₁(·) and F₂(·). The coefficient of upper tail dependence is defined as

provided that the limit exists. If λ_U = 0, then the two random variables are said to be asymptotically independent in the upper tail; otherwise, they are said to be asymptotically dependent in the upper tail.

Since u is reaching its upper limit in the definition, the coefficient λ_U captures the joint behavior of X₁ and X₂ in the limit of the upper-right (North–East) quadrant. It turns out that this property is related to the copula, which means that it is invariant for strictly increasing transformations of the random variables.

where

is the complement of event E. It is easy to see that the probability in Eq. (3.45) can be rewritten as

Therefore, using the copula C(·, ·) between X₁ and X₂ and assuming that the limit exists, we may rewrite the above probability as

An interesting feature of the normal copula is that, when the correlation coefficient ρ is strictly less than 1, it can be shown that λ_U = 0, hence, we have tail independence.²⁵ It is often argued that this fact points out another limitation of the multivariate normal distribution for risk management.

3.5 Linear regression models: A probabilistic view

Linear regression models are traditionally introduced within an inferential statistics framework. We will see how R can be used to accomplish classical estimation by ordinary least squares in Section 4.4. However, a probabilistic view is also possible, where we consider the problem of approximating a random variable Y by an affine transformation a + bX of another random variable X. This is quite useful in order to better appreciate:

The key point is how to define a measure of distance between Y and a + bX. One possibility is to consider the variance of their difference. If this variance is close to zero, we may argue that the approximation tracks its target pretty well. If we do so and minimize tracking error variance, we formulate the following distance:

We immediately see one issue with the above choice: The coefficient a plays no role. A reasonable choice is to select a such that the approximation is unbiased:

The reader familiar with linear regression in inferential statistics will find the above formulas quite familiar; in fact, least-squares estimators of the coefficients are just sample counterparts of the expressions in Eqs. (3.49) and (3.50). We may also set the two coefficients together, in a more convincing way, by minimizing the mean square deviation, rather than variance:

Using the above calculation and a well-known property of variance, the objective function may be rewritten as follows:

Now we enforce the optimality conditions, taking partial derivatives with respect to a and b, respectively. The first condition,

Within this framework, it is also easy to see that we obtain a kind of “orthogonal projection,” in the sense that the approximation a+bX and the residual Y − (a + bX) are uncorrected:

Thus, the optimal approximation uses the maximum information about Y that can be included in bX; the residual variability is orthogonal.

3.6 Time series models

The previous section was mainly aimed at cross-sectional dependence, i.e., the relationship between different random variables observed at the same time. In this section we move on to longitudinal dependence and consider discrete-time models based on time series. Time series models are an important topic within econometrics, and here we can only afford to scratch the surface. In doing so, we have the following objectives:

In the next chapter we will see some more R functions to estimate time series models. Nevertheless, the tone in this section is a bit different from the rest of the chapter. The theory behind time series is far from trivial, and we will rely on simple simulations to get an intuitive feeling for them. Furthermore, some concepts we use here, such as the analysis of autocorrelation, are a bit halfway between probability and statistics. We include them here as they are part of the toolbox for model building. In fact, applying time series requires the following steps:

We cover here the first task, and estimation methods are deferred to the next chapter. After a few preliminary considerations, we will move on to consider two fundamental classes of time series models:

Then, in Section 3.6.3 we will see how these modeling approaches can be integrated in the more general class of ARIMA (autoregressive integrated moving average) processes, also known as Box–Jenkins models. The next step, in Section 3.6.4, is to generalize a bit, moving from scalar time series to multidimensional vector autoregressive models. Finally, in Section 3.6.5 we will present some more specific model classes, like ARCH and GARCH, to represent stochastic volatility. These models are quite relevant for financial applications. As we shall see, time series modeling offers many degrees of freedom, maybe too many; when observing a time series, it may be difficult to figure out which type of model is best-suited to capture the essence of the underlying process. This is why model identification plays a pivotal role.

In its basic form, time series theory deals with weakly stationary processes, i.e., time series Y_t with the following properties, related to first- and second-order moments:

DEFINITION 3.16 (Autocovariance and autocorrelation) Given a weakly stationary stochastic process Y_t, the function

The definition of autocorrelation relies on the fact that variance is constant as well:

In practice, autocorrelation may be estimated by the sample autocorrelation function (SACF), given a sample path Y_t, t = 1, …, T:

where

is the sample mean of Y_t. The expression in Eq. (3.51) may not look quite convincing, since the numerator and the denominator are sums involving a different number of terms. In particular, the number of terms in the numerator is decreasing in the time lag k. Thus, the estimator looks biased and, for a large value of k, R_k will vanish. However, this is what one expects in real life. Furthermore, although we could account for the true number of terms involved in the numerator, for large k the sum involves very few terms and is not reliable. Indeed, the form of sample autocorrelation in Eq. (3.51) is what is commonly used in statistical software packages, even though alternatives have been proposed.²⁶ If T is large enough, under the null hypothesis that the true autocorrelations ρ_k are zero, for k ≥ 1, the statistic

is approximately normal standard. Since z_0.99 = 1.96 ≈ 2, a commonly used approximate rule states that if

then the sample autocorrelation at lag k is statistically significant. For instance, if T = 100, autocorrelations outside the interval [−0.2, 0.2] are significant. We should keep in mind that this is an approximate result, holding for a large number T of observations. We may plot the sample autocorrelation function at different lags, obtaining an autocorrelogram that can be most useful in pointing out hidden patterns in data.

Example 3.10 Detecting seasonality with autocorrelograms

Many time series in economics display seasonality. Consider the time series displayed on the left in Fig. 3.24. A cursory look at the plot does not suggest much structure in the data, but we may try to get a clue by plotting the ACF on the right. However, the ACF does not look quite helpful either. We notice decreasing autocorrelations, many of which are significant; the plot displays two horizontal bands outside which autocorrelation is statistically significant. Notice that

FIGURE 3.24 A seasonal time series and its autocorrelogram.

equation

and the positioning of the two horizontal bands is consistent with Eq. (3.52). In fact, the time series is seasonal and has been generated by simulating the model

equation

which features an additive seasonal component S_t superimposed on a linear trend a + bt, plus a noise term _t. The following R code shows how such a model can be simulated, and how the acf function can be used to plot the SACF:

set.seed(55555)
t <- 1:40
factors <- c(15, 30, −10, −35)
Y <- 100+10*t+factorS[t%%4+1]+rnorm(length(t),0,20)
par(mfrow=c(1,2))
plot(t,Y,type=′1′)
points(t,Y,pch=15)
acf(Y)

The seasonal factors are collected in the vector (15, 30, −10, −35); note that the average factor, for additive models, is zero, and that the indexing t%%4+1 repeats the four factors in a cycle over the simulation horizon; we recall that the operator %% corresponds to the mathematical operator mod and yields the remainder of the integer division. The noise terms _t are normally distributed, which need not be the case in general. The problem actually is in the trend. We should detrend the series in order to see more structure, which can be accomplished by linear regression:

mod <- lm(Y~t)
dY <- Y-mod$fitted
plot(t,dY,type=′1′)
points(t,dY,pch=15)
acf(dY)
par(mfrow=c(1,1))

The first statement above uses lm to estimate a linear regression model, relating variable Y with time t; the model is assigned to the variable mod, which is a rich data structure including fitted values, among other things. By subtracting the fitted trendline, we detrend the time series. The new ACF, displayed in Fig. 3.25 shows a quite different pattern: The autocorrelogram is indeed a useful tool to spot seasonality and other hidden patterns in data.

In the above example, we have used an important building block in time series modeling, called white noise, which is a sequence of i.i.d. (independent and identically distributed) random variables. If they are normal, we have a Gaussian white noise. In other models, noise may be “colored,” i.e., there may be some autocorrelation between the driving random terms

_t. As we have mentioned, the first step in modeling a time series is the identification of the model structure. In the following we outline a few basic ideas that are used to this aim, pointing out the role of autocorrelation in the analysis.

3.6.1 MOVING-AVERAGE PROCESSES

A finite-order moving-average process of order q, denoted by MA(q), can be expressed as

where random variables

_t are white noise, with E[

_t] = 0 and Var(

_t) = σ². These variables play the role of random shocks and drive the process. It is fairly easy to see that the process is weakly stationary. In fact, a first observation is that expected value and variance are constant:

The calculation of autocovariance is a bit more involved, but we may take advantage of the uncorrelation of white noise:

Thus, the autocorrelation function depends only on the lag k. We also notice that the autocorrelation function cuts off for lags larger than the order of the process. This makes sense, since the process Y_t is a moving average of the driving white noise

_t. Hence, by checking whether the sample autocorrelation function cuts off after a time lag, we may figure out whether a time series can be modeled as a moving average, as well as its order q. Of course, the sample autocorrelation will not be exactly zero for k > q; nevertheless, by using the autocorrelogram and its significance bands, we may get some clue.

Example 3.11 Moving-average processes

Let us consider a simple MA(1) process

equation

where _t is a sequence of uncorrected standard normal variables (Gaussian white noise). In Fig. 3.26 we show a sample path and the corresponding sample autocorrelogram obtained by the following R code (note how vectorization avoids a for loop):

FIGURE 3.26 Sample path and corresponding sample ACF for the moving-average process Y_t = 40 + _t + 0.8_t−1.

set.seed(55555)
t <- 1:100
epsVet <- rnorm(101)
Y <- 40+epsVet[2:101]+0.8*epsVet[1:100]
par(mfrow=c(1,2))
plot(t,Y,type=’l’)
points (t,Y,pch=15)
acf(Y)

Incidentally, in this section we are writing our own code to simulate ARMA models for illustrative purposes, but R offers a ready-to-use function arima.sim, which will be used later in Section 4.5. The sample autocorrelation looks significant at time lag 1, which is expected, given the nature of the process. Note that, by applying Eq. (3.53), we find that the autocorrelation function, for a MA(1) process Y_t = μ + _t − θ₁ _t−1, is

(3.54) equation

(3.55) equation

Figure 3.27 shows the sample path and autocorrelogram of a slightly different MA(1) process (the R code is essentially the same as above):

FIGURE 3.27 Sample path and corresponding ACF for the moving-average process Y_t = 40 + _t − 0.8_t−1.

equation

The change in sign in θ₁ has a noticeable effect on the sample path; an upswing tends to be followed by a downswing, and vice versa. The autocorrelogram shows a cutoff after time lag 1, and a negative autocorrelation.

If we increase the order of the process, we should expect more significant autocorrelations. In Fig. 3.28, we repeat the exercise for the MA(2) process

FIGURE 3.28 Sample path and corresponding ACF for the moving-average process Y_t = 40 + _t + 0.9_t−1 + 0.5_t−2.

equation

using the following R code:

set.seed(55555)
t <- 1:100
epsVet <- rnorm(102)
Y<-40+epsVet [3:102]+0.8*epsVet [2:101]+0.6*epsVet [1:100]
par(mfrow=c(1,2))
plot (t,Y,type=’l’)
points(t,Y,pch=15)
acf(Y)

We observe that, in this case, the autocorrelation function cuts off after time lag k = 2.

We should mention that sample autocorrelograms are a statistical tool. It may well be the case that, for the moving-average processes in the examples, we get different pictures when different sample paths are randomly generated. This is a useful experiment to carry out with the help of statistical software.

3.6.2 AUTOREGRESSIVE PROCESSES

In finite-order moving-average processes, only a finite number of past realizations of white noise influence the value of Y_t. This may be a limitation for those processes in which all of the previous realizations have an effect, even though this possibly fades in time. In principle, we could consider an infinite-order moving-average process, but having to do with an infinite sequence of θ_q coefficients does not sound quite practical. Luckily, under some technical conditions, such a process may be rewritten in a compact form involving time-lagged realizations of the Y_t itself. This leads us to the definition of an autoregressive process of a given order. The simplest such process is the autoregressive process of order 1, AR(1):

One could wonder under which conditions this process is stationary, since here we cannot use the same arguments as in the moving-average case. A heuristic argument to find the expected value μ = E[Y_t] is based on taking expectations and dropping the time subscript in Eq. (3.56):

The argument is not quite correct, as it leads to a sensible result if the process is indeed stationary, which is the case if |ϕ| < 1. Otherwise, intuition suggests that the process will grow without bounds. The reasoning can be made precise by using the infinite-term representation of Y_t, which is beyond the scope of this book. Using the correct line of reasoning, we may also prove that

We notice that autocorrelation is decreasing, but it fades away with no sharp cutoff.

Example 3.12 Autoregressive processes

In Figs. 3.29 and 3.30, we show a sample path and the corresponding sample autocorrelogram for the two AR(1) processes

FIGURE 3.29 Sample path and corresponding ACF for the autoregressive process Y_t = 8 + 0.8Y_t−1 + _t.

FIGURE 3.30 Sample path and corresponding ACF for the autoregressive process Y_t = 8 − 0.8Y_t−1 + _t.

equation

respectively. The R code cannot avoid a for loop in this case:

set.seed(55555)
t <- 1:100
epsVet <- rnorm(101)
Y <- 40+epsVet[2:101]+0.8*epsVet[1:100]
par(mfrow=c(1,2))
plot (t,Y,type=’l’)
points(t,Y,pch=15)
acf(Y)

Notice that the change in sign in the ϕ coefficient has a significant effect on the sample path, as well as on autocorrelations. In the first case, autocorrelation goes to zero along a relatively smooth path. This need not be the case, as we are working with sample autocorrelations. Nevertheless, at least for significant values, we observe a monotonic behavior consistent with Eq. (3.57). The sample path of the second process features evident up- and downswings; we also notice an oscillatory pattern in the autocorrelation.

The autocorrelation behavior of AR processes does not feature the cutoff properties that help us determine the order of an MA process. A tool that has been developed for the identification of AR processes is the partial autocorrelation function (PACF). The rationale behind PACF is to measure the degree of association between Y_t and Y_t−k, removing the effects of intermediate lags, i.e., Y_t−1, …, Y_t−k+1. We cannot dwell too much on PACF, but we may at least get a better intuitive feeling as follows. Consider three random variables X, Y, and Z, and imagine regressing X and Y on Z:

Note that we are considering a probabilistic regression, not a sample-based regression. From Section 3.5 we know that

which may be regarded as the random variables X and Y, after the effect of Z is removed. The correlation ρ(X, Y) may be large because of the common factor Z (which plays the role of a “lurking” variable). If we want to get rid of it, we may consider the partial correlation ρ(X^*, Y^*). On the basis of this intuition, we might consider estimating the partial autocorrelation between Y_t and Y_t−k by the following linear regression:

The inclusion of intermediate lagged variables Y_t−1, …, Y_t−k+1 allows to capture their effect explicitly by the regression coefficients b₁, …, b_k−1. Then, we could use b_k as an estimate of partial autocorrelation. Actually, this need not be the sounder approach, but software packages provide us with ready-to-use functions to estimate the PACF by its sample counterpart. In Fig. 3.31 we show the SPACF for the two AR(1) processes of Example 3.12. The plot was obtained by the pacf function. The example also shows how to use expression to build a mathematical annotation for the plot, including subscripts, etc.:²⁷

We see that the SPACF cuts off after lag 1, even though, due to statistical sampling errors, it seems that there is a significant value at larger lags in the first case. This suggests how to use this function to identify the order of an AR model.

3.6.3 ARMA AND ARIMA PROCESSES

Autoregressive and moving-average processes may be merged into ARMA (autoregressive moving-average) processes like

The model above is referred to as ARMA(p, q) process, for self-explanatory reasons. Conditions ensuring stationarity have been developed for ARMA processes, as well as identification and estimation procedures. Clearly, the ARMA modeling framework provides us with plenty of opportunities to fit historical data with a model that can be used to create scenarios by Monte Carlo sampling. However, it applies only to stationary data. It is not too difficult to find real-life examples of data processes that are nonstationary. Just think of stock market indices; most investors really wish that the process is not stationary.

Example 3.13 A nonstationary random walk

A quite common building block in many financial models is the random walk. An example of random walk is

(3.59) equation

where η_t is a sequence of independent and standard normal random variables. This is actually an AR process, but from Section 3.6.2 we know that it is nonstationary, as ϕ = 1. Three sample paths of this process are shown in Fig. 3.32. They look quite different, and a subjective comparison of the sample paths would not suggest that they are realizations of the same stochastic process. However, the ACFs show a common pattern: Autocorrelation fades out very slowly. Indeed, this is a common feature of nonstationary processes. The PACFs display a very strong partial autocorrelation at lag 1, which cuts off immediately in all of the three cases.

Since the theory of stationary MA and AR processes is well developed, it would be nice to find a way to apply it to nonstationary processes as well. A commonly used trick to remove nonstationarity in a time series is differencing, by which we consider the time series

Applying differencing in R is obtained by the diff function, which is applied in the following R snapshot to the random walk of Eq. (3.59):

In Fig. 3.33 we show the sample path and the differenced series, along with their ACFs. The shape of the ACF is not surprising, since the differenced process is just white noise.

Example 3.14 What is nonstationarity, anyway?

A time series with trend

(3.61) equation

where _t is white noise, is clearly nonstationary, as it features a deterministic trend. However, nonstationarity is a subtler concept than that. A little digression is in order to clarify the nature of nonstationarity in a random walk like

(3.62) equation

The sample paths that we have generated in Example 3.13 show that this random walk does not feature a deterministic trend. The recursive unfolding of Eq. (3.62) yields

equation

Therefore, we find that

equation

Hence, we must have a different kind of nonstationarity in the random walk of Eq. (3.62) than in the process described by Eq. (3.61). To investigate the matter, let us consider the expected value of the increment Y’_t = Y_t − Y_t−1, conditional on Y_t−1:

equation

Therefore, given the last observation Y_t−1, we cannot predict whether the time series will move up or down. This shows a connection between a stationary random walk and the efficient market hypothesis. Now, let us consider a stationary AR(1) process

equation

where ϕ (−1, 1). The increment in this case is

equation

Since (ϕ − 1) < 0, we have

equation

This suggests that a stationary AR(1) process is mean reverting, in the sense that the process tends to return to its expected value; the nonstationary random walk does not enjoy this property.

Sometimes, differencing must be repeated in order to obtain a stationary time series. We obtain second-order differencing by repeated application of (first-order) differencing:

This suggests that we may formally apply the algebra of polynomials to the backshift operator, in order to find differences of arbitrary order. By introducing the operator polynomials

We may also extend the class of stationary ARMA models, in order to allow for nonstationarity. Doing so, we find the more general class of ARIMA (autoregressive integrated moving average) processes, also known as Box–Jenkins models. An ARIMA(p, d, q) process can be represented as follows:²⁸

where Φ(B) and Θ(B) are polynomials of order p and q, respectively, and d is a differencing order such that the process Y_t is stationary, whereas, if we take differences of order (d − 1), the process is still nonstationary. The name “integrated” stems from the fact that we obtain the nonstationary process by integrating a stationary one, i.e., by undoing differencing. In most practical applications the order d is 0 or 1. Box–Jenkins models can be extended to cope with seasonality, but a full account of this class of models is beyond the scope of this book; we refer the reader to the references provided at the end of this chapter.

3.6.4 VECTOR AUTOREGRESSIVE MODELS

If you want to generalize the scalar time series models that we have seen so far, in principle, there is no limit to what one can conceive. Simultaneous equations can be built, resulting in quite rich structures. In practice, an overly complicated model need not be the smartest option. On the one hand, there may be issues with estimation and the assessment of properties such as stationarity. On the other hand, the predictive power of a seemingly rich model may be disappointing due to overfitting issues. This is why vector autoregressive models, VAR for short,²⁹ have been proposed. Let Y_t

^k be a vector of random variables; a VAR model can be represented in reduced form as

where ϕ(B) is a polynomial in the backshift operator B and c,

^k. More explicitely, a VAR(p) model is

where

. Note that the error term

_t need not be a vector or independent variables.

Example 3.15 Simulation of a VAR(1) model

Let us consider the bidimensional model

equation

where the error terms are standard normals with correlation ρ = 0.5. In matrix form, we may write the above equations as

equation

where

equation

and

equation

The following R script produces the time series displayed in Fig. 3.34.

FIGURE 3.34 Bidimensional time series corresponding to a VAR(1) model.

require(MASS)
tPhil <- t(matrix(c(0.7, 0.2, 0.2, 0.7), 2, 2))
c.vec <- c(−0.7, 1.3)
mu.vec <- numeric(2)
cov.mat <- matrix(c(1, 0.5, 0.5, 1), 2, 2)
set.seed(55555)
T <- 100
Ypath <- matrix(0, T, 2)
Ypath[1,] <- c(1,1) # initial condition
epsPath <- mvrnorm(n=T, mu.vec, cov.mat)
for (t in 2:T)
    Ypath[t,] <- c.vec+Ypath[t-1,]%*%tPhil+epsPath[t,]
plot.ts(Ypath)

Note that, in order to collect the time series in a matrix with rows corresponding to observations, we have to transpose Φ. We also need the MASS package to sample from a multivariate normal.

3.6.5 MODELING STOCHASTIC VOLATILITY

When observing financial time series of returns, the following stylized facts may be noticed:

The first feature is referred to as heteroskedasticity, in contrast to the homoskedasticity assumption, whereby volatility is constant. To model volatility, it is convenient to build a time series of returns. However, returns over consecutive periods cannot be added, as they should rather be multiplied. This is why a series of log-returns u_t is usually represented:

where S_t is the price of the relevant financial asset. We want to capture the conditional variance of the time series, which can be expressed as:

It is often the case that E[u_t] ≈ 0, if the time bucket is small enough.³⁰ Hence, the expected return term is typically dropped. To build a model, we might relate volatility to the last shock:

A model like this is called ARCH(1), where the acronym stands for autoregressive conditionally heteroskedastic, and 1 means that only u_t with lag 1 plays a role in Eq. (3.67). This is just a part of an overall model describing u_t. We may build an ARCH(q) model as

An ARCH model is, in a sense, an MA-style model for volatility. We may also consider an AR-style model like

which is known as GARCH(1, 1), i.e., a generalized ARCH model involving two variables with lag 1.

A GARCH model should be coupled with another equation describing the evolution of a time series of interest. Let us consider a time series of return generated as an expected return plus a random shock u_t:

Note that σ_t is the standard deviation, i.e., the volatility of the return time series, and it follows Eq. (3.69). One may use a fatter-tailed distribution to generate the shocks. The code in Fig. 3.35 generates the two plots displayed in Fig. 3.36.³¹

3.7 Stochastic differential equations

Many financial engineering models rely on continuous-time stochastic processes. One reason is that, in fact, time is continuous in the real world, but another one is that this modeling framework allows for the development of a rich theory. The theory is translated into practical models that allow for an analytical solution in some lucky cases, and even if we are not that lucky, we may still resort to numerical methods, including Monte Carlo. In this section we first illustrate a gentle transition from discrete- to continuous-time models, which leads to the introduction of a fundamental building block in continuous-time stochastic modeling, the Wiener process. Continuous-time models are typically expressed as stochastic differential equations, which not only generalize deterministic differential equations but also feature thorny technical issues that require the introduction of a new concept: The stochastic integral in the sense of Itô. This, in turn, leads to an essential result, the celebrated Ithon’s lemma, which is in some sense an extension of the chain rule for the derivative of composite functions in familiar calculus. In fact, it turns out that the familiar calculus rules do not work in this context and must be suitably modified. This is essential in tackling stochastic differential equations analytically, when possible, and numerically via sample path generation.³² Then, we introduce the family of Itô stochastic differential equations, which includes geometric Brownian motion, and some generalizations. In the following, we will mostly consider the price of a stock share as the relevant variable, but we can also model interest rates, exchange rates, macroeconomic variables, etc.

3.7.1 FROM DISCRETE TO CONTINUOUS TIME

It is a good idea to start with a discrete-time model and then derive a continuous-time model heuristically. Consider a time interval [0, T], and imagine that we discretize it by a sequence of time periods of length δt, such that T = n · δt; discrete-time instants are indexed by t = 0, 1, 2, …, n. Let S_t be, for instance, the price of a stock share at time t. One possible and reasonable model for the evolution of this price over time is the multiplicative form:

where u_t is a non-negative random variable and the initial price S₀ is known. If we consider continuous random variables u_t, the model is continuous-state. In simplest model the variables u_t are i.i.d., which may be justified by the efficient market hypothesis, if we believe in it (or by the lack of an alternative model we trust enough). The multiplicative model ensures that prices will stay non-negative, which is an obvious requirement for stock prices. If we used an additive model, such as S_t+1 = u_t + S_t, we should admit negative values for the random variables u_t, in order to represent price drops; therefore, we would not have the guarantee that S_t ≥ 0. With the multiplicative form, a price drops when u_t < 1, but it remains positive. Furthermore, the actual impact of a price change depends on the present stock price; a $1 increase is a different matter if the current price is $100 rather than $5. This is easily accounted for by the multiplicative form, which essentially deals with percent changes in price.

In order to select a sensible probability distribution for the random variables u_t, let us use a typical trick of the trade and consider the natural logarithm of the stock price:

The random variable z_t is the increment in the logarithm of price, and a common assumption is that it is normally distributed, which implies that u_t is lognormal. Starting from the initial price S₀ and unfolding (3.70) recursively, we find

Since the sum of normal random variables is still a normal variable, log S_t is normally distributed, which in turn implies that stock prices are lognormally distributed, according to this model. To put it in another way, the product of lognormals is still lognormal. Using the notation

where we take advantage of the intertemporal independence of the variables z_t when calculating variance. The important point to see here is that the expected value and the variance of the increment in the logarithm of the stock price scale linearly with time; this implies that standard deviation scales with the square root of time, whereas expected value scales linearly. This property is known as the square-root rule and implies that on a very short time period volatility, related to a standard deviation, dominates drift, related to an expected value.³³

Our aim is to build a continuous-time model, so let us consider first what we typically do in a deterministic case to accomplish the task. The familiar approach is to take the limit of a difference equation, for δt → 0, and build a differential equation. Informally, if there were no uncertainty, we would recast what we have seen in discrete time as

In the deterministic case, as we remarked earlier in this book, it is customary to write the differential equation as

where we have used standard calculus, in particular the chain rule of derivatives, to rewrite the differential

To make the model stochastic (and more realistic), we have to include a noise term, which entails a few important changes. First, we should write the equation in the form

where dW(t) can be considered as the increment of a stochastic process over the interval [t, t + dt]. Equation (3.75) is an example of a rather tricky object, called a stochastic differential equation. It is reasonable to guess that the solution of a stochastic differential equation is a stochastic process, rather than a deterministic function of time. This topic is quite difficult to deal with rigorously, as it requires some background in measure theory and stochastic calculus, but we may grasp the essentials by relying on intuition and heuristic arguments.

The first thing we need is to investigate which type of continuous-time stochastic process W(t) we can use as a building block. In the next section we introduce such a process, called the Wiener process, which plays more or less the same role as the process z_t above. It turns out that this process is not differentiable, whatever this may mean for a stochastic process; informally, we will see that its sample paths are quite jagged indeed. Hence, we cannot write the stochastic differential equation as

Actually, a stochastic differential equation must be interpreted as a shorthand for an integral equation much like (3.73), involving the increments of a stochastic process. This calls for the definition of a stochastic integral and the related stochastic calculus. A consequence of the definition of the stochastic integral is that working with differentials as in Eq. (3.74) is not possible. We need a way to generalize the chain rule for differentials from the deterministic to the stochastic case. This leads to a fundamental tool of stochastic calculus called Itô’s lemma.

3.7.2 STANDARD WIENER PROCESS

In the discrete-time model, we have assumed normally distributed increments in logarithmic prices, and we have also seen that the expected value of the increment of the logarithm of price scales linearly with time, whereas standard deviation scales with the square root of time.

where

_t is a sequence of independent standard normal variables. We see that, for k > j,

In order to move to continuous time, we may define the standard Wiener process as a continuous-time stochastic process characterized by the following properties:

Note how the second condition makes sure that standard deviation scales with the square root of time, like in the discrete-time model. To see the importance of the independent increments assumption, let us consider the sample paths of a process defined as

, with

~ N(0, 1), which are shown in Fig. 3.37. This is a “degenerate” stochastic process,³⁴ since knowledge of just one point on a sample path yields the complete knowledge of the whole sample path, which makes the process quite predictable. However, if we only look at the marginal distributions of Q(t), this process just looks like the Wiener process, since

It is lack of independence that makes the difference, as we may also check empirically. We may recast Eq. (3.76) as

where the notation

_t+δt, rather than

_t, may be used to emphasize the fact that this standard normal shock conveys new information. This discretization immediately allows us to write a piece of R code to generate sample paths of the standard Wiener process, which is illustrated in Fig. 3.38 and produces the plot in Fig. 3.39, which is quite different from Fig. 3.37, even though the marginals are the same.

From Fig. 3.39, we observe that sample paths of the Wiener process look continuous, but not differentiable. This may be stated precisely, but not very easily. Introducing continuity and differentiability rigorously calls for specifying a precise concept of stochastic convergence, as we should say that the Wiener process is nowhere differentiable with probability 1. To get an intuitive feeling for this fact, let us consider the increment ratio

If we take the limit for δt → 0, this variance goes to infinity. Strictly speaking, this is no proof of nondifferentiability of W(t), but it does suggest that there is some trouble in using an object like dW(t)/dt; indeed, you will never see a notation like this. We only use the differential dW(t) of the Wiener process. Informally, we may think of dW(t) as a random variable with distribution N(0, dt). Actually, we should think of this differential as an increment, which may be integrated as follows:

This looks reasonable, doesn’t it? We may even go further and use W(t) as the building block of stochastic differential equations. For instance, given real numbers a and b, we may imagine a stochastic process X(t) satisfying the equation

This equation defines a generalized Wiener process and may be solved by straightforward integration over the time interval [0, t]:

things are not that intuitive. A process satisfying an equation like (3.79) is called an Itô process. We could argue that the solution should be something like

Here the first integral looks like a standard Riemann integral of a function over time, but what about the second one? We need to assign a precise meaning to it, and this leads to the definition of a stochastic integral.

3.7.3 STOCHASTIC INTEGRATION AND ITÔ’S LEMMA

In a stochastic differential equation defining a process X(t), where a Wiener process W(t) is the driving factor, we may assume that the value X(t) depends only on the history of W(t) over the time interval from 0 to t. Technically speaking, we say that process X(t) is adapted to process W(t). Now let us consider a stochastic integral like

How can we assign a meaning to this expression? To begin with, it is reasonable to guess that a stochastic integral is a random variable. If we integrate a deterministic function of time we get a number; so, it is natural to guess that, by integrating a stochastic process over time, we should get a random variable. Furthermore, the stochastic integral above looks related to the sample paths of process W(t), and an approximation could be obtained by partitioning the integration interval into small subintervals defined by points 0 = t₀, t₁, t₂, …, t_n = T and considering the sum

If you recall how the Riemann integral was introduced in your first calculus course, as the limit of sums over partitions, the above definition does not look too different. Yet, there is a subtle and crucial point. It is very important to notice how we have chosen the time instants in the expression above: X(t_k) is taken at the left extreme of the interval (t_k, t_k+1). This is actually one possible choice: Why not take the value of X in the midpoint? This choice has an important effect: X(t_k) is a random variable which is independent from the increment W(t_k+1) − W(t_k) by which it is multiplied. This choice is what makes the stochastic integral in the sense of Itô a bit peculiar, and it may be motivated as follows.

Example 3.16 Financial motivation of Itô integrals

Consider a set of n assets, whose prices are modeled by stochastic processes S_i(t),i = 1, …, n, which are described by stochastic differential equations like (3.79), and assume that we have a portfolio strategy represented by functions h_i(t). These functions represent the number of stock shares of each asset i that we hold at time t. But which functions make sense? An obvious requirement is that functions h_i(·) should not be anticipative: h_i(t) may depend on all the history so far, over the interval [0, t], but clairvoyance should be ruled out. Furthermore, we should think of h_i(t) as the number of shares we hold over a time interval of the form [t, t + dt).

Now, assume that we have an initial wealth that we invest in the portfolio, whose initial value, depending on the portfolio strategy represented by the functions h_i(·), is

equation

where we have grouped h_i(t) and S_i(t) in column vectors. What about the dynamics of the portfolio value? If the portfolio is self-financing, i.e., we can trade assets but we do not invest (nor withdraw) any more cash after t = 0, it can be shown that the portfolio value will satisfy the equation

equation

This looks fairly intuitive and convincing, but some careful analysis is needed to prove it (see, e.g., [1, Chapter 6]). Then, we may reasonably guess that the wealth at time t = T will be

equation

However, it is fundamental to interpret the stochastic integral as the limit of an approximation like (3.82), i.e.,

equation

The number of stock shares we hold at time t_k does not depend on future prices S(t_k+1). First we allocate wealth at time t_k, and then we observe return over the time interval (t_k, t_k+1). This makes financial sense and is why Itô stochastic integrals are defined the way they are.

Now, if we take approximation (3.82) and consider finer and finer partitions of the interval [0, t], letting n → ∞, what do we obtain? The answer is technically involved. We must select a precise concept of stochastic convergence and check that everything makes sense. Using mean square convergence, it can be shown that the definition makes indeed sense and provides a rigorous definition of the stochastic integral in the sense of Itô.

The definition of stochastic integral has some important consequences. To begin with, what is the expected value of the integral in Eq. (3.81)? This is a good starting question, because the stochastic integral is a random variable, and the first and foremost feature of a random variable is its expected value. We may get a clue by considering the approximation (3.82):

where we have used the independence between X(t_k) and the increments of the Wiener process over the time interval (t_k,t_k+1), along with the fact that the expected value of these increments is zero. This shows that the integral of Eq. (3.81) is a random variable with expected value zero, but can we say something more? The definition of stochastic integral does not yield a precise way to compute it practically. We may try, however, to consider a specific case to get some intuition. The following example illustrates one nasty consequence of the way in which we have defined the stochastic integral.

Example 3.17 Chain rule and stochastic differentials

Say that we want to “compute” the stochastic integral

equation

Analogy with ordinary calculus would suggest using the chain rule of differentiation of composite functions to obtain a differential which can be integrated directly. Specifically, we might guess that

equation

This in turn would suggest that

equation

Unfortunately, this cannot be the correct answer, as it contradicts our previous findings. We have just seen that the expected value of an integral of this kind is zero, but:

equation

We see that the two expected values do not match at all, and there must be something wrong somewhere.

Example 3.17 shows that the chain rule does not work in Itô stochastic calculus. To proceed further, we need to find the right rule, and the answer is Itô’s lemma. As usual in this section, we cannot follow rigorous arguments, but we may give an informal, yet quite instructive argument (following [12, Chapter 13]). The argument is instructive as it provides us with valuable intuition explaining what went wrong in Example 3.17. Let us recall that an Itô process X(t) satisfies a stochastic differential equation such as

where

(t) ~ N(0, 1). What we actually need is a way to derive a stochastic differential equation for a function F(X, t) of X(t), as this plays the role of the derivative of a composite function, which is what the chain rule is used for.

Let us take a little step back in the realm of deterministic calculus, and consider what is the rationale behind Taylor’s expansion of a function G(x, y) of two variables. The key ingredient we need for our reasoning is the formula for the differential of such a function:

when δx, δ_y → 0. Now we may apply this Taylor expansion to F(X, t), limiting it to the leading terms. In doing so it is important to notice that the term √δt in Eq. (3.84) needs careful treatment when squared. In fact, we have something like

which implies that the term in (δX)² cannot be neglected in the approximation. Since

is a standard normal variable, we have E[

²] = 1 and E[

² δt] = δt. A delicate point is the following: It can be shown that, as δt tends to zero, the term

² δt can be treated as nonstochastic, and it is equal to its expected value. An informal (far from rigorous) justification relies on the variance of this term:

which, for δt → 0, can be neglected with respect to first-order terms. Here we have used the fact E[

⁴] = 3, which can be checked by using moment generating functions³⁵ or, cheating a bit, by recalling that the kurtosis of any normal, including a standard one, is 3. A useful way to remember this point is the formal rule

Neglecting higher-order terms and taking the limit as both δX and δt tend to zero, we end up with

Although this proof is far from rigorous, we see that all the trouble is due to the term of order

, which is introduced by the increment of the Wiener process.

It is instructive to see that if we set b(X, t) = 0 in the Itô differential equation, i.e., if we eliminate randomness, then we step back to familiar ground. In such a case, the stochastic differential equation is actually deterministic and can be rewritten as

where we use the notation x rather than X to point out the deterministic nature of function x(t). Then, Itô’s lemma boils down to

which can be rearranged to yield the following chain rule for a function F(x, t):

Example 3.18 Itô’s lemma

In order to compute the stochastic integral of W²(t), we may apply Itô’s lemma to the case X(t) = W(t), by setting a(X,t) ≡ 0, b(X, t) ≡ 1, and F(X, t) = X²(t). Hence, we have

(3.89) equation

(3.90) equation

(3.91) equation

It is important to point out that in Eq. (3.89) the partial derivative with respect to time is zero; it is true that F(X(t),t) depends on time through X(t), but here we have no direct dependence on t, thus the partial derivative with respect to time vanishes. To put it another way, the dependence of X(t) with respect to time does not play any role when taking the partial derivative with respect to t, because X(t) is held fixed. You may also have a look back at Eq. (3.88) to see the relationship between the partial and a total derivative of F(X(t),t) with respect to t.

The application of Itô’s lemma yields

(3.92) equation

It is essential to notice that dt is exactly the term which we would not expect by applying the usual chain rule. But this is the term that allows us to get the correct expected value of W²(T). Indeed, by straightforward integration of Eq. (3.92) now we find

equation

Taking the expected values yields

equation

which is coherent with our findings in Example 3.17.

Let us summarize our findings: With respect to Eq. (3.87), in Itô’s lemma we have an extra term in dW, which is expected given the form of the differential equation defining the stochastic process X(t), and an unexpected term

In deterministic calculus, second-order derivatives occur in second-order terms linked to (δt)², which can be neglected; but here we have a term of order

which must be taken into account even when it is squared.

3.7.4 GEOMETRIC BROWNIAN MOTION

Itô’s lemma may be used to find the solution of stochastic differential equations, at least in relatively simple cases. A very important example is geometric Brownian motion. Geometric Brownian motion is defined by a specific choice of the functions in Itô stochastic differential equation:

where μ and σ are constant parameters referred to as drift and volatility, respectively. Intuition would suggest to rewrite the equation as

and then to consider the differential of d log S, which would be dS/S in deterministic calculus, to find the corresponding integral. However, we have seen that some extra care is needed when dealing with stochastic differential equations. Nevertheless, the above intuition deserves further investigation, so let us apply Itô’s lemma and find the stochastic differential equation for F(S, t) − log S(t). As a first step we compute the partial derivatives:

Now we see that our guess above was not that bad, as this equation may be easily integrated and yields

Recalling that W(t) has a normal distribution and can be written as W(t) =

, where

~ N(0, 1), we conclude that the logarithm of price is normally distributed:

On the one hand, this shows that prices, according to the geometric Brownian motion model, are lognormally distributed. On the other hand, now we have a way to generate sample paths, based on a discretization of a prescribed time horizon in time periods of length δt, like we did with the standard Wiener process. A simple R function to do so is given in Fig. 3.40.³⁶ This function, when given the drift and volatility selected in the script, yields the plot in Fig. 3.41.

By using properties of the lognormal distribution³⁷ and letting ν = μ −σ²/2, we also find

from which we see that the drift parameter μ is linked to the continuously compounded return. The volatility parameter σ is related to standard deviation of the increment of logarithm of price. The roles of drift and volatility can also be grasped intuitively by considering the following approximation of the differential equation:

where δS/S is the return of the asset over small time interval δt. According to this approximation, we see that return can be approximated by a normal variable with expected value μδt and standard deviation σ√δt. Actually, this normal distribution is only a local approximation of the “true” (according to the model) lognormal distribution.

3.7.5 GENERALIZATIONS

Geometric Brownian motion is not the only type of stochastic process relevant in finance, and the Wiener process is not the only relevant building block. One of the main features of these processes is the continuity of sample paths. However, discontinuities do occur sometimes, such as jumps in prices. In this case, different building blocks are used, such as the Poisson process, which counts the events occurring at a certain rate. We should also note that continuous sample paths do not make sense for certain state variables such as credit ratings. Another point is that the lognormal distribution, that we get from geometric Brownian motion, is a consequence of the normality associated with the Wiener process. Distributions with fatter tails are typically observed, questioning the validity of the simple models based on Gaussian diffusions.

Indeed, several alternative processes have been proposed and are commonly used in financial engineering. Here we just mention some possible generalizations, leaving further details to Chapter 6 on path generation and Chapter 11 on pricing derivatives.

Correlated Wiener processes. The need to model the dynamics of a set of asset prices arises when we deal with a portfolio, or when we want to price a rainbow option depending on multiple underlying assets. The simplest model that we may adopt is a direct generalization of geometric Brownian motion. According to this approach, the prices S_i(t) of assets i = 1, …, n satisfy the equation

equation

where the Wiener processes W_i(t) are not necessarily independent. They are characterized by a set of instantaneous correlation coefficients ρ_ij, whose meaning can be grasped by an extension of the formal rule of Eq. (3.85):

equation

From a Monte Carlo point of view, sample path generation of these processes require the generation of increments

equation

where and _j are standard normals with correlation ρ_ij.

Ornstein–Uhlenbeck processes. In geometric Brownian motion, we have an exponential trend on which a diffusion process is superimposed.³⁸ This exponential growth makes no sense for several financial variables, which rather feature mean reversion. Then, we may resort to Ornstein–Uhlenbeck processes like

Square-root diffusions. One drawback of Ornstein–Uhlenbeck processes is that they do not prevent negative values, which make no sense for stock prices and interest rates. A possible adjustment is the following:

equation

This is an example of a square-root diffusion, which in the context of short-term interest rates is known as the Cox–Ingersoll–Ross model. For a suitable choice of parameters, it can be shown that a square-root diffusion stays non-negative. This changes the nature of the process, which is not Gaussian anymore, as we will see in Chapter 6. Similar considerations hold when modeling a stochastic and time-varying volatility σ(t). Geometric Brownian motion assumes constant volatility, whereas in practice we may observe time periods in which volatility is higher than usual. In a discrete-time setting, ARCH and GARCH models may be used. In continuous-time, one possible model for stochastic volatility is the Heston model, which consists of a pair of stochastic differential equations:

equation

where V(t) = σ²(t), is a long-term value, and different assumptions can be made on the correlation of the two driving Wiener processes.

Jump–diffusions. In order to account for jumps, we may devise processes with both a diffusion and a jump component, such as

equation

where Y(t) is a compound Poisson process. From a formal point of view, we require the definition of a stochastic integral of a stochastic process Y(t) with respect to a Poisson process N(t):

equation

where N(t) is the number of jumps occurred up to time t in the Poisson process, and t_i, i = 1, …, N(t), are the time instants at which jumps occur. We note that the Wiener and the Poisson process, which are the basic building blocks of jump–diffusions, are quite different, but they do have an important common feature: stationary and independent increments. In fact, the class of Lévy processes generalizes both of them by emphasizing this feature.