There is a multifaceted interaction between optimization and Monte Carlo simulation, which is illustrated by the rich connections between this and other chapters of this book. Monte Carlo methods may be used when dealing with stochastic optimization models, i.e., when some of the parameters characterizing a decision problem are uncertain. This is clearly relevant in finance, and there is an array of approaches depending on the problem structure and on our purpose:
A further use of Monte Carlo methods is to check the out-of-sample robustness of a solution. Within an optimization model, we may only afford a limited representation of uncertainty; the in-sample value of the objective function may be a misleading indicator of quality. Thus, it may be useful to verify the performance of the selected solution in a much wider set of scenarios.
Stochastic optimization in all of its forms does not tell the whole story about the interaction between optimization and Monte Carlo methods. Stochastic search strategies are also used to solve deterministic nonconvex optimization problems. There is an array of useful strategies, including some fancy ones mimicking biological systems, such as genetic algorithms or particle swarm optimization. Such methods are typically derivative-free, i.e., unlike classical nonlinear programming methods, they do not rely on information provided by the gradient or the Hessian matrix of the objective function. This allows to use them when the objective is not guaranteed to be differentiable or even continuous. Furthermore, by blending exploitation with exploration behavior, they may be able to overcome the local optimality issues that make nonconvex optimization difficult. The optimization of nonconvex functions may arise, e.g., in model calibration or parameter estimation by maximum likelihood (see Chapter 4), when the underlying model is complicated enough.
As the reader can imagine, there is an enormous range of models and methods involved, which cannot be covered in a single chapter. Nevertheless, we offer here a set of basic examples in order to get a taste for this wide and quite interesting application field. In Section 10.1 we outline a classification of optimization models, for the sake of the uninitiated. Then, we provide a few concrete examples in Section 10.2, with due emphasis on economic and financial applications. Given our limited aim and scope, we will not go through traditional concepts like Lagrange multipliers in nonlinear programming or the simplex method for linear programming.1 When necessary, we will just use R functions from suitable packages implementing these methods. Emphasis is partly on modeling and partly on alternative methods, which are illustrated in Section 10.3 for global (nonconvex) optimization. These methods can also be applied to simulation-based optimization, which is the subject of Section 10.4. In Section 10.5 we illustrate model building in stochastic programming. We cannot cover specialized solution methods for this class of models, but we insist on scenario generation in Section 10.5.3, illustrating important concepts such as in-sample and out-of-sample stability. The last three sections are devoted to stochastic dynamic programming. The cornerstone principle of dynamic programming, the recursive functional equation due to Bellman, is introduced in Section 10.6, where we also compare this approach against stochastic programming with recourse. In Section 10.7 we describe “standard” numerical methods for dynamic programming, which are well-suited to problems with a rather moderate size, whereas in Section 10.8 we describe ADP strategies that are aimed at larger scale problems and feature interesting links with machine learning.
A fairly general statement of an optimization problem is
which highlights the following three building blocks:
This rather dry and abstract statement encompasses an incredible variety of decision problems. Note that there is no loss of generality in considering minimization problems, as a maximization problem can be easily converted into a minimization one:
Indeed, many software tools assume that the problem is either in minimization or in maximization form and, if necessary, leave the burden of changing the sign of the objective function to the user. In other cases, a flag is used to characterize the kind of problem.
In this book we only deal with finite-dimensional optimization problems, involving a finite number of decision variables. In continuous-time optimal control problems, where system dynamics is described by differential equations, decisions may be represented by functions u(t) where t [t0, T]; hence, such problems are infinite-dimensional. We do consider dynamic optimization problems in this chapter, but we assume that time has been discretized; this results in a finite-dimensional problem when the time horizon is finite. When the time horizon is infinite, in principle we do have an infinite-dimensional problem (though countably so); however, if the system is stationary, we may look for a stationary policy by dynamic programming. Dynamic programming relies on functional equations whose unknown are, in fact, infinite-dimensional objects if the underlying state space is continuous. As we shall see, we always boil down the overall problem to a sequence of finite-dimensional subproblems by suitable discretizations. Hence, what we really deal with are mathematical programming problems, but we will use this term and the more general “optimization problem” interchangeably.
When S ≡ n we speak of an unconstrained optimization problem; otherwise we have a constrained optimization problem. The feasible region S, in concrete terms, is typically described by a set of of equality and inequality constraints, resulting in the mathematical program:
The condition x X may include additional restrictions, such as the integrality of some decision variables. As we shall see, the most common form of this restriction concerns binary decision variables, i.e., xj
{0, 1}; this trick of the trade is useful to model logical decisions like “we do it” vs. “we do not.” Simple lower and upper bounds, describing box constraints like lj ≤ xj ≤ uj, are usually considered apart from general inequalities for the sake of computational efficiency. So, we might also have X = {x
n | 1 ≤ x ≤ u}, where vectors 1 and u collect the lower and upper bounds on decision variables, respectively.
An optimization problem can be very simple or on the verge of intractability, depending on some essential features of its building blocks. Such features are often more important than the sheer size of the problem. We will consider concrete modeling examples later, shedding light on such features and showing the astonishing variety of problems that we may address, but an essential classification framework involves the following dimensions:
By far, the most important property of an optimization problem is convexity. Convexity is a property of sets, in particular of the feasible set S, which can be extended to a property of functions, in our case, the objective function.
DEFINITION 10.1 (Convex sets) A set S ⊆ n is a convex if
Convexity can be grasped intuitively by observing that points of the form λx + (1 − λ)y, where 0 ≤ λ ≤ 1, are simply the points on the straight-line segment joining x and y. So, a set S is convex if the line segment joining any pair of points x, y S is contained in S. This is illustrated in Fig. 10.1: S1 is convex, but S2 is not. S3 is a discrete set and it is not convex; this fact has important consequences for discrete optimization problems. It is easy to see that the intersection of convex sets is a convex set. This does not necessarily apply to set union.
FIGURE 10.1 An illustration of set convexity.
Polyhedral sets are essential in describing the feasible set of a linear programming problem. A hyperplane aix = bi, where bi
and ai, x
n, divides
n into two half-spaces expressed by the linear inequalities ai
x ≤ bi and ai
x ≥ bi. Note that we always assume column vectors, so we use transposition to specify a row vector ai
. A polyhedron P ⊆
n is a set of points satisfying a finite collection of linear inequalities, i.e.,
where matrix A
m,n collects row vectors ai
, i = 1, …, m, and column vector b
m collects the right-hand sides bi. Note that a linear equality can be rewritten as two inequalities with opposite senses. A polyhedron is therefore the intersection of a finite collection of half-spaces and is convex. As we will discuss later (see Eqs. (10.4) and (10.5)), the feasible set of a linear programming problem can be described, e.g., as S = {x
n | Ax = b, x ≥ 0} or S = {x
n | Ax ≥ b}, and it is a polyhedron.
The closed intervals [2, 5] and [6, 10] are clearly convex sets. However, their union [2, 5] ∪ [6, 10] is not. As a more practically motivated example, consider an activity whose level is measured by the decision variable xj; in concrete, xj may represent the weight of an asset in a financial portfolio. A possible restriction is
meaning that we may not invest in that asset, i.e., xj = 0, but if we do, there are a lower bound lj and an upper bound uj on its weight. This restriction specifies a nonconvex set, and it should not be confused with the box constraint lj ≤ xj ≤ uj, which corresponds to a convex set. Later, we shall see how we can represent such a nonconvexity by the introduction of binary decision variables.
Set convexity can be extended to function convexity as follows.
DEFINITION 10.2 (Convex functions) A function f : n →
, defined over a convex set S ⊆
n, is a convex function on S if, for any y and z in S, and for any λ
[0, 1], we have
The definition can be interpreted by looking at Fig. 10.2. If we join any two points on the function graph with a line segment, all of the segment lies above the function graph. In other words, a function is convex if its epigraph, i.e., the region above the function graph, is a convex set. If the condition (10.3) is satisfied with strict inequality for all y ≠ z, the function is strictly convex. We illustrate examples of convex and nonconvex functions in Fig. 10.3. The first function is convex, whereas the second is not. The third function is a polyhedral convex function, and it is kinky. This example shows that a convex function need not be differentiable everywhere. Note that in the second case we have a local minimum that is not a global one. Indeed, convexity is so relevant in minimization problems because it rules out local minima.
FIGURE 10.2 An illustration of function convexity.
FIGURE 10.3 Convex and nonconvex functions.
Consider the following unconstrained problem:
The objective function is a polynomial, which can be dealt with in R by the polynom package, as illustrated by the script in Fig. 10.4. The script produces the following output:
FIGURE 10.4 Local minima in polynomial optimization.
out1 <- optim(par=4,fn=f,method=“BFGS”) > out1$par; out1$value [1] 3.643656 [1] −0.6934635 out2 <- optim(par=1,fn=f,method=“BFGS”) > out2$par; out2$value [1] 1.488273 [1] −1.875737
showing that we end up with different solutions, depending on the initial point. The script also produces the plot in Fig. 10.5: We observe that the polynomial features two local minimizers, one of which is also the global one, as well as one local maximizer. The function goes to infinity when x is large in absolute value, so there is no global maximizer. With a high-degree polynomial, there are potentially many local maxima and minima; in fact, multivariable polynomial optimization is, in general, a difficult nonconvex problem.
FIGURE 10.5 Global and local optima of a polynomial.
When solving a minimization problem, it is nice to have a convex objective function. When the objective has to be maximized, it is nice to deal with a concave function.
DEFINITION 10.3 (Concave function) A function f is concave if (−f) is convex.
Thus, a concave function is just a convex function turned upside down. We have observed that a convex function features a convex epigraph; hence, function convexity relies on set convexity. A further link between convex sets and convex functions is that the set S = {x
n | g(x) ≤ 0} is convex if g is a convex function.
Consider the function g(x) = x12 + x22 − 1. The constraint g(x) ≤ 0 describes a circle of radius 1, which is clearly a convex set. Indeed, g(x) is a convex function. However, the constraint g(x) ≥ 0 describes the exterior of the circle, which is clearly nonconvex. Indeed, the latter constraint is equivalent to −g(x) ≤ 0, where the function describing the set is concave. The equality constraint g(x) = 0 describes the boundary of the circle, i.e., the circumference, which again is nonconvex.
In general, equality constraints like
do not describe convex sets. We may understand why by rewriting the equality as two inequalities:
There is no way in which h(·) and −h(·) can be both convex, unless the involved function is affine:
It is pretty intuitive that convexity of the objective function makes an unconstrained problem relatively easy, since local minima are ruled out. But what about convexity of the feasible set of a constrained problem? Actually, convexity must be ensured in both the objective function and the feasible set, as shown by the next example.
Even if the objective function is convex, nonconvexities in the feasible set may make the problem difficult to solve. In Fig. 10.6 we observe two level curves of a convex objective function like f(x1, x2) = (x1 − α)2 + (x2 − β)2. The feasible set is a sort of bean and is nonconvex. We may notice that B is the global minimum, as there the feasible set is tangent to the lowest level curve. However, we also have a local optimum at point A: In its neighborhood there is no better feasible solution.
FIGURE 10.6 Local optima due to a nonconvex feasible set.
DEFINITION 10.4 (Convex optimization problem) The minimization problem minxS f(x) is convex if both S and f(·) are convex. The maximization problem maxx
S f(x) is convex if S is convex and f is concave.
Note that we also speak of a convex problem when we are maximizing a concave objective. We speak of a concave problem when we minimize a concave objective on a convex set. Concave problems are not as nice as convex problems, as the following example illustrates.
Consider the following one-dimensional problem:
This is a concave problem, since the leading term in the quadratic polynomial defining the objective is negative, so that the second-order derivative is negative everywhere. In Fig. 10.7 we show the objective function and the feasible set. The stationarity point x = 2 is of no use to us, since it is a maximizer. We see that local minimizers are located at the boundary of the feasible set. A local minimizer lies at the left boundary, x = 1, and the global minimizer is located at the right boundary, x = 4.
FIGURE 10.7 A concave problem may have local optima, but they lie on the boundary of the convex feasible set.
It turns out that the property illustrated in the example, i.e., there is an optimal solution on the boundary of the (convex) feasible set, does apply to concave problems in general. This kind of structure may be exploited in global optimization procedures.
An optimization model in the concrete form (10.2) can be linear or nonlinear. We have a linear programming (LP) problem if all of the involved functions are linear affine, as in the following model:
where , and
. Clearly, such a problem makes sense if m < n. The above form is an LP in standard form, involving only equality constraints and non-negative decision variables. Another way to state the problem is the canonical form
Using tricks of the trade, it is possible to convert any LP in one of the two forms above. The important point is that LP is both a convex and concave programming problem. This implies that any local optimum is also a global one, and that we may just bother with points on the boundary of the feasible region. More precisely, we have an optimal solution corresponding to a vertex of the feasible region, which is polyhedral. This is exploited in the celebrated simplex algorithm, which is available in plenty of commercial packages. There are also alternative strategies based on interior point methods.
Say that we want to solve the following LP:
We may resort to the linprog package, which provides us with the function solveLP to solve LPs by the simplex algorithm. The following script solves the problem:
library(linprog) Amat <- rbind(c(15,35),c(25,15),c(1, 0),c(0,1)) bvet <- c(2400,2400,100,50) cvet <- c(45,60) result <- solveLP(cvet,bvet,Amat,maximum=TRUE) cat(’optimal obj’, result$opt, ‘\n’) cat(’optimal x’, result$solution, ‘\n’)
Note that we have to include the upper bounds in the matrix of generic linear constraints; better solvers allow us to keep bounds apart. The resulting output is
optimal obj 5538.462 optimal x 73.84615 36.92308
If even one function involved in problem (10.2) is nonlinear, be it the objective or a function defining an equality or inequality constraint, we step into the domain of nonlinear programming. Contrary to what one may imagine, the real difficulty boundary is not marked by the linear/nonlinear dichotomy. The defining property is convexity.
For instance, we lose convexity in LP if we restrict a subset of variables to take only integer values: We get a mixed-integer LP model, MILP for short, which is nonconvex and requires much more complicated solution procedures. In principle, since there is no easy optimality condition to check for MILPs, enumeration of solution is the only viable approach; partial enumeration strategies based on branch and bound methods are widely available, even though they may be quite expensive in terms of computational effort. On the contrary, there are nonlinear programming problems that are rather easy to solve. One such example is convex quadratic programming (QP):
In QPs, we have an objective involving a quadratic form, whereas all of the constraints are linear. The problem is easy if the quadratic form is convex, which is the case if the matrix Q is positive semidefinite. It is worth noting that a quite relevant example of positive semidefinite matrix in financial applications is a covariance matrix; see Section 10.2.1.
Say that we want to solve the following QP:
with obvious solution x* = [2, 2]. We may resort to the quadprog package. The following script solves the problem:
library(quadprog) Qmat <- rbind(c(2,0),c(0,2)) bvet <- c(4) Amat <- matrix(c(1,1),nrow=2) result <- solve.QP(Qmat, c(0,0),Amat,bvet,meq=1) cat(’optimal obj’, result$value, ‘\n’) cat(’optimal x’, result$solution, ‘\n’)
The resulting output is
optimal obj 8 optimal x 2 2
We note a few peculiarities of solve.QP:
As we show later, QPs have a long tradition in finance. Another class of relevant problems is quadratically constrained QP (QCQP), which involves convex quadratic forms in the constraints as well. There are other classes of nonlinear, yet convex problems for which quite efficient solution procedures are available. The more general nonlinear programming problems are typically harder but not intractable, provided that convexity applies. On the contrary, nonconvex problems require global optimization methods and are quite challenging. This is where Monte Carlo methods come into play.
In most interesting and realistic problems we have to cope with some form of uncertainty in the data. This is certainly the case in financial problems, where return on investments is uncertain. We may have even problems with an uncertain time horizon, such as pension planning. A rather abstract form of a stochastic optimization problem is
which is an obvious generalization of the abstract problem (10.1). Here the objective function depends on decision variables x and random variables ξ. Taking the expectation eliminates ξ, yielding a function of x. Assuming that the random variables are continuous with joint density gξ(0), we have
The random variables represent uncertainty that is realized after we make the decision x, as illustrated in the following simple example.
The newsvendor problem is a prototypical problem in supply chain management and concerns production or purchasing decisions for items subject to perishability or obsolescence risk, like fashion goods. There is a limited time window for sales, and we must decide the order quantity q before knowing the realization of a random demand D during the time window. Items are bought (or produced) at unit cost c and sold at the full retail price p > c within the sales time window; any unsold item after the time window must be marked down and sold at price pu < c. If the case q < D occurs, we have an opportunity cost corresponding to the lost profit margin m ≡ p − c for each unit of unsatisfied demand. On the other hand, when q > D we incur a cost cu ≡ c − pu for each unsold item. The newsvendor problem seems quite unrelated with financial problems. Nevertheless, to see a useful connection, imagine a retail bank that has to decide on the amount of cash required to meet withdrawal requests; ideally, we would like to set apart exactly the cash needed, and any surplus or shortfall is associated with a cost. Furthermore, the newsvendor problem is a simple decision problem with an analytical solution, which we use later to illustrate issues in scenario generation.
Let us use the notation X+ ≡ max{0, X} and express profit as:
Our problem is the maximization of expected profit,
which can also be recast in terms of cost minimization:
To see the equivalence of models (10.9) and (10.10), observe that the expectation of Eq. (10.8) involves the the expected demand, which is a given constant and does not play any role. In the second form (10.10), we associate a cost cu with each unit of surplus and a cost m with each unit of shortfall with respect to demand. In the case of a continuous demand distribution, it is easy to show that the optimal solution solves the equation
where FD(x) ≡ P{D ≤ x} is the cumulative distribution function (CDF) of demand. In other words, the profit margin m and the cost of unsold items cu define a quantile of the demand distribution. If demand is normal, D ~ N(μ, σ2), then q* = μ + zσ, where z is the standard normal quantile corresponding to a probability level given by the ratio in Eq. (10.11). Note that, in general, the optimal solution is not to order the expected value of demand. If uncertainty is significant, we should tilt our decision toward larger values if the critical ratio is larger than 0.5, and toward smaller values when it is smaller than 0.5. The important takeaway is that, when faced with an optimization problem involving uncertain data, we should not solve a deterministic problem based on their expected values.
The formulation (10.7) actually encompasses cases in which we are not really able to formulate the problem as a mathematical program explicitly. It may be the case that the form of the objective function is not known, but we can only estimate its value by a simulation model. At first, this may sound weird, but there are indeed optimization problems in which we are not quite able to evaluate the objective function exactly. In fact, the objective function could be related to the performance of a complex dynamical system, such that there is no closed-form expression of the objective function. Under uncertainty, the decisions themselves may not be explicit, but given in the form of a decision rule to be applied on the basis of the current state and the previous realization of uncertain parameters. In such a setting, we have to resort to simulation-based optimization approaches, where Monte Carlo methods play a prominent role.
While the formulation of the model (10.7) is general, it obscures two different sides of the coin:
Infeasibility is not an issue in the newsvendor model above, but it is in general. We will be more specific on how stochastic optimization can be addressed in Section 10.5.
The ability of solving an optimization problem is of no use without the ability of building a sensible optimization model. Model building is a difficult art to learn, possibly a bit out of the scope of this book. Nevertheless, it is useful to illustrate a few relevant examples to get a glimpse of the resulting model classes:
Mean–variance optimization is the prototypical static portfolio optimization model. The decision concerns setting the weights wi, i = 1, …, n, of a set of n assets, in such as way to strike a satisfactory trade-off between expected return and risk. Let
If we assume that risk can be measured by the standard deviation of portfolio return, one sensible model for the asset allocation decision is:
Here we minimize variance, rather than standard deviation, since this yields an equivalent, but highly tractable convex quadratic programming problem. The constraint (10.12) enforces the achievement of the target expected return; then we also require that portfolio weights add up to 1, and the non-negativity requirements forbid short-selling.
From a computational point of view, this is a trivial problem that can be solved quite efficiently. However, the model itself has a few weak points:
These variations on the theme lead to nonconvex and/or stochastic optimization models, which may be quite challenging to formulate and solve; Monte Carlo methods may play a pivotal role in tackling them, in one way or another.
One of the features that may make an optimization problem nonconvex and much harder to solve is the presence of decision variables restricted to integer values. By far, the most common case involves logical (or binary) decision variables, taking values in the set {0, 1}. We first outline a few standard modeling tricks involving such variables, and then we describe a practically relevant portfolio optimization model requiring their introduction.
Consider an economic activity whose level is measured by a continuous decision variable x ≥ 0. The total cost TC(x) related to the activity may consist of two terms:
In principle, we could introduce a step function such as
and express total cost as
Unfortunately, the step function is nonlinear and discontinuous at the origin, as it jumps from 0 to 1. To avoid the resulting technical difficulty, we may resort to an alternative representation by introducing the following binary decision variable:
Then, we link x and δ by the so-called big-M constraint:
where M is a suitably large constant. To see how Eq. (10.13) works, imagine that δ = 0; then, the constraint reads x ≤ 0, whatever the value of M is. This constraint, together with the standard non-negativity restriction x ≥ 0, enforces x = 0. If δ = 1, the constraint is x ≤ M, which is nonbinding if M is large enough; in practice, M should be a sensible upper bound on the level of activity x.
A possible requirement on the level of an activity is that, if it is undertaken, its level should be in the interval [mi, Mi]. Note that this is not equivalent to requiring that mi ≤ xi ≤ Mi. Rather, we want something like
which involves a nonconvex set (recall that the union of convex sets need not be convex). For instance, we may use a constraint like this to avoid the inclusion of assets with negligible weights in a portfolio. Using the big-M trick as above, we may write
These constraints define a semicontinuous decision variable.
Active portfolio managers aim at achieving superior return by tacking advantage of skill or analytical capabilities. On the contrary, passive portfolio managers aim at replicating an index or tracking a benchmark portfolio as close as possible. There is some theoretical support for the passive view, as the efficient market hypothesis, in its various forms, essentially states that there is no way to beat the market. The capital asset pricing model, too, supports the view that the market portfolio is the optimal risky component in an overall portfolio. The matter is highly controversial, but one thing can be taken for granted: A passive portfolio must have a very low management cost. One way to achieve this is to restrict the number of asset names included in the portfolio, i.e., to enforce a cardinality constraint.
To formulate the problem, let us consider a universe of n assets and a benchmark portfolio with given weights wib, i, …, n. We want to find the weights wi ≥ 0 of a tracking portfolio, in such a way to minimize a measure of distance between the target and the tracking portfolio. An additional restriction is that at most Cmax assets can be included, i.e., at most Cmax weights can be strictly positive (we assume that short-selling is forbidden). In order to express the cardinality constraint, we introduce a set of binary variables δi, one for each asset, modeling the inclusion of asset i in the tracking portfolio:
The binary variables δi and the continuous variables wi are linked by the big-M constraint
and the cardinality constraint reads as
Since a weight cannot be larger than 1, when we rule out short-selling, we can set M = 1 in Eq. (10.14); the big-M constraints can be tightened by setting M = i, where
i is some suitable upper bound on the weight of asset i.3
Now the problem is how to measure the portfolio tracking error. A trivial distance metric can be defined by an L1 norm:
By proper model formulation, this metric yields an MILP model. However, this distance metric does not take any return model into account. To see the point, let us consider the following weights for two assets:
According to the L1 norm, the distance between the target and the tracking portfolio, considering only these two assets, is
However, what is the distance if the two assets are perfectly correlated? Of course, the actual distance would be zero, in such a case. More generally, the actual distance depends on the link between the two returns. An alternative distance metric is TEV (tracking error variance), defined as
where Rp and Rb are the return of the tracking portfolio and the benchmark, respectively. Then, TEV can be expressed as
where σij is the covariance between the returns of assets i and j. Straightforward minimization of TEV, subject to a cardinality constraint, yields the following model:
(10.15)
This model is a mixed-integer quadratic programming problem with a convex relaxation. Commercial solvers are able to tackle this model structure. Nevertheless, some stochastic search methods, such as the sampling based algorithms described in Section 10.3, have been proposed as alternatives for large-scale problem instances.
We have introduced the newsvendor problem in Example 10.9. The basic version of the problem can be solved analytically, and the same applies to many interesting variants. Nevertheless, it is instructive, as well useful for what follows, to consider a scenario-based model formulation. The idea is to approximate the probability distribution for demand by a set of S scenarios, characterized by a probability πS and a demand realization Ds, s = 1, …, S. The main decision variable is the ordered amount q ≥ 0 and, in order to express the objective in cost minimization form, we introduce two non-negative deviations with respect to realized demand:
(10.16)
where y+s is a surplus and y−s is a shortfall. The, the model can be formulated as an LP model:
As we shall discover later, this can be considered as a very simple example of two-stage stochastic LP model. From a practical point of view, it is a basis for more interesting models allowing for interactions among multiple products, e.g., by demand substitution.
In Section 10.2.1 we have considered a static, single-period asset allocation model. Later, in Section 10.5.2, we will consider a full-fledged multiperiod model, allowing for dynamic portfolio adjustments. An intermediate strategy can be devised by specifying a fixed-mix, i.e., by keeping portfolio weights wi constants over multiple periods. Wealth will change along the way, and to keep the relative weights constant, we will sell high-price assets, and buy low-price assets. Thus, the strategy is intrinsically a contrarian one, rather than a trend-following one. Let us also assume that we only care about mean and variance, as we did before, and that our mean–risk objective is the expected value of terminal wealth penalized by its variance. Unlike the static case, expressing variance of the terminal wealth is not easily accomplished; hence, we resort to a set of S return scenarios. Each scenario is a history of total4 asset returns Rits, over a planning horizon consisting of time periods t = 1, …, T. Each scenario is associated with a probability πs, s = 1 …, S. The model we describe here is due to [32], to which we refer the reader for further information and computational experiments, and is basically an extension of the mean–variance framework.
Let W0 be the initial wealth. Then, wealth at the end of time period 1 in scenario s will be
Note that wealth is scenario-dependent, but the asset allocation is not. In general, when we consider two consecutive time periods, we have
Unfolding the recursion, we see that wealth at the end of the planning horizon is
Within a mean–variance framework, we may build a quadratic utility function depending on the terminal wealth. Given a parameter λ related to risk aversion, the objective function is
To express the objective function, we recall that Var(X) = E[X2] − E2[X] and write the model as
This looks like a very complex problem; however, while the objective function is a bit messy, the constraints are quite simple. The real difficulty is that this is a nonconvex problem. To see why, just note that the objective turns out to be a polynomial in the decision variables; since polynomials may have many minima and maxima (see Example 10.3), we face a nonlinear nonconvex problem. Once again, stochastic search algorithms, based on Monte Carlo sampling, can be used to cope with these problems.
Asset pricing is a central problem in financial engineering, as well as in financial economics and corporate finance. We know from Section 3.9.2 that the price of a European-style option can be expressed as the discounted expected payoff, under the risk-neutral measure. Assuming a constant risk-free rate r, with continuous compounding, we have
where S0 is the underlying asset price now and Φ(ST) is the option payoff, depending on the underlying asset price ST at maturity. In principle, pricing a European-style option by Monte Carlo methods is easily accomplished, provided that we can generate sample paths efficiently; if we do not consider hedging issues, we have just to sit down, cross our fingers, and wait until maturity. However, there are assets whose value depends on our choices along the way. Generally speaking, the value of a resource depends on how the resource itself is used. American-style options, unlike their European-style counterparts, can be exercised at any date prior to expiration. One easy conclusion is that an American-style option cannot cost less than the corresponding European one, as it offers more opportunities. However, early exercise opportunities make pricing much harder, as we should consider an optimal exercise strategy, which in turn calls for the solution of a stochastic dynamic optimization problem. At each exercise opportunity, if the option is in-the-money, we have to compare the intrinsic value of the option, i.e., the immediate payoff that we may earn from exercising the option early, and the continuation value, i.e., the value of postponing exercise and waiting for better opportunities.
Formally, the price of an American option can be written as
(10.19)
The difference with Eq. (10.18) is that now we have to solve an optimization problem with respect to a stopping time τ. The term “stopping time” has a very precise meaning in the theory of stochastic processes, but here we may simply interpret it as the time at which we exercise the option, according to some well-defined strategy.
A put option is in-the-money at time t if S(t) < K. Immediate exercise in this case yields a payoff K − S(t). Clearly, early exercise will not occur if the option is not in-the-money, but even if S(t) < K it may be better to keep the option alive and wait. This means that early exercise will occur only if the option is “enough” in-the-money. By how much, it will generally depend on time to expiration. When we are close to option expiration, there are less opportunities, and we are more inclined to exercise; but when we are far from expiration, we are more inclined to wait. Formally, we have to find a boundary separating the early exercise region, where the intrinsic value is larger than the continuation value, from the continuation region. Qualitatively, for an American put option we would expect an early exercise boundary like the one depicted in Fig. 10.8. This boundary specifies a stock price S*(t) such that if S(t) < S*(t), i.e., if the option is deeply enough in-the-money, then we exercise. If we are above the boundary, we are in the continuation region, and we keep the option alive. Finding the boundary is what makes the problem difficult; for a detailed treatment of the exercise boundary for American options see, e.g., [31, Chapter 4].
FIGURE 10.8 Qualitative sketch of the early exercise boundary for a vanilla American put. The option is exercised within the shaded area.
From the last example, we understand that the time at which the option will be exercised is indeed a random variable, in this case the time at which we cross the exercise boundary, if we cross it. More complicated strategies are conceivable and must be applied to exotic and multidimensional options, and this is what the stopping time captures. The stopping time is related to any conceivable strategy, provided that it is nonanticipative; we may only use information gathered from time 0 up to time t. From a practical perspective, in order to price an option with early exercise features, we have to solve a possibly nontrivial optimization problem. For low-dimensional options, numerical methods based on suitable discretizations of the state space, like binomial or trinomial lattices, or finite-difference methods to solve partial differential equations work very well. For high-dimensional or path-dependent options, we have to resort to Monte Carlo methods. In the past, this was considered impossible, but by framing the problem within a stochastic dynamic programming framework, we may tackle it, as we shall see in Section 11.3.
We have introduced maximum-likelihood estimation in Section 4.2.3. Parameter values are chosen in such a way as to maximize a likelihood function related to a selected probability distribution. In very simple cases, the solution is easily found analytically. In nontrivial cases, numerical optimization methods are needed for the solution of the resulting problems, possibly because of the complexity of the likelihood function or the addition of constraints on parameter values, on the basis of some additional information we might have. Furthermore, the resulting problem may be nonconvex and call for global optimization strategies.
A similar kind of problem is encountered when we have to calibrate a pricing model. For instance:
At time t we observe a set of bond prices Pk(t), k = 1, …, n. Let us assume for the moment that the cash flows associated with these bonds, i.e., the payment of coupons along the way and the face value at bond maturity, occur at a set of common time instants Ti, i = 1, …, m. Let r(t, T) denote the continuously compounded interest rate applying over the time interval [t, T]. Based on observed bond prices, we want to estimate r(t, T) for a range of maturities. According to straightforward bond pricing formulas, the bond prices should be given by
where ck,i is the cash flow from bond k at time Ti. These predicted prices are compared against the observed prices Pko(t). If m = n, the number of unknown parameters and prices are the same, and we may solve a system of nonlinear equations to find the interest rates for maturities Ti, where we assume k(t) = Pko(t). A textbook approach, known as bootstrapping, assumes that exactly one bond will mature at each time instant Ti; then, the system of equations has a “triangular” structure and we can solve one equation at a time, finding one rate per equation. In practice, it may be difficult to find a suitable set of bonds lending itself to such a simplified solution strategy. Liquidity and other issues may affect bond prices, introducing noise and some irregularity in the resulting yield curve.
To smooth those effects, it is preferable to choose m > n and minimize an error distance between observed and predicted prices, with respect to rates:
This is a nonlinear least-squares problem that can be solved by numerical optimization methods. However, it is typically desirable not to consider a restricted set of cash flow times Ti, in order to widen the set of bonds we may use for calibration. Furthermore, we would like to fit a term structure with some reasonable degree of smoothness, in such a way that we can estimate rates for maturities different from the aforementioned cash flow times Ti. One way to accomplish this is by spline interpolation, but due to possible pricing anomalies exact interpolation may yield unreasonable results, and least-squares approximation may be preferred. More generally, we may consider a set of basis functions φj(T), j = 1, …, J, whose linear combination
gives the yield curve as a function of maturity T. Here, the decision variables of the optimization problem are the coefficients αj, and we might also consider estimating the discount curve directly, which is then recast as the yield curve. The use of basis functions to boil down an infinite dimensional problem to a finite dimensional one is very common, and we will take advantage of the idea later, when dealing with numerical dynamic programming.
The calibration of the yield curve on the basis of quoted prices is an example of an inverse problem. Model calibration is also a cornerstone of option pricing in incomplete markets. As we have seen in Section 3.9.3, market incompleteness arises whenever we step out of the safe BSM domain based on GBM process with very simple drift and volatility terms. If volatility is stochastic or the underlying variable is nontradable, we may need a more sophisticated approach. In Eq. (3.126) we have pointed out the existence of an unknown function m*(·, ·) serving as a drift in an Itô differential equation, under a pricing measure. We report the equation here for the sake of convenience, recasting it in terms of a generic state variable Xt:
Once again, note that for pricing purposes the drift term is related to a risk-neutral measure, and it cannot be estimated on the basis of a time series of Xt. By a similar token, we may not want to fit the volatility function on the basis of historical values of volatility; rather, we want to extract the current market view, which is implicit in asset prices. Hence, we are faced with an inverse problem. Depending on the kind of dependence on time, the functions m*(·, ·) and s(·, ·) may depend on a finite or infinite set of parameters. The second case occurs when these terms explicitly depend on time, in order to improve their matching performance against quoted prices. Whatever the case, by a suitable discretization mechanism, possibly based on basis functions, interpolation, and whatnot, we end up with a pricing model depending on a vector of parameters α. The pricing model predicts a price k(α) for a set of securities with quoted (observed) price Pko, and it calibrated by solving the optimization problem
One might wonder why we should calibrate a model by fitting a set of unobservable parameters against quoted prices of exchange-traded instruments. Clearly, the aim is not to recover quoted prices per se, but to fit a model in order to price over-the-counter assets and to assess option price sensitivity with respect to a set of risk factors.
Depending on the complexity of the pricing model, we may end up with relatively simple optimization models, amenable to classical nonlinear programming procedures (most notably, the Levenberg–Marquardt algorithm), or an extremely difficult global optimization problem. In the latter case, the Monte Carlo based global optimization methods that we describe in the next section might be the only viable solution method.
When dealing with a convex optimization problem featuring a well-defined objective function, we may take advantage of useful information provided by its gradient and Hessian matrix. Unfortunately, it may be the case that the objective function is badly behaved, i.e., nonconvex, possibly discontinuous, or even not known in analytical form. Then, we may resort to algorithms that rely only on function evaluations, possibly obtained by stochastic simulation, and a search mechanism to explore the solution space randomly. Monte Carlo methods may be used at both levels:
As the reader can imagine, there is no clear-cut line separating the two ideas, as random sampling may be employed at both levels.
For any specific decision problem it is possible to devise an ad hoc heuristic method providing us with a good, possibly near-optimal, solution. Here we only consider fairly general strategies, which may be customized to the specific problem at hand. Indeed, a wide class of such methods are known under the name of metaheuristics, which emphasizes their generality and flexibility. Indeed, they are able to cope with both continuous nonconvex problems as well as discrete optimization problems, like optimal portfolio tracking, where the nonconvexity arises from the feasible set, which includes binary decision variables. We first introduce local search and other metaheuristic principles in Section 10.3.1. Then, in Section 10.3.2 we describe simulated annealing, a simple stochastic search method inspired by an interesting physical analogy. Other methods, such as genetic algorithms (Section 10.3.3) and particle swarm optimization (Section 10.3.4) rely on analogies with biological systems and, unlike simulated annealing, work on a population of solutions, rather than a single one.
Local search is a wide class of simple exploration strategies based on the idea of generating a sequence of solutions exploring the neighborhood of the current one. Consider a generic optimization problem
defined over a feasible set S. The basic idea is to improve a given feasible solution by the application of a set of local perturbations. We associate a feasible solution x with a neighborhood N(x), which is defined by applying a set of simple perturbations to x. Different perturbations yield different neighborhood structures. The simplest local search algorithm is local improvement: Given a current (incumbent) solution x0, an alternative (candidate) solution x’ is searched for in the neighborhood of the incumbent:
If the candidate solution improves the objective function, i.e., f(x’) < f(x0), then the incumbent solution is replaced by the candidate and the process is repeated. The search is terminated when we cannot find any improving solution, and the incumbent solution is returned. Clearly, there is no guarantee that an optimal solution is found, as the process may get trapped in a bad locally optimal solution. Furthermore the solution is locally optimal with respect to the selected neighborhood structure, which must be simple enough to allow for an efficient search, but not so simplistic that the exploration process is ineffective. If the candidate solution is the optimal solution in the neighborhood, we speak of a best-improving local search; if we stop exploring the neighborhood as soon as an improving solution is found, we speak of a first-improving local search.
Neighborhood structures may be devised for both discrete and continuous optimization problems. If all the components of x
n are restricted to binary values, a simple neighborhood structure is obtained by complementing each variable in turn, i.e., for each i = 1, …, n we set
The neighborhood consists of n points and it is easy to explore, but possibly very poor. We may consider also complementing pairs of variables, as well as swapping values of variables having different values.
The case of continuous variables adds a degree of difficulty, as it is not clear by how much we should perturb each component. We may define steps δi for each component and generate 2n alternative solutions by applying the perturbation in both directions:
where ei is a unit vector with 1 in position i and 0 in the other positions. This is the simplest pattern for searching the neighborhood, but some mechanism should be devised to dynamically adjust the step lengths.
Alternatively, we may generate a random candidate at a time, by sampling randomly in the neighborhood of the incumbent solution using a suitable probability distribution. For instance, we may adopt a multivariate normal distribution with expected value x0 and a diagonal covariance matrix ∑. This approach lends itself to first-improving strategy.
Actually, devising a clever and effective neighborhood is not trivial at all. A further complication is that due attention must be paid to constraints. In some cases we might just discard solutions violating them; however, this is not easily accomplished when dealing with equality constraints, as staying inside the feasible region might be difficult. An alternative strategy is to relax constraints and augment the objective function by a corresponding penalty function. For instance, the constrained problem
may be transformed into the unconstrained problem
for suitably large penalty coefficients ωi, possibly expressing the relative importance of constraints. The penalty coefficients might also be dynamically adjusted, allowing the search process to occasionally leave the feasible region and getting back into it at another point. This kind of approach, called strategic oscillation, may be used, along with other trick of the trade, to balance exploitation (the need to stick to a good solution and explore its neighborhood) and exploration (the need to explore other subregions of the feasible set).
Local improvement is not suitable to global optimization. However, one may relax the requirement that the new solution in the search path is an improving one. By accepting nonimproving candidates, we may hope to escape from local optima. Two such strategies are:
Both simulated annealing and tabu search work with a single incumbent solution. In order to find a better trade-off between exploration and exploitation, one may work with a population of solutions. Examples of population-based approaches are genetic algorithms and particle swarm optimization.
Simulated annealing owes its name to an analogy between cost minimization in optimization and energy minimization in physical systems. The local improvement strategy behaves much like physical systems do according to classical mechanics. It is impossible for a system to have a certain energy level at a certain time instant then and to increase it without external input: If you place a ball in a hole, it will stay there. This is not true in statistical mechanics; according to these physical models, at a temperature above absolute zero, thermal noise makes an increase in the energy of a system possible. This applies to changes at a microscopic level, and an increase in energy is more likely to occur at high temperatures. The probability P of this upward jump depends on the increase in energy ΔE and the temperature T, according to the Boltzmann distribution
where K is the Boltzmann constant. Annealing is a metallurgical process by which a melted material is slowly cooled in order to obtain a good (low-energy) solid-state configuration. If the temperature is decreased too fast, the system gets trapped in a local energy minimum, and a glass is produced. But if the process is slow enough, random kinetic fluctuations due to thermal noise allow the system to escape from local minima, reaching a point very close to the global optimum.
In simulated annealing we consider the change in the objective function when moving from the incumbent solution to a candidate solution:
The candidate solution is accepted with probability
An improving candidate yields Δf < 0 and is always accepted. A nonimproving candidate is randomly accepted, according to a probability depending on the amount of cost increase Δf and the control parameter T, as shown in Fig. 10.9. At very high temperatures, most candidate solutions are accepted, whereas at intermediate temperatures candidate solutions that increase cost are accepted with lower probability if the deterioration is significant. For T → 0 the acceptance probability collapses into a step function, and the method behaves like local improvement. In Section 14.4 we will interpret simulated annealing within the framework of Markov chain Monte Carlo methods, as a way to sample from a density that is concentrated on the set of globally optimal solutions. However, to reach that set without getting trapped in local optima, the temperature should be decreased according to a cooling schedule. The simplest cooling schedule is
FIGURE 10.9 Acceptance probabilities as a function of cost increase for different temperatures in simulated annealing.
In practice, it is advisable to keep the temperature constant for a certain number of steps, in order to reach a thermodynamic equilibrium before changing the control parameter. More sophisticated adaptive cooling strategies have been proposed. Furthermore, we have to establish a termination criterion. The simplest idea is to stop the algorithm after a given number of iterations; alternatively, we may stop after a maximum number of iterations without accepting a candidate.
We show an extremely naive implementation of simulated annealing in Fig. 10.10, for illustrative purpose only. We may test it on the Rastrigin function5
FIGURE 10.10 A naive implementation of simulated annealing.
which has a global optimum for x = 0:
> Rastrigin<-function(x)sum(x^2-10*cos(2*pi*x))+10*length(x) > dim <- 20 > lb <- rep(−100,dim) > ub <- rep (100,dim) > set.seed(55555) > x0 <- runif(dim,min=lb,max=ub) > out <- optim(par=x0,fn=Rastrigin,lower=lb,upper=ub, + method=“L-BFGS-B”) > out[c(“value”,“counts”)] $value [1] 1361.003 $counts function gradient 43 43 > > set.seed(55555) > out <- NaiveSA(fn=Rastrigin,par=x0,temp0=100, + beta=0.9999,maxit=200000,sigma=1) > out[c(“value”,“counts”)] $value [1] 125.437 $counts [1] 2e+05
We use n = 20 and start search from a randomly selected point within a box. For compatibility reasons with other R functions that we want to use for comparison purposes, we constrain all decision variables in a box with lower bound −100 and upper bound 100. A classical nonlinear programming method, available as an option in the standard R function optim gets quickly stuck in a bad solution, whereas the naive annealing does better. Actually, we may take advantage of a better implementation of simulated annealing offered by optim:
> set.seed(55555) > out <- optim(par=x0,fn=Rastrigin,method=“SANN”, + control=list(temp=100,tmax=10,maxit=200000)) > out[c(“value”,“counts”)] $value [1] 81.33197 $counts function gradient 200000 NA
Here, tmax is the maximum number of iterations without decreasing the temperature. Still better results are obtained by function GenSA offered in the package having the same name:
> library(GenSA) > set.seed(55555) > out <- GenSA(fn = Rastrigin,lower=lb,upper=ub,par=x0) > out [c(“value”,“par”,“counts”)] $value [1] 0 $par 1.149916e-11 8.491049e-12 −1.670109e-11 −3.363457e-11 −2.144013e-11 −4.907911e-13 −3.319979e-11 −3.287746e-11 6.624013e-12 −2.850014e-11 3.173777e-11 −1.580739e-11 −2.003915e-13 −2.192858e-09 2.288534e-11 3.180884e-11 1.039823e-11 −7.188424e-12 3.151424e-11 1.477490e-11 $counts [1] 257528
This function implements a generalized simulated annealing algorithm; see [44] for more information.
Unlike simulated annealing, genetic algorithms work with a set of solutions rather than a single point. The idea is based on the survival-of-the-fittest mechanism of biological evolution, whereby a population of individuals evolves by a combination of the following elements:
There is a considerable freedom in the way the above idea is translated into a concrete algorithm. A critical issue is how to balance exploration and exploitation. On the one hand, biasing the selection of the new population toward high quality individuals may lead to premature termination; on the other hand, unless a certain degree of elitism is used to keep good individuals into the population, the algorithm might wander around and skip the opportunity of exploring promising subregions of the solution space.
A key ingredient is the encoding of a feasible solution as a string of features, corresponding to the genes of a chromosome. For continuous optimization in n the choice of representing each solution by the corresponding vector x is fairly natural.6 Let us consider the simplest form of crossover: Given two individuals x and y in the current pool, a “breakpoint” position k
{1, 2, …, n} is randomly selected and two offsprings are generated as follows:
Variations on the theme are possible; for instance, a double crossover may be exploited, in which two breakpoints are selected for the crossover. The two new solutions are clearly feasible for an unconstrained optimization problem in n; when constraints are added, this guarantee is lost in general. As we have already mentioned, one can enforce hard constraints by eliminating noncomplying individuals. A usually better alternative is a relaxation by a suitable penalty function.
There are a few packages implementing genetic algorithms in R. Figure 10.11 shows how to use one of them, GA, to minimize the Rastrigin function. Note that we have to set the “real-valued” option to specify that we are dealing with a continuous optimization, and that there are other settings aimed at discrete optimization. Furthermore, genetic algorithms typically maximize a fitness function; hence, we have to change the sign of the Rastrigin function. We obtain the following (edited) output:
FIGURE 10.11 Using the GA package to minimize the Rastrigin function.
We notice that some options related to crossover and mutation probabilities, as well as the degree of elitism, have been set to default values. We may also provide the ga function with customized functions to carry out mutation, crossover, etc. Genetic algorithms offer plenty of customization options, which is a mixed blessing, as it may be awkward to find a robust and satisfactory setting. It is also possible to collect the best solutions found along the way and to refine them by a classical nonlinear programming method, finding a local minimizer nearby.
Particle swarm optimization (PSO) methods are an alternative class of stochastic search algorithms, based on a population of m particles exploring the space of solutions, which we assume is a subset of n. The position of each particle j in the swarm is a vector
which changes in discrete time (t = 1, 2, 3, …) according to three factors:
The idea is to mimic the interactions of members of a swarm looking for food, and a classical statement of the algorithm is the following:
Equation (10.20) governs the evolution of each component i = 1,…, n of the velocity of each particle j = 1, …, m. The new velocity depends on:
Equation (10.21) simply changes each component of the current position according to the new velocity. At each iteration, the personal and global bests are updated when necessary, and other adjustments are used in order to keep velocity within a given range, as well as the positions, if bounds on the variables are specified.
PSO can be regarded as another Monte Carlo search approach, and several variants have been proposed (see the end of chapter references):
Some of these algorithms have a physical motivation, whereas some have a biological motivation; they differ in the number of coefficients to be set and the corresponding complexity of fine tuning, though some self-adaptive methods are available. As the reader can imagine, it is easy to get lost because of the sheer number of proposals and combinations. The key is to experiment with some of these variants and find the best combination for the specific problem at hand. A welcome tool is the pso R package, which implements some variants of the approach. The code in Fig. 10.12 shows how to use the psoptim function to minimize the Rastrigin function. The function has a similar interface as the standard optim function, plus several control parameters controlling the maximum number of iterations (maxit), the tracing of execution (trace and REPORT), as well as the possibility of carrying out a local optimization for each particle using a standard nonlinear optimizer (if hybrid is set). The code produces the following output:
FIGURE 10.12 Using the pso package to minimize the Rastrigin function.
> out <- psoptim(par=x0,fn=Rastrigin,lower=lb,upper=ub, + control=list(maxit=10000,trace=1,REPORT=1000, + hybrid=F)) S=18, K=3, p=0.1576, w0=0.7213, w1=0.7213, c.p=1.193, c.g=1.193, v.max=NA, d=894.4, vectorize=FALSE, hybrid=off It 1000: fitness=27.86 It 2000: fitness=27.86 It 3000: fitness=27.86 It 4000: fitness=27.86 It 5000: fitness=27.86 It 6000: fitness=27.86 It 7000: fitness=27.86 It 8000: fitness=27.86 It 9000: fitness=27.86 It 10000: fitness=27.86 Maximal number of iterations reached > out <- psoptim(par=x0,fn=Rastrigin,lower=lb,upper=ub, + control=list(maxit=1000,trace=1,REPORT=100, + hybrid=T)) S=18, K=3, p=0.1576, w0=0.7213, w1=0.7213, c.p=1.193, c.g=1.193, v.max=NA, d=894.4, vectorize=FALSE, hybrid=on It 100: fitness=35.82 It 200: fitness=28.85 It 300: fitness=24.87 It 400: fitness=21.89 It 500: fitness=18.9 It 600: fitness=14.92 It 700: fitness=14.92 It 800: fitness=14.92 It 900: fitness=8.955 It 1000: fitness=6.965 Maximal number of iterations reached
Indeed, we notice that the hybrid algorithm is more effective, at the cost of a possibly larger computational effort. Some experimentation with control parameters may be needed to fine tune the algorithm performance.
In the previous section we have considered Monte Carlo-based stochastic search approaches to optimize a deterministic objective function. The difficulty was in the nonconvexity of the objective function, but not in its evaluation. When dealing with optimization under uncertainty, as in the case
the function evaluation itself may be difficult. Thus, we may be forced to adopt Monte Carlo sampling to obtain noisy estimates of the objective function. In well-structured problems, we may build a formal optimization model, which can be tackled by stochastic programming with recourse or stochastic dynamic programming, to be discussed later. However, sometimes the system is so complicated that we may only estimate the objective function by some stochastic simulator acting as a black box: We feed the black box with x and we obtain an estimate of the expected performance. As a consequence, some nice properties such as continuity and differentiability may not be always guaranteed. Population-based approaches, like genetic algorithms and particle swarm optimization may help, since they do not place any restriction on the objective function; furthermore, since they work on a population, they tend to be less vulnerable to estimation errors. We may also keep a pool of the most promising solutions found, which could be simulated to a greater degree of accuracy to yield the optimal solution.
Here we discuss two approaches which may be used for simulation-based optimization:
We should note that these methods are not aimed at global optimization, unlike those discussed in the previous section. Nevertheless they may be more effective and efficient in specific settings.
The simplex search algorithm, also known as Nelder–Mead method, is a deterministic search strategy relying on functions evaluations only. Rather than working with a single point, it uses a simplex of n+1 points in n. A simplex in
n is the convex hull of a set of n+1 affinely independent points x1, …, xn+1.7 The convex hull of a set of points is just the set of points that may be obtained by taking convex combinations (linear combinations whose weights are non-negative and add up to 1) of the elements of the set. For instance, in two dimensions, a simplex is simply a triangle, whereas in three dimensions it is a tetrahedron.
The rationale behind the method is illustrated in Fig. 10.13 for a minimization problem in 2. In this case, the simplex search generates a sequence of sets consisting of three points, in which the worst one is discarded and replaced by a new one. For instance, let us assume that x3 is associated with the worst objective value; then, it seems reasonable to move away from x3 by reflecting it through the center of the face formed by the other two points, as shown in the figure. A new simplex is obtained and the process is repeated. The generation of the new point is easily accomplished algebraically. If xn+1 is the worst point, We compute the centroid of the other n points as
FIGURE 10.13 Reflection of the worst value point in the Nelder–Mead simplex search procedure.
and we try a new point of the form
Clearly, the key issue is finding the right reflection coefficient α > 0. If xr turns out to be even worse than xn+1, we may argue that the step was too long, and the simplex should be contracted. If xr turns out to be the new best point, we have found a good direction and the simplex should be expanded. Different strategies have been devised in order to improve the convergence of the method.8 Although the simplex search procedure has its merit, it does not overcome the possible difficulties due to the nonconvexity of the objective function Or the discrete character of some decision parameters. In fact, it was originally developed for experimental, response surface optimization.
The Nelder–Mead algorithm is offered as an option in the basic optim function. Furthermore, there are some R packages implementing variants of this and other derivative-free optimization methods, such as neldermead and dfoptim. The package optimsimplex offers a framework to implement specific variants of simplex search.
We have discussed several derivative-free algorithms. Avoiding reliance on information provided by the gradient and the Hessian matrix of the objective function is certainly a good idea when we are dealing with a badly behaved function. However, in other cases, we may miss an opportunity to better drive the search process. Metamodeling aims to build an approximation of the objective function
on the basis of simulation experiments. Hence, we use a simulation model as a black box, whose response surface is modeled by a suitably chosen approximating function. This kind of higher-level model explains the name of the overall approach.
If we want to find the gradient ∇h(xk) at some point xk, we may build a local linear approximator like
Such a model can be easily found by linear regression. Then, we just set
Apparently, the gradient does not depend on xk. However, the linear metamodel does depend on that point, as the simulation experiments that we use in linear regression have to be carried out in its neighborhood. They should not be too close, otherwise the noise in the evaluation of f(·, ·) will be overwhelming. However, they should not be placed too far, as the linear approximation is likely to be valid only on a neighborhood of xk. Statistical methods for design of experiments can also be put to good use to limit the number of simulation runs needed to feed the linear regression model. Also variance reduction strategies, like common random numbers may help.
The estimated gradient may be used within standard nonlinear programming algorithms to generate a sequence of points like
for a step length αk > 0. This is the steepest descent (or gradient) method, which suffers from poor convergence. One way to improve performance is to build a quadratic metamodel
If the square matrix Γ is positive semidefinite, the above quadratic form is convex and it is easily minimized by solving a system of linear equations, yielding a new approximation xk+1 of the optimal solution. A new quadratic approximation is built, centered on the new point, and the procedure is repeated until some convergence condition is met. This kind of approach is related to sequential quadratic programming methods for traditional nonlinear programming. It may be advisable to use a linear metamodel at the early stages of the process, since this may result in a faster movement toward the minimizer; then, when convergence of the gradient method becomes problematic, the procedure may switch to the quadratic metamodel.
A noteworthy feature of metamodeling is that it can cope with constraints on the decision variables, since we solve a sequence of nonlinear programming models, whereas alternative approaches have to resort to penalty functions. An obvious disadvantage is that it is likely to be quite expensive in computational terms, if many simulation experiments are needed to filter random noise and find a good metamodel. Alternative methods, such as perturbation analysis, have been proposed to estimate sensitivities with a single simulation run. We will consider a few simple application examples to financial derivatives in Chapter 12.
Earlier in the chapter we have introduced decision models under uncertainty. In this section we describe one possible modeling framework, stochastic programming with recourse. For the sake of simplicity, we will just deal with linear programming (LP) models under uncertainty, but the approach can be extended to nonlinear optimization models. We begin with two-stage models in Section 10.5.1. Then, we generalize the approach to multistage problems in Section 10.5.2. Finally, we stress important issues related to scenario generation in Section 10.5.3. An alternative approach to dynamic decision making under uncertainty, stochastic dynamic programming, will be the subject of the last sections of the chapter.
To get a clear picture on how uncertainty can be introduced into an optimization model, let us consider an uncertain LP problem:
where all problem data depend on random variables ξ with support . Stated as such, the model is not even posed in a sensible way, since the objective function is a random variable, i.e., a function of the underlying events, and the minimization is not well defined. When we select a decision, we only select the probability distribution of cost or profit, and in order to assign a proper meaning to the problem, we must specify how probability distributions are ranked. The most natural approach is to take the expected value of the objective, if we ignore risk aversion.9 However, even if we settle for some sensible optimality criterion, there are serious issues with feasibility. We should wonder whether we can require feasibility for any possible contingency, i.e.,
This is clearly not possible in general, as we get an inconsistent set of equations. The case of inequality constraints is a bit easier. Nevertheless, requiring
yields a fat solution that may be overly expensive, assuming that such a feasible solution exists. More often than not, we must accept a controlled violation of constraints. One possibility is to settle for a solution with a probabilistic guarantee:
for a suitably small α. This idea leads to chance-constrained models, which have a sensible interpretation in terms of solution reliability. Chance-constrained models are indeed useful in some cases, but they suffer from definite limitations:
An alternative modeling framework, to which one may resort to overcome these issues, is stochastic programming with recourse. Let us consider the stochastic constraints
As we have pointed out, it is impossible to satisfy these constraints with an assignment of x that is valid for any realization of the random variable ξ. However, we may think of “adjusting” the solution, after observing uncertainty, by some action represented by decision variables y(ξ):
These variables are called recourse variables, and they are random in the sense that they are adapted to the random variable ξ, as they are made after uncertainty is resolved. They are, therefore, second-stage decisions corresponding to contingency plans, whereas the x are first-stage, here-and-now, decisions. The matrix W is called recourse matrix and represent the “technology” that we use to adjust first-stage decisions. Its structure has a remarkable impact on the difficulty of the problem, and it is worth noting that we are assuming deterministic recourse, as the matrix W is not random. Problems with stochastic recourse are generally more difficult and are subject to some pathologies.
In Section 10.2.3 we have formulated the newsvendor model within a scenario-based framework. The first-stage decision q, the amount we produce, should ideally match demand Ds in every scenario, which is clearly impossible. In the model of Eq. (10.17) we just penalize deviations from the target:
Thus, we immediately see that in this model the structure of the recourse matrix is
where I is the identity matrix of order S. This case is referred to as simple recourse, as the second-stage variables are just used for bookkeeping and cost evaluation, and are not really linked to decisions. However, we could apply more clever decisions, such as asking a third-party to produce or sell us additional items, possibly using express shipping, or trying to persuade the customer to accept a substitute item. When we include these recourse actions in the model, and we account for multiple items, we really introduce second-stage decisions.
Clearly, these adjustments are not for free and incur some cost. We should minimize the total expected cost of both our initial and recourse decisions, which yields the following optimization model:
(10.25)
This is a two-stage stochastic LP model with recourse and can be interpreted as follows:
We may also introduce the recourse function Q(x) and rewrite the model as the deterministic equivalent
where
and
This formulation shows that a stochastic linear program is, in general, a nonlinear programming problem. Nonlinear problems are usually less pleasing to solve than linear ones, but this is not the really bad news. In fact, the recourse function Q(x) looks like a “hopeless” function to deal with:
Luckily, in many cases of practical interest, we can prove interesting properties of the recourse function, most notably convexity, which implies continuity (on an open domain). In some cases, Q(x) is differentiable; in other cases it is polyhedral. This does not imply that it is easy to evaluate the recourse function, but we may resort to statistical sampling (scenario generation) and take advantage of both convexity and problem structure.
We may represent uncertainty by a discrete probability distribution, where S is the set of scenarios and πs is the probability of scenario s S. We obtain the scenario tree (fan) in Fig. 10.14, where ξs is the realization of random variables corresponding to scenario s. One obvious way to sample a scenario tree is by Monte Carlo, in which case the scenarios have the same probability πs = 1/ |S|, but other scenario generation strategies are available. This yields the following LP:
FIGURE 10.14 A scenario fan.
Thus we are left with the task of solving a plain LP, even though a possibly large-scale one. A crucial point is the ability to generate a good and robust solution on the basis of a limited number of representative scenarios.
Multistage stochastic programming formulations arise naturally as a generalization of two-stage models. At each stage, we gather new information and we make decisions accordingly, taking into account immediate costs and expected future recourse cost. The resulting decision process may be summarized as follows:12
From the perspective of time period t = 0, the decisions x1, …, xH are random variables, as they will be adapted to the realization of the underlying stochastic process. However, the only information that we may use in making each decision consists on the observed history so far. The resulting structure of the dynamic decision process can be appreciated by the following recursive formulation of the multistage problem:
In this formulation, we observe that a decision xt depends directly only on the previous decision xt−1 In general, decisions may depend on all of the past history, leading to a slightly more complicated model. However, we may often introduce additional state variables, in such a way that the above formulation applies. What the above formulation hides is the dependence structure of the random variables ξt, t = 1, …, H, which are the underlying risk factors of the random matrices and vectors A, b, and c:
From the viewpoint of stochastic programming, all of the above cases may be tackled. On the contrary, the stochastic dynamic approach that we consider later relies on a Markovian structure and is more limited. We shall also see that dynamic programming takes advantage of a recursive formulation, which is implicit in the above multistage model. On the other hand, a definite advantage of the dynamic programming framework is that decisions are given in feedback form, as a function of the current system state. Indeed, it should be noted that, in practice, the relevant output of a stochastic programming model is the set of immediate decisions x0. The remaining decision variables could be regarded as contingency plans, which should be implemented as time goes by; however, it is more likely that the model will be solved again and again according to a rolling-horizon logic.
While this formulation points out the dynamic optimization nature of multistage problems, we usually resort to deterministic equivalents based on discretized scenario trees, like the one depicted in Fig. 10.15. Each node nk in the tree corresponds to an event, where we should make a decision. We have an initial node n0 corresponding to time t = 0. Then, for each event node, we branch descendant nodes. In the figure, each node has two successors, but the branching factors are arbitrary. The terminal nodes (the leaves) have no successor and their number gives the number of scenarios in the tree. A scenario is a path from the root node to a leaf node. For instance, scenario 2 consists of the node sequence (n0, n1, n3, n8). Each branch is associated with a conditional probability of occurrence P(nk | ni), where ni = a(nk) is the immediate predecessor of node nk. If scenarios are sampled by plain Monte Carlo, this probability is related to the branching factor. For instance, we may assume that all conditional probabilities in Fig. 10.15 are 0.5. The probability of a scenario is the product of the conditional probabilities of the branches we meet on the path. Hence, in this example we have 8 scenarios with probability 1/8.
FIGURE 10.15 Scenario tree for a simple asset–liability management problem.
Decision variables are associated with each node of the tree, as we show in the next section. It is important to realize that the tree structure enforces nonanticipativity of decisions. For instance, at node n1 we do not know if we are on scenario 1, 2, 3, or 4. The decisions x(n1) at node n1 will be the same for each of those 4 scenarios, which cannot be distinguished at node n1. Alternatively, we could associate decisions x(s, t) with scenarios and time steps. Therefore, rather than second-stage decisions x(n1) and x(n1), we would include decisions x(s, 1), for s = 1, 2, …, 8. However, we should include nonanticipativity constraints explicitely:13
In the next section we illustrate a simple financial model where decisions are associated with nodes, rather than scenarios, resulting in a more compact model formulation. The choice between the two approaches, however, depends on the solution algorithm that we choose for solving a possibly large-scale optimization model. In fact, we realize that if the branching structure is not kept under control, the number of nodes will grow exponentially.
To illustrate how multistage stochastic programming can be used to tackle financial problems, let us consider a simple asset–liability management problem.
As we have already pointed out, a scenario in the tree is a path from the root node n0 to a terminal node s S; let πs be the corresponding probability. We introduce the following decision variables:
Variables zin, yin, and xin are only defined at nodes , as we assume that no rebalancing occurs at terminal nodes, where the portfolio is liquidated, we pay the liability Ls, and we measure the terminal wealth Ws. Let u(w) is the utility for wealth w; this function is used to express utility of terminal wealth.
On the basis of this notation, we may write the following model:
(10.32)
The objective (10.27) is the expected utility of terminal wealth. Equation (10.28) expresses the initial asset balance, taking the current holdings into account; the asset balance at intermediate trading dates, i.e., at all nodes with exception of root and leaves, is taken into account by Eq. (10.29). Equation (10.30) ensures that enough cash is generated by selling assets in order to meet the liabilities; we may also reinvest the proceeds of what we sell in new asset holdings. Note how proportional transaction costs are expressed for selling and purchasing. Equation (10.31) is used to evaluate terminal wealth at leaf nodes, after portfolio liquidation and payment of the last liability. In practice, we would repeatedly solve the model on a rolling-horizon basis, so the exact expression of the objective function is a bit debatable. The role of terminal utility is just to ensure that we are left in a good position at the end of the planning horizon, in order to avoid nasty end-of-horizon effects.
The key ingredient in any stochastic programming model is a scenario tree, as the quality of the solution depends critically on how well uncertainty is captured. In principle, given a model of the underlying stochastic process, straightforward Monte Carlo simulation could be used to generate a scenario tree, on the basis of the sampling and path generation methods of Chapters 5 and 6. Unfortunately, a suitably rich tree may be needed to capture uncertainty and, given the exponential growth of nodes, this is not practical in the multistage case. An additional difficulty in financial problems is that the tree should be arbitrage-free, which places additional requirements on the branching structure.
One possible way out is a radical change in the solution paradigm. Rather than generate the tree and then solve a large-scale optimization problem, we may interleave optimization and sampling and stop when a convergence criterion is met; this kind of approach relies on sophisticated decomposition strategies that are beyond the scope of this book,15 and it requires the development of specific software. If we want to stick to commercial solvers, we may resort to the following ideas for clever scenario generation:
In the next sections we outline these ideas. Whatever scenario generation strategy we select, it is important to check its effect on solution stability, as we discuss in Section 10.5.3.4. We will use the simple newsvendor model to illustrate these concepts in Section 10.5.3.5.
We also remark a couple of possibly useful guidelines:
We have introduced Gaussian quadrature in Section 2.2 as a tool for numerical integration. When the integral represents the expected value of a function g(x, ξ) depending on continuous random variables ξ with joint density fξ(·), a quadrature formula represents a discretization of the underlying continuous distribution:
(10.33)
where the weights πs correspond to scenario probabilities, and the integration nodes ξs are the realizations of the random variables in the discrete approximation. Note that, unlike Monte Carlo sampling, different probabilities are associated with different scenarios. Since Gaussian quadrature formulas ensure exact integration of polynomials up to a given order, this ensures moment matching. As we have remarked, this approach can be applied up to a limited dimensionality, however. Nevertheless, it may work very well for two-stage models.
Low-discrepancy sequences are another tool for numerical integration that can be used to generate scenarios; see [29]. There is also some room for variance reduction strategies:
See also [40, Chapter 4] to appreciate the role of the aforementioned techniques for statistical inference in stochastic programming.
Arguably, the most sophisticated approaches to scenario generation are those based on the solution of an optimization problem. We have already appreciated the role of moment matching in Section 2.5. The idea can be applied to scenario generation as well, since we aim at finding a good discretization of a continuous distribution.
To illustrate the approach, consider a variable distributed as a multivariate normal, X ~ N(μ, ∑). The expected values and covariances of the discrete approximation should match those of the continuous distribution. Furthermore, since we are dealing with a normal distribution, we know that skewness and kurtosis for each component should be 0 and 3, respectively.
Let us denote by xis the realization of Xi, i = 1, …, n, in scenario s, s = 1, …, S. Natural requirements are:
Approximate moment matching is obtained by minimizing the following squared error:
(10.34)
The objective function includes four weights wk which may be used to fine tune performance. It should be mentioned that the resulting scenario optimization problem need not be convex. However, if we manage to find any solution with a low value of the “error” objective function, this is arguably a satisfactory solution, even though it is not necessarily the globally optimal one. The idea can be generalized to “property matching,” since one can match other features that are not necessarily related to moments; see, e.g., [20].
Moment (or property) matching has been criticized, since it is possible to build counterexamples, i.e., pairs of quite different distributions that share the first few moments. An alternative idea is to rely on metrics fully capturing the distance between probability distributions. Thus, given a scenario tree topology, we should find the assignment of values and probabilities minimizing some distance with respect to the “true” distribution. Alternatively, given a large scenario tree, we should try to reduce it to a more manageable size by optimal scenario reduction. From a formal point of view, there are different ways to define a distance between two probability measures and
. One possibility has its roots in the old Monge transportation problem, which consists of finding the optimal way of transporting mass (e.g., soil, when we are building a road). The problem has a natural probabilistic interpretation, which was pointed out by Kantorovich, when we interpret mass in a probabilistic sense (see [38], for more details). Thus, we may define a transportation functional:
Here c(·, ·) is a suitably chosen cost function; the problem calls for finding the minimum of the integral over all joint measures η, defined on the Cartesian product , where
is the support of random variable ξ, whose marginals coincide with
and
, respectively (π1 and π2 represent projection operators).
As the reader can imagine, we are threading on difficult ground, requiring some knowledge of functional analysis. Nevertheless, the idea is not too difficult to grasp if we deal with discrete distributions, where mass is concentrated on points (atoms). The Kantorovich distance, in the discrete case, boils down to
where
Therefore, the evaluation of distance requires the solution of a transportation problem, which is a well-known instance of a linear programming problem. Given this distance, it is possible to build heuristics in which scenarios are successively eliminated, redistributing probability mass in an optimal way, after having prescribed the scenario cardinality or the quality of the approximation. See, e.g., [17] for more details.
We should mention that a counterargument to proponents of the optimal approximation approach is that we are not really interested in the distance between the ideal distribution and the scenario tree. What really matters is the quality of the solution. For instance, if we are considering a mean–variance portfolio optimization problem along the lines of Section 10.2.1, it can be argued that matching the first two moments is all we need. See, e.g., [26, Chapter 4] for a discussion of these issues.
Whatever scenario reduction strategy we employ, it is important to check the stability of the resulting solution. There are quite sophisticated studies on the stability of stochastic optimization, relying on formal distance concepts. Here we just consider a down-to-earth approach along the lines of [26]. The exact problem is
which involves the expectation of the objective under the true measure . The actual problem we solve is based on an approximate scenario tree τ, possibly resulting from random sampling:
where with the notation we point out that we are just estimating the true expected value of the objective, given the generated scenario tree. Each scenario tree induces an optimal solution. Thus, if we sample trees
i and
j, we obtain solutions xi* and xj*, respectively. If the tree generation mechanism is reliable, we should see some stability in what we get. We may consider stability in the solution itself, but it is easier, and perhaps more relevant, to check stability in the value of the objective function. After all, if we want to minimize cost or maximize profit, this is what matters most.
There are two concepts of stability. In-sample stability means that if we sample two scenario trees, the values of the solutions that we obtain are not too different:
(10.35)
This definition does not apply directly if the scenario tree is generated deterministically; in that case, we might compare trees with slightly different branching structures to see if the tree structure that we are using is rich enough to ensure solution robustness. This concept of stability is called in-sample, as we evaluate a solution using the same tree we have used to find it. But since the tree is only a limited representation of uncertainty, we should wonder what happens when we apply the solution in the real world, where a different scenario may unfold. This leads to out-of-sample stability, where we compare the objective value from the optimization model against the actual expected performance of the solution. If the trees are reliable, we should not notice a significant difference in performance:
(10.36)
This kind of check is not too hard in two stage models, as we should just plug the first-stage solution into several second-stage problems, sampling a large number of solutions. Typically, solving a large number of second-stage problems is not too demanding, especially if they are just linear programs; as a rule, it takes more time to solve one large stochastic LP than a large number of deterministic LPs. Unfortunately, this is not so easy in a multistage setting. To see why, consider the second-stage decisions. They are associated with a realization of random variables, which will be different from the out-of-sample realization, and there is no obvious way to adapt the solution.16 Some ideas can be devised of course, but a realistic evaluation should rely on a rolling-horizon simulation whereby we solve a sequence of multistage stochastic programming problems and apply only the first-stage solution. This is technically feasible but quite expensive computationally. A possible (cheaper) alternative is to check solutions on different trees to see if
In the next section we apply these ideas on a very simple problem, for the purpose of illustration.
We have introduced the newsvendor model in Example 10.9, as a basic optimization model under uncertainty. Given its structure, we are able to find the optimal solution in explicit form. If we assume that demand is normally distributed, D ~ N(μ, σ2), then we know that the optimal decision is to make/buy a quantity
of items, where zβ is the quantile of the standard normal for the probability level
defined by the profit margin m (underage cost) and by the cost of unsold items cu (overage cost). We have also seen that maximization of expected profit is equivalent to minimization of an expected deviation cost
In Section 10.2.3 we have formulated the problem as a scenario-based LP, and it is interesting to check the accuracy of scenario generation mechanisms on this prototypical example, as well as investigating stability. To this end, we should also express the objective function analytically, which is not too difficult to accomplish for normally distributed demand,
Let us denote the normal PDF of demand by fD(x) and express the expected deviation cost in the case of normal demand:
Since we have assumed a normal distribution, the integration domain starts from −∞, which looks unreasonable for a non-negative demand. The assumption is plausible only if the probability of a negative demand is practically negligible. The last integral represents expected lost sales as a function of q and is related to the loss function. We digress a little to illustrate the essential properties of the loss function in the normal case.17
Technical note: The loss function. We define the standard loss function for a standard normal variable as
where φ(x) is the standard normal density. For a standard normal, the loss function can be expressed in terms of the PDF φ(x) and the corresponding CDF
The first integral can be evaluated by substitution of variable:
Changing the integration limits accordingly we find
Putting all of it together, we end up with
(10.37)
Given the standard loss function, we may express lost sales in the case of a generic normal random variable:
where we have used the substitution of variable t = (x − μ)/σ.
Using the loss function of a generic normal, we may express the deviation cost for a normal demand explicitely:
The R code of Fig. 10.16 takes advantage of this knowledge and yields the optimal decision and the related cost, given purchase cost, selling price, markdown price, as well as expected value and standard deviation of demand:
FIGURE 10.16 Solving the newsvendor problem optimally in the normal case.
> pur <- 10 > sell <- 12 > mdown <- 5 > mu <- 100 > sigma <- 30 > newsOptimal(pur,sell,mdown,mu,sigma) optimal q 83.02154 optimal cost 71.38016
In a practical setting, we would probably be forced to round the solution when demand is restricted to integer values, in which case the normal distribution would be an approximation. By the way, since μ > 3σ we may neglect the embarrassing possibility of negative demand.
Given this analytical benchmark, we may use the scenario-based LP model of Section 10.2.3 to check the accuracy of the solution and the estimated expected cost. If we use straightforward Monte Carlo for scenario generation, then the scenarios are equiprobable: πs = 1/S, s = 1, …, S. We may use the function solveLP of package linprog to solve the model, but this requires building the technology matrix explicitly, which is in this case quite simple as the recourse matrix is just W = [I, I]. The resulting code is shown in Fig. 10.17. Demand scenarios and probabilities are an input to the procedure, so that we may check the performance of Monte Carlo scenario generation against other strategies. The figure also includes code to check in- and out-of-sample stability. Note that in scenario generation we clip demand to non-negative values, as we may not rule out generation of a negative demand, especially with a large sample. In order check in-sample stability, let us repeat the generation of 50 random scenarios five times:
FIGURE 10.17 Solving the newsvendor problem by stochastic LP and checking in- and out-of-sample stability.
obj 66.801 q 82.7 obj 58.2748 q 79.6 obj 77.0129 q 82.3 obj 68.2175 q 80.9 obj 59.6849 q 85.3
We observe some variability, especially in the objective function. As this casts some doubts on the quality of the solution, we may repeat the procedure with 1,000 scenarios, with a corresponding increase in computational effort, which yields
obj 71.4185 q 81.3 obj 70.1544 q 82.4 obj 70.801 q 82.5
obj 71.3892 q 84.7 obj 72.8487 q 84.3
Now we see that the solution is more stable, and fairly close to the true optimum. As a further check, we may assess out-of-sample stability, by generating a large number of scenarios (one million) and estimating the cost of each solution generated by the stochastic LP model. A surprising fact is that the solution obtained by using 50 scenarios proves to be a rather good one:
true cost 71.4412 true cost 71.8959 true cost 71.4542 true cost 71.6127 true cost 71.6429
The same happens with the solutions obtained with 1,000 scenarios:
true cost 71.5528 true cost 71.4514 true cost 71.4477 true cost 71.5522 true cost 71.5014
This finding can be explained by observing that in some cases the true objective function is rather flat around the optimal solution, which makes the model rather robust. However, this need not be the case in other settings. For instance, in financial applications the model might take full advantage of the faulty information included in badly generated scenarios, yielding a very unstable solution. Furthermore, here we may afford a large number of scenarios because this is a very simple model, but a parsimonious approach must be pursued in general. Thus, it is also interesting to see what happens if we generate scenarios using Gaussian quadrature:18
> library(statmod) > quad <- gauss.quad.prob(10,dist=“normal”,mu=mu,sigma=sigma) > out <- newsLP(m,cu,pmax(quad$nodes,0),quad$weights) > cat(’obj ‘,round(out$cost,4),’ q ‘,round(out$q,1),’\n’) obj 65.5008 q 85.5 > trueCost <- outScost(out$q,m,cu,outsDemand) > cat(’true cost ‘,round(trueCost,4),’\n’) true cost 71.6757
With only ten scenarios, we get a solution that proves to be good out-of-sample, even though the in-sample evaluation of the objective function does not look too satisfactory. A less pleasing feature of Gaussian quadrature for this example is that, in order to capture the tails of the distribution, it generates negative as well as rather extreme demand values:
> quad$nodes -45.783885 −7.454705 25.470225 56.020327 85.451929 114.548071 143.979673 174.529775 207.454705 245.783885 > quad$weights 4.310653e-06 7.580709e-04 1. 911158e-02 1.354837e-01 3.446423e-01 3.446423e-01 1.354837e-01 1.911158e-02 7.580709e-04 4.310653e-06
Even though negative values have small probability, this shows a potential issue, which is actually related to the fact that we are using a normal distribution to model an intrinsically non-negative variable. A similar issue could arise when modeling rates of return of stock shares, which cannot be less than −100%.
In Gaussian quadrature we determine both nodes and weights together. We may also pursue a simple approach in which probabilities are fixed to a uniform value, and we look for nodes. One such approach can be devised by integrating random variate generation by the inverse transform method with a sort of “deterministic stratification.” The generation of a standard normal random variable by the inverse transform method, as we have seen in Chapter 5, requires the generation of a uniform random variable U ~ U(0, 1) and the inversion of the CDF of the standard normal, Z = Φ−1(U). We may partition the unit interval uniformly, define a representative point of each subinterval, and generate the corresponding standard normal. For instance, if we use five intervals, their end points are
Each subinterval has probability 0.2, and we might consider their midpoints as their representative points. If so, the nodes would be
We note that, in the stated form, this is at best a simple heuristic; in order to apply stratification we should find the conditional expectation of the standard normal on each interval. This may actually be done, but it is left as an exercise for the reader, and we just check the performance of this simple approach:
> numVals <- 50 > aux <- (0:numVals)/numVals > U <- (aux[−1]+aux[−(numVals+1)])/2 > demand <- pmax(0, mu+sigma*qnorm(U)) > probs <- rep(1/numVals,numVals) > out <- newsLP(m,cu,demand,probs) > cat(’obj’,round(out$cost,4),’ q ‘,round(out$q,1),’\n’) obj 70.8987 q 83.4 > trueCost <- outScost(out$q, m,cu,outsDemand) > cat(’true cost ‘,round(trueCost,4),’\n’) true cost 71.4444
Needless to say, the performance is arguably satisfactory because of the simple and well-behaved problem we are tackling. Still, the idea might be worth considering, and it can be applied to small-dimensional cases (where we build hyper-rectangular cells, rather than subintervals), possibly with some preliminary Monte Carlo runs to find the representative point of each cell. It is interesting to note that in this case we do not generate “extreme” node values:
> range(demand) [1] 30.20956 169.79044
Whether this is a strength or a weakness depends on the application. As a final example, let us resort to low-discrepancy sequences and try scrambled Sobol sequences:19
> library(randtoolbox) > numVals <- 50 > eps <- sobol(numVals,dim=1,normal=TRUE,scrambling=2) > demand <- pmax(0, mu+sigma*eps) > probs <- rep(1/numVals, numVals) > out <- newsLP(m,cu,demand,probs) > cat(’obj’,round(out$cost,4),’ q’, round(out$q,1),’\n’) obj 73.4044 q 82 > trueCost <- outScost(out$q,m,cu,outsDemand) > cat(’true cost ‘,round(trueCost,4),’\n’) true cost 71.4798
Once again, the result does not look bad, but the reader is warned against drawing any conclusion on the basis of this extremely limited experiment. Anyway, we observe that both Monte Carlo and quasi-Monte Carlo methods can be applied to relatively high-dimensional problems, whereas both Gaussian quadrature and stratification are more limited. Unfortunately, the limitation of R as an optimization tool does not allow to cope with more interesting examples. However, we should mention that state-of-the-art commercial solvers, such as IBM CPLEX and Gurobi do offer an interface with R.
Dynamic programming (DP) is, arguably, the single most powerful optimization principle that we have at our disposal. In fact, it can be used to tackle a wide variety of problems:
Actually, for some problem families, dynamic programming is the only viable approach. Since there is no free lunch, power and generality must come at some price. On the one hand, dynamic programming is a principle, rather than a well defined optimization algorithm. Thus, unlike the simplex method for linear programming, there is no on-the-shelves software package, and a considerable degree of customization and implementation effort may be needed to make it work. On the other hand, dynamic programming is associated with the so-called curse of dimensionality. As we shall see, the curse actually takes different forms, but the essence is that, as a rule, conventional dynamic programming can only be applied to rather small-scale problems. These limitations can be overcome, at least in part, by resorting to numerical approximations where Monte Carlo methods play a fundamental role. Here we are interested in discrete-time decision models under uncertainty; hence, there is a considerable connection with stochastic programming with recourse.
To introduce the dynamic programming concept in the simplest way, we show how it can be used to solve a deterministic shortest path problem in Section 10.6.1. Then, in Section 10.6.2, we formalize the principle in the form of the recursive functional equation originally introduced by Bellman; this form allows us to cope with stochastic optimization problems. We outline the adaption of the approach to infinite-horizon problems in Section 10.6.3. Finally, in Section 10.6.4, we briefly compare stochastic programming with recourse and stochastic dynamic programming; as expected, the two approaches have somewhat complementary strengths and weaknesses, as well as many intersections.
The easiest way to introduce dynamic programming is by considering one of its most natural applications, i.e., finding the shortest path in a network. Graph and network optimization are not customary topics in finance or economics, but a quick look at Fig. 10.18 is enough to understand what we are talking about. A network consists of a set of nodes (numbered from 0 to 7 in our toy example) and a set of arcs joining pairs of nodes. Arcs are labeled by a number which can be interpreted as the arc length (or cost). Our purpose is finding a path in the network, starting from node 0 and leading to node 7, such that the path has total minimal length. For instance, by summing the arc lengths that are followed on the path (0, 1, 4, 7), we see that its total length is 18, whereas path (0, 1, 3, 5, 7) has length 16. At each node, we must choose the next node to visit. We may immediately appreciate that this problem bears some resemblance to dynamic decision making; given some state we are in, we should decide what to do in order to optimize an outcome that depends on the whole path. A greedy decision need not be the optimal one; for instance, the closest node to the starting point 0 in our network is node 2, but there is no guarantee that this arc is on an optimal path.
FIGURE 10.18 A shortest path problem.
Of course, we could simply enumerate all of the possible paths to spot the optimal one; here we have just a finite set of alternatives and there is no uncertainty involved, so the idea is conceptually feasible. However, this approach becomes quickly infeasible in practice, as the network size increases. Furthermore, we need a better idea in view of an extension to a stochastic case: If arc costs are random, running a Monte Carlo simulation for each possible path to estimate its expected length is out of the question. So we must come up with some clever way to avoid exhaustive enumeration. Dynamic programming is one possible approach to accomplish this aim. It is worth noting that more efficient algorithms are available for the shortest path problem, but the idea we illustrate here can be extended to problems featuring infinite state spaces (count-able or not) and uncertain data. Let = {0, 1, 2, …, N} be the node set and
be the arc set; let the start and final nodes be 0 and N, respectively. For simplicity, we assume that the network is acyclic and that the arc lengths cij, i, j
, are non-negative: If we had the possibility of getting trapped in a loop of negative length arcs, the optimal cost would be −∞, and we do not want to consider such pathological cases.
The starting point is to find a characterization of the optimal solution, which can be translated into a constructive algorithm. Let Vi be the length of the shortest path from node to node N (denoted by i
N). Assume that, for a specific i
, node j lies on the optimal path i
N. Then the following property holds:
In other words, the optimal solution for a problem is obtained by assembling optimal solutions for smaller subproblems, which suggests a useful decomposition strategy. To understand why, consider the decomposition of i N into the subpaths i → j and j → N. The length of i
N is the sum of the lengths of the two subpaths:
(10.38)
Note that the second subpath is not affected by how we go from i to j. This is strongly related to the concept of state in Markovian dynamic systems: How we get to state j has no influence on the future. Now, assume that the subpath j → N is not the optimal path from j to N. Then we could improve the second term of Eq. (10.38) by assembling a path consisting of i → j followed by j N. The length of this new path would be
which is a contradiction, as we have assumed that Vi is the optimal path length.
This observation leads to the following recursive equation for the shortest path from a generic node i to the terminal node N:
(10.39)
In other words, to find the optimal path from node i to node N, we should consider the immediate cost cij of going from i to all of its immediate successors j, plus the optimal cost of going from j to the terminal node. Note that we do not only consider the immediate cost, as in a greedy decision rule. We also add the future cost of the optimal sequence of decisions starting from each state j that we can visit next; this is what makes the approach nonmyopic. The function Vi is called cost-to-go or value function and is denned recursively by Eq. (10.39). The value function, for each point in the state space, tells us what the future optimal cost would be, if we reach that state and proceed with an optimal policy. This kind of recursive equation, whose exact form depends on the problem at hand, is the heart of dynamic programming and is an example of a functional equation. In the shortest path problem, we have a finite set of states, and the value function is a vector; in a continuous-state model, the value function is an infinite-dimensional object.
Solving the problem requires finding the value function V0 for the initial node, and to do that we should go backward from the terminal node.20 We can associate a terminal condition VN = 0 with our functional equation, as the cost for going from N to N is zero. Then we unfold the recursion by considering the immediate predecessors i of the terminal node N; for each of them, finding the optimal path length is trivial, as this is just ciN. Then we proceed backward, labeling each node with the corresponding value function. In this unstructured network, we may label a node only when all of its successors have been labeled; we can always find a correct node ordering in acyclic networks.
Let us find the shortest path for the network depicted in Fig. 10.18. We have the terminal condition V7 = 0 for the terminal node, and we look for its immediate predecessors 4 and 6 (we cannot label node 5 yet, because node 6 is one of its successors). We have
Now we may label node 5:
Then we consider node 3 and its immediate successors 4, 5, and 6:
By the same token we have
Apart from getting the optimal length, which is 16, we may find the optimal path by looking for the nodes optimizing each single decision, starting from node 0:
The shortest path problem is one of the prototypical examples of dynamic programming, yet it lacks the structure of most problems of interest in finance and economics, which are dynamic problems where uncertainty unfolds over time. We have introduced the three essential families of dynamic models in Chapter 3, Dynamic programming can be applied to continuous-time and discrete-event models, too, but here we only consider discrete-time models, which are relatively simple but do hide some potential traps. In fact, when discretizing time it is necessary to be very clear about the meaning of time instants and time intervals:21
These definitions are illustrated in Fig. 10.19. Note that, with this timing convention, we emphasize the fact that noise is realized during time interval t, after making the decision at time instant t − 1.
FIGURE 10.19 Illustrating time conventions.
Let us introduce a bit of notation:
Then, the system dynamics is represented by a state equation like
(10.40)
where St+1 is the state transition function over time period t + 1; the subscript may be omitted if the dynamics do not change over time, i.e., if the system is stationary. The latter is the common assumption when dealing with infinite-horizon problems.
By making the decision xt in state st we incur an immediate cost Ct(st, xt). We would like to minimize the expected cost incurred over a given planning horizon,
(10.41)
where γ < 1 is a suitable discount factor. Clearly, when dealing with profits or utility functions, we rewrite the problem in terms of a maximization. However, in the statement above we are not stating precisely what kind of decision process we are dealing with. In a deterministic setting, we would minimize with respect to decision variables xt, and we would come up with a plan specifying all of the future decisions. However, we have already seen in multistage stochastic programming with recourse that decisions should be adapted to the occurred contingencies. As a consequence, apart from the first-stage ones, decisions are in fact random variables. In stochastic dynamic programming we make this even clearer, as decisions are obtained by the application of a feedback policy.22 We define a policy, denoted by π, whereby an action is selected on the basis of the current state:
(10.42)
Hence, a more precise statement of the problem is
where Π is the set of feasible policies. By feasible we mean that each decision xt is constrained to be in Xt(st), the set of feasible decisions at time t if we are in state st.
Stated as such, the problem does not look quite tractable. However, the shortest path example suggests a suitable decomposition strategy. When in state st at time t, we should select the decision xt Xt(st) that minimizes the sum of the immediate cost, and the expected value of cost-to-go that we will incur from time period t + 1 onward:
More generally, we talk of a value function, as the term is better suited to both minimization and maximization problems. The conditioning with respect to st stresses the fact that we are taking an expectation conditional on the current state; it might even be the case that uncertainty depends on our decisions, and if so we should also condition with respect to xt. We will not analyze the conditions under which Eq. (10.44) can be used to solve problem (10.43). The intuition gained from the shortest path problem suggests that the stochastic process t must be Markovian. With more general path dependency, the decomposition argument behind dynamic programming fails. Sometimes, a model reformulation based on additional state variables can be used to transform a non-Markovian model into a Markovian one, at the price of an increase in complexity. The careful reader will immediately notice a similarity between this modeling framework and stochastic programming with recourse, where the recourse cost plays the role of the value function. The recursive structure of a multistage problem, which is pretty evident in Eq. (10.26), is also visible here. However, multistage stochastic programming with recourse does not require the Markov property. On the other hand, we should note that standard stochastic programming models with recourse assume exogenous uncertainty.
Equation (10.44) is the functional equation of dynamic programming, also called the optimality equation, and it requires to find the value function for each time instant. Typically, the functional equation is solved backward in time, taking advantage of the fact that the problem in the last time period is a myopic choice. At the last decision time instant, t = T, we should solve the problem
for every possible state sT. The value function VT+1 (sT+1) is given by a boundary condition. For problem (10.41), the boundary condition is
since the objective function includes only a trajectory cost. Sometimes, a cost (or reward) ΦT+1(sT+1) is associated with the terminal state, and the problem is
(10.46)
For instance, in life-cycle consumption-saving models for pension economics, the objective function could include both a utility function depending on consumption and some utility from bequest. In this case, we impose the boundary condition
Whatever boundary condition we select, problem (10.45) is a static optimization model. By solving it for every state sT, we build the value function VT(sT) at time t = T. Then, unfolding the recursion backward in time, we find the value functions VT−1(sT−1), VT−2(sT−2), and so on, down to V1(s1). Finally, given the initial state s0, we find the next optimal decision by solving the static problem
(10.47)
Unlike stochastic programming with recourse, we have a clear way to make also the next decisions, as the full set of value functions allows us to solve each problem over time, finding the decision xt as a function of the current state st. Thus, we find the optimal policy in an implicit feedback form, and the dynamic problem is decomposed into a sequence of static problems.
The skeptical reader will wonder what is the price to pay for this extremely powerful approach. The price is actually pretty evident in Eq. (10.44). We should solve an optimization problem for each possible value of the state variable st. If we are dealing with a continuous state space, this is impossible, unless we may take advantage of some specific problems structure (we shall see an example in Section 10.7.1). But even if the state space is discrete, when the dimensionality is large, i.e., when st is a vector with several components, the sheer computational burden is prohibitive. This is called the curse of dimensionality. Actually, a close look at Eq. (10.44) shows that we face a three-fold curse of dimensionality:
The shortest path problem is so easy because it is deterministic, it features a discrete state space, and the of possible decisions is discrete as well. In general the application of DP is not trivial, and calls for the integration of several numerical tricks of the trade, as we show in Section 10.7. Even so, DP in its standard form can be tackled only for moderately sized problems. However, recent trends in approximate dynamic programming, which we outline in Section 10.8, integrate statistical learning with optimization, paving the way for a much wider array of applications. Typically, these approaches rely on Monte Carlo simulation.
The recursive form of Eq. (10.44) needs some adjustment when coping with an infinite-horizon problem like
(10.48)
where we assume that immediate costs are bounded and γ < 1, so that the series converges to a finite value.23 In this case, we typically drop the subscript t from the immediate cost, as well as from the state-transition equation
i.e., we assume a stationary model. An infinite-horizon model may be of interest per se, or it can be a trick to avoid the need to specify a terminal state value function. In this case, the functional equation boils down to
(10.49)
where X(s) is the set of feasible decisions when we are in state s. The good news is that we need only one value function, rather than a value function for each time instant. The bad news is that now we have a value function defined as the fixed point of a possibly complicated operator, as we have V(s) on both sides of the equation. In general, an iterative method is needed to solve Eq. (10.49).
Before we delve into numerical solution methods for stochastic dynamic programming, it is useful to pause and compare this approach with stochastic programming with recourse, since there is a clear connection between the two.
As a general consideration, stochastic programming models are mostly favored within the operations research community, whereas dynamic programming is favored by economists.24 One reason is that stochastic programming is intrinsically operational: What we care about is the first-stage decision, which is what we actually implement. The next stages are included to find a nonmyopic and robust solution, but they are not really contingency plans to be implemented. When uncertainty is progressively resolved, we typically recalibrate and resolve the model, within a rolling-horizon framework. On the contrary, the use of a model in economics is generally quite different. Consider a life-cycle, strategic asset allocation model.25 Usually, a very limited number of assets is included in the model, possibly only two, risk-free and risky. Indeed, the aim there is not really to manage a portfolio of a pension fund. The aim is to investigate the behavior of individual workers in order to see what drives their decisions (precautionary savings, habit formation, income path over time, etc.). To simulate consumption-saving and asset allocation decisions over several years, dynamic programming is definitely more convenient, as once we have found the sequence of value functions, we may simulate the sequence of optimal decisions, which calls for the solution of a sequence of static optimization models, without the need of solving a dynamic model repeatedly. The result is a pattern in terms of consumption–saving decisions, as well as an asset allocation between the risky and risk-free assets. The pattern may be checked against empirical evidence in order to assess the validity of the model and the assumptions behind it.
There is also a clear connection between recourse functions in stochastic programming and value functions in dynamic programming. However, a recourse function is instrumental in finding a first-stage decision. Decisions at latter stages are made after observing the new state, which need not be included in the scenario tree; however, there is no obvious way to adjust a recourse decision when an unplanned scenario is realized. The value function is essential in finding decisions in feedback form; hence, they may be used for any scenario. This is a definite advantage of dynamic programming, which comes at a price. First, we have to rely on a Markovian structure, whereas stochastic programming may cope with more general path dependence. Second, stochastic programming may cope with more complicated, less stylized decision models.
Another feature of dynamic programs is that, as a rule, they deal with multi-stage models where the same kind of decision must be made over and over. For instance, in a consumption–saving problem, at each time step we have to decide how much to consume and how much to save. However, there are decision problems in which the decisions ate different stages are completely different. For instance, consider the design of a logistic network. At one stage, we cope with the fixed costs of installing warehouses and other facilities; hence, the decision variables define the network structure, which must be designed under demand uncertainty. At later stages, we have to cope with variable transportation costs; hence, the second-stage decision are related to the optimal routing of items, in order to satisfy a known demand, with constraints imposed by the selected network structure. Clearly, this latter kind of model is less common in finance or economics, but it contributes to explain with stochastic programming is more common in operational decision problems.
Another noteworthy point is that, so far, we have assumed that the important output of the optimization model is a set of decisions, possibly in feedback form. However, there are problems in which it is the value of the objective function that is relevant. One important class of such problems is the pricing of American-style options. As we have seen in Section 10.2.5, pricing options with early exercise features requires the solution of a stochastic optimization problem. Since our ability to represent uncertainty by scenario trees is limited, the validity of a price found by stochastic programs is questionable. Armed with a sequence of value functions, possibly approximated ones, we may simulate the application of an early exercise policy with a large number of scenarios. Due to approximations in the solution of the dynamic programming problems, we will get a lower bound on the option price, but at least we may check its reliability, as we shall see in Chapter 11.
Finally, we observe that DP is well-suited to cope with infinite horizon problems, whereas stochastic programming with recourse is not. However, the two approaches may be integrated, since the value function for an infinite-horizon program may be used to avoid end-of-horizon effects in a finite-horizon stochastic program. Indeed, there are important connections between the approaches, especially when it comes to devising sophisticated decomposition algorithms to cope with large-scale problems. Thus, both approaches should be included in our bag of tools to tackle a wide class of relevant problems.