7
Trial-and-Error Searching and Data Fitting

In this chapter, we add more tools to our computational toolbox. First, we devise ways to find solutions to equations by a trial-and-error search, sometimes using our new-found numerical differentiation tools. Although trial-and-error searching may not sound very precise, it is in fact widely used to solve problems where analytic solutions do not exist or are not practical. We have already looked at one such example in Chapter 6, where we saw how the two-weights-on-a-string problem led to matrix equations. In Chapter 8, we combine trial-and-error searching with the solution of ordinary differential equations to solve the general quantum eigenvalue problem. The second part of this chapter introduces some aspects of data fitting. We examine how to interpolate within a table of numbers and how to do a least-squares fit of a function to data, the latter often requiring a search.

7.1 Problem 1: A Search for Quantum States in a Box

Many computer techniques are well-defined sets of procedures leading to definite outcomes. In contrast, some computational techniques are trial-and-error algorithms in which decisions on what path to follow are made based on the current values of variables, and the program quits only when it thinks it has solved the problem. (We already did some of this when we summed a power series until the terms became small.) Writing this type of program is usually interesting because we must foresee how to have the computer act intelligently in all possible situations, and running them is very much like an experiment in which it is hard to predict what the computer will come up with.

Problem Probably the most standard problem in quantum mechanics¹⁾ is to solve for the energies of a particle of mass m bound within a 1D square well of radius a:

(7.1) $c07-math-0001$

As shown in quantum mechanics texts (Gottfried, 1966), the energies of the bound states $c07-math-0002$ within this well are solutions of the transcendental equations

(7.2) $c07-math-0003$

(7.3) $c07-math-0004$

where even and odd refer to the symmetry of the wave function. Here we have chosen units such that $c07-math-0005$ . Your problem is to

Find several bound-state energies E_B for even wave functions, that is, the solution of (7.2).
See if making the potential deeper, say, by changing the 10 to a 20 or a 30, produces a larger number of, or deeper, bound states.

7.2 Algorithm: Trial-and-Error Roots via Bisection

Trial-and-error root finding looks for a value of x for which

(7.4) $c07-math-0006$

where the 0 on the right-hand side is conventional (an equation such as $c07-math-0007$ can easily be written as $c07-math-0008$ . The search procedure starts with a guessed value for x, substitutes that guess into f(x) (the “trial”), and then sees how far the LHS is from zero (the “error”). The program then changes x based on the error and tries out the new guess in f(x). The procedure continues until $c07-math-0009$ to some desired level of precision, or until the changes in x are insignificant, or if the search seems endless.

The most elementary trial-and-error technique is the bisection algorithm. It is reliable but slow. If you know some interval in which f(x) changes sign, then the bisection algorithm will always converge to the root by finding progressively smaller and smaller intervals within which the zero lies. Other techniques, such as the Newton–Raphson method we describe next, may converge more quickly, but if the initial guess is not close, it may become unstable and fail completely.

c07f001 — **Figure 7.1** A graphical representation of the steps involved in solving for a zero of f(x) using the bisection algorithm. The bisection algorithm takes the midpoint of the interval as the new guess for x, and so each step reduces the interval size by one-half. Four steps are shown for the algorithm.

The basis of the bisection algorithm is shown in Figure 7.1. We start with two values of x between which we know that a zero occurs. (You can determine these by making a graph or by stepping through different x values and looking for a sign change.) To be specific, let us say that f(x) is negative at x₋ and positive at x₊:

(7.5) $c07-math-0010$

(Note that it may well be that x₋ > x₊ if the function changes from positive to negative as x increases.) Thus, we start with the interval x₊ ≤ x ≤ x₋ within which we know a zero occurs. The algorithm (implemented as in Listing 7.1) then picks the new x as the bisection of the interval and selects as its new interval the half in which the sign change occurs:

This process continues until the value of f(x) is less than a predefined level of precision or until a predefined (large) number of subdivisions occurs.

c07f002 — **Listing 7.1** **The Bisection.py** is a simple implementation of the bisection algorithm for finding a zero of a function, in this case 2 cos *x – x*.

The example in Figure 7.1 shows the first interval extending from x₋ = x₊₁ to x₊ = x₋₁. We bisect that interval at x, and because f(x) < 0 at the midpoint, we set x₋ ≡ x₋₂ = x and label it x₋₂ to indicate the second step. We then use x₊₂ ≡ x₊₁ and x₋₂ as the next interval and continue the process. We see that only x₋ changes for the first three steps in this example, but for the fourth step x₊ finally changes. The changes then become too small for us to show.

7.2.1 Implementation: Bisection Algorithm

The first step in implementing any search algorithm is to get an idea of what your function looks like. For the present problem, you do this by making a plot of $c07-math-0014$ Note from your plot some approximate values at which $c07-math-0015$ . Your program should be able to find more exact values for these zeros.
Write a program that implements the bisection algorithm and use it to find some solutions of (7.2).
Warning: Because the tan function has singularities, you have to be careful. In fact, your graphics program (or Maple) may not function accurately near these singularities. One cure is to use a different but equivalent form of the equation. Show that an equivalent form of (7.2) is
(7.6) $c07-math-0016$
Make a second plot of (7.6), which also has singularities but at different places. Choose some approximate locations for zeros from this plot.
Evaluate f(E_B) and thus determine directly the precision of your solution.
Compare the roots you find with those given by Maple or Mathematica.

7.3 Improved Algorithm: Newton–Raphson Searching

The Newton–Raphson algorithm finds approximate roots of the equation

(7.7) $c07-math-0017$

more quickly than the bisection method. As we see graphically in Figure 7.2, this algorithm is the equivalent of drawing a straight line $c07-math-0018$ tangent to the curve at an x value for which $c07-math-0019$ and then using the intercept of the line with the x-axis at $c07-math-0020$ as an improved guess for the root. If the “curve” was a straight line, the answer would be exact; otherwise, it is a good approximation if the guess is close enough to the root for f(x) to be nearly linear. The process continues until some set level of precision is reached. If a guess is in a region where f(x) is nearly linear (Figure 7.2), then the convergence is much more rapid than for the bisection algorithm.

The analytic formulation of the Newton–Raphson algorithm starts with an old guess x₀ and expresses a new guess x as a correction Δx to the old guess:

(7.8) $c07-math-0021$

(7.9) $c07-math-0022$

c07f001a — **Figure 7.2** A graphical representation of the steps involved in solving for a zero of f(x) using the Newton–Raphson method. The Newton–Raphson method takes the new guess as the zero of the line tangent to f(x) at the old guess. Two guesses are shown.

We next expand the known function f(x) in a Taylor series around x₀ and keep only the linear terms:

(7.10) $c07-math-0023$

We then determine the correction Δx by calculating the point at which this linear approximation to f(x) crosses the x-axis:

(7.11) $c07-math-0024$

(7.12) $c07-math-0025$

The procedure is repeated starting at the improved x until some set level of precision is obtained.

The Newton–Raphson algorithm (7.12) requires evaluation of the derivative df / dx at each value of x₀. In many cases, you may have an analytic expression for the derivative and can build it into the algorithm. However, especially for more complicated problems, it is simpler and less error-prone to use a numerical forward-difference approximation to the derivative²⁾:

(7.13) $c07-math-0026$

where δx is some small change in x that you just chose (different from the Δ used for searching in (7.12)). While a central-difference approximation for the derivative would be more accurate, it would require additional evaluations of the f’s, and once you find a zero, it does not matter how you got there. In Listing 7.2, we give a program NewtonCD.py that implement the search with the central difference derivative.

c07f001b — **Listing 7.2** **NewtonCD.py** uses the Newton–Raphson method to search for a zero of the function f(x). A central-difference approximation is used to determine df/dx.

7.3.1 Newton–Raphson with Backtracking

Two examples of possible problems with the Newton–Raphson algorithm are shown in Figure 7.3. In Figure 7.3a, we see a case where the search takes us to an x value where the function has a local minimum or maximum, that is, where df / dx = 0. Because Δx = –f/f′, this leads to a horizontal tangent (division by zero), and so the next guess is x = ∞, from where it is hard to return. When this happens, you need to start your search with a different guess and pray that you do not fall into this trap again. In cases where the correction is very large but maybe not infinite, you may want to try backtracking (described below) and hope that by taking a smaller step you will not get into as much trouble.

In Figure 7.3b, we see a case where a search falls into an infinite loop surrounding the zero without ever getting there. A solution to this problem is called backtracking. As the name implies, in cases where the new guess x₀ + Δx leads to an increase in the magnitude of the function, $c07-math-0031$ , you can backtrack somewhat and try a smaller guess, say, x₀ + Δx/2. If the magnitude of f still increases, then you just need to backtrack some more, say, by trying x₀ + Δx/4 as your next guess, and so forth. Because you know that the tangent line leads to a local decrease in images f images , eventually an acceptable small enough step should be found.

The problem in both these cases is that the initial guesses were not close enough to the regions where f(x) is approximately linear. So again, a good plot may help produce a good first guess. Alternatively, you may want to start your search with the bisection algorithm and then switch to the faster Newton–Raphson algorithm when you get closer to the zero.

7.3.2 Implementation: Newton–Raphson Algorithm

Use the Newton–Raphson algorithm to find some energies E_B that are solutions of (7.2). Compare these solutions with the ones found with the bisection algorithm.
Again, notice that the 10 in (7.2) is proportional to the strength of the potential that causes the binding. See if making the potential deeper, say, by changing the 10 to a 20 or a 30, produces more or deeper bound states. (Note that in contrast to the bisection algorithm, your initial guess must be closer to the answer for the Newton–Raphson algorithm to work.)
Modify your algorithm to include backtracking and then try it out on some difficult cases.
Evaluate f(E_B) and thus determine directly the precision of your solution.

7.4 Problem 2: Temperature Dependence of Magnetization

Problem Determine M(T) the magnetization as a function of temperature for simple magnetic materials.

A collection of N spin-1/2 particles each with the magnetic moment µ is at temperature T. The collection has an external magnetic field B applied to it and comes to equilibrium with N_L particles in the lower energy state (spins aligned with the magnetic field), and with N_U particles in the upper energy state (spins opposed to the magnetic field). The Boltzmann distribution law tells us that the relative probability of a state with energy E is proportional to $c07-math-0033$ , where k_B is Boltzmann’s constant. For a dipole with moment µ, its energy in a magnetic field is given by the dot product $c07-math-0034$ . Accordingly, spin-up particle have lower energy in a magnetic field than spin-down particles, and thus are more probable.

Applying the Boltzmann distribution to our spin problem, we have that the number of particles in the lower energy level (spin up) is

(7.14) $c07-math-0035$

while the number of particles in the upper energy level (spin down) is

(7.15) $c07-math-0036$

As discussed by (Kittel, 2005), we now assume that the molecular magnetic field B = λ M is much larger than the applied magnetic field and so replace B by the molecular field. This permits us to eliminate B from the preceding equations. The magnetization M(T) is given by the individual magnetic moment µ times the net number of particles pointing in the direction of the magnetic field:

(7.16) $c07-math-0038$

(7.17) $c07-math-0039$

Note that this expression appears to make sense because as the temperature approaches zero, all spins will be aligned along the direction of B and so M(T = 0) = Nµ.

Solution via Searching Equation 7.17 relates the magnetization and the temperature. However, it is not really a solution to our problem because M appears on the LHS of the equation as well as within the hyperbolic function on the RHS. Generally, a transcendental equation of this sort does not have an analytic solution that would give M as simply a function of the temperature T. But by working backward, we can find a numerical solution. To do that we first express (7.17) in terms of the reduced magnetization m, the reduced temperature t, and the Curie temperature T_c:

(7.18) $c07-math-0041$

(7.19) $c07-math-0042$

While it is no easier to find an analytic solution to (7.18) than it was to (7.17), the simpler form of (7.18) makes the programming easier as we search for values of t and m that satisfy (7.18).

c07f003 — **Figure 7.4** A function of the reduced magnetism m at three reduced temperatures t. A zero of this function determines the value of the magnetism at a particular value of t.

One approach to a trial-and-error solution is to define a function

(7.20) $c07-math-0043$

and then, for a variety of fixed t = t_i values, search for those m values at which f(m, t_i) = 0. (One could just as well fix the value of m to m_j and search for the value of t for which f(m_j, t) = 0; once you have a solution, you have a solution.) Each zero so found gives us a single value of m(t_i). A plot or a table of these values for a range of t_i values then provides the best we can do as the desired solution m(t).

Figure 7.4 shows three plots of f(m,t) as a function of the reduced magnetization m, each plot for a different value of the reduced temperature. As you can see, other than the uninteresting solution at m = 0, there is only one solution (a zero) near m = 1 for t = 0.5, and no solution at other temperatures.

7.4.1 Searching Exercise

Find the root of (7.20) to six significant figures for t = 0.5 using the bisection algorithm.
Find the root of (7.20) to six significant figures for t = 0.5 using the Newton–Raphson algorithm.
Compare the time it takes to find the solutions for the bisection and Newton–Raphson algorithms.
Construct a plot of the reduced magnetization m(t) as a function of the reduced temperature t.

7.5 Problem 3: Fitting An Experimental Spectrum

Problem The cross sections measured for the resonant scattering of neutrons from a nucleus are given in Table 7.1. Your problem is to determine values for the cross sections at energy values lying between those in the table.

You can solve this problem in a number of ways. The simplest is to numerically interpolate between the values of the experimental f(E_i) given in Table 7.1. This is direct and easy but does not account for there being experimental noise in the data. A more appropriate solution (discussed in Section 7.7) is to find the best fit of a theoretical function to the data. We start with what we believe to be the “correct” theoretical description of the data,

Table 7.1 Experimental values for a scattering cross section (f(E) in the theory), each with absolute error ±σ_i, as a function of energy (x_i in the theory).

i =	1	2	3	4	5	6	7	8	9
E_i (MeV)	0	25	50	75	100	125	150	175	200
g(E_i) (mb)	10.6	16.0	45.0	83.5	52.8	19.9	10.8	8.25	4.7
Error (mb)	9.34	17.9	41.5	85.5	51.5	21.5	10.8	6.29	4.14

(7.21) $c07-math-0046$

where f_r, E_r, and Γ are unknown parameters. We then adjust the parameters to obtain the best fit. This is a best fit in a statistical sense but in fact may not pass through all (or any) of the data points. For an easy, yet effective, introduction to statistical data analysis, we recommend (Bevington and Robinson, 2002).

These two techniques of interpolation and least-squares fitting are powerful tools that let you treat tables of numbers as if they were analytic functions and sometimes let you deduce statistically meaningful constants or conclusions from measurements. In general, you can view data fitting as global or local. In global fits, a single function in x is used to represent the entire set of numbers in a table like Table 7.1. While it may be spiritually satisfying to find a single function that passes through all the data points, if that function is not the correct function for describing the data, the fit may show nonphysical behavior (such as large oscillations) between the data points. The rule of thumb is that if you must interpolate, keep it local and view global interpolations with a critical eye.

Consider Table 7.1 as ordered data that we wish to interpolate. We call the independent variable x and its tabulated values x_i(i = 1, 2, …), and assume that the dependent variable is the function g(x), with the tabulated values g_i = g(x_i). We assume that g(x) can be approximated as an (n – 1)-degree polynomial in each interval i:

(7.22) $c07-math-0048$

Because our fit is local, we do not assume that one g(x) can fit all the data in the table but instead use a different polynomial, that is, a different set of a_i values, for each interval. While each polynomial is of low degree, multiple polynomials are needed to span the entire table. If some care is taken, the set of polynomials so obtained will behave well enough to be used in further calculations without introducing much unwanted noise or discontinuities.

The classic interpolation formula was created by Lagrange. He figured out a closed-form expression that directly fits the (n – 1)-order polynomial (7.22) to n values of the function g(x) evaluated at the points x_i . The formula for each interval is written as the sum of polynomials:

(7.23) $c07-math-0049$

(7.24) $c07-math-0050$

For three points, (7.23) provides a second-degree polynomial, while for eight points it gives a seventh-degree polynomial. For example, assume that we are given the points and function values

(7.25) $c07-math-0051$

With four points, the Lagrange formula determines a third-order polynomial that reproduces each of the tabulated values:

(7.26) $c07-math-0052$

As a check, we see that

(7.27) $c07-math-0053$

If the data contain little noise, this polynomial can be used with some confidence within the range of the data, but with risk beyond the range of the data.

Notice that Lagrange interpolation has no restriction that the points x_i be evenly spaced. Usually, the Lagrange fit is made to only a small region of the table with a small value of n, despite the fact that the formula works perfectly well for fitting a high-degree polynomial to the entire table. The difference between the value of the polynomial evaluated at some x and that of the actual function can be shown to be the remainder

(7.28) $c07-math-0054$

where ζ lies somewhere in the interpolation interval. What significant here is that we see that if significant high derivatives exist in g(x), then it cannot be approximated well by a polynomial. For example, a table of noisy data would have significant high derivatives.

7.5.1 Lagrange Implementation, Assessment

Consider the experimental neutron scattering data in Table 7.1. The expected theoretical functional form that describes these data is (7.21), and our empirical fits to these data are shown in Figure 7.5.

Write a subroutine to perform an n-point Lagrange interpolation using (7.23). Treat n as an arbitrary input parameter. (You may also perform this exercise with the spline fits discussed in Section 7.5.2.)
Use the Lagrange interpolation formula to fit the entire experimental spectrum with one polynomial. (This means that you must fit all nine data points with an eight-degree polynomial.) Then use this fit to plot the cross section in steps of 5MeV.
Use your graph to deduce the resonance energy E_r (your peak position) and Γ (the full-width at half-maximum). Compare your results with those predicted by a theorist friend, (E_r, Γ) = (78, 55) MeV.
A more realistic use of Lagrange interpolation is for local interpolation with a small number of points, such as three. Interpolate the preceding cross-sectional data in 5-MeV steps using three-point Lagrange interpolation for each interval. (Note that the end intervals may be special cases.)
We deliberately have not discussed extrapolation of data because it can lead to serious systematic errors; the answer you get may well depend more on the function you assume than on the data you input. Add some adventure to your life and use the programs you have written to extrapolate to values outside Table 7.1. Compare your results to the theoretical Breit–Wigner shape (7.21).

This example shows how easy it is to go wrong with a high-degree-polynomial fit to data with errors. Although the polynomial is guaranteed to pass through all the data points, the representation of the function away from these points can be quite unrealistic. Using a low-order interpolation formula, say, n = 2 or 3, in each interval usually eliminates the wild oscillations, but may not have any theoretical justification. If these local fits are matched together, as we discuss in the next section, a rather continuous curve results. Nonetheless, you must recall that if the data contain errors, a curve that actually passes through them may lead you astray. We discuss how to do this properly with least-squares fitting in Section 7.7.

7.5.2 Cubic Spline Interpolation (Method)

If you tried to interpolate the resonant cross section with Lagrange interpolation, then you saw that fitting parabolas (three-point interpolation) within a table may avoid the erroneous and possibly catastrophic deviations of a high-order formula. (A two-point interpolation, which connects the points with straight lines, may not lead you far astray, but it is rarely pleasing to the eye or precise.) A sophisticated variation of an n = 4 interpolation, known as cubic splines, often leads to surprisingly eye-pleasing fits. In this approach (Figure 7.5), cubic polynomials are fit to the function in each interval, with the additional constraint that the first and second derivatives of the polynomials be continuous from one interval to the next. This continuity of slope and curvature is what makes the spline fit particularly eye-pleasing. It is analogous to what happens when you use the flexible spline drafting tool (a lead wire within a rubber sheath) from which the method draws its name.

c07f004 — **Figure 7.5** Three fits to data. *Dashed*: Lagrange interpolation using an eight-degree polynomial; *Short dashes*: cubic splines fit ; *Long dashed*: Least-squares parabola fit.

The series of cubic polynomials obtained by spline-fitting a table of data can be integrated and differentiated and is guaranteed to have well-behaved derivatives. The existence of meaningful derivatives is an important consideration. As a case in point, if the interpolated function is a potential, you can take the derivative to obtain the force. The complexity of simultaneously matching polynomials and their derivatives over all the interpolation points leads to many simultaneous linear equations to be solved. This makes splines unattractive for hand calculation, yet easy for computers and, not surprisingly, popular in both calculations and computer drawing programs. To illustrate, the smooth solid curve in Figure 7.5 is a spline fit.

The basic approximation of splines is the representation of the function g(x) in the subinterval [x_i, x_i+1] with a cubic polynomial:

(7.29) $c07-math-0056$

(7.30) $c07-math-0057$

This representation makes it clear that the coefficients in the polynomial equal the values of g(x) and its first, second, and third derivatives at the tabulated points x_i. Derivatives beyond the third vanish for a cubic. The computational chore is to determine these derivatives in terms of the N tabulated g_i values. The matching of g_i at the nodes that connect one interval to the next provides the equations

(7.31) $c07-math-0058$

The matching of the first and second derivatives at each interval’s boundaries provides the equations

(7.32) $c07-math-0059$

The additional equations needed to determine all constants are obtained by matching the third derivatives at adjacent nodes. Values for the third derivatives are found by approximating them in terms of the second derivatives:

(7.33) $c07-math-0060$

As discussed in Chapter 5, a central-difference approximation would be more accurate than a forward-difference approximation, yet (7.33) keeps the equations simpler.

It is straightforward yet complicated to solve for all the parameters in (7.30). We leave that to the references (Thompson, 1992; Press et al., 1994). We can see, however, that matching at the boundaries of the intervals results in only (N – 2) linear equations for N unknowns. Further input is required. It usually is taken to be the boundary conditions at the endpoints a = x₁ and b = x_N, specifically, the second derivatives g″(a) and g″(b). There are several ways to determine these second derivatives:

Natural spline: Set g″(a) = g″(b) = 0; that is, permit the function to have a slope at the endpoints but no curvature. This is “natural” because the derivative vanishes for the flexible spline drafting tool (its ends being free).

Input values for g′ at the boundaries: The computer uses g′(a) to approximate g″(a). If you do not know the first derivatives, you can calculate them numerically from the table of g_i values.

Input values for g″ at the boundaries: Knowing values is of course better than approximating values, but it requires the user to input information. If the values of g″ are not known, they can be approximated by applying a forward-difference approximation to the tabulated values:

(7.34) $c07-math-0061$

7.5.2.1 Cubic Spline Quadrature (Exploration)

A powerful integration scheme is to fit an integrand with splines and then integrate the cubic polynomials analytically. If the integrand g(x) is known only at its tabulated values, then this is about as good an integration scheme as is possible; if you have the ability to calculate the function directly for arbitrary x, Gaussian quadrature may be preferable. We know that the spline fit to g in each interval is the cubic (7.30)

(7.35) $c07-math-0062$

It is easy to integrate this to obtain the integral of g for this interval and then to sum over all intervals:

(7.36) $c07-math-0063$

(7.37) $c07-math-0064$

Making the intervals smaller does not necessarily increase precision, as subtractive cancelations in (7.36) may get large.

Spline Fit of Cross Section (Implementation) Fitting a series of cubics to data is a little complicated to program yourself, so we recommend using a library routine. While we have found quite a few Java-based spline applications available on the Internet, none seemed appropriate for interpreting a simple set of numbers. That being the case, we have adapted the splint.c and the spline.c functions from (Press et al., 1994) to produce the SplineInteract.py program shown in Listing 7.3 (there is also an applet). Your problem for this section is to carry out the assessment in Section 7.5.1 using cubic spline interpolation rather than Lagrange interpolation.

7.6 Problem 4: Fitting Exponential Decay

Problem Figure 7.6 presents actual experimental data on the number of decays ΔN of the π meson as a function of time (Stetz et al., 1973). Notice that the time has been “binned” into Δt = 10 ns intervals and that the smooth curve is the theoretical exponential decay expected for very large numbers of pions (which there is not). Your problem is to deduce the lifetime τ of the π meson from these data (the tabulated lifetime of the pion is 2.6 × 10^–8 s).

Theory Assume that we start with N₀ particles at time t = 0 that can decay to other particles.³⁾ If we wait a short time Δt, then a small number ΔN of the particles will decay spontaneously, that is, with no external influences. This decay is a stochastic process, which means that there is an element of chance involved in just when a decay will occur, and so no two experiments are expected to give exactly the same results. The basic law of nature for spontaneous decay is that the number of decays ΔN in a time interval Δt is proportional to the number of particles N(t) present at that time and to the time interval

(7.38) $c07-math-0065$

c07f005 — **Figure 7.6** A reproduction of the experimental measurement by Stetz *et al*. (1973) giving the number of decays of a π meson as a function of time since its creation. Measurements were made during time intervals (box sizes) of 10-ns width. The dashed curve is the result of a linear least-squares fit to the log N(t).

Here τ = 1/λ is the lifetime of the particle, with λ the rate parameter. The actual decay rate is given by the second equation in (7.38). If the number of decays ΔN is very small compared to the number of particles N, and if we look at vanishingly small time intervals, then the difference equation (7.38) becomes the differential equation

(7.39) $c07-math-0067$

This differential equation has an exponential solution for the number as well as for the decay rate:

(7.40) $c07-math-0068$

Equation 7.40 is the theoretical formula we wish to “fit” to the data in Figure 7.6. The output of such a fit is a “best value” for the lifetime τ.

7.7 Least-Squares Fitting (Theory)

Books have been written and careers have been spent discussing what is meant by a “good fit” to experimental data. We cannot do justice to the subject here and refer the reader to Bevington and Robinson (2002); Press et al. (1994); Thompson (1992). However, we will emphasize three points:

If the data being fit contain errors, then the “best fit” in a statistical sense should not pass through all the data points.
If the theory is not an appropriate one for the data (e.g., the parabola in Figure 7.5), then its best fit to the data may not be a good fit at all. This is good, for this is how we know that the theory is not right.
Only for the simplest case of a linear least-squares fit, can we write down a closed-form solution to evaluate and obtain the fit. More realistic problems are usually solved by trial-and-error search procedures, sometimes using sophisticated subroutine libraries. However, in Section 7.8.2 we show how to conduct such a nonlinear search using familiar tools.

Imagine that you have measured N_D data values of the independent variable y as a function of the dependent variable x:

(7.41) $c07-math-0069$

where ±σ_i is the experimental uncertainty in the ith value of y. (For simplicity we assume that all the errors σ_i occur in the dependent variable, although this is hardly ever true (Thompson, 1992)). For our problem, y is the number of decays as a function of time, and x_i are the times. Our goal is to determine how well a mathematical function y = g(x) (also called a theory or a model) can describe these data. Additionally, if the theory contains some parameters or constants, our goal can be viewed as determining the best values for these parameters. We assume that the theory function g(x) contains, in addition to the functional dependence on x,an additional dependence upon M_P parameters {a₁, a₂, …, a_{M_P}}. Notice that the parameters {a_m} are not variables, in the sense of numbers read from a meter, but rather are parts of the theoretical model, such as the size of a box, the mass of a particle, or the depth of a potential well. For the exponential decay function (7.40), the parameters are the lifetime τ and the initial decay rate dN(0)/dt.We include the parameters as

(7.42) $c07-math-0070$

where the a_i’s are parameters and x the independent variable. We use the chisquare (χ²) measure (Bevington and Robinson, 2002) as a gauge of how well a theoretical function g reproduces data:

(7.43) $c07-math-0071$

where the sum is over the N_D experimental points (x_i, y_i ±σ_i). The definition (7.43) is such that smaller values of χ² are better fits, with χ² = 0 occurring if the theoretical curve went through the center of every data point. Notice also that the $c07-math-0072$ weighting means that measurements with larger errors⁴⁾ contribute less to χ².

Least-squares fitting refers to adjusting the parameters in the theory until a minimum in χ² is found, that is, finding a curve that produces the least value for the summed squares of the deviations of the data from the function g(x). In general, this is the best fit possible and the best way to determine the parameters in a theory. The M_P parameters {a_m, m = 1, M_P} that make χ² an extremum are found by solving the M_P equations:

(7.44) $c07-math-0074$

Often, the function g(x; {a_m}) has a sufficiently complicated dependence on the a_m values for (7.44) to produce M_P simultaneous nonlinear equations in the a_m values. In these cases, solutions are found by a trial-and-error search through the M_P-dimensional parameter space, as we do in Section 7.8.2. To be safe, when such a search is completed, you need to check that the minimum χ² you found is global and not local. One way to do that is to repeat the search for a whole grid of starting values, and if different minima are found, to pick the one with the lowest χ².

7.7.1 Least-Squares Fitting: Theory and Implementation

When the deviations from theory are as a result of random errors and when these errors are described by a Gaussian distribution, there are some useful rules of thumb to remember (Bevington and Robinson, 2002). You know that your fit is good if the value of χ² calculated via the definition (7.43) is approximately equal to the number of degrees of freedom χ² images N_D−M_p where N_D is the number of data points and M_P is the number of parameters in the theoretical function. If your χ² is much less than N_D − M_P, it does not mean that you have a “great” theory or a really precise measurement; instead, you probably have too many parameters or have assigned errors (σ_i values) that are too large. In fact, too small χ² may indicate that you are fitting the random scatter in the data rather than missing approximately one-third of the error bars, as expected if the errors are random. If your χ² is significantly greater than N_D − M_P, the theory may not be good, you may have significantly underestimated your errors, or you may have errors that are not random.

The M_P simultaneous equations (7.44) can be simplified considerably if the functions g(x; {a_m}) depend linearly on the parameter values a_i, for example,

(7.45) $c07-math-0077$

**Figure 7.7** A linear least-squares best fit of a straight line to data. The deviation of theory from experiment is greater than would be expected from statistics, which means that a straight line is not a good theory to describe these data.

In this case (also known as linear regression or straight-line fit),as shown in Figure 7.7, there are M_p = 2 parameters, the slope a₂, and the y intercept a₁. Notice that while there are only two parameters to determine, there still may be an arbitrary number N_D of data points to fit. Remember, a unique solution is not possible unless the number of data points is equal to or greater than the number of parameters. For this linear case, there are just two derivatives,

(7.46) $c07-math-0078$

and after substitution, the χ² minimization equations (7.44) can be solved (Press et al., 1994):

(7.47) $c07-math-0079$

(7.48) $c07-math-0080$

(7.49) $c07-math-0081$

Statistics also gives you an expression for the variance or uncertainty in the deduced parameters:

(7.50) $c07-math-0082$

This is a measure of the uncertainties in the values of the fitted parameters arising from the uncertainties σ_i in the measured y_i values. A measure of the dependence of the parameters on each other is given by the correlation coefficient:

(7.51) $c07-math-0083$

Here cov(a₁, a₂) is the covariance of a₁ and a₂ and vanishes if a₁ and a₂ are independent. The correlation coefficient ρ(a₁, a₂)lies in the range −1 ≤ ρ ≤ 1, with a positive ρ indicating that the errors in a₁ and a₂ are likely to have the same sign, and a negative ρ indicating opposite signs.

The preceding analytic solutions for the parameters are of the form found in statistics books but are not optimal for numerical calculations because subtractive cancelation can make the answers unstable. As discussed in Chapter 3, a rearrangement of the equations can decrease this type of error. For example, Thompson (1992) gives improved expressions that measure the data relative to their averages:

(7.52) $c07-math-0084$

In Fit.py in Listing 7.4 we give a program that fits a parabola to some data. You can use it as a model for fitting a line to data, although you can use our closed-form expressions for a straight-line fit. In Fit.py on the instructor’s site, we give a program for fitting to the decay data.

7.8 Exercises: Fitting Exponential Decay, Heat Flow and Hubble’s Law

Fit the exponential decay law (7.40) to the data in Figure 7.6. This means finding values for τ and ΔN(0)/Δt that provide a best fit to the data, and then judging how good the fit is.
1. a) Construct a table of approximate values for(ΔN/Δt_i, t_i), for i =1, N_D as read from Figure 7.6. Because time was measured in bins, t_i should correspond to the middle of a bin.
2. b) Add an estimate of the error σ_i to obtain a table of the form (ΔN/Δt_i ± σ_i,t_i). You can estimate the errors by eye, say, by estimating how much the histogram values appear to fluctuate about a smooth curve, or you can take $c07-math-0085$ (This last approximation is reasonable for large numbers, which this is not.)
3. c) In the limit of very large numbers, we would expect a plot of ln |dN/dt|vs. t to be a straight line:
  (7.53) $c07-math-0086$
  
  This means that if we treat ln |ΔN(t)/Δt| as the dependent variable and time Δt as the independent variable, we can use our linear-fit results. Plot ln|ΔN/Δt| vs. Δt.
4. d) Make a least-squares fit of a straight line to your data and use it to determine the lifetime τ of the π meson. Compare your deduction to the tabulated lifetime of 2.6 × 10⁻⁸ s and comment on the difference.
5. e) Plot your best fit on the same graph as the data and comment on the agreement.
6. f) Deduce the goodness of fit of your straight line and the approximate error in your deduced lifetime. Do these agree with what your “eye” tells you?
7. g) Now that you have a fit, look at the data again and estimate what a better value for the errors in the ordinates might be.
Table 7.2 gives the temperature T along a metal rod whose ends are kept at a fixed constant temperature. The temperature is a function of the distance x along the rod.
1. a) Plot the data in Table 7.3 to verify the appropriateness of a linear relation
  (7.54) $c07-math-0087$
2. b) Because you are not given the errors for each measurement, assume that the least significant figure has been rounded off and so σ ≥ 0.05.
3. c) Use that to compute a least-squares straight-line fit to these data.
4. d) Plot your best a + bx on the curve with the data.
5. e) After fitting the data, compute the variance and compare it to the deviation of your fit from the data. Verify that about one-third of the points miss the σ error band (that is what is expected for a normal distribution of errors).
6. f) Use your computed variance to determine the χ² of the fit. Comment on the value obtained.
7. g) Determine the variances σ_a and σ_b and check whether it makes sense to use them as the errors in the deduced values for a and b.

In 1929, Edwin Hubble examined the data relating the radial velocity v of 24 extra galactic nebulae to their distance r from our galaxy (Hubble, 1929). Although there was considerable scatter in the data, he fit them with a straight line:

(7.55) $c07-math-0088$

where H is now called the Hubble constant. Table 7.3 contains the distances and velocities used by Hubble.

a) Plot the data to verify the appropriateness of a linear relation
(7.56) $c07-math-0089$

b) Because you are not given the errors for each measurement, you may assume that the least significant figure has been rounded off and so σ ≥ 1. Or, you may assume that astronomical measurements are hard to make and that there are at least 10% errors in the data.

Table 7.2 Temperature vs. distance as measured along a metal rod.

x_i (cm)	1.0	2.0	3.0	4.0	5.0	6.0	7.0	8.0	9.0
T_i (C)	14.6	18.5	36.6	30.8	59.2	60.1	62.2	79.4	99.9

Table 7.3 Distance vs. radial velocity for 24 extragalactic nebulae.

Object	r (Mpc)	v (km/s)	Object	r (Mpc)	v (km/s)
	0.032	170	3627	0.9	650
	0.034	290	4826	0.9	150
6822	0.214	–130	5236	0.9	500
598	0.263	–70	1068	1.0	920
221	0.275	–185	5055	1.1	450
224	0.275	–220	7331	1.1	500
5457	0.45	200	4258	1.4	500
4736	0.5	290	4141	1.7	960
5194	0.5	270	4382	2.0	500
4449	0.63	200	4472	2.0	850
4214	0.8	300	4486	2.0	800
3031	0.9	–30	4649	2.0	1090

c) Compute a least-squares straight-line fit to these data.
d) Plot your best a + Hr on the curve with the data.
e) After fitting the data, compute the variance and compare it to the deviation of your fit from the data. Verify that about one-third of the points miss the σ error band (that is what is expected for a normal distribution of errors).
f) Use your computed variance to determine the χ² of the fit. Comment on the value obtained.
g) Determine the variances σ_a and σ_b and check whether it makes sense to use them as the errors in the deduced values for a and b.
h) Now that you have a fit, look at the data again and estimate what a better value for the errors in the ordinates might be.

7.8.1 Linear Quadratic Fit

As indicated earlier, as long as the function being fitted depends linearly on the unknown parameters a_i, the condition of minimum χ² leads to a set of simultaneous linear equations for the a’s that can be solved by hand or on the computer using matrix techniques. To illustrate, suppose we want to fit the quadratic polynomial

(7.57) $c07-math-0090$

to the experimental measurements (x_i, y_i, i= l, N_D)(Figure 7.8). Because this g(x) is linear in all the parameters a_i, we can still make a linear fit although x is raised to the second power. (However, if we tried to a fit a function of the form g(x) = (a₁ + a₂x) exp(−a₃x) to the data, then we would not be able to make a linear fit because there is not a linear dependence on a₃)

**Figure 7.8** A linear least-squares best fit of a parabola to data. Here we see that the fit misses approximately one-third of the points, as expected from the statistics for a good fit.

The best fit of this quadratic to the data is obtained by applying the minimum χ² condition (7.44) for M_p = 3 parameters and N_D (still arbitrary) data points. A solution represents the maximum likelihood that the deduced parameters provide a correct description of the data for the theoretical function g(x). Equation 7.44 leads to the three simultaneous equations for a₁, a₂,and a₃:

(7.58) $c07-math-0091$

(7.59) $c07-math-0092$

(7.60) $c07-math-0093$

Note: Because the derivatives are independent of the parameters (the a’s), the a dependence arises only from the term in square brackets in the sums, and because that term has only a linear dependence on the a’s, these equations are linear in the a’s.

Exercise Show that after some rearrangement, (7.58)–(7.60) can be written as

(7.61) $c07-math-0094$

Here the definitions of the S‘s are simple extensions of those used in (7.47)–(7.49) and are programmed in Fit.py shown in Listing 7.4. After placing the three unknown parameters into a vector x and the known three RHS terms in (7.61) into a vector b, these equations assume the matrix form

(7.62) $c07-math-0095$

The solution for the parameter vector a is obtained by solving the matrix equations. Although for 3 × 3 matrices, we can write out the solution in a closed form, for larger problems the numerical solution requires matrix methods.

Linear Quadratic Fit Assessment

Fit the quadratic (7.57) to the following data sets [given as (x₁, y₁), (x₂, y₂),…] In each case, indicate the values found for the a’s the number of degrees of freedom, and the value of χ².
1. a) (0,1)
2. b) (0,1), (1,3)
3. c) (0,1), (1,3), (2, 7)
4. d) (0,1), (1,3), (2, 7), (3,15)
Find a fit to the last set of data to the function
(7.63) $c07-math-0096$
Hint: A judicious change of variables will permit you to convert this to a linear fit. Does a minimum χ² still have meaning here?

7.8.2 Problem 5: Nonlinear Fit to a Breit–Wigner

Problem Remember how earlier in this chapter we interpolated the values in Table 7.1 in order to obtain the experimental cross section ∑ as a function of energy. Although we did not use it, we also gave the theory describing these data, namely, the Breit-Wigner resonance formula (7.21):

(7.64) $c07-math-0097$

Your problem is to determine what values for the parameters (E_r, f_r, and Γ in (7.64) provide the best fit to the data in Table 7.1.

Because (7.64) is not a linear function of the parameters (E_r, Σ₀, Γ), the three equations that result from minimizing χ² are not linear equations and so cannot be solved by the techniques of linear algebra (matrix methods). However, in our study of the masses on a string problem, we showed how to use the Newton–Raphson algorithm to search for solutions of simultaneous nonlinear equations. That technique involved expansion of the equations about the previous guess to obtain a set of linear equations and then solving the linear equations with the matrix libraries. We now use this same combination of fitting, trial-and-error searching, and matrix algebra to conduct a nonlinear least-squares fit of (7.64) to the data in Table 7.1.

Recall that the condition for a best fit is to find values of the M_p parameters a_m in the theory g(x, a_m) that minimize $c07-math-0098$ . This leads to the M_p equations (7.44) to solve

(7.65) $c07-math-0069$

To find the form of these equations appropriate to our problem, we rewrite our theory function (7.64) in the notation of (7.65):

(7.66) $c07-math-0100$

(7.67) $c07-math-00101$

The three derivatives required in (7.65) are then

(7.68) $c07-math-0102$

Substitution of these derivatives into the best-fit condition (7.65) yields three simultaneous equations in a₁, a₂, and a₃ that we need to solve in order to fit the N_D = 9 data points (x_i, y_i) in Table 7.1:

(7.69) $c07-math-0103$

Even without the substitution of (7.64) for g(x, a), it is clear that these three equations depend on the a’s in a nonlinear fashion. That is okay because in Section 6.1.2 we derived the N-dimensional Newton–Raphson search for the roots of

(7.70) $c07-math-0104$

where we have made the change of variable y_i → a_i for the present problem. We use that same formalism here for the N = 3 Equation 7.69 by writing them as

(7.71) $c07-math-0105$

(7.72) $c07-math-0106$

(7.73) $c07-math-0107$

Because f_r ≡ a₁ is the peak value of the cross section, E_R ≡ a₂ is the energy at which the peak occurs, and $c07-math-0110$ is the full width of the peak at half-maximum, good guesses for the a’s can be extracted from a graph of the data. To obtain the nine derivatives of the three f’s with respect to the three unknown a’s, we use two nested loops over i and j, along with the forward-difference approximation for the derivative

(7.74) $c07-math-0111$

where Δa_j corresponds to a small, say ≤1%, change in the parameter value.

Nonlinear Fit Implementation Use the Newton–Raphson algorithm as outlined in Section 7.8.2 to conduct a nonlinear search for the best-fit parameters of the Breit–Wigner theory (7.64) to the data in Table 7.1. Compare the deduced values of (f_r, E_R, Γ) to that obtained by inspection of the graph.