Mathematical Handbook for Scientists and Engineers

of the sample values is a random variable whose probability distribution (sampling distribution of y is uniquely determined by the likelihood function, and hence by the distribution of x. Each sampling distribution will, in general, depend on the sample size n.

While the assumptions made in this section do not apply to every physical situation, the model is capable of considerable generalization. The distribution of x need not be continuous (games, quality control). The sample may be of infinite size and may not even be a countable set; the x_k may not have the same probability distribution and may not be statistically independent (random-process theory, Sec. 18.9-2). Finally, x, and thus each x_k, can be replaced by a multidimensional variable (Secs. 19.7-1 to 19.7-6).

(b) As the size n of a random sample increases, many sample statistics converge in probability to corresponding parameters of the theoretical distribution of x; in particular, statistical relative frequencies converge in mean to the corresponding probabilities (Sec. 19.2-1). Thus one considers each sample drawn from an infinite (theoretical) population (universe, ensemble) whose sample distribution (Sec. 19.2-2) is identical with the theoretical probability distribution of x. The probability distribution is then referred to as the population distribution, and its parameters are population parameters. In many applications, the theoretical population is an idealization of an actual population from which samples are drawn.

19.1-3. Relation between Probability Model and Reality: Estimation and Testing (a) Estimation of Parameters. Statistical methods use empirical data (sample values) to infer specifications of a probability model, e.g., to estimate the probability density φ(x) of a random variable x. An important application of such inferred models is to make decisions based on inferred probabilities of future events. In most applications, statistical relative frequencies (Sec. 19.2-1) are used directly only for rough qualitative (graphical) estimates of the population distribution. Instead, one infers (postulates) the general form of the theoretical distribution, say

where η₁, η₂, . . . are unknown population parameters to be estimated on the basis of the given random sample (x₁, x₂, . . . , x_n). Sections 19.3-1 to 19.3-5 list a number of “general-purpose” frequency functions (2) to be chosen in accordance with the physical background, the form of the sample distribution, and convenience in computations.

The parameters η₁,η₂, . . . usually measure specific properties of the theoretical distribution of x (e.g., population mean, population variance, skewness; see also Table 18.3-1). In general, one attempts to estimate values of the parameters η₁, η₂, .. . “fitting” a given sample (x₁, x₂. . . , x_n) by the empirical values of corresponding sample statistics y₁(x₁, x₂, . . . , x_n), y₂(x₁, x₂ ... , x_n), . . . which measure analogous properties of the sample (e.g., sample average, sample variance, Secs. 19.2-3 to 19.2-6). "Fitting" is interpreted subjectively and not necessarily uniquely; in particular, one prefers estimates y(x₁, x₂, . . . , x_n) which converge in probability to η as n → ∞ (consistent estimates), whose expected value equals η (unbiased estimates), whose sampling distribution has a small variance, and/or which are easy to compute (Secs. 19.4-1 to 19.4-5).

(b) Testing Statistical Hypotheses. Tests of a statistical hypothesis specifying some property of a theoretical distribution (say, an inferred set of parameter values η₁, η₂, . . .) are based on the likelihood (1) of a test sample (x₁, x₂ . . ., x_n ) when the hypothetical probability density (2) is used to compute L(x₁, x₂, . . . , x_n). Generally speaking, the test will reject the hypothesis if the test sample (x₁, x₂, . . . , x_n) falls into a region of small likelihood; or equivalently, if the corresponding value of a test statistic y(x₁, x₂ . . . , x_n) is improbable on the basis of the hypothetical likelihood function. The choice of specific conditions of rejection is again subjective and is ultimately based on the penalties of false rejection and/or acceptance and, to some extent, on the cost of obtaining test samples of various sizes (Secs. 19.6-1 to 19.6-9).

Nonparametric tests test hypothetical distribution properties other than parameter values (e.g., identity of two distributions, statistical independence of two random variables, Secs. 19.6-8 and 19.7-6) and are particularly convenient in suitable applications, since no specific form (2) of the population distribution need be inferred.

NOTE: Incorrect use of statistical methods can lead to grave errors and seriously wrong conclusions. All (possibly tacit) assumptions regarding a theoretical distribution must be checked. Never use the same sample for estimation and testing. Finally, remember that statistical tests cannot prove any hypothesis; they can only demonstrate a "lack of disproof.''

19.2. STATISTICAL DESCRIPTION. DEFINITION AND COMPUTATION OF RANDOM-SAMPLE STATISTICS

19.2-1. Statistical Relative Frequencies, (a) Definition and Basic Properties. Consider an event E which occurs if and only if a measurement of the random variable x yields a value in some measurable set S_E(usually a class interval. Sec. 19.2-2b). Given a random sample (x_l x₂. . . , x_n) of x, let n_E denote the number of times a sample value x_kimplies the occurrence of the event E. The statistical relative frequency of the event E obtained from the given random sample is

where n is the size of the sample.

The definition of statistical relative frequencies implies the existence of an event algebra (Sec. 18.2-1) for the experiment or observation in Question. The defining properties of mathematical probabilities (Sec. 18.2-2) are abstracted from corresponding properties of statistical relative frequencies. Thus the relative frequencies of mutually exclusive events add, the relative frequency of a certain event is 1, etc.

(b) Mean and Variance. Since the random sample may be regarded as a set of n Bernoulli trials (Sec. 18.7-3) which yield or do not yield E, the random variable n_E has a binomial distribution (Table 18.8-3) where is the probability associated with the event E, and

The statistical relative frequency h[E] is an unbiased, consistent estimate of the corresponding probability P[E]; as n → ∞, h[E] is asymptotically normal with the parameters (2) (Secs. 18.6-4 and 18.6-5a).

19.2-2. The Distribution of the Sample. Grouped Data, (a) The Empirical Cumulative Distribution Function. For a given random sample (x₁, x₂, . . . , x_n), the empirical cumulative distribution function

is a nondecreasing step function, with F(— ∞) = 0, F(∞) = 1. F(X) is an unbiased, consistent estimate of the cumulative distribution function Φ(X) = P[x ≤ X] (Sec. 18.2-9) and defines the distribution (frequency distribution) of the sample (empirical distribution based on the given sample).

(b) Class Intervals and Grouped Data. Let the range of the random variable x be partitioned into a finite or infinite number of conveniently chosen class intervals (cells) (j = 1, 2, . . .) respectively of length ΔX₁, ΔX₂, . . . and centered at x = X₁ < X₂ < . . . . For a given random sample, the class frequency (occupation number) n_j is the number of times an x_k falls into the j^th class interval (description of the sample in terms of grouped data). The statistical relative frequencies hj = n_j/n (relative frequencies of observations in the j^th class interval) must add up to unity and are consistent, unbiased estimates of the corresponding probabilities

The cumulative frequencies N_j and the cumulative relative frequencies Fj are denned by

Sample statistics can be calculated from the statistical relative frequencies h_j = n_j/n just as corresponding population parameters are calculated from probabilities. The roundoff implicit in numerical computation of statistics groups data with a class-interval width equal to one least-significant digit. Grouping into much larger class intervals may be economically advantageous, since “quantization errors” due to grouping very often average out or are easily corrected (Sec. 19.2-5 and Ref. 19.25). The statistics F(X), n_j, h_j, N_j, and F_j also yield various graphical representations of sample distributions and hence of estimated population distributions (bar charts, histograms, frequency polygons, probability graph paper, etc.; see Refs. 19.1 and 19.8).

(c) Sample Fractiles (see also Table 18.3-1 and Sec. 19.5-2b). The sample P fractiles (sample quantiles) Xp are defined by

Equation (5) does not define X_p uniquely but brackets it by two adjacent sample values x_k. X_1/2 is the sample median, and X_1/4 X_1/2, X_3/4 are sample quartiles, with analogous definitions for sample deciles and sample percentiles.

19.2-3. Sample Averages (see also Secs. 18.3-3, 19.2-5, and 19.5-3). (a) The Sample Average of x. Given a random sample (x₁, x₂, ... , x_n), the sample average of x is

In terms of the sample distribution over a set of class intervals centered at x = X₁, X₂, . . . , X_m (Sec. 19.2-2), is approximated by

is a measure of location of the sample distribution. Note

whenever the quantity on the right exists, is an unbiased, consistent estimate of the population mean ξ = E{x}; if σ² exists, ; is asymptotically normal with the parameters (8) as n → ∞ (Secs. 18.6-4 and 19.5-2).

(b) The Sample Average of y(x). The sample average of a function y(x) of the random variable x is

Estimates based on Eq. (11) are sometimes improved by correction terms (Sec. 19.2-5).

19.2-4. Sample Variances and Moments (see also Secs. 18.3-3, 18.3-7, 19.2-5, 19.4-2, and 19.5-2). (a) The Sample Variances. The sample variances

are measures of dispersion of the sample distribution; s is called sample standard deviation or sample dispersion. Note

whenever the quantity on the right exists. S² is an unbiased, consistent estimate of the population variance σ² = Var {x} and is thus often more useful than s².

(b) Sample Moments. The sample moments a_r and sample central moments m_r of order r are defined by

Note

whenever the quantity on the right exists. a_r is an unbiased, consistent estimate of the corresponding population moment α_r = E{x^r}. If α_2r exists, a_r is asymptotically normal with the parameters (16) as n → ∞. m_r is a consistent (but not unbiased) estimate of μ_r.

The random variables

are, respectively, consistent, unbiased estimates of μ₃ and μ₄;

(c) Measures of Skewness and Excess (see also Table 18.3-1 and Sec. 19.3-5).

The statistics

respectively measure the skewness and the excess (curtosis, flatness) of the sample and are consistent estimates of the corresponding population parameters (see also Sec. 19.4-2). Roughly speaking, g₁ > 0 indicates a longer “tail” to the right. Some authors introduce g₁² and g₂ + 3 or (g₂ + 3)/2 instead of g₁ and g₂. Several other measures of skewness have been used (Ref. 19.1).

19.2-5. Simplified Numerical Computation of Sample Averages and Variances. Corrections for Grouping (see Refs. 19.3 and 19.8 for calculation-sheet layouts), (a) For numerical computations, it is convenient to choose a computing origin Xo near the center of the sample distribution (“guessed mean”) and to compute

or, for grouped data,

(b) The sample variances s² and S² may be computed from

which is approximated for grouped data by

(c) Computations with grouped data are simplified if all class intervals are of equal length ΔX, and if one of the class-interval mid-points X_j = X₀ is taken to be the computing origin, so that

where the Y_j are “coded” class-interval centers which take integral values 0, ±1, ±2, .... One has then

One may check computations by introducing a different computing origin X₀, say X₀ = 0.

(d) Sheppard's Corrections for Grouping. Let all class intervals be of equal length ΔX. Then, if the characteristic function (Sec. 18.3-8) x_z(q) and its derivatives are small for |q| ≥ 2π/Δx, one may improve the grouped-data approximation s_G² to the true sample variance s² by adding Sheppard's correction — (ΔX)²/12. Analogous corrections for the grouped-data sample moments

yield the improved estimates

More generally,

where the B_k are the Bernoulli numbers (Sec. 21.5-2). These corrections become exact if x_z(q) = 0 for \q\ ≥ 2π/Δx — ∊ (∊ > 0). For normal variables with ΔX ≤ 2σ, a₁′ = a₁ within 2.3 10^-3 ΔX (ξ ≠ 0), and -(ΔX)²/12 is within 3.1 lO^-2σ² of the exact correction if ξ = 0.

NOTE: Sheppard's corrections often yield useful estimates of errors due to the use of rounded-off sample values in the exact formulas (12) and (15). Thus, if Sheppard's correction applies, a mean round-off error ΔX/2 in the x_k affects s² only as (ΔX)²/12.

19.2-6. The Sample Range (see also Sec. 19.5-4). The sample range w for a random sample (x₁, x₂, . . . , x_n) is the difference of the largest and the smallest sample value x_k. The sample range has physical significance (quality control) and serves as a rough but conveniently calculated estimate of population parameters for specific theoretical distributions (Sec. 19.5-4). The sample range and the smallest and largest sample values are examples of rank (order) statistics.

19.3. GENERAL-PURPOSE PROBABILITY DISTRIBUTIONS

19.3-1. Introduction. The probability distributions described in Secs. 18.8-1 to 18.8-9 and, in particular, the normal, binomial, hyper-geometric, and Poisson distributions can often serve as theoretical population distributions in statistical models. The applicability of a particular type of probability distribution (with suitably fitted parameters) may be inferred either theoretically or from the graph of an empirical distribution.

Normal distributions are particularly convenient. Each normal distribution is completely denned by its first- and second-order moments; again, the use of normal populations favors the computation of exact sampling distributions for use in statistical tests (Secs. 19.6-4 and 19.6-6). The use of a normal distribution is frequently justified by the central limit theorem (Sec. 18.6-5); in particular, errors in measurements are often regarded as normally distributed sums of many independent “elementary errors.”

19.3-2. Edgeworth-Kapteyn Representation of Theoretical Distributions. * It is often desirable to fit the empirical distribution of a random variable x with a probability distribution described by

where g(x) is a function of x selected so as to be normally distributed with parameters (μ, σ_g²). Once g(x) has been chosen (e.g., from theoretical considerations), only two parameters, μ and σ_g, remain to be estimated, and much of the theory of normal populations is applicable.

Any random variable x described by Eq. (1) may be regarded as the limit of a sequence of random variables y₁ = y₀ + z₁h(y₀), y₂ = y₁ + z₂h(y₁), . . . , where z₁, z₂, . . . are random variables satisfying the conditions of Sec. 18.6-5c, and . The z₁, z₂, . . . may be considered as “reaction intensities” in a physical process which successively generates y₁, y₂, . . . .

EXAMPLES: (1) h(y) = constant yields a normal distribution for x. (2) h(y) = y — a results in a logarithmic normal distribution defined by

19.3-3. Gram-Charlier and Edgeworth Series Approximations.

It is frequently convenient to approximate the frequency function of a standardized random variable (Sec. 18.5-5) in the form

where the parameters μ₃, μ₄, γ₁, γ₂ refer to the theoretical distribution of z (Sec. 18.3-7b and Table 18.3-1). An analogous expression for the distribution function Φ_z(x) is obtained by substitution of Φ_u^(k)(x) for φ_u^(k)(x) in Eq. (3). Note that Eq. (3) expresses φ_z(x) in terms of the widely tabulated derivatives of the normal frequency function φ_u(x) (Sec. 18.8-3), and that the parameters (coefficients) can be estimated as functions of moments (Sec. 19.4-3); but the computation of sampling distributions is not easy. See also Table 18.5-1.

For a rather restricted class of distributions, the approximation (3) comprises the first terms of the orthogonal expansion (Sec. 15.2-6)

(19.3-4a)(19.3-4b)(19.3-4c)

(Gram-Charlier Type A series), where H_k(x) is the k^th Hermite polynomial (Sec. 21.7-1).

The series (4a) converges to Φ_z(x) if μ₁, μ₂, . . . exist and converges; the series (4b) will then converge to φ_z(x) at all points of continuity if φ_z(x) is of bounded variation (Sec. 4.4-8b) in (— ∞, ∞). For a much larger class of distributions, the approximation (3) can be based on Edgeworth's asymptotic series (Ref. 19.4).

19.3-4. Truncated Normal Distributions and Pareto's Distribution, (a) Truncated Normal Distributions. If all events [x ≤ ξ_a] are removed from a normal population with mean μ and variance σ² (Sec. 19.1-2), the remaining population has a one-sided truncated normal distribution such that φ(x) = Φ(x) = 0 for x ≤ ξ_a, and

where (degree of truncation) is the fraction of the original population removed by the truncation. If ξ_a is known, one may use

to estimate μ and σ by the moment method (Sec. 19.4-3) with the aid of published tables (Ref. 19.15) expressing μ and σ in terms of α₁ and α₂.

(b) Pareto's distribution is defined by φ(x) = Φ(x) = 0 for x ≤ ξ_a and

19.3-5. Pearson's General-purpose Distributions. The frequency functions φ(x) of many continuous probability distributions can be described as solutions of the differential equation

whose four parameters define each distribution completely. Each parameter η_k can be estimated as a function of the first four moments α_r or μ_r (Sec. 19.4-3), but the computation of sampling distributions is, at best, difficult. The distributions defined by Eq. (9) can be classified according to the nature of the roots of η₂ + η₃x + η₄x² = 0 and include most of the continuous distributions described in Table 18.8-8, Sec. 18.8-3, and Sec. 19.5-3 as special cases (see also Ref. 19.7).

For Pearson's distributions, Pearson's measure of skewness (Table 18.3-1) is

so that for small γ₁, γ₂ one has ξ_mode ≈ ξ – γ₁σ/2.

19.4. CLASSICAL PARAMETER ESTIMATION

19.4-1. Properties of Estimates (see also Sec. 19.1-3). (a) A sample statistic y(x₁, x₂, . . . , x_n) is a consistent estimate of the theoretical population parameter η if and only if y converges to η in probability as the sample size n increases, i.e., if and only if the probability of realizing any finite deviation |y — η| converges to zero as n → ∞ (Sec. 18.6-1).

(b) The bias of an estimate y for the parameter η is the difference b(η) ≡ E(y) — η. y is an unbiased estimate of η if and only if E{y} = η for all values of η.

(c) Asymptotically Efficient Estimates and Efficient Estimates. It is desirable to employ estimates whose sampling distributions cluster as densely as possible about the desired parameter value, i.e., estimates with small variance Var {y}, or small standard deviation (standard error of the estimate) For the important special class of estimates y(x₁, x₂, . . . , x_n) whose sampling distributions are asymptotically normal with mean η and variance constant/n = λ/n (see also Sec. 19.5-2),

The asymptotic efficiency e_∞{y} = λ_min/λ of such an estimate y measures the concentration of the asymptotic sampling distribution about the parameter η; y is an asymptotically efficient estimate of the parameter η if and only if λ = λ_min. Every asymptotically efficient estimate is consistent.

More generally, the “relative efficiency” of the estimate y(x₁, x₂, . . . , x_n) of η from a given sample size n is measured by the reciprocal of the mean-square deviation E{(y — η)²}. Under quite general conditions (Ref. 19.4, Chap. 32), the mean-square deviations E{(y — η)² } of the various possible estimates y of a given parameter η have a lower bound given by

(Cramér-Rao Inequality). For unbiased estimates y, Eq. (2) reduces to

The efficiency of an unbiased estimate satisfying Eq. (3) measures the concentration of the sampling distribution about is again called the asymptotic efficiency, if this quantity exists. A (necessarily unbiased and consistent) estimate y is an efficient estimate of η if and only if Var {y} exists and equals the lower bound λ_min/n.

(d) Sufficient Estimates. An estimate y(x₁, x₂ . . . , x_n) of the parameter η is a sufficient estimate if and only if the likelihood function for one sample (Sec. 19.1-2) can be written in the form

where L₂ is functionally independent (Sec. 4.5-6) of Y. In this case, the conditional probability distribution of (x₁, x₂ . . . , x_n) on the hypothesis [y = Y] is independent of η, so that a sufficient estimate y of η embodies all the information about η which the given sample can supply. Efficient estimates are necessarily sufficient.

(e) Generalizations. Equations (1) to (3) apply to discrete probability distributions if the probability p(x₁, x₂, . . . , x_n) is substituted for the probability density φ(x₁, x₂. . . , x_n). The theory of Secs. 19.4-1a to d applies without change to populations described by multidimensional random variables.

A set of m suitable unbiased estimates y₁, y₂, . . . , y_m are joint efficient estimates of m corresponding population parameters η₁, η₂, . . . , η_m if the concentration ellipsoid (Sec. 18.4-8c) of the joint sampling distribution coincides with a “maximum concentration ellipsoid” analogous to the minimum variance defined by Eq. (3). Joint asymptotically efficient estimates are similarly defined in terms of the asymptotic sampling distribution. The reciprocal of the generalized variance (Sec. 18.4-8c) associated with the joint sampling distribution of y₁ y₂, . . . , y_m is a measure of the relative joint efficiency of the set of estimates.

To define a set of m joint sufficient estimates y₁ y₂, . . . , y_m it is only necessary to replace the random variable y in Eq. (4) by the m-dimensional variable (y₁ y₂,. . . y_m).

19.4-2. Some Properties of Statistics Used as Estimates (see also Sec. 19.2-4). (a) Functions of Moments. Every statistic expressible as a power of a rational function of the sample moments a_r is a consistent estimate of the same function of the corresponding population moments α_r,provided that the α_r's in question exist and yield a finite function value (see also Sec. 18.3-7). Multiplication of a biased consistent estimate by a suitable function of n will often yield an unbiased consistent estimate.

An entirely analogous theorem applies to functions of the sample moments a_r1r2... of multivariate samples. In particular, g₁, g₂, l_ik, r_ik, and r_12-34 . . . m (Sec. 19.7-2) are consistent estimates of the corresponding population parameters γ₁, γ₂, λ_ik, ρ_ik, and ρ_12-34 . . .

(b) For samples taken from a normal population

1. is an efficient estimate of ξ.

2. and s² are joint asymptotically efficient estimates of ξ and σ², but s² is biased; and are joint sufficient and asymptotically efficient estimates of ξ and σ². S² has the efficiency .

3. If ξ is known, is an efficient estimate of σ².

4. The sample median x_1/2 has the asymptotic efficiency

For samples taken from a binomial distribution (Table 18.8-3), is an efficient estimate of ξ.

For samples taken from a two-dimensional normal distribution (Sec. 18.8-6) with known center of gravity, the sample central moments l₁₁, l₁₂, and l₂₂ (Sec. 19.7-2) are joint asymptotically efficient estimates of λ₁₁, λ₁₂, and λ₂₂.

19.4-3. Derivation of Estimates: The Method of Moments. If the population distribution is described by a given function Φ(x; η₁, η₂, . . .), φ(x; η₁, η₂, . . .) or p(x; η₁, η₂, . . .) where η₁, η₂, . . . are parameters to be determined, each population characteristic like E{x}, Var {x}, α_r, etc., is a function of the parameters η₁, η₂, ... . In particular

if these quantities exist. The method of moments defines (joint) estimates y(x₁, x₂, . . . , x_n), y₂(x₁, x₂, . . . , x_n), . . . , y_m(x₁, x₂, . . . , x_n) of m corresponding population parameters η₁, η₂, . . . , η_m by the m equations

obtained on equating the first m sample moments a_r to the corresponding population moments α_r. The resulting estimates y_k are necessarily functions of the sample moments (see also Sec. 19.4-2a).

19.4-4. The Method of Maximum Likelihood. For any given sample (x₁, x₂, . . . , x_n), the value of the likelihood function L(x₁, x₂, . . . , x_n) (Sec. 19.1-2) is a function of the unknown parameters η₁, η₂, ... . The method of maximum likelihood estimates each parameter η_k by a corresponding trial function y_k(x₁, x₂, . . . , x_n) chosen so that L(x₁, x₂, . . . x_n;y₁,y₂ . . .) is as large as possible for each sample (x₁, x₂, . . . , x_n). One attempts to obtain a set of m (joint) maximum-likelihood estimates y₁(x_l, x₂ . . . , x_n), y₂(₁, x₂ . . . , x_n), . . . , y_m(x₁, x₂, . . . , x_n) as nontrivial solutions of m equations

which constitute necessary conditions for a maximum of the likelihood function if the latter is suitably differentiate (Sec. 11.3-3).

Although the maximum-likelihood method often involves more complicated computations than the moment method (Sec. 19.4-3), maximum-likelihood estimates may be preferable, particularly in the case of small samples, because

1. If an efficient estimate (or a set of joint efficient estimates) exists, it will appear as a unique solution of the likelihood equation or equations (6).

2. If a sufficient estimate (or a set of sufficient estimates) exists, every solution of the likelihood equation or equations (6) will be a function of this estimate or estimates.

In addition, under quite general conditions (Ref. 19.4, Chap. 33), the likelihood equations (6) have a solution yielding consistent, asymptotically normai, and asymptotically efficient estimates.

EXAMPLE: If x is normally distributed, the maximum-likelihood estimate X = for ξ is an efficient estimate and minimizes (method of least squares in the theory of errors).

Note that the maximum-likelihood method applies also to multidimensional populations and applies even in the case of nonrandom samples.

19.4-5. Other Methods of Estimation. A number of methods usually employed to test the goodness of fit of an estimate or estimates can be modified to infer parameter values (Sec. 19.6-7; see also Secs. 19.9-1 to 19.9-5).

19.5. SAMPLING DISTRIBUTIONS

19.5-1. Introduction (see also Secs. 19.1-2 and 19.1-3). Section 19.5-1 lists properties of a class of statistics frequently used as consistent estimates of corresponding population parameters. Section 19.5-2 deals with the approximate computation of sampling distributions for large samples, while Secs. 19.5-3 and 19.5-4b are concerned with the distributions of statistics derived from normal populations.

19.5-2. Asymptotically Normal Sampling Distributions (see also Sec. 18.6-4). The following theorems are derived from the limit theorems of Sec. 18.6-5 and permit one to approximate the sampling distributions of many statistics by normal distributions if the sample size is sufficiently large.

(a) Let y(x₁, x₂, . . , x_n) be any statistic expressible as a function of the sample moments m_k such that y = f(m₁ m₂, . . .) exists and is twice continuously differentiate throughout a neighborhood of m₁ = μ₁, m₂ = μ₂, .... Then the sampling distribution of y is asymptotically normal as n → ∞, with mean ƒ(μ₁, μ₂, . . .) and variance The theorem applies, in particular, to the sample mean , to the sample variances s and S, and to the sample moments a_r and m_r (Sec. 19.2-4). An analogous theorem applies to multidimensional populations.

(b) The distribution of each sample fractile X_p is asymptotically normal with mean x_p and variance provided that (1) the population fractile x_p is unique, and (2) φ′(x) exists and is continuous throughout a neighborhood of x = x_p. This theorem applies, in particular, to the sample median X_1/2. Under analogous conditions, any joint distribution of the sample fractiles (and hence, for example, the sample interquartile range) is also asymptotically normal.

19.5-3. Samples Drawn from Normal Populations. The χ², t, and ν² Distributions, (a) In the case of samples drawn from a normal population (normal samples), all sample values are normal variables, and many sampling distributions can be calculated explicitly with the aid of Secs. 18.5-7 and 18.8-9. The assumption of a normal population can often be justified by the central-limit theorem (Sec. 18.6-5, e.g., errors in measurements), or the method of Sec. 19.3-2 can be used.

For any sample of size n drawn from a normal population with mean ξ and variance σ²

and

4. has a standardized normal distribution.

5.has a t distribution with n — 1 degrees of freedom.

Table 19.5-1. The χ² Distribution with m Degrees of Freedom

(Fig. 19.5-1; see also Secs. 18.8-7, 19.5-3, and 19.6-7)

(c) Typical Interpretation.Given any m statistically independent Standardized normal variables the sum distribution with m degrees of freedom.

(d) The fractiles of y will be denoted by X²P or X²P(M); published tables frequently show

(e) Approximations.As m —>?,

y is asymptotically normal with mean m and variance 2m

y/m is asymptotically normal with mean 1 and variance 2/m is asymptotically normal with mean and variance 1 A useful approximation based on the last item listed above is

This approximation is worst if P is either small or large. A better approximation is given by

6. has an r distribution with m = n — 2 (Sec. 19.7-4).

7.has a χ² distribution with n degrees of freedom.

Table 19.5-2. Student's t Distribution with m Degrees of Freedom

(Fig. 19.5-2; see also Secs. 19.5-3 and 19.6-6)

(e) Typical Interpretation. y is distributed like the ratio

where x₀ x₁, x₂, . . . , x_m are m + 1 statistically independent normal variables, each having the mean 0 and the variance σ².Note that t is independent of σ². (d)The fractilesyp of y will be denoted by tp; note tp = — U-p.The distribution of |y| = |t| is related to the t distribution by

The fractiles

defined by (a values of t) are often tabulated for use in statistical tests; note that some published tables denote |t|1-_a by tα.

(e) Approximaotions.Asm y is asymptotically normal with mean 0 and variance 1, so that (Sec. 18.8-46) for m > 30.

Table 19.5-3. The Variance-ratio Distribution (ν² Distribution, Snedecor F Distribution, ω² Distribution) and Related Distributions

(c) Typical Interpretation. y is distributed like the ratio v² of two random variables having x² distributions with m and m’ degrees of freedom (Table 19.5-1), or

where the x and are m + m’ statistically independent normal variables each having the mean 0 and the variance a². Note that v² is independent of a².

(d) Change of Variables. The distribution of the variance ratio v² is also described by other random variables, viz.

(e) Published tables usually present the fractiles v₁-_a(m, m’), or z1_ot (m, m’) as functions of a for various values of m, m’.

Table 19.6-1 lists additional formulas. Tables 19.5-1 to 19.5-3 detail the properties of the χ², t, and ν² distributions; fractiles of these functions are available in tabular form.

(b) The sample mean and the sample variance s² are statistically independent if and only if the sample in question is drawn from a normal population.

For every normal sample, , s², and m_rm₂^-r/2 are statistically independent for all r, and g₁ = m₃m₂^-3/2, g₂ = m₄m₂^-2 —3 yield

19.5-4. The Distribution of the Sample Range (see also Secs. 19.2-6 and 19.7-6). (a) For every continuous distribution, the frequency function of the range w for a random sample of size n is

This function has been tabulated for a number of population distributions (Ref. 19.8).

(b) For normal populations, both the mean and the dispersion of w are multiples of the population dispersion σ:

k_n, c_n and c_n/k_n have been tabulated as functions of n (Ref. 19.3); w/k_n is an unbiased estimate of σ. The average range

obtained from a sample of m random samples of size n is asymptotically normal with mean k_nσ and variance c_n²σ²/m as m → ∞ ; w/k_n is an unbiased, consistent estimate of σ.

(c) For a uniform (rectangular) population distribution in the interval (a,b)

and is a consistent, unbiased estimate of b — a. Note also that the arithmetic mean of the smallest and the largest sample value is a consistent, unbiased estimate of E{x} (Ref. 19.4).

(d) For any continuous population distribution, the probability that at least a fraction q of the population lies between the extreme values x_min, x_max of a given random sample of size n is

19.5-5. Sampling from Finite Populations. Given a finite population of N elements (events, results of observations) E₁ E₂, . . . , E_N each labeled with one of the M ≤ N spectral values X₍₁₎, X₍₂₎, . . . , X_(M) of a (necessarily discrete) random variable x, let p(X_(i)) = N_i/N, where N_i is the number of elements E_k labeled by X_(i). For a random sample of size n ≤ N (without replacement), n/N is called the sampling ratio, and one has

These formulas reduce to those given in Secs. 19.2-3 and 19.2-4 for N → ∞

19.6. CLASSICAL STATISTICAL TESTS

19.6-1. Statistical Hypotheses. Consider a “space” of samples (sample points) (x_l x₂, . . . , x_n), where x₁, x₂, . . . , x_n are numerical random variables. Every self-consistent set of assumptions involving the joint distribution of x₁, x₂, . . . , x_n is a statistical hypothesis. A hypothesis H is a simple statistical hypothesis if it defines the probability distribution uniquely; otherwise it is a composite statistical hypothesis.

More specifically, let the joint distribution of x₁, x₂, . . . , x_n be defined by Φ(x₁, x₂, . . . , x_n; η₁, η₂, . . . ), φ(x₁, x₂, . . . , x_n; η₁, η₂, . . .),or p(x₁, x₂, . . . x_n; η₁, η₂,. . .), where η₁, η₂, . . . are parameters (see also Sec. 19.1-3). Then a simple statistical hypothesis assigns definite values η₁₀, η₂₀, ... to the respective parameters η₁, η₂, . . . (“point” in parameter space), whereas a composite statistical hypothesis confines the “points” (η₁, η₂, . . .) to a set or region in parameter space. The class of admissible statistical hypotheses (admissible parameter combinations) is restricted by the context of the problem in question.

If (x₁, x₂ , . . . x_n) is a random sample drawn from a single theoretical population (i.e., if the x_k are statistically independent with identical marginal distributions), statistical hypotheses refer to values of, or relations between, population parameters. Note, however, that the theory of Secs. 19.6-1 to 19.6-2 is not limited to random samples (x₁, x₂, . . . , x_n) but applies to samples from any random process.

19.6-2. Fixed-sample Tests: Definitions. Given a fixed sample size n, a test of the statistical hypothesis H is a rule rejecting or accepting the hypothesis H on the basis of a test sample (X₁, X₂, . . . , X_n). Each test specifies a critical set (critical region, rejection region) S_c of “points” (x₁, x₂, . . . , x_n) such that H will be rejected if the test sample (X₁, X₂, . . . , X_n) belongs to the critical set; otherwise H is accepted.

Such rejection or acceptance does not constitute a logical disproof or proof, even if the sample is infinitely large. Four possible events arise:

1. H is true and is accepted by the test.

2. H is false and is rejected by the test.

3. H is true but is rejected by the test (error of the first kind).

4. H is false but is accepted by the test (error of the second kind).

For any set of true (actual) parameter values η₁,η₂, . . . , the probability that a critical region S_c will reject the hypothesis tested is

Whenever the hypothesis H₁ ≡ [η₁ = η₁₁, η₂ = η₂₁, . . .] contradicts the hypothesis tested, the rejection probability πs_c(η₁₁, η₂₁, . . .) is called the power (power function) of the test defined by Sc with respect to the alternative (simple) hypothesis H₁ (>see also Fig. 19.6-1a). A graph of the correct-acceptance probability 1 — β against the false-rejection probability α is called the operating characteristic of the test (>Fig. 19.6-1b; see also Sec. 19.6-3).

19.6-3. Level of Significance. Neyman-Pearson Criteria for Choosing Tests of Simple Hypotheses, (a) It is, generally speaking, desirable to use a critical region Sc such that πs_c( η₁, η₂, . . .) is small for parameter combinations η₁, η₂, . . . admitted by the hypothesis tested, and as large as possible for other parameter combinations. Given a critical region Sc used to test the simple hypothesis H₀ ≡ [η₁ = η₁₀, η₂ = η₂₀, . . .] (“null hypothesis”), let H₀ be true. Then the probability of falsely rejecting H₀ (error of the first kind) is πs_c(η₁₀, η₂₀, . . .) = α. α is called the level of significance of the test: the critical region Sc tests the simple hypothesis H₀ at the level of significance- α.

In the case of discrete random variables x₁, x₂, . . . , x_n, one cannot, in general, specify α at will, but an upper bound for α may be given. A critical region used to test a composite hypothesis H ≡ [(η₁, η₂, . . .) € D] will, in general, yield different levels of significance for different simple hypotheses (parameter combinations η₁, η₂, . . .) admitted by H; one may specify the least upper bound of these levels of significance.

(b) For each given sample size n and level of significance α

1. A most powerful test of the simple hypothesis H₀ ≡ [η₁ = η₁₀, η₂ = η₂₀, . . .] relative to the simple alternative H₁ ≡ [η₁ = η₁₁, η₂ = η₂₁, . . .] is defined by the critical region Sc which yields the largest value of 1 - β = πS_c (η₁₁, η₁₁, . . .).

2. A uniformly most powerful test is most powerful relative to every admissible alternative hypothesis H₁; such a test does not always exist.

3. A test is unbiased if πS_c(η₁₁, η₂₁, . . .) ≥ α for every alternative simple hypothesis H₁ otherwise the test is biased. A most powerful unbiased test relative to a given alternative H₁ and a uniformly most powerful unbiased test may be defined as above.

To construct the critical region S_c for a most powerful test, use all sample points (x₁, x₂, . . . , x_n) such that the likelihood ratio φ(x₁, x₂, ... , x_n;η₁₀, η₂₀ . . .) /φ(x₁, x₂, ... , x_n;η₁₁, η₂₁ . . .) or p(x₁, x₂, ... , x_n;η₁₀, η₂₀ . . .) / p(x₁, x₂, ... , x_n;η₁₁, η₂₁ . . .) is less than some fixed constant c; different values of c will yield “best” critical regions at different levels of significance α. Uniformly most powerful tests are of particular interest if one desires to test H₀ against a composite alternative hypothesis. A uniformly most powerful unbiased test may exist even though no uniformly most powerful test exists (see also Ref. 19.4). In practice, ease of computation may be the deciding factor in a choice among several possible tests; one can usually increase the power of each test by increasing the sample size n (see also Sec 19.6-9).

/>FIG. 19.1a. Test of the null hypothesis H₀ against a simple alternate hypothesis H₁ in terms of a test statistic y = y(x1, x2 . . ., xn).

FIG. 19.1b. operating characteristic of a test (see also Fig. 19.6-1a).

Table 19.6-1. Some Tests of Significance Relating to the Parameters ξ, σ² of a samples. Obtain fractiles from published tables; the approximations given in Tables

* Note that published tables often tabulate χ²_1-α rather than χ²_α, check carefully

19.6-4. Tests of Significance. Many applications require one to test a hypothetical population property specified in terms of a set of parameter values η₁= η₁₀, η₂= η₂₀, . . . against a corresponding sample property described by the respective estimates y₁(x₁, x₂ . . . ,x_n), y₂(x₁, x₂, . . . , x_n), . . . of η₁, η₂, . . . . One attempts to construct a “test statistic”

whose values measure a deviation or ratio comparing the sample property to the hypothetical population property for each test sample (x₁, x₂, . . . , x_n). The simple hypothesis H₀ ≡ [η₁ = η₁₀, η₂ = η₂₀, . . .] is then rejected at a given level of significance α (the “deviation” is significant) whenever the sample value of y falls outside an acceptance interval [y_P₁, ≤ y ≤ y_P₂] such that

Normal Population (see also Secs. 19.5-3a and 19.6-4). Tests are based on random 19.5-1 and 19.5-2 apply if the sample size n is large *.

on the notation used in each case.

Tests defined in this manner are often called tests of significance. Equation (3) specifies y_p₁ = y_p₁(η₁₀, η₂₀, . . .) and y_P₂ = y_P₂(η₁₀, η₂₀, . . .) as fractiles of the sampling distribution of y(x₁, x₂, . . . , x_n;η₁₀, η₂₀, . . .). It is frequently possible to choose a test statistic y such that its fractiles y_p are independent of η₁₀, η₂₀, . . . . Table 19.6-1 and Secs. 19.6-6 and 19.6-7 list a number of important examples (see also Fig. 19.6-1).

In quality-control applications, the acceptance limits y_P₁ and y_P₂ defined by Eq. (3) are called tolerance limits at the tolerance level α, and the interval [y_P₁, y_p₂] is called a tolerance interval (see also Fig. 19.6-1).

19.6-5. Confidence Regions. (a) Assume that one has constructed a family of critical regions S_α(η₁, η₂ . . .) capable of testing a corresponding set of simple hypotheses (admissible parameter combinations) (η₁, η₂, . . .) at some given level of significace α.† Then for any fixed test sample (x₁ = X₁, x₂ = X₂, . . . , x_n = X_n), the set D_α(X₁, X₂, . . . ,

† In the case of discrete distributions, the critical regions S_α(η₁, η₂, . . .) will, in general, be defined for a level of significance less than or equal to α.

X_n) of parameter combinations (η₁, η₂ . . .) (“points” in parameter space) accepted on the basis of the given sample is a confidence region at the confidence level α; 1 — α is called the confidence coefficient. The confidence region comprises all admissible parameter combinations whose acceptance on the basis of the given sample is associated with a probability P[(X₁, X₂, . . . , X_n) not in S_α(η₁, η₂ . . .)] at least equal to the confidence coefficient 1 — α.

The method of confidence regions relates a given sample to a corresponding set of “probable” parameter combinations without the necessity of assigning “inverse probabilities” to essentially nonrandom parameter combinations (see also Sec. 19.6-9).

(b) Confidence Intervals Based on Tests of Significance (see also Sec. 19.6-4). To find confidence regions relating values of one of the unknown parameters, say η₁, to the given sample value Y = y(X₁, X₂,. . . , X_n) of a suitable test statistic y, refer to Fig. 19.6-1. Plot lower

and upper acceptance limits (tolerance limits) yp₁(η₁) and yp₂(η₁) against η₁ for a given level of significance α.* The intersections of these acceptance-limit curves with each line y = Y define upper and lower confidence limits (fiducial limits) η₁ = γ₂(Y), η₁ = γ₁(Y) bounding a confidence interval D_α ≡ [γ₁, γ₂] at the confidence level γ. The confidence interval comprises those values of η₁ whose acceptance on the basis of the sample value y = Y is associated with a probability P[yp₁(η₁) ≤ Y ≤ yp₂(η₁)] at least equal to the confidence coefficient 1 — α.

Table 19.6-2. Confidence Intervals for Normal Populations

Use the approximations given in Tables 19.5-1 to 19.5-3 for large n

19.6-6. Tests for Comparing Normal Populations. Analysis of Variance. (a) Pooled-sample Statistics. Consider r statistically independent random samples (x_i1 x_i2, . . . , x_{in_i}) drawn, respectively, from r theoretical populations with means ξ₁ and variances σ_i² (i = 1, 2, . . . , r). The i^th sample is of size n_i and has the mean and variance

The r samples may be regarded as a “pooled sample” whose size, mean, and variance are given by

The statistics S₀² (pooled variance) and S_A² are measures of dispersion within samples and between samples, respectively.

(b) Comparison of Normal Populations (see also Tables 19.5-1 to 19.5-3). If the r independent samples are drawn from normal populations with means ξ_i and identical variances σ_i² = σ², then S², S₀², and S_A² are all consistent and unbiased estimates (Sec. 19.4-1) of the (usually unknown) population variance σ². (n — r)S₀²/σ² and (r — 1)S_A²/σ² are statistically independent random variables respectively distributed like χ² with n — r and r — 1 degrees of freedom. Note the sampling distributions of the following test statistics:

Table 19.6-3 shows the use of these test statistics in tests of significance (Sec. 19.6-4) comparing normal populations. It is also possible to calculate confidence limits (Sec. 19.6-5) for the difference ξ_i — ξ_k from the t distribution. The case of normal populations with different variances is discussed in Ref. 19.8.

(c) Analysis of Variance. The third test of Table 19.6-3 compares the mean values ξ_i by partitioning the over-all variance S² into components S₀² and S_A² respectively due to statistical fluctuation within samples and to differences between samples. This technique is known as analysis of variance; the particular case in question involves a one-way classification of samples corresponding to values of the index i. Many similar tests are used to analyze the effects of different medications, soil treatments, etc. (Refs. 19.3, 19.4, and 19.8).

Analysis of Variance for Two-way Classification (Randomized Blocks). Consider rq sample values x_ik arranged in an array

and introduce row averages _i., column averages , and the over-all average by

The over-all variance is partitioned as follows:

If the random variables x_ik are normal with identical variances, S₀², S²_Row, and S²_Colare statistically independent random variables respectively distributed like χ² with (r — 1)(q — 1), r — 1, and q — 1 degrees of freedom. The test statistic S²_Row/S₀² is, then, distributed like v² with m = r — 1, m' = (r — l)(q — 1) and serves to test the equality of the row means . in the manner of Table 19.6-3. Similarly, S²_Col/S₀² is distributed like ν² with m = q - 1, m' = (r - l)(q - 1) and serves to test the equality of the column means

Table 19.6-3. Significance Tests for Comparing Normal Populations

(refer to Sec. 19.6-6; see also Tables 19.5-2 and 19.5-3)

19.6-7. The χ² Test for Goodness of Fit (see also Table 19.5-1). (a) The x² test checks the “fit” of the hypothetical probabilities p_k = p[E_k] associated with r simple events E₁, E₂, . . . , E_r to their relative frequencies h_k = h[E_k] = n_k/n in a sample of n independent observations. In many applications, each E_k is the event that some random variable x falls within a class interval (Sec. 19.2-2), so that the test compares the hypothetical theoretical distribution of x with its empirical distribution.

The goodness of fit is measured by the test statistic

y converges in probability to χ² with m = r — 1 degrees of freedom as n → ∞. If all np_k > 10 (pool some class intervals if necessary), the resulting test rejects the hypothetical probabilities p₁, p₂, . . . , p_r at the level of significance α whenever the test sample yields y > x_1-a{m); for m > 30 one may replace the χ² distribution by a normal distribution with the indicated mean and variance.

(b) The χ² Test with Estimated Parameters. If the hypothetical probabilities p_k depend on a set of q unknown population parameters η₁, η₂, . . . , η_q, obtain their joint maximum-likelihood estimates from the given sample (Sec. 19.4-4) and insert the resulting values of p_k = p_k(η₁, η₂, . . . , η_q) into Eq. (9). The test statistic y will then converge in probability to χ² with m = r — q — 1 degrees of freedom under very general conditions (see below), and the χ² test applies with m = r — q — 1. Tests of this type check the applicability of a normal distribution, Poisson distribution, etc., with unspecified parameters.

The theorem applies whenever the given functions p_k(η₁, η₂, . . . , η_q) satisfy the following conditions throughout a neighborhood of the joint maximum-likelihood “point” (η₁, η₂ . . . , η_q):

The p_k(η₁, η₂, . . . , η_q) have a common positive lower bound, are twice continuously differentiable, and the matrix [∂p_k/∂η_i] is of rank q.

19.6-8. Parameter-free Comparison of Two Populations: The Sign Test. One desires to test the hypothesis that two random variables x and y have identical probability distributions, or

assuming that it is known that P [x = y] = 0. Consider a random sample of n pairs x₁ y₁;x₂,y₂,; . . . ; x_n,y_n; neglect any pairs such that x_i = y_i(ties) in computing the sample size n. The probability that more than m differences x_i — y_i are positive is

Now let m_α be the smallest value of m such that p(m) s≤ α and

Reject the hypothesis (1) at the level of significance s≤α whenever the number of positive differences x_i — y_i exceeds m_α (One-tailed Sign Test), or

Reject the hypothesis at the level of significance 2α if the number of positive or negative differences exceeds m_α (Two-tailed Sign Test).

m_α has been tabulated against α and n (Ref. 19.15). The sign test can also be used to test (1) the symmetry of a probability distribution; (2) the hypothesis that x = X is the median of a distribution (Ref. 19.15).

19.6-9. Generalizations (Ref. 19.17). The fixed-sample tests of Secs. 19.6-1 to 19.6-7 admit only two alternatives, viz., acceptance or rejection of a given hypothesis on the basis of a test sample. Sequential tests permit an increase in the sample size (additional observations) as a third possible decision; it is then possible to specify both the level of significance and the power of the test (relative to some alternative hypothesis) when the test hypothesis will finally be accepted or rejected.

Schemes of fixed-sample and sequential tests are special examples of statistical decision functions or rules of behavior associating a decision (set of parameters η₁, η₂, . . . , “point” in parameter space) with each given sample (x₁, x₂, . . .) of some observed qualities. In practice (operations research, detection theory), the decision function is designed so as to maximize the expected value of some measure of effectiveness (payoff function) involving the gains due to correct decisions, the losses due to incorrect decisions, and the cost of testing samples of various sizes.

19.7. SOME STATISTICS, SAMPLING DISTRIBUTIONS, AND TESTS FOR MULTIVARIATE DISTRIBUTIONS

19.7-1. Introduction. Sections 19.7-2 to 19.7-7 are in no sense an exhaustive survey of multivariate statistics but present a number of frequently used definitions, formulas, and tests for convenient reference. Note that multivariate statistics often serve to estimate and test stochastic relationships between two or more random variables.

19.7-2. Statistics Derived from Multivariate Samples, (a) Given a multidimensional random variable x ≡ (x₁, x₂ . . . , x_n) (Sec. 18.2-9), one proceeds by analogy with Secs. 19.1-2 and 19.2-2 to 19.2-4 to introduce a random sample of size n, (x₁ x₂, . . . , x_n) ≡ (x₁₁, x₂₁, . . . , x_v1; x₁₂, x₂₂, . . . x_v2; . . . ;x_1n, x_2n, . . . x_vn) and the statistics

The “point” corresponding to is the sample center of gravity, and the matrix L ≡ [l_ij] is the sample moment matrix; det [l_ij] is the generalized variance of the sample.

(b) Given a two-dimensional random sample (x_l1, x₂₁; x₁₂, x₂₂; ... ; x_1n, x_2n) of a bivariate random variable (x₁, x₂), one defines the sample regression coefficients

The empirical linear mean-square regression of x₂ on x₁

is that linear function ax₁ + b which minimizes the sample mean-square deviation

(see also Secs. 18.4-6b and 19.7-4c).

For ν-dimensional populations, empirical multiple and partial correlation coefficients and regression coefficients are derived from the sample moments l_ij by analogy with Eqs. (18.4-35) to (18.4-38) and serve as estimates of the corresponding population parameters. See Refs. 19.4 and 19.8 for the complete theory.

(c) The statistics (1) to (4) can be approximated by grouped-data estimates in the manner of Secs. 19.2-3 to 19.2-5. In particular, Sheppard's corrections for grouped-data estimates l_ikG of the second-order central moments λ_ik are given by

where the ΔX_i are the constant class-interval lengths. These formulas often render errors due to grouping negligible if ΔX_i is less than one-eighth the range of x_i. Practical computation schemes are given in Refs. 19.3 and 19.8.

19.7-3. Estimation of Parameters (see also Sec. 19.4-1). Since the theorem of Sec. 19.4-2a applies to multivariate distributions, the statistics (3) to (5), as well as the sample averages _i, are consistent estimates of the corresponding population parameters. The mean value and variance of each sample average _i and sample variance l_ii = s_i² is given by Eqs. (19.2-8), (19.2-13), and (19.2-14); in addition, note

Var {r_ij} is of the order of 1/n as n increases (Ref. 19.4).

For multidimensional normal distributions (see also Sec. 19.7-4) only,

19.7-4. Sampling Distributions for Normal Populations (see also Secs. 18.8-6 and 18.8-8). For random samples drawn from multivariate normal populations, sampling distributions and tests involving only the sample averages _i and the sample variances l_ii = s_i² are obtained simply from Sec. 19.5-3. It remains to investigate statistics which describe stochastic relationships of different random variables x_i, in particular sample correlation and regression coefficients.

(a) Distribution of the Sample Correlation Coefficient. For simplicity, consider a random sample (x₁₁, x₂₁; x₁₂, x₂₂; ... ; x_1n, x_2n) drawn from a two-dimensional normal population described by

(Sec. 18.8-6). The probability density of the sample correlation coefficient r₁₂ = r (Sec. 19.7-2) is

Note that φ_r(n)(r) is independent of ξ₁, ξ₂, σ₂, σ₂. For n = 2, one has φ_r(2)(r) =0( —1 < r < l), since r is either 1 or —1.

It is useful to introduce the new random variable

which for n ≥ 10 may be regarded as approximately normal with the approximate mean and variance

Figure 19.7-1 illustrates the behavior of the statistics r and y for different values of ρ and n.

If y and y' are values of the statistic (17) calculated from two independent random samples of respective sizes n, n' from the same normal population, then y — y' is approximately normal with mean 0 and variance l/(n — 3) + l/(n' — 3).

(b) The r Distribution. Test for Uncorrelated Variables. In the important special case ρ = 0 (test for uncorrelated variables!), the frequency function (15) reduces to

In this case, has a t distribution with n — 2 degrees of freedom (Table 19.5-2). The statistic is said to have an r distribution with n — 2 degrees of freedom. The r distribution has been tabulated (Ref. 19.3) and is asymptotically normal with mean 0 and variance 1 as n → ∞. Either the t distribution or the r distribution yields tests for the hypothesis ρ = 0.

FIG. 19.7-1. Probability densities of the statistics and y used to estimate the correlation coefficient p of a multidimensional normal distribution. (From Buringlon and May, Handbook of Probability and Statistics, McGraw-Hill, New York, 1953.)

(c) Test for Hypothetical Values of the Regression Coefficicmt (see also Sec. 19.7-2). Given a random sample drawn from a bivariate normal population described by Eq. (14), one tests hypothetical values of the population regression coefficient β₂₁ = λ₁₁/λ₁₁ = ρ₁₂α₂/α₁ (Sec. 18.4-6b) by means of the test statistic

which is distributed like t with n — 2 degrees of freedom. This test is far more convenient than one using the sample distribution of the sample regression coefficient b₂₁ (Ref. 19.8).

(d) ν-dimensional Samples. For random samples drawn from a ν-dimensional normal population described by the probability density (18.8-26), the joint distribution of the sample averages ₁, ₂, . . . , _v is normal with mean values ξ₁, ξ₂, . . . , ξ_v and moment matrix A/n. The joint distribution of the _i is statistically independent of the joint distribution of the v(v + l)/2 sample moments l_ij {Generalized Fisher's Theorem, see also Sec. 19.5-3b).

19.7-5. The Sample Mean-square Contingency. Contingency-table Test for Statistical Independence of Two Random Variables (see also Sec. 19.6-7). (a) Given a two-dimensional random sample (x₁, y₁; x₂, y₂; ... ; x_n, y_n) of a pair of random variables x, y, a contingency table arranges the n sample-value pairs (x_k, y_k) in a matrix of s x class intervals and r y class intervals. Let there be

n_i. pairs (x_k, y_k) in the i^th x class interval

n._j pairs (x_k, y_k) in the k^th y class interval

n_ij pairs (x_k, y_k) both in the i^th x class interval and in the j^th y class interval

The statistic

measures the “degree of association” (statistical dependence) between x and y. f² ranges between 0 and min (r, s) — 1 and reaches the latter value if and only if each row (r ≥ s) or each column (r s≤ s) contains only one element different from zero.

(b) If x and y are statistically independent, then the test statistic nf² converges in probability to χ² with m = (r — l)(s — 1) degrees of freedom as n → ∞ (Table 19.5-1). If all n_ij > 10 (pool some class intervals if necessary), the hypothesis of statistical independence is rejected at the level of significance α by the critical region nf² > χ²_1-α(m) (Sec. 19.6-3),

(c) The special case r = s = 2 (success or failure, two-by-two contingency table) is of special interest; in this case (see also Ref. 19.4).

19.7-6. Spearman's Rank Correlation. A Nonparametric Test of Statistical Dependence. Suppose that a random sample of n observation pairs (x₁,xy₁; x₂,y₂;. . . ; x_n, y_n) yields only the information that x_kis the A_k^th largest value of x_k in the sample, and y_k is the B_k^th largest value of y_k (k = 1, 2, . . . , n). If x and y are statistically independent, the test statistic

is asymptotically normal with mean 0 and variance l/(n — 1) as n → ∞. For any value of n > 1, the hypothesis of statistical independence is rejected at the level of significance ≤ α (Sec. 19.6-3) if

NOTE: If x and y have a normal joint distribution, then 2 sin πR/6'is a consistent estimate of their correlation coefficient ρ_xy.

19.8. RANDOM-PROCESS STATISTICS AND MEASUREMENTS

19.8-1. Simple Finite-time Averages. Let

be a given measurable function of the sample values x(t_i), y(t_i), . . . generated by the one-dimensional or multidimensional random process x(t), y(t), . . . . (Sec. 18.9-1). The finite-time averages

can be obtained, respectively, through sampled-data and continuous averaging of finite real data. [f]_n and (f)_T are random variables whose distributions are determined by the given random process. If x(t) represents a stationary (but not necessarily ergodic) random process, then

The finite-time averages [f]_n and (f)_T are, then, unbiased estimates of the unknown expected value E{f} and will be useful for estimating E{f} if their random fluctuations about their expected value (4) are reasonably small. More specifically, the mean-square error associated with estimation of E{f} by a measured value of [f]_n is

If one introduces k Δt = λ and lets Δt → 0, n = T/Δt → ∞, then the sampled-data average (2) converges to the continuous average (y)_T if the latter exists. A similar limiting process applied to Eq. (3) yields

Depending on the nature of the autocovariance function

the mean-square error (4) or (5) may or may not decrease to an acceptably small value with increasing sample size n or integration time T.

19.8-2. Averaging Filters. Time averaging is often accomplished by low-pass filters implemented as electrical networks, by electromechanical measuring devices with inertia and damping, or by digital sampled-data operations. Consider, in particular, a general time-invariant averaging filter with stationary input f(t) ≡ f(t₁ + t, t₂ + t, . . . , t_n + t), bounded weighting function h(ζ), and frequency-response function H(iw) (Sec. 9.4-7). If the filter input is applied between t = 0 and t = T {averaging time), the filter output is

so that z(T)/a(T) is an unbiased estimate of E{f}.

The estimate variance is given by

As T → ∞, Var {z{T)} will not in general go to zero but, rather, approaches the stationary output variance

where φ(w) is the power spectral density (Sec. 18.10-3) of f(t) — E{f}, and B_EQ is the bandwidth of an equivalent “rectangular” low-pass niter having the frequency response

B_EQ is a useful measure of the variance-reducing properties of a given averaging filter. Table 19.8-1 lists H(iw) and B_EQ for some practical filters (flat-spectrum input).

Table 19.8-1. Averaging Filters (Ref. 19.25)

19.8-3. Examples (Ref. 19.25). (a) Measurement of Mean Value. It is desired to measure the mean value E{x) = ξ of a stationary random voltage x(t) with

(white noise passed through a simple filter with — 3-db bandwidth α/2π cps, or random telegraph wave with counting rate α/2, Sec. 18.11-5). In this case,

and for the first-order averaging filter of Table 19.8-1 with T >> T₀ (T > 4T₀ for most practical purposes),

(b) Measurement of Mean Square. It is desired to measure E{f} = E{x²} for Gaussian noise x(t) satisfying Eq. (14). In this case,

and for the first-order averaging filter with T >> To and f ≡ x²,

(c) Measurement of Correlation Functions (see also Sec. 18.9-3b). The variances of correlation-function estimates for jointly stationary x(t), y(t) are given by Eqs. (5), (6), and (11) with f(t) ≡ x{t)y{t + r). Unfortunately, each variance depends on

and this fourth-order moment of the joint distribution of x(t) and y(t) is hardly ever known. In the special case of jointly Gaussian and stationary signals x{t), y(t) with zero mean values,

but even this involves the unknown correlation function R_xy(r) itself and hence yields useful information only in simple special cases.

For stationary Gaussian x{t),

To illustrate the dependence of the autocorrelation-function estimate

on signal bandwidth and delay, consider again Gaussian noise x(t) satisfying Eq. (14). In this case,

For r = 0, this agrees with Eq. (18). For observation times T large compared to the reciprocal signal bandwidth,

(within 1 per cent for αT ≥ 10⁴).

For the more general case of a stationary Gaussian signal x{t) with

one similarly obtains

19.8-4. Sample Averages. Different sample functions generated by the same random process will be denoted by ^lx{t), ²x(t), . . . (Fig. 18.9-1) and are regarded as statistically independent; i.e., every finite set of samples ⁱx{t₁), ⁱx(t₂)... is statistically independent of every set of samples from a different sample function ^kx{t). If one can realize a set of sample functions ^lx(t)_y ²x(t), . . . , ⁿx(t) ^{in independent repeated experiments, then the sample values ¹x(t₁), ²x(t₁), . . . , ⁿx(t₁) constitute a classical random sample of size n; i.e., the ^kx(t₁) are statistically independent random variables with identical probability distributions. Similarly, ¹x(t₁), ²x(t₂); ²x(t₁), ²x(t₂); . . . ⁿx(t₁), ⁿx(t₂),or ¹x(t₁), ¹y(t₂);²x(t₁), ²y(t₂); . . . ;ⁿx(t₁), ⁿy(t₂) constitute bivariate random samples.}

Sample averages, like

are, then, random-sample statistics in the sense of classical statistical theory. Sample averages must be obtained from repeated or multiple experiments, but it is usually much simpler to derive variances and probability distributions for sample averages than it is for time averages. In particular,

just as in Sec. 19.2-3 (see also Fig. 19.8-1).

FIG. 19.8-1. Four sample functions x(t) = ^kx(t) from a continuous random process represented by x(t).

19.9. TESTING AND ESTIMATION WITH RANDOM PARAMETERS

19.9-1. Problem Statement. A practically important class of decision situations can be represented by the model of Fig. 19.9-1. The cost C (risk) associated with one operation of the system shown is a function of the state of the environment represented by the m-dimensional random variable (s₁, s₂, . . . , s_m) ≡ s and by a decision (response) represented by the r-dimensional variable (y₁, y₂, ... , y_r) ≡ y, i.e.,

The decision maker (man, machine, or system) arrives at the decision y on the basis of received, observed, or measured data represented by the

n-dimensional random variable (x₁, x₂, . . . , x_n) ≡ x, which is related to the state of the environment through the given joint distribution of s and x. The decision maker forms y as a deterministic function (decision function, see also Sec.19.6-9) of the received data:*

Given the joint distribution of s and x together with the cost function (1) representing the system performance for each combination of environment state and decision, one desires to minimize the expected risk

through an optimal choice of the decision function y(x). s, x, and y may be discrete or continuous variables.

19.9-2. Bayes Estimation and Tests. If the environment-state parameters s₁, s₂, . . . , s_m are regarded as parameters of the unknown probability distribution of the observed sample (x₁, x₂, . . . , x_n), the problem is somewhat similar to the classical problems of testing and estimation; the essential difference is that the parameters s₁, s₂, . . . , s_m are now random variables.

For continuous variables s, x, the decision maker's knowledge of the environment state on the basis of a received sample x is, then, expressed by the conditional probability density φ(s|x). Minimization of the expected risk (3), or

then requires minimization of the conditional risk

for each sample x through a proper choice of the decision function y(x). If one is given the “a priori” distribution of s and the conditional probability density φ(x\s), then the “a posteriori” probability distribution required for Eq. (5) is obtained with the aid of Bayes's theorem (Secs. 18.2-6 and 18.4-5) as

Decision processes based on such minimization of the expected risk are known as Bayes estimation, and as Bayes tests if y is a discrete variable. Note that not all the unknown parameters s_i need affect C[s, y] explicitly (e.g., signal-carrier phase in amplitude-modulated radio transmission, Ref. 19.23).

If, as is often the case, one has no reliable knowledge of the cost function C(s, y) and/or of the “a priori” distribution of s, then Bayes estimation and testing becomes impossible. One may assume “worst-case” C(s, y) and φ(s) (minimax tests, Refs. 19.24 and 19.26), or one returns to the “classical” procedures of maximum-likelihood estimation (Sec. 19.4-4) and Neyman-Pearson tests (Sec. 19.6-3).

19.9-3. Binary State and Decision Variables: Statistical Tests or Detection (see also Secs. 19.6-1 to 19.6-4). Assume that there exist only two environment states, s = 0 (null hypothesis) and s = 1 (alternate hypotheses corresponding, say, to the absence and presence of a target to be detected). Assume two possible decisions y = 0, y = 1, which correspond to acceptance or rejection of the null hypothesis on the basis of the observed data sample (x₁, x₂, . . . , x_n). The problem amounts to the choice of a critical region (rejection region) S_c of sample points (x₁, x₂, . . . , x_n) which will minimize the expected risk

E{C(s,y)}

Where po = P[s = 0], and S_c is the complement of S_c (acceptance region). E{C(s, y)} will be minimized if one rejects the null hypothesis whenever the likelihood ratio

(see also Sec. 19.6-3) exceeds the critical value

Note that (1) any increasing or decreasing function of the likelihood ratio (8) can replace the latter as a test statistic, and (2) the likelihood ratio (8) is itself a mono tonic function of the “a posteriori” conditional probability p(s|x₁, x₂ . . . x_n ), which may be regarded as a basic test statistic.

EXAMPLE: Signal Detection with Additive Flat-spectrum Gaussian Noise. The objective is to decide whether a received signal x(t) of bandwidth B is due to noise alone [s = 0, or x(t) = n(t)] or to signal and additive noise [s = 1, or x{t) = s{t) + n(t)]; in view of the finite bandwidth, one can describe signals and noise in terms of sample values

where T is the observation time (Sec. 18.11-2). One is given

so that

Since

is an increasing function of the likelihood ratio A(x₁, x₂ . . . , x_2BT), z rather than A may be used as a test statistic. The receiver forms z{x₁, x₂, . . . , x_2BT), either by discrete summation or by continuous integration, and compares it with a threshold value z_c determined by p_o and C(s, y); z > z_c corresponds to the decision y = 1 (crosscorrelation detector or matched-filter detector). See Secs. 19.22, 19.23, and 19.26 to 19.28 for additional examples and applications.

19.9-4. Estimation (Signal Extraction, Regression). In a typical measurement situation, the environment has a continuously distributed set of states represented by values of s ≡ (s₁, s₂, . . . , s_n), and the m components y_k of the m-dimensional decision function y ≡ (y₁, y₂, . . . , y_m) must be chosen so as to approximate the corresponding s_k as closely as possible in some sense specified by the cost function C(s, y). In the practically important case of least-square estimation, one assumes a cost function of the form

In this case, E{C(s,y)} is minimized if each y_k equals the conditional expected value of s_k for the measured sample (x₁, x₂, . . . , x_n):

(see also Secs. 18.4-5, 18.4-6, and 18.4-9), where φp(s_k|x₁, x₂, . . . , x_n) must be obtained from φ(x₁, x₂, . . . > x_n|s₁, s₂, . . . , s_m) with the aid of Bayes's formula

EXAMPLE: D-c Measurement with Additive Gaussian Noise. The objective is to measure a single random quantity s from a sample (x₁, x₂ . . . , x_n) of measured values x_k = s + n_k, given

which corresponds to additive-noise samples n_k statistically independent of each other and of s. Bayes's formula yields

E{(y — s)²} is then minimized by the least-squares estimate

which depends on the sample values x_k only by way of the statistic (sample average) and is biased by a priori knowledge of φ(s); the bias increases with increasing P_N and decreasing sample size n, both of which reduce the “information” in the measured sample (x₁, x₂, . . . , x_n). The resulting expected risk is

References 19.22, 19.23, and 19.26 to 19.28 describe additional applications.

19.10. RELATED TOPICS, REFERENCES, AND BIBLIOGRAPHY

19.10-1. Related Topics. The following topics related to the study of mathematical statistics are treated in other chapters of this handbook:

Probability theory, special distributions, limit theorems, random processes Chap. 18

Numerical calculations Chap. 20

Linear programming, game theory Chap. 11

Combinatorial analysis Appendix C

19.10-2. References and Bibliography (see also Sec. 18.13-2).

      19.1. Brownlee, K. A.: Statistical Theory and Methodology in Science and Engineering, Wiley, New York, 1961.

      19.2. Brunk, H. D.: An Introduction to Mathematical Statistics, 2d ed., Blaisdell, New York, 1964.

      19.3. Burington, R. S., and D. C. May: Handbook of Probability and Statistics with Tables, 2d ed., McGraw-Hill, New York, 1967.

      19.4. Cramér, H.: Mathematical Methods of Statistics, Princeton, Princeton, N.J., 1951.

      19.5. Dixon, W. J., and F. J. Massey, Jr.: An Introduction to Statistical Analysis, 2d ed., McGraw-Hill, New York, 1957.

      19.6. Eisenhart, G. C, M. W. Hastay, and W. A. Wallis: Techniques of Statistical Analysis, McGraw-Hill, New York, 1947.

      19.7. Elderton, W. P.: Frequency Curves and Correlation, 3d ed., Cambridge, New York, 1938.

      19.8. Hald, A.: Statistical Theory with Engineering Applications, Wiley, New York, 1952.

      19.9. Hoel, P. G.: Elementary Statistics, 2d ed., Wiley, New York, 1966.

      19.10. Hogg, R. V., and A. T. Craig: Introduction to Mathematical Statistics, Mac-millan, New York, 1959.

      19.11. Lehmann, E. L.: Testing Statistical Hypotheses, Wiley, New York, 1959.

      19.12. Mood, A. M., and F. A. Graybill: Introduction to the Theory of Statistics, 2d ed., McGraw-Hill, New York, 1963.

      19.13. Neyman, J.: First Course in Probability and Statistics, Holt, New York, 1950.

      19.14. Scheffe, H.: The Analysis of Variance, Wiley, New York, 1959.

      19.15. Van der Waerden, B. L.: Mathematische Statistik, 2d ed., Springer, Berlin, 1965.

      19.16. Walsh, J. E.: Handbook of Nonparamelric Statistics (2 vols.), Van Nostrand, Princeton, N.J., 1960/65.

      19.17. Weiss, L.: Statistical Decision Theory, McGraw-Hill, New York, 1961.

      19.18. Wilks, S. S.: Mathematical Statistics, 2d ed., Wiley, New York, 1962.

Random-process Statistics and Decision Theory

      19.19. Bendat, J. S.: Principles and Applications of Random Noise Theory, Wiley, New York, 1958.

      19.20. —— and A. G. Piersol: Measurement and Analysis of Random Data, Wiley, New York, 1966.

      19.21. Blackman, R. B., and J. W. Tukey: The Measurement of Power Spectra, Dover, New York, 1958.

      19.22. Davenport, W. B., and W. L. Root: Introduction to Random Signals and Noise, McGraw-Hill, New York, 1958.

      19.23. Hancock, J. C: Signal Detection, McGraw-Hill, New York, 1966.

      19.24. Helstrom, C. W.: Statistical Theory of Signal Detection, Pergamon Press, New York, 1960.

      19.25. Korn, G. A.: Random-process Simulation and Measurements, McGraw-Hill, New York, 1966.

      19.26. Middleton, D.: An Introduction to Statistical Communication Theory, McGraw-Hill, New York, 1960.

      19.27. Wainstein, L. A., and V. D. Zubakov: Extraction of Signals from Noise, Prentice-Hall, Englewood Cliffs, N.J., 1962.

      19.28. Wozencraft, J. M., and I. M. Jacobs: Principles of Communication Engineering, Wiley, New York, 1965.

*See footnote to Sec 18.3-4.

* See footnote to Sec. 19.6-5a.

* Random or partially random selection of decisions (as in games with mixed strategies, Sec. 11.4-46) will not be considered here.