CHAPTER 19
MATHEMATICAL STATISTICS
19.1. Introduction to Statistical Methods
19.1-3. Relation between Probability Model and Reality: Estimation and Testing
19.2. Statistical Description. Definition and Computation of Random-sample Statistics
19.2-1. Statistical Relative Frequencies
(a) Definition and Basic Properties
19.2-2. The Distribution of the Sample. Grouped Data
(a) The Empirical Cumulative Distribution Function
(b) Class Intervals and Grouped Data
(b) The Sample Average of y(x)
19.2-4. Sample Variances and Moments
(c) Measures of Skewness and Excess Populations. The x2, t, and v2 Distributions
19.2-5. Simplified Numerical Computation of Sample Averages and Variances. Corrections for Grouping
19.3. General-purpose Probability Distributions
19.3-2. Edgeworth-Kapteyn Repre-sentation of Theoretical Distributions
19.3-3. Gram-Charlier and Edgeworth Series Approximations
19.3-4. Truncated Normal Distributions and Pareto's Distribution
19.3-5. Pearson's General-purpose Distributions
19.4. Classical Parameter Estimation
19.4-1. Properties of Estimates
19.4-2. Some Properties of Statistics Used as Estimates
19.4-3. Derivation of Estimates: The Method of Moments
19.4-4. The Method of Maximum Likelihood
19.4-5. Other Methods of Estimation
19.5-2. Asymptotically Normal Sampling Distributions
19.5-3. Samples Drawn from Normal Populations. The χ2, t, and ν2 Distributions
19.5-4. The Distribution of the Sample Range
19.5-5. Sampling from Finite Populations
19.6. Classical Statistical Tests
19.6-1. Statistical Hypotheses
19.6-2. Fixed-sample Tests: Definitions
19.6-3. Level of Significance. Ney-man-Pearson Criteria for Choosing Tests of Simple Hypotheses
19.6-6. Tests for Comparing Normal Populations. Analysis of Variance
(b) Comparison of Normal Populations
19.6-7. The x2 Test for Goodness of Fit
19.6-8. Parameter-free Comparison of Two Populations: The Sign Test
19.7. Some Statistics, Sampling Distributions, and Tests for Multivariate Distributions
19.7-2. Statistics Derived from Multivariate Samples
19.7-3. Estimation of Parameters
19.7-4. Sampling Distributions for Normal Populations
(a) Distribution of the Sample Correlation Coefficient
(b) The r Distribution. Test for Uncorrelated Variables
(c) Test for Hypothetical Values of the Regression Coefficient
19.7-6. Spearman's Rank Correlation. A Nonparametric Test of Statistical Dependence
19.8. Random-process Statistics and Measurements
19.8-1. Simple Finite-time Averages
(b) Measurement of Mean Square
(c) Measurement of Correlation Functions
19.9. Testing and Estimation with Random Parameters
19.9-2. Bayes Estimation and Tests
19.9-3. Binary State and Decision Variables: Statistical Tests or Detection
19.9-4. Estimation (Signal Extraction, Regression)
19.10. Related Topics, References, and Bibliography
19.10-2. References and Bibliography
19.1. INTRODUCTION TO STATISTICAL METHODS
19.1-1. Statistics. In the most general sense of the word, statistics is the art of using quantitative empirical data to describe experience and to infer and test propositions (numerical estimates, hypothetical correspondences, predicted results, decisions). More specifically, statistics deals (1) with the statistical description of processes or experiments, and (2) with the induction and testing of corresponding mathematical models involving the probability concept. The relevant portions of probability theory constitute the field of mathematical statistics. These techniques extend the possibility of scientific prediction and rational decisions to many situations where deterministic prediction fails because essential parameters cannot be known or controlled with sufficient accuracy.
Statistical description and probability models apply to physical processes exhibiting the following empirical phenomenon. Even though individual measurements of a physical quantity x cannot be predicted with sufficient accuracy, a suitably determined function y = y(x1, x2, . . .) of a set (sample) of repeated measurements x1, x2, ... of x can often be predicted with substantially better accuracy, and the prediction of y may still yield useful decisions. Such a function y of a set of sample values is called a statistic, and the incidence of increased predictability is known as statistical regularity. Statistical regularity, in each individual situation, is an empirical physical law which, like the law of gravity or the induction law, is ultimately derived from experience and not from mathematics. Frequently a statistic can be predicted with increasing accuracy as the size n of the sample (x1, x2, ... , xn) increases (physical laws of large numbers). The best-known statistics are statistical relative frequencies (Sec. 19.2-1) and sample averages (Sec. 19.2-3).
19.1-2. The Classical Probability Model: Random-sample Statistics. Concept of a Population (Universe). (a) In an important class of applications, a continuously variable physical quantity (observable) x is regarded as a one-dimensional random variable with the inferred or estimated probability density φ(x). Each sample (x1, x2, . . . , xn) of measurements of x is postulated to be the result of n repeated independent measurements (Sec. 18.2-4). Hence x1, x2, . . . , xn are statistically independent random variables with identical probability density. A sample (x1, x2, . . . , xn) defined in this manner is called a random sample of size n and constitutes an n-dimensional random variable. The probability density in the n-dimensional sample space of “sample points” (x1, x2, . . . , xn) is the likelihood function
Every random-sample statistic defined as a measurable function
of the sample values is a random variable whose probability distribution (sampling distribution of y is uniquely determined by the likelihood function, and hence by the distribution of x. Each sampling distribution will, in general, depend on the sample size n.
While the assumptions made in this section do not apply to every physical situation, the model is capable of considerable generalization. The distribution of x need not be continuous (games, quality control). The sample may be of infinite size and may not even be a countable set; the xk may not have the same probability distribution and may not be statistically independent (random-process theory, Sec. 18.9-2). Finally, x, and thus each xk, can be replaced by a multidimensional variable (Secs. 19.7-1 to 19.7-6).
(b) As the size n of a random sample increases, many sample statistics converge in probability to corresponding parameters of the theoretical distribution of x; in particular, statistical relative frequencies converge in mean to the corresponding probabilities (Sec. 19.2-1). Thus one considers each sample drawn from an infinite (theoretical) population (universe, ensemble) whose sample distribution (Sec. 19.2-2) is identical with the theoretical probability distribution of x. The probability distribution is then referred to as the population distribution, and its parameters are population parameters. In many applications, the theoretical population is an idealization of an actual population from which samples are drawn.
19.1-3. Relation between Probability Model and Reality: Estimation and Testing (a) Estimation of Parameters. Statistical methods use empirical data (sample values) to infer specifications of a probability model, e.g., to estimate the probability density φ(x) of a random variable x. An important application of such inferred models is to make decisions based on inferred probabilities of future events. In most applications, statistical relative frequencies (Sec. 19.2-1) are used directly only for rough qualitative (graphical) estimates of the population distribution. Instead, one infers (postulates) the general form of the theoretical distribution, say
where η1, η2, . . . are unknown population parameters to be estimated on the basis of the given random sample (x1, x2, . . . , xn). Sections 19.3-1 to 19.3-5 list a number of “general-purpose” frequency functions (2) to be chosen in accordance with the physical background, the form of the sample distribution, and convenience in computations.
The parameters η1,η2, . . . usually measure specific properties of the theoretical distribution of x (e.g., population mean, population variance, skewness; see also Table 18.3-1). In general, one attempts to estimate values of the parameters η1, η2, .. . “fitting” a given sample (x1, x2. . . , xn) by the empirical values of corresponding sample statistics y1(x1, x2, . . . , xn), y2(x1, x2 ... , xn), . . . which measure analogous properties of the sample (e.g., sample average, sample variance, Secs. 19.2-3 to 19.2-6). "Fitting" is interpreted subjectively and not necessarily uniquely; in particular, one prefers estimates y(x1, x2, . . . , xn) which converge in probability to η as n → ∞ (consistent estimates), whose expected value equals η (unbiased estimates), whose sampling distribution has a small variance, and/or which are easy to compute (Secs. 19.4-1 to 19.4-5).
(b) Testing Statistical Hypotheses. Tests of a statistical hypothesis specifying some property of a theoretical distribution (say, an inferred set of parameter values η1, η2, . . .) are based on the likelihood (1) of a test sample (x1, x2 . . ., xn ) when the hypothetical probability density (2) is used to compute L(x1, x2, . . . , xn). Generally speaking, the test will reject the hypothesis if the test sample (x1, x2, . . . , xn) falls into a region of small likelihood; or equivalently, if the corresponding value of a test statistic y(x1, x2 . . . , xn) is improbable on the basis of the hypothetical likelihood function. The choice of specific conditions of rejection is again subjective and is ultimately based on the penalties of false rejection and/or acceptance and, to some extent, on the cost of obtaining test samples of various sizes (Secs. 19.6-1 to 19.6-9).
Nonparametric tests test hypothetical distribution properties other than parameter values (e.g., identity of two distributions, statistical independence of two random variables, Secs. 19.6-8 and 19.7-6) and are particularly convenient in suitable applications, since no specific form (2) of the population distribution need be inferred.
NOTE: Incorrect use of statistical methods can lead to grave errors and seriously wrong conclusions. All (possibly tacit) assumptions regarding a theoretical distribution must be checked. Never use the same sample for estimation and testing. Finally, remember that statistical tests cannot prove any hypothesis; they can only demonstrate a "lack of disproof.''
19.2. STATISTICAL DESCRIPTION. DEFINITION AND COMPUTATION OF RANDOM-SAMPLE STATISTICS
19.2-1. Statistical Relative Frequencies, (a) Definition and Basic Properties. Consider an event E which occurs if and only if a measurement of the random variable x yields a value in some measurable set SE(usually a class interval. Sec. 19.2-2b). Given a random sample (xl x2. . . , xn) of x, let nE denote the number of times a sample value xkimplies the occurrence of the event E. The statistical relative frequency of the event E obtained from the given random sample is
where n is the size of the sample.
The definition of statistical relative frequencies implies the existence of an event algebra (Sec. 18.2-1) for the experiment or observation in Question. The defining properties of mathematical probabilities (Sec. 18.2-2) are abstracted from corresponding properties of statistical relative frequencies. Thus the relative frequencies of mutually exclusive events add, the relative frequency of a certain event is 1, etc.
(b) Mean and Variance. Since the random sample may be regarded as a set of n Bernoulli trials (Sec. 18.7-3) which yield or do not yield E, the random variable nE has a binomial distribution (Table 18.8-3) where is the probability associated with the event E, and
The statistical relative frequency h[E] is an unbiased, consistent estimate of the corresponding probability P[E]; as n → ∞, h[E] is asymptotically normal with the parameters (2) (Secs. 18.6-4 and 18.6-5a).
19.2-2. The Distribution of the Sample. Grouped Data, (a) The Empirical Cumulative Distribution Function. For a given random sample (x1, x2, . . . , xn), the empirical cumulative distribution function
is a nondecreasing step function, with F(— ∞) = 0, F(∞) = 1. F(X) is an unbiased, consistent estimate of the cumulative distribution function Φ(X) = P[x ≤ X] (Sec. 18.2-9) and defines the distribution (frequency distribution) of the sample (empirical distribution based on the given sample).
(b) Class Intervals and Grouped Data. Let the range of the random variable x be partitioned into a finite or infinite number of conveniently chosen class intervals (cells) (j = 1, 2, . . .) respectively of length ΔX1, ΔX2, . . . and centered at x = X1 < X2 < . . . . For a given random sample, the class frequency (occupation number) nj is the number of times an xk falls into the jth class interval (description of the sample in terms of grouped data). The statistical relative frequencies hj = nj/n (relative frequencies of observations in the jth class interval) must add up to unity and are consistent, unbiased estimates of the corresponding probabilities
The cumulative frequencies Nj and the cumulative relative frequencies Fj are denned by
Sample statistics can be calculated from the statistical relative frequencies hj = nj/n just as corresponding population parameters are calculated from probabilities. The roundoff implicit in numerical computation of statistics groups data with a class-interval width equal to one least-significant digit. Grouping into much larger class intervals may be economically advantageous, since “quantization errors” due to grouping very often average out or are easily corrected (Sec. 19.2-5 and Ref. 19.25). The statistics F(X), nj, hj, Nj, and Fj also yield various graphical representations of sample distributions and hence of estimated population distributions (bar charts, histograms, frequency polygons, probability graph paper, etc.; see Refs. 19.1 and 19.8).
(c) Sample Fractiles (see also Table 18.3-1 and Sec. 19.5-2b). The sample P fractiles (sample quantiles) Xp are defined by
Equation (5) does not define Xp uniquely but brackets it by two adjacent sample values xk. X1/2 is the sample median, and X1/4 X1/2, X3/4 are sample quartiles, with analogous definitions for sample deciles and sample percentiles.
19.2-3. Sample Averages (see also Secs. 18.3-3, 19.2-5, and 19.5-3). (a) The Sample Average of x. Given a random sample (x1, x2, ... , xn), the sample average of x is
In terms of the sample distribution over a set of class intervals centered at x = X1, X2, . . . , Xm (Sec. 19.2-2), is approximated by
is a measure of location of the sample distribution. Note
whenever the quantity on the right exists, is an unbiased, consistent estimate of the population mean ξ = E{x}; if σ2 exists,
; is asymptotically normal with the parameters (8) as n → ∞ (Secs. 18.6-4 and 19.5-2).
(b) The Sample Average of y(x). The sample average of a function y(x) of the random variable x is
Estimates based on Eq. (11) are sometimes improved by correction terms (Sec. 19.2-5).
19.2-4. Sample Variances and Moments (see also Secs. 18.3-3, 18.3-7, 19.2-5, 19.4-2, and 19.5-2). (a) The Sample Variances. The sample variances
are measures of dispersion of the sample distribution; s is called sample standard deviation or sample dispersion. Note
whenever the quantity on the right exists. S2 is an unbiased, consistent estimate of the population variance σ2 = Var {x} and is thus often more useful than s2.
(b) Sample Moments. The sample moments ar and sample central moments mr of order r are defined by
Note
whenever the quantity on the right exists. ar is an unbiased, consistent estimate of the corresponding population moment αr = E{xr}. If α2r exists, ar is asymptotically normal with the parameters (16) as n → ∞. mr is a consistent (but not unbiased) estimate of μr.
The random variables
are, respectively, consistent, unbiased estimates of μ3 and μ4;
(c) Measures of Skewness and Excess (see also Table 18.3-1 and Sec. 19.3-5).
The statistics
respectively measure the skewness and the excess (curtosis, flatness) of the sample and are consistent estimates of the corresponding population parameters (see also Sec. 19.4-2). Roughly speaking, g1 > 0 indicates a longer “tail” to the right. Some authors introduce g12 and g2 + 3 or (g2 + 3)/2 instead of g1 and g2. Several other measures of skewness have been used (Ref. 19.1).
19.2-5. Simplified Numerical Computation of Sample Averages and Variances. Corrections for Grouping (see Refs. 19.3 and 19.8 for calculation-sheet layouts), (a) For numerical computations, it is convenient to choose a computing origin Xo near the center of the sample distribution (“guessed mean”) and to compute
or, for grouped data,
(b) The sample variances s2 and S2 may be computed from
which is approximated for grouped data by
(c) Computations with grouped data are simplified if all class intervals are of equal length ΔX, and if one of the class-interval mid-points Xj = X0 is taken to be the computing origin, so that
where the Yj are “coded” class-interval centers which take integral values 0, ±1, ±2, .... One has then
One may check computations by introducing a different computing origin X0, say X0 = 0.
(d) Sheppard's Corrections for Grouping. Let all class intervals be of equal length ΔX. Then, if the characteristic function (Sec. 18.3-8) xz(q) and its derivatives are small for |q| ≥ 2π/Δx, one may improve the grouped-data approximation sG2 to the true sample variance s2 by adding Sheppard's correction — (ΔX)2/12. Analogous corrections for the grouped-data sample moments
yield the improved estimates
More generally,
where the Bk are the Bernoulli numbers (Sec. 21.5-2). These corrections become exact if xz(q) = 0 for \q\ ≥ 2π/Δx — ∊ (∊ > 0). For normal variables with ΔX ≤ 2σ, a1′ = a1 within 2.3 10-3 ΔX (ξ ≠ 0), and -(ΔX)2/12 is within 3.1
lO-2σ2 of the exact correction if ξ = 0.
NOTE: Sheppard's corrections often yield useful estimates of errors due to the use of rounded-off sample values in the exact formulas (12) and (15). Thus, if Sheppard's correction applies, a mean round-off error ΔX/2 in the xk affects s2 only as (ΔX)2/12.
19.2-6. The Sample Range (see also Sec. 19.5-4). The sample range w for a random sample (x1, x2, . . . , xn) is the difference of the largest and the smallest sample value xk. The sample range has physical significance (quality control) and serves as a rough but conveniently calculated estimate of population parameters for specific theoretical distributions (Sec. 19.5-4). The sample range and the smallest and largest sample values are examples of rank (order) statistics.
19.3. GENERAL-PURPOSE PROBABILITY DISTRIBUTIONS
19.3-1. Introduction. The probability distributions described in Secs. 18.8-1 to 18.8-9 and, in particular, the normal, binomial, hyper-geometric, and Poisson distributions can often serve as theoretical population distributions in statistical models. The applicability of a particular type of probability distribution (with suitably fitted parameters) may be inferred either theoretically or from the graph of an empirical distribution.
Normal distributions are particularly convenient. Each normal distribution is completely denned by its first- and second-order moments; again, the use of normal populations favors the computation of exact sampling distributions for use in statistical tests (Secs. 19.6-4 and 19.6-6). The use of a normal distribution is frequently justified by the central limit theorem (Sec. 18.6-5); in particular, errors in measurements are often regarded as normally distributed sums of many independent “elementary errors.”
19.3-2. Edgeworth-Kapteyn Representation of Theoretical Distributions. * It is often desirable to fit the empirical distribution of a random variable x with a probability distribution described by
where g(x) is a function of x selected so as to be normally distributed with parameters (μ, σg2). Once g(x) has been chosen (e.g., from theoretical considerations), only two parameters, μ and σg, remain to be estimated, and much of the theory of normal populations is applicable.
Any random variable x described by Eq. (1) may be regarded as the limit of a sequence of random variables y1 = y0 + z1h(y0), y2 = y1 + z2h(y1), . . . , where z1, z2, . . . are random variables satisfying the conditions of Sec. 18.6-5c, and
. The z1, z2, . . . may be considered as “reaction intensities” in a physical process which successively generates y1, y2, . . . .
EXAMPLES: (1) h(y) = constant yields a normal distribution for x. (2) h(y) = y — a results in a logarithmic normal distribution defined by
19.3-3. Gram-Charlier and Edgeworth Series Approximations.
It is frequently convenient to approximate the frequency function of a standardized random variable (Sec. 18.5-5) in the form
where the parameters μ3, μ4, γ1, γ2 refer to the theoretical distribution of z (Sec. 18.3-7b and Table 18.3-1). An analogous expression for the distribution function Φz(x) is obtained by substitution of Φu(k)(x) for φu(k)(x) in Eq. (3). Note that Eq. (3) expresses φz(x) in terms of the widely tabulated derivatives of the normal frequency function φu(x) (Sec. 18.8-3), and that the parameters (coefficients) can be estimated as functions of moments (Sec. 19.4-3); but the computation of sampling distributions is not easy. See also Table 18.5-1.
For a rather restricted class of distributions, the approximation (3) comprises the first terms of the orthogonal expansion (Sec. 15.2-6)
(19.3-4a)(19.3-4b)(19.3-4c)
(Gram-Charlier Type A series), where Hk(x) is the kth Hermite polynomial (Sec. 21.7-1).
The series (4a) converges to Φz(x) if μ1, μ2, . . . exist and converges; the series (4b) will then converge to φz(x) at all points of continuity if φz(x) is of bounded variation (Sec. 4.4-8b) in (— ∞, ∞). For a much larger class of distributions, the approximation (3) can be based on Edgeworth's asymptotic series (Ref. 19.4).
19.3-4. Truncated Normal Distributions and Pareto's Distribution, (a) Truncated Normal Distributions. If all events [x ≤ ξa] are removed from a normal population with mean μ and variance σ2 (Sec. 19.1-2), the remaining population has a one-sided truncated normal distribution such that φ(x) = Φ(x) = 0 for x ≤ ξa, and
where (degree of truncation) is the fraction of the original population removed by the truncation. If ξa is known, one may use
to estimate μ and σ by the moment method (Sec. 19.4-3) with the aid of published tables (Ref. 19.15) expressing μ and σ in terms of α1 and α2.
(b) Pareto's distribution is defined by φ(x) = Φ(x) = 0 for x ≤ ξa and
19.3-5. Pearson's General-purpose Distributions. The frequency functions φ(x) of many continuous probability distributions can be described as solutions of the differential equation
whose four parameters define each distribution completely. Each parameter ηk can be estimated as a function of the first four moments αr or μr (Sec. 19.4-3), but the computation of sampling distributions is, at best, difficult. The distributions defined by Eq. (9) can be classified according to the nature of the roots of η2 + η3x + η4x2 = 0 and include most of the continuous distributions described in Table 18.8-8, Sec. 18.8-3, and Sec. 19.5-3 as special cases (see also Ref. 19.7).
For Pearson's distributions, Pearson's measure of skewness (Table 18.3-1) is
so that for small γ1, γ2 one has ξmode ≈ ξ – γ1σ/2.
19.4. CLASSICAL PARAMETER ESTIMATION
19.4-1. Properties of Estimates (see also Sec. 19.1-3). (a) A sample statistic y(x1, x2, . . . , xn) is a consistent estimate of the theoretical population parameter η if and only if y converges to η in probability as the sample size n increases, i.e., if and only if the probability of realizing any finite deviation |y — η| converges to zero as n → ∞ (Sec. 18.6-1).
(b) The bias of an estimate y for the parameter η is the difference b(η) ≡ E(y) — η. y is an unbiased estimate of η if and only if E{y} = η for all values of η.
(c) Asymptotically Efficient Estimates and Efficient Estimates. It is desirable to employ estimates whose sampling distributions cluster as densely as possible about the desired parameter value, i.e., estimates with small variance Var {y}, or small standard deviation (standard error of the estimate) For the important special class of estimates y(x1, x2, . . . , xn) whose sampling distributions are asymptotically normal with mean η and variance constant/n = λ/n (see also Sec. 19.5-2),
The asymptotic efficiency e∞{y} = λmin/λ of such an estimate y measures the concentration of the asymptotic sampling distribution about the parameter η; y is an asymptotically efficient estimate of the parameter η if and only if λ = λmin. Every asymptotically efficient estimate is consistent.
More generally, the “relative efficiency” of the estimate y(x1, x2, . . . , xn) of η from a given sample size n is measured by the reciprocal of the mean-square deviation E{(y — η)2}. Under quite general conditions (Ref. 19.4, Chap. 32), the mean-square deviations E{(y — η)2 } of the various possible estimates y of a given parameter η have a lower bound given by
(Cramér-Rao Inequality). For unbiased estimates y, Eq. (2) reduces to
The efficiency of an unbiased estimate satisfying Eq. (3) measures the concentration of the sampling distribution about
is again called the asymptotic efficiency, if this quantity exists. A (necessarily unbiased and consistent) estimate y is an efficient estimate of η if and only if Var {y} exists and equals the lower bound λmin/n.
(d) Sufficient Estimates. An estimate y(x1, x2 . . . , xn) of the parameter η is a sufficient estimate if and only if the likelihood function for one sample (Sec. 19.1-2) can be written in the form
where L2 is functionally independent (Sec. 4.5-6) of Y. In this case, the conditional probability distribution of (x1, x2 . . . , xn) on the hypothesis [y = Y] is independent of η, so that a sufficient estimate y of η embodies all the information about η which the given sample can supply. Efficient estimates are necessarily sufficient.
(e) Generalizations. Equations (1) to (3) apply to discrete probability distributions if the probability p(x1, x2, . . . , xn) is substituted for the probability density φ(x1, x2. . . , xn). The theory of Secs. 19.4-1a to d applies without change to populations described by multidimensional random variables.
A set of m suitable unbiased estimates y1, y2, . . . , ym are joint efficient estimates of m corresponding population parameters η1, η2, . . . , ηm if the concentration ellipsoid (Sec. 18.4-8c) of the joint sampling distribution coincides with a “maximum concentration ellipsoid” analogous to the minimum variance defined by Eq. (3). Joint asymptotically efficient estimates are similarly defined in terms of the asymptotic sampling distribution. The reciprocal of the generalized variance (Sec. 18.4-8c) associated with the joint sampling distribution of y1 y2, . . . , ym is a measure of the relative joint efficiency of the set of estimates.
To define a set of m joint sufficient estimates y1 y2, . . . , ym it is only necessary to replace the random variable y in Eq. (4) by the m-dimensional variable (y1 y2,. . . ym).
19.4-2. Some Properties of Statistics Used as Estimates (see also Sec. 19.2-4). (a) Functions of Moments. Every statistic expressible as a power of a rational function of the sample moments ar is a consistent estimate of the same function of the corresponding population moments αr,provided that the αr's in question exist and yield a finite function value (see also Sec. 18.3-7). Multiplication of a biased consistent estimate by a suitable function of n will often yield an unbiased consistent estimate.
An entirely analogous theorem applies to functions of the sample moments ar1r2... of multivariate samples. In particular, g1, g2, lik, rik, and r12-34 . . . m (Sec. 19.7-2) are consistent estimates of the corresponding population parameters γ1, γ2, λik, ρik, and ρ12-34 . . .
(b) For samples taken from a normal population
1. is an efficient estimate of ξ.
2. and s2 are joint asymptotically efficient estimates of ξ and σ2, but s2 is biased;
and
are joint sufficient and asymptotically efficient estimates of ξ and σ2. S2 has the efficiency
.
3. If ξ is known, is an efficient estimate of σ2.
4. The sample median x1/2 has the asymptotic efficiency
For samples taken from a binomial distribution (Table 18.8-3), is an efficient estimate of ξ.
For samples taken from a two-dimensional normal distribution (Sec. 18.8-6) with known center of gravity, the sample central moments l11, l12, and l22 (Sec. 19.7-2) are joint asymptotically efficient estimates of λ11, λ12, and λ22.
19.4-3. Derivation of Estimates: The Method of Moments. If the population distribution is described by a given function Φ(x; η1, η2, . . .), φ(x; η1, η2, . . .) or p(x; η1, η2, . . .) where η1, η2, . . . are parameters to be determined, each population characteristic like E{x}, Var {x}, αr, etc., is a function of the parameters η1, η2, ... . In particular
if these quantities exist. The method of moments defines (joint) estimates y(x1, x2, . . . , xn), y2(x1, x2, . . . , xn), . . . , ym(x1, x2, . . . , xn) of m corresponding population parameters η1, η2, . . . , ηm by the m equations
obtained on equating the first m sample moments ar to the corresponding population moments αr. The resulting estimates yk are necessarily functions of the sample moments (see also Sec. 19.4-2a).
19.4-4. The Method of Maximum Likelihood. For any given sample (x1, x2, . . . , xn), the value of the likelihood function L(x1, x2, . . . , xn) (Sec. 19.1-2) is a function of the unknown parameters η1, η2, ... . The method of maximum likelihood estimates each parameter ηk by a corresponding trial function yk(x1, x2, . . . , xn) chosen so that L(x1, x2, . . . xn;y1,y2 . . .) is as large as possible for each sample (x1, x2, . . . , xn). One attempts to obtain a set of m (joint) maximum-likelihood estimates y1(xl, x2 . . . , xn), y2(1, x2 . . . , xn), . . . , ym(x1, x2, . . . , xn) as nontrivial solutions of m equations
which constitute necessary conditions for a maximum of the likelihood function if the latter is suitably differentiate (Sec. 11.3-3).
Although the maximum-likelihood method often involves more complicated computations than the moment method (Sec. 19.4-3), maximum-likelihood estimates may be preferable, particularly in the case of small samples, because
1. If an efficient estimate (or a set of joint efficient estimates) exists, it will appear as a unique solution of the likelihood equation or equations (6).
2. If a sufficient estimate (or a set of sufficient estimates) exists, every solution of the likelihood equation or equations (6) will be a function of this estimate or estimates.
In addition, under quite general conditions (Ref. 19.4, Chap. 33), the likelihood equations (6) have a solution yielding consistent, asymptotically normai, and asymptotically efficient estimates.
EXAMPLE: If x is normally distributed, the maximum-likelihood estimate X = for ξ is an efficient estimate and minimizes
(method of least squares in the theory of errors).
Note that the maximum-likelihood method applies also to multidimensional populations and applies even in the case of nonrandom samples.
19.4-5. Other Methods of Estimation. A number of methods usually employed to test the goodness of fit of an estimate or estimates can be modified to infer parameter values (Sec. 19.6-7; see also Secs. 19.9-1 to 19.9-5).
19.5-1. Introduction (see also Secs. 19.1-2 and 19.1-3). Section 19.5-1 lists properties of a class of statistics frequently used as consistent estimates of corresponding population parameters. Section 19.5-2 deals with the approximate computation of sampling distributions for large samples, while Secs. 19.5-3 and 19.5-4b are concerned with the distributions of statistics derived from normal populations.
19.5-2. Asymptotically Normal Sampling Distributions (see also Sec. 18.6-4). The following theorems are derived from the limit theorems of Sec. 18.6-5 and permit one to approximate the sampling distributions of many statistics by normal distributions if the sample size is sufficiently large.
(a) Let y(x1, x2, . . , xn) be any statistic expressible as a function of the sample moments mk such that y = f(m1 m2, . . .) exists and is twice continuously differentiate throughout a neighborhood of m1 = μ1, m2 = μ2, .... Then the sampling distribution of y is asymptotically normal as n → ∞, with mean ƒ(μ1, μ2, . . .) and variance The theorem applies, in particular, to the sample mean
, to the sample variances s and S, and to the sample moments ar and mr (Sec. 19.2-4). An analogous theorem applies to multidimensional populations.
(b) The distribution of each sample fractile Xp is asymptotically normal with mean xp and variance provided that (1) the population fractile xp is unique, and (2) φ′(x) exists and is continuous throughout a neighborhood of x = xp. This theorem applies, in particular, to the sample median X1/2. Under analogous conditions, any joint distribution of the sample fractiles (and hence, for example, the sample interquartile range) is also asymptotically normal.
19.5-3. Samples Drawn from Normal Populations. The χ2, t, and ν2 Distributions, (a) In the case of samples drawn from a normal population (normal samples), all sample values are normal variables, and many sampling distributions can be calculated explicitly with the aid of Secs. 18.5-7 and 18.8-9. The assumption of a normal population can often be justified by the central-limit theorem (Sec. 18.6-5, e.g., errors in measurements), or the method of Sec. 19.3-2 can be used.
For any sample of size n drawn from a normal population with mean ξ and variance σ2
and
4. has a standardized normal distribution.
5.has a t distribution with n — 1 degrees of freedom.
Table 19.5-1. The χ2 Distribution with m Degrees of Freedom
(Fig. 19.5-1; see also Secs. 18.8-7, 19.5-3, and 19.6-7)
(c) Typical Interpretation.Given any m statistically independent Standardized normal variables the sum
distribution with m degrees of freedom.
(d) The fractiles of y will be denoted by X2P or X2P(M); published tables frequently show
(e) Approximations.As m —>?,
y is asymptotically normal with mean m and variance 2m
y/m is asymptotically normal with mean 1 and variance 2/m is asymptotically normal with mean
and variance 1 A useful approximation based on the last item listed above is
This approximation is worst if P is either small or large. A better approximation is given by
6. has an r distribution with m = n — 2 (Sec. 19.7-4).
7.has a χ2 distribution with n degrees of freedom.
Table 19.5-2. Student's t Distribution with m Degrees of Freedom
(Fig. 19.5-2; see also Secs. 19.5-3 and 19.6-6)
(e) Typical Interpretation. y is distributed like the ratio
where x0 x1, x2, . . . , xm are m + 1 statistically independent normal variables, each having the mean 0 and the variance σ2.Note that t is independent of σ2. (d)The fractilesyp of y will be denoted by tp; note tp = — U-p.The distribution of |y| = |t| is related to the t distribution by
The fractiles
defined by (a values of t) are often tabulated for use in statistical tests; note that some published tables denote |t|1-a by tα.
(e) Approximaotions.Asm y is asymptotically normal with mean 0 and variance 1, so that (Sec. 18.8-46) for m > 30.
Table 19.5-3. The Variance-ratio Distribution (ν2 Distribution, Snedecor F Distribution, ω2 Distribution) and Related Distributions
(c) Typical Interpretation. y is distributed like the ratio v2 of two random variables having x2 distributions with m and m’ degrees of freedom (Table 19.5-1), or
where the x and are m + m’ statistically independent normal variables each having the mean 0 and the variance a2. Note that v2 is independent of a2.
(d) Change of Variables. The distribution of the variance ratio v2 is also described by other random variables, viz.
(e) Published tables usually present the fractiles v1-a(m, m’), or z1_ot (m, m’) as functions of a for various values of m, m’.
Table 19.6-1 lists additional formulas. Tables 19.5-1 to 19.5-3 detail the properties of the χ2, t, and ν2 distributions; fractiles of these functions are available in tabular form.
(b) The sample mean and the sample variance s2 are statistically independent if and only if the sample in question is drawn from a normal population.
For every normal sample, , s2, and mrm2-r/2 are statistically independent for all r, and g1 = m3m2-3/2, g2 = m4m2-2 —3 yield
19.5-4. The Distribution of the Sample Range (see also Secs. 19.2-6 and 19.7-6). (a) For every continuous distribution, the frequency function of the range w for a random sample of size n is
This function has been tabulated for a number of population distributions (Ref. 19.8).
(b) For normal populations, both the mean and the dispersion of w are multiples of the population dispersion σ:
kn, cn and cn/kn have been tabulated as functions of n (Ref. 19.3); w/kn is an unbiased estimate of σ. The average range
obtained from a sample of m random samples of size n is asymptotically normal with mean knσ and variance cn2σ2/m as m → ∞ ; w/kn is an unbiased, consistent estimate of σ.
(c) For a uniform (rectangular) population distribution in the interval (a,b)
and is a consistent, unbiased estimate of b — a. Note also that the arithmetic mean of the smallest and the largest sample value is a consistent, unbiased estimate of E{x} (Ref. 19.4).
(d) For any continuous population distribution, the probability that at least a fraction q of the population lies between the extreme values xmin, xmax of a given random sample of size n is
19.5-5. Sampling from Finite Populations. Given a finite population of N elements (events, results of observations) E1 E2, . . . , EN each labeled with one of the M ≤ N spectral values X(1), X(2), . . . , X(M) of a (necessarily discrete) random variable x, let p(X(i)) = Ni/N, where Ni is the number of elements Ek labeled by X(i). For a random sample of size n ≤ N (without replacement), n/N is called the sampling ratio, and one has
These formulas reduce to those given in Secs. 19.2-3 and 19.2-4 for N → ∞
19.6. CLASSICAL STATISTICAL TESTS
19.6-1. Statistical Hypotheses. Consider a “space” of samples (sample points) (xl x2, . . . , xn), where x1, x2, . . . , xn are numerical random variables. Every self-consistent set of assumptions involving the joint distribution of x1, x2, . . . , xn is a statistical hypothesis. A hypothesis H is a simple statistical hypothesis if it defines the probability distribution uniquely; otherwise it is a composite statistical hypothesis.
More specifically, let the joint distribution of x1, x2, . . . , xn be defined by Φ(x1, x2, . . . , xn; η1, η2, . . . ), φ(x1, x2, . . . , xn; η1, η2, . . .),or p(x1, x2, . . . xn; η1, η2,. . .), where η1, η2, . . . are parameters (see also Sec. 19.1-3). Then a simple statistical hypothesis assigns definite values η10, η20, ... to the respective parameters η1, η2, . . . (“point” in parameter space), whereas a composite statistical hypothesis confines the “points” (η1, η2, . . .) to a set or region in parameter space. The class of admissible statistical hypotheses (admissible parameter combinations) is restricted by the context of the problem in question.
If (x1, x2 , . . . xn) is a random sample drawn from a single theoretical population (i.e., if the xk are statistically independent with identical marginal distributions), statistical hypotheses refer to values of, or relations between, population parameters. Note, however, that the theory of Secs. 19.6-1 to 19.6-2 is not limited to random samples (x1, x2, . . . , xn) but applies to samples from any random process.
19.6-2. Fixed-sample Tests: Definitions. Given a fixed sample size n, a test of the statistical hypothesis H is a rule rejecting or accepting the hypothesis H on the basis of a test sample (X1, X2, . . . , Xn). Each test specifies a critical set (critical region, rejection region) Sc of “points” (x1, x2, . . . , xn) such that H will be rejected if the test sample (X1, X2, . . . , Xn) belongs to the critical set; otherwise H is accepted.
Such rejection or acceptance does not constitute a logical disproof or proof, even if the sample is infinitely large. Four possible events arise:
1. H is true and is accepted by the test.
2. H is false and is rejected by the test.
3. H is true but is rejected by the test (error of the first kind).
4. H is false but is accepted by the test (error of the second kind).
For any set of true (actual) parameter values η1,η2, . . . , the probability that a critical region Sc will reject the hypothesis tested is
Whenever the hypothesis H1 ≡ [η1 = η11, η2 = η21, . . .] contradicts the hypothesis tested, the rejection probability πsc(η11, η21, . . .) is called the power (power function) of the test defined by Sc with respect to the alternative (simple) hypothesis H1 (>see also Fig. 19.6-1a). A graph of the correct-acceptance probability 1 — β against the false-rejection probability α is called the operating characteristic of the test (>Fig. 19.6-1b; see also Sec. 19.6-3).
19.6-3. Level of Significance. Neyman-Pearson Criteria for Choosing Tests of Simple Hypotheses, (a) It is, generally speaking, desirable to use a critical region Sc such that πsc( η1, η2, . . .) is small for parameter combinations η1, η2, . . . admitted by the hypothesis tested, and as large as possible for other parameter combinations. Given a critical region Sc used to test the simple hypothesis H0 ≡ [η1 = η10, η2 = η20, . . .] (“null hypothesis”), let H0 be true. Then the probability of falsely rejecting H0 (error of the first kind) is πsc(η10, η20, . . .) = α. α is called the level of significance of the test: the critical region Sc tests the simple hypothesis H0 at the level of significance- α.
In the case of discrete random variables x1, x2, . . . , xn, one cannot, in general, specify α at will, but an upper bound for α may be given. A critical region used to test a composite hypothesis H ≡ [(η1, η2, . . .) € D] will, in general, yield different levels of significance for different simple hypotheses (parameter combinations η1, η2, . . .) admitted by H; one may specify the least upper bound of these levels of significance.
(b) For each given sample size n and level of significance α
1. A most powerful test of the simple hypothesis H0 ≡ [η1 = η10, η2 = η20, . . .] relative to the simple alternative H1 ≡ [η1 = η11, η2 = η21, . . .] is defined by the critical region Sc which yields the largest value of 1 - β = πSc (η11, η11, . . .).
2. A uniformly most powerful test is most powerful relative to every admissible alternative hypothesis H1; such a test does not always exist.
3. A test is unbiased if πSc(η11, η21, . . .) ≥ α for every alternative simple hypothesis H1 otherwise the test is biased. A most powerful unbiased test relative to a given alternative H1 and a uniformly most powerful unbiased test may be defined as above.
To construct the critical region Sc for a most powerful test, use all sample points (x1, x2, . . . , xn) such that the likelihood ratio φ(x1, x2, ... , xn;η10, η20 . . .) /φ(x1, x2, ... , xn;η11, η21 . . .) or p(x1, x2, ... , xn;η10, η20 . . .) / p(x1, x2, ... , xn;η11, η21 . . .) is less than some fixed constant c; different values of c will yield “best” critical regions at different levels of significance α. Uniformly most powerful tests are of particular interest if one desires to test H0 against a composite alternative hypothesis. A uniformly most powerful unbiased test may exist even though no uniformly most powerful test exists (see also Ref. 19.4). In practice, ease of computation may be the deciding factor in a choice among several possible tests; one can usually increase the power of each test by increasing the sample size n (see also Sec 19.6-9).
/>FIG. 19.1a. Test of the null hypothesis H0 against a simple alternate hypothesis H1 in terms of a test statistic y = y(x1, x2 . . ., xn).
FIG. 19.1b. operating characteristic of a test (see also Fig. 19.6-1a).
Table 19.6-1. Some Tests of Significance Relating to the Parameters ξ, σ2 of a samples. Obtain fractiles from published tables; the approximations given in Tables
* Note that published tables often tabulate χ21-α rather than χ2α, check carefully
19.6-4. Tests of Significance. Many applications require one to test a hypothetical population property specified in terms of a set of parameter values η1= η10, η2= η20, . . . against a corresponding sample property described by the respective estimates y1(x1, x2 . . . ,xn), y2(x1, x2, . . . , xn), . . . of η1, η2, . . . . One attempts to construct a “test statistic”
whose values measure a deviation or ratio comparing the sample property to the hypothetical population property for each test sample (x1, x2, . . . , xn). The simple hypothesis H0 ≡ [η1 = η10, η2 = η20, . . .] is then rejected at a given level of significance α (the “deviation” is significant) whenever the sample value of y falls outside an acceptance interval [yP1, ≤ y ≤ yP2] such that
Normal Population (see also Secs. 19.5-3a and 19.6-4). Tests are based on random 19.5-1 and 19.5-2 apply if the sample size n is large*.
on the notation used in each case.
Tests defined in this manner are often called tests of significance. Equation (3) specifies yp1 = yp1(η10, η20, . . .) and yP2 = yP2(η10, η20, . . .) as fractiles of the sampling distribution of y(x1, x2, . . . , xn;η10, η20, . . .). It is frequently possible to choose a test statistic y such that its fractiles yp are independent of η10, η20, . . . . Table 19.6-1 and Secs. 19.6-6 and 19.6-7 list a number of important examples (see also Fig. 19.6-1).
In quality-control applications, the acceptance limits yP1 and yP2 defined by Eq. (3) are called tolerance limits at the tolerance level α, and the interval [yP1, yp2] is called a tolerance interval (see also Fig. 19.6-1).
19.6-5. Confidence Regions. (a) Assume that one has constructed a family of critical regions Sα(η1, η2 . . .) capable of testing a corresponding set of simple hypotheses (admissible parameter combinations) (η1, η2, . . .) at some given level of significace α.† Then for any fixed test sample (x1 = X1, x2 = X2, . . . , xn = Xn), the set Dα(X1, X2, . . . ,
† In the case of discrete distributions, the critical regions Sα(η1, η2, . . .) will, in general, be defined for a level of significance less than or equal to α.
Xn) of parameter combinations (η1, η2 . . .) (“points” in parameter space) accepted on the basis of the given sample is a confidence region at the confidence level α; 1 — α is called the confidence coefficient. The confidence region comprises all admissible parameter combinations whose acceptance on the basis of the given sample is associated with a probability P[(X1, X2, . . . , Xn) not in Sα(η1, η2 . . .)] at least equal to the confidence coefficient 1 — α.
The method of confidence regions relates a given sample to a corresponding set of “probable” parameter combinations without the necessity of assigning “inverse probabilities” to essentially nonrandom parameter combinations (see also Sec. 19.6-9).
(b) Confidence Intervals Based on Tests of Significance (see also Sec. 19.6-4). To find confidence regions relating values of one of the unknown parameters, say η1, to the given sample value Y = y(X1, X2,. . . , Xn) of a suitable test statistic y, refer to Fig. 19.6-1. Plot lower
and upper acceptance limits (tolerance limits) yp1(η1) and yp2(η1) against η1 for a given level of significance α.* The intersections of these acceptance-limit curves with each line y = Y define upper and lower confidence limits (fiducial limits) η1 = γ2(Y), η1 = γ1(Y) bounding a confidence interval Dα ≡ [γ1, γ2] at the confidence level γ. The confidence interval comprises those values of η1 whose acceptance on the basis of the sample value y = Y is associated with a probability P[yp1(η1) ≤ Y ≤ yp2(η1)] at least equal to the confidence coefficient 1 — α.
Table 19.6-2. Confidence Intervals for Normal Populations
Use the approximations given in Tables 19.5-1 to 19.5-3 for large n
19.6-6. Tests for Comparing Normal Populations. Analysis of Variance. (a) Pooled-sample Statistics. Consider r statistically independent random samples (xi1 xi2, . . . , xini) drawn, respectively, from r theoretical populations with means ξ1 and variances σi2 (i = 1, 2, . . . , r). The ith sample is of size ni and has the mean and variance
The r samples may be regarded as a “pooled sample” whose size, mean, and variance are given by
The statistics S02 (pooled variance) and SA2 are measures of dispersion within samples and between samples, respectively.
(b) Comparison of Normal Populations (see also Tables 19.5-1 to 19.5-3). If the r independent samples are drawn from normal populations with means ξi and identical variances σi2 = σ2, then S2, S02, and SA2 are all consistent and unbiased estimates (Sec. 19.4-1) of the (usually unknown) population variance σ2. (n — r)S02/σ2 and (r — 1)SA2/σ2 are statistically independent random variables respectively distributed like χ2 with n — r and r — 1 degrees of freedom. Note the sampling distributions of the following test statistics:
Table 19.6-3 shows the use of these test statistics in tests of significance (Sec. 19.6-4) comparing normal populations. It is also possible to calculate confidence limits (Sec. 19.6-5) for the difference ξi — ξk from the t distribution. The case of normal populations with different variances is discussed in Ref. 19.8.
(c) Analysis of Variance. The third test of Table 19.6-3 compares the mean values ξi by partitioning the over-all variance S2 into components S02 and SA2 respectively due to statistical fluctuation within samples and to differences between samples. This technique is known as analysis of variance; the particular case in question involves a one-way classification of samples corresponding to values of the index i. Many similar tests are used to analyze the effects of different medications, soil treatments, etc. (Refs. 19.3, 19.4, and 19.8).
Analysis of Variance for Two-way Classification (Randomized Blocks). Consider rq sample values xik arranged in an array
and introduce row averages i., column averages
, and the over-all average
by
The over-all variance is partitioned as follows:
If the random variables xik are normal with identical variances, S02, S2Row, and S2Colare statistically independent random variables respectively distributed like χ2 with (r — 1)(q — 1), r — 1, and q — 1 degrees of freedom. The test statistic S2Row/S02 is, then, distributed like v2 with m = r — 1, m' = (r — l)(q — 1) and serves to test the equality of the row means . in the manner of Table 19.6-3. Similarly, S2Col/S02 is distributed like ν2 with m = q - 1, m' = (r - l)(q - 1) and serves to test the equality of the column means
Table 19.6-3. Significance Tests for Comparing Normal Populations
(refer to Sec. 19.6-6; see also Tables 19.5-2 and 19.5-3)
19.6-7. The χ2 Test for Goodness of Fit (see also Table 19.5-1). (a) The x2 test checks the “fit” of the hypothetical probabilities pk = p[Ek] associated with r simple events E1, E2, . . . , Er to their relative frequencies hk = h[Ek] = nk/n in a sample of n independent observations. In many applications, each Ek is the event that some random variable x falls within a class interval (Sec. 19.2-2), so that the test compares the hypothetical theoretical distribution of x with its empirical distribution.
The goodness of fit is measured by the test statistic
y converges in probability to χ2 with m = r — 1 degrees of freedom as n → ∞. If all npk > 10 (pool some class intervals if necessary), the resulting test rejects the hypothetical probabilities p1, p2, . . . , pr at the level of significance α whenever the test sample yields y > x1-a{m); for m > 30 one may replace the χ2 distribution by a normal distribution with the indicated mean and variance.
(b) The χ2 Test with Estimated Parameters. If the hypothetical probabilities pk depend on a set of q unknown population parameters η1, η2, . . . , ηq, obtain their joint maximum-likelihood estimates from the given sample (Sec. 19.4-4) and insert the resulting values of pk = pk(η1, η2, . . . , ηq) into Eq. (9). The test statistic y will then converge in probability to χ2 with m = r — q — 1 degrees of freedom under very general conditions (see below), and the χ2 test applies with m = r — q — 1. Tests of this type check the applicability of a normal distribution, Poisson distribution, etc., with unspecified parameters.
The theorem applies whenever the given functions pk(η1, η2, . . . , ηq) satisfy the following conditions throughout a neighborhood of the joint maximum-likelihood “point” (η1, η2 . . . , ηq):
The pk(η1, η2, . . . , ηq) have a common positive lower bound, are twice continuously differentiable, and the matrix [∂pk/∂ηi] is of rank q.
19.6-8. Parameter-free Comparison of Two Populations: The Sign Test. One desires to test the hypothesis that two random variables x and y have identical probability distributions, or
assuming that it is known that P [x = y] = 0. Consider a random sample of n pairs x1 y1;x2,y2,; . . . ; xn,yn; neglect any pairs such that xi = yi(ties) in computing the sample size n. The probability that more than m differences xi — yi are positive is
Now let mα be the smallest value of m such that p(m) s≤ α and
Reject the hypothesis (1) at the level of significance s≤α whenever the number of positive differences xi — yi exceeds mα (One-tailed Sign Test), or
Reject the hypothesis at the level of significance 2α if the number of positive or negative differences exceeds mα (Two-tailed Sign Test).
mα has been tabulated against α and n (Ref. 19.15). The sign test can also be used to test (1) the symmetry of a probability distribution; (2) the hypothesis that x = X is the median of a distribution (Ref. 19.15).
19.6-9. Generalizations (Ref. 19.17). The fixed-sample tests of Secs. 19.6-1 to 19.6-7 admit only two alternatives, viz., acceptance or rejection of a given hypothesis on the basis of a test sample. Sequential tests permit an increase in the sample size (additional observations) as a third possible decision; it is then possible to specify both the level of significance and the power of the test (relative to some alternative hypothesis) when the test hypothesis will finally be accepted or rejected.
Schemes of fixed-sample and sequential tests are special examples of statistical decision functions or rules of behavior associating a decision (set of parameters η1, η2, . . . , “point” in parameter space) with each given sample (x1, x2, . . .) of some observed qualities. In practice (operations research, detection theory), the decision function is designed so as to maximize the expected value of some measure of effectiveness (payoff function) involving the gains due to correct decisions, the losses due to incorrect decisions, and the cost of testing samples of various sizes.
19.7. SOME STATISTICS, SAMPLING DISTRIBUTIONS, AND TESTS FOR MULTIVARIATE DISTRIBUTIONS
19.7-1. Introduction. Sections 19.7-2 to 19.7-7 are in no sense an exhaustive survey of multivariate statistics but present a number of frequently used definitions, formulas, and tests for convenient reference. Note that multivariate statistics often serve to estimate and test stochastic relationships between two or more random variables.
19.7-2. Statistics Derived from Multivariate Samples, (a) Given a multidimensional random variable x ≡ (x1, x2 . . . , xn) (Sec. 18.2-9), one proceeds by analogy with Secs. 19.1-2 and 19.2-2 to 19.2-4 to introduce a random sample of size n, (x1 x2, . . . , xn) ≡ (x11, x21, . . . , xv1; x12, x22, . . . xv2; . . . ;x1n, x2n, . . . xvn) and the statistics
The “point” corresponding to is the sample center of gravity, and the matrix L ≡ [lij] is the sample moment matrix; det [lij] is the generalized variance of the sample.
(b) Given a two-dimensional random sample (xl1, x21; x12, x22; ... ; x1n, x2n) of a bivariate random variable (x1, x2), one defines the sample regression coefficients
The empirical linear mean-square regression of x2 on x1
is that linear function ax1 + b which minimizes the sample mean-square deviation
(see also Secs. 18.4-6b and 19.7-4c).
For ν-dimensional populations, empirical multiple and partial correlation coefficients and regression coefficients are derived from the sample moments lij by analogy with Eqs. (18.4-35) to (18.4-38) and serve as estimates of the corresponding population parameters. See Refs. 19.4 and 19.8 for the complete theory.
(c) The statistics (1) to (4) can be approximated by grouped-data estimates in the manner of Secs. 19.2-3 to 19.2-5. In particular, Sheppard's corrections for grouped-data estimates likG of the second-order central moments λik are given by
where the ΔXi are the constant class-interval lengths. These formulas often render errors due to grouping negligible if ΔXi is less than one-eighth the range of xi. Practical computation schemes are given in Refs. 19.3 and 19.8.
19.7-3. Estimation of Parameters (see also Sec. 19.4-1). Since the theorem of Sec. 19.4-2a applies to multivariate distributions, the statistics (3) to (5), as well as the sample averages i, are consistent estimates of the corresponding population parameters. The mean value and variance of each sample average
i and sample variance lii = si2 is given by Eqs. (19.2-8), (19.2-13), and (19.2-14); in addition, note
Var {rij} is of the order of 1/n as n increases (Ref. 19.4).
For multidimensional normal distributions (see also Sec. 19.7-4) only,
19.7-4. Sampling Distributions for Normal Populations (see also Secs. 18.8-6 and 18.8-8). For random samples drawn from multivariate normal populations, sampling distributions and tests involving only the sample averages i and the sample variances lii = si2 are obtained simply from Sec. 19.5-3. It remains to investigate statistics which describe stochastic relationships of different random variables xi, in particular sample correlation and regression coefficients.
(a) Distribution of the Sample Correlation Coefficient. For simplicity, consider a random sample (x11, x21; x12, x22; ... ; x1n, x2n) drawn from a two-dimensional normal population described by
(Sec. 18.8-6). The probability density of the sample correlation coefficient r12 = r (Sec. 19.7-2) is
Note that φr(n)(r) is independent of ξ1, ξ2, σ2, σ2. For n = 2, one has φr(2)(r) =0( —1 < r < l), since r is either 1 or —1.
It is useful to introduce the new random variable
which for n ≥ 10 may be regarded as approximately normal with the approximate mean and variance
Figure 19.7-1 illustrates the behavior of the statistics r and y for different values of ρ and n.
If y and y' are values of the statistic (17) calculated from two independent random samples of respective sizes n, n' from the same normal population, then y — y' is approximately normal with mean 0 and variance l/(n — 3) + l/(n' — 3).
(b) The r Distribution. Test for Uncorrelated Variables. In the important special case ρ = 0 (test for uncorrelated variables!), the frequency function (15) reduces to
In this case, has a t distribution with n — 2 degrees of freedom (Table 19.5-2). The statistic
is said to have an r distribution with n — 2 degrees of freedom. The r distribution has been tabulated (Ref. 19.3) and is asymptotically normal with mean 0 and variance 1 as n → ∞. Either the t distribution or the r distribution yields tests for the hypothesis ρ = 0.
FIG. 19.7-1. Probability densities of the statistics and y used to estimate the correlation coefficient p of a multidimensional normal distribution. (From Buringlon and May, Handbook of Probability and Statistics, McGraw-Hill, New York, 1953.)
(c) Test for Hypothetical Values of the Regression Coefficicmt (see also Sec. 19.7-2). Given a random sample drawn from a bivariate normal population described by Eq. (14), one tests hypothetical values of the population regression coefficient β21 = λ11/λ11 = ρ12α2/α1 (Sec. 18.4-6b) by means of the test statistic
which is distributed like t with n — 2 degrees of freedom. This test is far more convenient than one using the sample distribution of the sample regression coefficient b21 (Ref. 19.8).
(d) ν-dimensional Samples. For random samples drawn from a ν-dimensional normal population described by the probability density (18.8-26), the joint distribution of the sample averages 1,
2, . . . ,
v is normal with mean values ξ1, ξ2, . . . , ξv and moment matrix A/n. The joint distribution of the
i is statistically independent of the joint distribution of the v(v + l)/2 sample moments lij {Generalized Fisher's Theorem, see also Sec. 19.5-3b).
19.7-5. The Sample Mean-square Contingency. Contingency-table Test for Statistical Independence of Two Random Variables (see also Sec. 19.6-7). (a) Given a two-dimensional random sample (x1, y1; x2, y2; ... ; xn, yn) of a pair of random variables x, y, a contingency table arranges the n sample-value pairs (xk, yk) in a matrix of s x class intervals and r y class intervals. Let there be
ni. pairs (xk, yk) in the ith x class interval
n.j pairs (xk, yk) in the kth y class interval
nij pairs (xk, yk) both in the ith x class interval and in the jth y class interval
The statistic
measures the “degree of association” (statistical dependence) between x and y. f2 ranges between 0 and min (r, s) — 1 and reaches the latter value if and only if each row (r ≥ s) or each column (r s≤ s) contains only one element different from zero.
(b) If x and y are statistically independent, then the test statistic nf2 converges in probability to χ2 with m = (r — l)(s — 1) degrees of freedom as n → ∞ (Table 19.5-1). If all nij > 10 (pool some class intervals if necessary), the hypothesis of statistical independence is rejected at the level of significance α by the critical region nf2 > χ21-α(m) (Sec. 19.6-3),
(c) The special case r = s = 2 (success or failure, two-by-two contingency table) is of special interest; in this case (see also Ref. 19.4).
19.7-6. Spearman's Rank Correlation. A Nonparametric Test of Statistical Dependence. Suppose that a random sample of n observation pairs (x1,xy1; x2,y2;. . . ; xn, yn) yields only the information that xkis the Akth largest value of xk in the sample, and yk is the Bkth largest value of yk (k = 1, 2, . . . , n). If x and y are statistically independent, the test statistic
is asymptotically normal with mean 0 and variance l/(n — 1) as n → ∞. For any value of n > 1, the hypothesis of statistical independence is rejected at the level of significance ≤ α (Sec. 19.6-3) if
NOTE: If x and y have a normal joint distribution, then 2 sin πR/6'is a consistent estimate of their correlation coefficient ρxy.
19.8. RANDOM-PROCESS STATISTICS AND MEASUREMENTS
19.8-1. Simple Finite-time Averages. Let
be a given measurable function of the sample values x(ti), y(ti), . . . generated by the one-dimensional or multidimensional random process x(t), y(t), . . . . (Sec. 18.9-1). The finite-time averages
can be obtained, respectively, through sampled-data and continuous averaging of finite real data. [f]n and (f)T are random variables whose distributions are determined by the given random process. If x(t) represents a stationary (but not necessarily ergodic) random process, then
The finite-time averages [f]n and (f)T are, then, unbiased estimates of the unknown expected value E{f} and will be useful for estimating E{f} if their random fluctuations about their expected value (4) are reasonably small. More specifically, the mean-square error associated with estimation of E{f} by a measured value of [f]n is
If one introduces k Δt = λ and lets Δt → 0, n = T/Δt → ∞, then the sampled-data average (2) converges to the continuous average (y)T if the latter exists. A similar limiting process applied to Eq. (3) yields
Depending on the nature of the autocovariance function
the mean-square error (4) or (5) may or may not decrease to an acceptably small value with increasing sample size n or integration time T.
19.8-2. Averaging Filters. Time averaging is often accomplished by low-pass filters implemented as electrical networks, by electromechanical measuring devices with inertia and damping, or by digital sampled-data operations. Consider, in particular, a general time-invariant averaging filter with stationary input f(t) ≡ f(t1 + t, t2 + t, . . . , tn + t), bounded weighting function h(ζ), and frequency-response function H(iw) (Sec. 9.4-7). If the filter input is applied between t = 0 and t = T {averaging time), the filter output is
so that z(T)/a(T) is an unbiased estimate of E{f}.
The estimate variance is given by
As T → ∞, Var {z{T)} will not in general go to zero but, rather, approaches the stationary output variance
where φ(w) is the power spectral density (Sec. 18.10-3) of f(t) — E{f}, and BEQ is the bandwidth of an equivalent “rectangular” low-pass niter having the frequency response
BEQ is a useful measure of the variance-reducing properties of a given averaging filter. Table 19.8-1 lists H(iw) and BEQ for some practical filters (flat-spectrum input).
Table 19.8-1. Averaging Filters (Ref. 19.25)
19.8-3. Examples (Ref. 19.25). (a) Measurement of Mean Value. It is desired to measure the mean value E{x) = ξ of a stationary random voltage x(t) with
(white noise passed through a simple filter with — 3-db bandwidth α/2π cps, or random telegraph wave with counting rate α/2, Sec. 18.11-5). In this case,
and for the first-order averaging filter of Table 19.8-1 with T >> T0 (T > 4T0 for most practical purposes),
(b) Measurement of Mean Square. It is desired to measure E{f} = E{x2} for Gaussian noise x(t) satisfying Eq. (14). In this case,
and for the first-order averaging filter with T >> To and f ≡ x2,
(c) Measurement of Correlation Functions (see also Sec. 18.9-3b). The variances of correlation-function estimates for jointly stationary x(t), y(t) are given by Eqs. (5), (6), and (11) with f(t) ≡ x{t)y{t + r). Unfortunately, each variance depends on
and this fourth-order moment of the joint distribution of x(t) and y(t) is hardly ever known. In the special case of jointly Gaussian and stationary signals x{t), y(t) with zero mean values,
but even this involves the unknown correlation function Rxy(r) itself and hence yields useful information only in simple special cases.
For stationary Gaussian x{t),
To illustrate the dependence of the autocorrelation-function estimate
on signal bandwidth and delay, consider again Gaussian noise x(t) satisfying Eq. (14). In this case,
For r = 0, this agrees with Eq. (18). For observation times T large compared to the reciprocal signal bandwidth,
(within 1 per cent for αT ≥ 104).
For the more general case of a stationary Gaussian signal x{t) with
one similarly obtains
19.8-4. Sample Averages. Different sample functions generated by the same random process will be denoted by lx{t), 2x(t), . . . (Fig. 18.9-1) and are regarded as statistically independent; i.e., every finite set of samples ix{t1), ix(t2)... is statistically independent of every set of samples from a different sample function kx{t). If one can realize a set of sample functions lx(t)y 2x(t), . . . , nx(t) in independent repeated experiments, then the sample values 1x(t1), 2x(t1), . . . , nx(t1) constitute a classical random sample of size n; i.e., the kx(t1) are statistically independent random variables with identical probability distributions. Similarly, 1x(t1), 2x(t2); 2x(t1), 2x(t2); . . . nx(t1), nx(t2),or 1x(t1), 1y(t2);2x(t1), 2y(t2); . . . ;nx(t1), ny(t2) constitute bivariate random samples.
Sample averages, like
are, then, random-sample statistics in the sense of classical statistical theory. Sample averages must be obtained from repeated or multiple experiments, but it is usually much simpler to derive variances and probability distributions for sample averages than it is for time averages. In particular,
just as in Sec. 19.2-3 (see also Fig. 19.8-1).
FIG. 19.8-1. Four sample functions x(t) = kx(t) from a continuous random process represented by x(t).
19.9. TESTING AND ESTIMATION WITH RANDOM PARAMETERS
19.9-1. Problem Statement. A practically important class of decision situations can be represented by the model of Fig. 19.9-1. The cost C (risk) associated with one operation of the system shown is a function of the state of the environment represented by the m-dimensional random variable (s1, s2, . . . , sm) ≡ s and by a decision (response) represented by the r-dimensional variable (y1, y2, ... , yr) ≡ y, i.e.,
The decision maker (man, machine, or system) arrives at the decision y on the basis of received, observed, or measured data represented by the
n-dimensional random variable (x1, x2, . . . , xn) ≡ x, which is related to the state of the environment through the given joint distribution of s and x. The decision maker forms y as a deterministic function (decision function, see also Sec.19.6-9) of the received data:*
Given the joint distribution of s and x together with the cost function (1) representing the system performance for each combination of environment state and decision, one desires to minimize the expected risk
through an optimal choice of the decision function y(x). s, x, and y may be discrete or continuous variables.
19.9-2. Bayes Estimation and Tests. If the environment-state parameters s1, s2, . . . , sm are regarded as parameters of the unknown probability distribution of the observed sample (x1, x2, . . . , xn), the problem is somewhat similar to the classical problems of testing and estimation; the essential difference is that the parameters s1, s2, . . . , sm are now random variables.
For continuous variables s, x, the decision maker's knowledge of the environment state on the basis of a received sample x is, then, expressed by the conditional probability density φ(s|x). Minimization of the expected risk (3), or
then requires minimization of the conditional risk
for each sample x through a proper choice of the decision function y(x). If one is given the “a priori” distribution of s and the conditional probability density φ(x\s), then the “a posteriori” probability distribution required for Eq. (5) is obtained with the aid of Bayes's theorem (Secs. 18.2-6 and 18.4-5) as
Decision processes based on such minimization of the expected risk are known as Bayes estimation, and as Bayes tests if y is a discrete variable. Note that not all the unknown parameters si need affect C[s, y] explicitly (e.g., signal-carrier phase in amplitude-modulated radio transmission, Ref. 19.23).
If, as is often the case, one has no reliable knowledge of the cost function C(s, y) and/or of the “a priori” distribution of s, then Bayes estimation and testing becomes impossible. One may assume “worst-case” C(s, y) and φ(s) (minimax tests, Refs. 19.24 and 19.26), or one returns to the “classical” procedures of maximum-likelihood estimation (Sec. 19.4-4) and Neyman-Pearson tests (Sec. 19.6-3).
19.9-3. Binary State and Decision Variables: Statistical Tests or Detection (see also Secs. 19.6-1 to 19.6-4). Assume that there exist only two environment states, s = 0 (null hypothesis) and s = 1 (alternate hypotheses corresponding, say, to the absence and presence of a target to be detected). Assume two possible decisions y = 0, y = 1, which correspond to acceptance or rejection of the null hypothesis on the basis of the observed data sample (x1, x2, . . . , xn). The problem amounts to the choice of a critical region (rejection region) Sc of sample points (x1, x2, . . . , xn) which will minimize the expected risk
E{C(s,y)}
Where po = P[s = 0], and Sc is the complement of Sc (acceptance region). E{C(s, y)} will be minimized if one rejects the null hypothesis whenever the likelihood ratio
(see also Sec. 19.6-3) exceeds the critical value
Note that (1) any increasing or decreasing function of the likelihood ratio (8) can replace the latter as a test statistic, and (2) the likelihood ratio (8) is itself a mono tonic function of the “a posteriori” conditional probability p(s|x1, x2 . . . xn ), which may be regarded as a basic test statistic.
EXAMPLE: Signal Detection with Additive Flat-spectrum Gaussian Noise. The objective is to decide whether a received signal x(t) of bandwidth B is due to noise alone [s = 0, or x(t) = n(t)] or to signal and additive noise [s = 1, or x{t) = s{t) + n(t)]; in view of the finite bandwidth, one can describe signals and noise in terms of sample values
where T is the observation time (Sec. 18.11-2). One is given
so that
Since
is an increasing function of the likelihood ratio A(x1, x2 . . . , x2BT), z rather than A may be used as a test statistic. The receiver forms z{x1, x2, . . . , x2BT), either by discrete summation or by continuous integration, and compares it with a threshold value zc determined by po and C(s, y); z > zc corresponds to the decision y = 1 (crosscorrelation detector or matched-filter detector). See Secs. 19.22, 19.23, and 19.26 to 19.28 for additional examples and applications.
19.9-4. Estimation (Signal Extraction, Regression). In a typical measurement situation, the environment has a continuously distributed set of states represented by values of s ≡ (s1, s2, . . . , sn), and the m components yk of the m-dimensional decision function y ≡ (y1, y2, . . . , ym) must be chosen so as to approximate the corresponding sk as closely as possible in some sense specified by the cost function C(s, y). In the practically important case of least-square estimation, one assumes a cost function of the form
In this case, E{C(s,y)} is minimized if each yk equals the conditional expected value of sk for the measured sample (x1, x2, . . . , xn):
(see also Secs. 18.4-5, 18.4-6, and 18.4-9), where φp(sk|x1, x2, . . . , xn) must be obtained from φ(x1, x2, . . . > xn|s1, s2, . . . , sm) with the aid of Bayes's formula
EXAMPLE: D-c Measurement with Additive Gaussian Noise. The objective is to measure a single random quantity s from a sample (x1, x2 . . . , xn) of measured values xk = s + nk, given
which corresponds to additive-noise samples nk statistically independent of each other and of s. Bayes's formula yields
E{(y — s)2} is then minimized by the least-squares estimate
which depends on the sample values xk only by way of the statistic (sample average) and is biased by a priori knowledge of φ(s); the bias increases with increasing PN and decreasing sample size n, both of which reduce the “information” in the measured sample (x1, x2, . . . , xn). The resulting expected risk is
References 19.22, 19.23, and 19.26 to 19.28 describe additional applications.
19.10. RELATED TOPICS, REFERENCES, AND BIBLIOGRAPHY
19.10-1. Related Topics. The following topics related to the study of mathematical statistics are treated in other chapters of this handbook:
Probability theory, special distributions, limit theorems, random processes Chap. 18
Numerical calculations Chap. 20
Linear programming, game theory Chap. 11
Combinatorial analysis Appendix C
19.10-2. References and Bibliography (see also Sec. 18.13-2).
19.1. Brownlee, K. A.: Statistical Theory and Methodology in Science and Engineering, Wiley, New York, 1961.
19.2. Brunk, H. D.: An Introduction to Mathematical Statistics, 2d ed., Blaisdell, New York, 1964.
19.3. Burington, R. S., and D. C. May: Handbook of Probability and Statistics with Tables, 2d ed., McGraw-Hill, New York, 1967.
19.4. Cramér, H.: Mathematical Methods of Statistics, Princeton, Princeton, N.J., 1951.
19.5. Dixon, W. J., and F. J. Massey, Jr.: An Introduction to Statistical Analysis, 2d ed., McGraw-Hill, New York, 1957.
19.6. Eisenhart, G. C, M. W. Hastay, and W. A. Wallis: Techniques of Statistical Analysis, McGraw-Hill, New York, 1947.
19.7. Elderton, W. P.: Frequency Curves and Correlation, 3d ed., Cambridge, New York, 1938.
19.8. Hald, A.: Statistical Theory with Engineering Applications, Wiley, New York, 1952.
19.9. Hoel, P. G.: Elementary Statistics, 2d ed., Wiley, New York, 1966.
19.10. Hogg, R. V., and A. T. Craig: Introduction to Mathematical Statistics, Mac-millan, New York, 1959.
19.11. Lehmann, E. L.: Testing Statistical Hypotheses, Wiley, New York, 1959.
19.12. Mood, A. M., and F. A. Graybill: Introduction to the Theory of Statistics, 2d ed., McGraw-Hill, New York, 1963.
19.13. Neyman, J.: First Course in Probability and Statistics, Holt, New York, 1950.
19.14. Scheffe, H.: The Analysis of Variance, Wiley, New York, 1959.
19.15. Van der Waerden, B. L.: Mathematische Statistik, 2d ed., Springer, Berlin, 1965.
19.16. Walsh, J. E.: Handbook of Nonparamelric Statistics (2 vols.), Van Nostrand, Princeton, N.J., 1960/65.
19.17. Weiss, L.: Statistical Decision Theory, McGraw-Hill, New York, 1961.
19.18. Wilks, S. S.: Mathematical Statistics, 2d ed., Wiley, New York, 1962.
Random-process Statistics and Decision Theory
19.19. Bendat, J. S.: Principles and Applications of Random Noise Theory, Wiley, New York, 1958.
19.20. —— and A. G. Piersol: Measurement and Analysis of Random Data, Wiley, New York, 1966.
19.21. Blackman, R. B., and J. W. Tukey: The Measurement of Power Spectra, Dover, New York, 1958.
19.22. Davenport, W. B., and W. L. Root: Introduction to Random Signals and Noise, McGraw-Hill, New York, 1958.
19.23. Hancock, J. C: Signal Detection, McGraw-Hill, New York, 1966.
19.24. Helstrom, C. W.: Statistical Theory of Signal Detection, Pergamon Press, New York, 1960.
19.25. Korn, G. A.: Random-process Simulation and Measurements, McGraw-Hill, New York, 1966.
19.26. Middleton, D.: An Introduction to Statistical Communication Theory, McGraw-Hill, New York, 1960.
19.27. Wainstein, L. A., and V. D. Zubakov: Extraction of Signals from Noise, Prentice-Hall, Englewood Cliffs, N.J., 1962.
19.28. Wozencraft, J. M., and I. M. Jacobs: Principles of Communication Engineering, Wiley, New York, 1965.
*See footnote to Sec 18.3-4.
* See footnote to Sec. 19.6-5a.
* Random or partially random selection of decisions (as in games with mixed strategies, Sec. 11.4-46) will not be considered here.