8
PROBABILITY

In this chapter I shall only deal with the probability of events and the problems it raises. They arise in connection with the theory of games of chance, and with the probabilistic laws of physics. I shall leave the problems of what may be called the probability of hypotheses—such questions as whether a frequently tested hypothesis is more probable than one which has been little tested—to be discussed in sections 79 to 85 under the title of ‘Corroboration’.

Ideas involving the theory of probability play a decisive part in modern physics. Yet we still lack a satisfactory, consistent definition of probability; or, what amounts to much the same, we still lack a satisfactory axiomatic system for the calculus of probability. The relations between probability and experience are also still in need of clarification. In investigating this problem we shall discover what will at first seem an almost insuperable objection to my methodological views. For although probability statements play such a vitally important rôle in empirical science, they turn out to be in principle impervious to strict falsification. Yet this very stumbling block will become a touchstone upon which to test my theory, in order to find out what it is worth.

Thus we are confronted with two tasks. The first is to provide new foundations for the calculus of probability. This I shall try to do by developing the theory of probability as a frequency theory, along the lines followed by Richard von Mises, but without the use of what he calls the ‘axiom of convergence’ (or ‘limit axiom’), and with a somewhat weakened ‘axiom of randomness’. The second task is to elucidate the relations between probability and experience. This means solving what I call the problem of decidability of probability statements.

My hope is that these investigations will help to relieve the present unsatisfactory situation in which physicists make much use of probabilities without being able to say, consistently, what they mean by ‘probability’.*1


47 THE PROBLEM OF INTERPRETING PROBABILITY STATEMENTS

I shall begin by distinguishing two kinds of probability statements: those which state a probability in terms of numbers—which I will call numerical probability statements—and those which do not.

Thus the statement, ‘The probability of throwing eleven with two (true) dice is 1/18’, would be an example of a numerical probability statement. Non-numerical probability statements can be of various kinds. ‘It is very probable that we shall obtain a homogeneous mixture by mixing water and alcohol’, illustrates one kind of statement which, suitably interpreted, might perhaps be transformed into a numerical probability statement. (For example, ‘The probability of obtaining... is very near to 1’.) A very different kind of non-numerical probability statement would be, for instance, ‘The discovery of a physical effect which contradicts the quantum theory is highly improbable’; a statement which, I believe, cannot be transformed into a numerical probability statement, or put on a par with one, without distorting its meaning. I shall deal first with numerical probability statements; nonnumerical ones, which I think less important, will be considered afterwards.

In connection with every numerical probability statement, the question arises: ‘How are we to interpret a statement of this kind and, in particular, the numerical assertion it makes?’


48 SUBJECTIVE AND OBJECTIVE INTERPRETATIONS

The classical (Laplacean) theory of probability defines the numerical value of a probability as the quotient obtained by dividing the number of favourable cases by the number of equally possible cases. We might disregard the logical objections which have been raised against this definition,1 such as that ‘equally possible’ is only another expression for ‘equally probable’. But even then we could hardly accept this definition as providing an unambiguously applicable interpretation. For there are latent in it several different interpretations which I will classify as subjective and objective.

A subjective interpretation of probability theory is suggested by the frequent use of expressions with a psychological flavour, like ‘mathematical expectation’ or, say, ‘normal law of error’, etc.; in its original form it is psychologistic. It treats the degree of probability as a measure of the feelings of certainty or uncertainty, of belief or doubt, which may be aroused in us by certain assertions or conjectures. In connection with some non-numerical statements, the word ‘probable’ may be quite satisfactorily translated in this way; but an interpretation along these lines does not seem to me very satisfactory for numerical probability statements.

A newer variant of the subjective interpretation,*1 however, deserves more serious consideration here. This interprets probability statements not psychologically but logically, as assertions about what may be called the ‘logical proximity’2 of statements. Statements, as we all know, can stand in various logical relations to one another, like derivability, incompatibility, or mutual independence; and the logico-subjective theory, of which Keynes3 is the principal exponent, treats the probability relation as a special kind of logical relationship between two statements. The two extreme cases of this probability relation are derivability and contradiction: a statement q ‘gives’,4 it is said, to another statement p the probability 1 if p follows from q. In case p and q contradict each other the probability given by q to p is zero. Between these extremes lie other probability relations which, roughly speaking, may be interpreted in the following way: The numerical probability of a statement p (given q) is the greater the less its content goes beyond what is already contained in that statement q upon which the probability of p depends (and which ‘gives’ to p a probability).

The kinship between this and the psychologistic theory may be seen from the fact that Keynes defines probability as the ‘degree of rational belief’. By this he means the amount of trust it is proper to accord to a statement p in the light of the information or knowledge which we get from that statement q which ‘gives’ probability to p.

A third interpretation, the objective interpretation, treats every numerical probability statement as a statement about the relative frequency with which an event of a certain kind occurs within a sequence of occurrences.5

According to this interpretation, the statement ‘The probability of the next throw with this die being a five equals 1/6’ is not really an assertion about the next throw; rather, it is an assertion about a whole class of throws of which the next throw is merely an element. The statement in question says no more than that the relative frequency of fives, within this class of throws, equals 1/6.

According to this view, numerical probability statements are only admissible if we can give a frequency interpretation of them. Those probability statements for which a frequency interpretation cannot be given, and especially the non-numerical probability statements, are usually shunned by the frequency theorists.

In the following pages I shall attempt to construct anew the theory of probability as a (modified) frequency theory. Thus I declare my faith in an objective interpretation; chiefly because I believe that only an objective theory can explain the application of the probability calculus within empirical science. Admittedly, the subjective theory is able to give a consistent solution to the problem of how to decide probability statements; and it is, in general, faced by fewer logical difficulties than is the objective theory. But its solution is that probability statements are nonempirical; that they are tautologies. And this solution turns out to be utterly unacceptable when we remember the use which physics makes of the theory of probability. (I reject that variant of the subjective theory which holds that objective frequency statements should be derived from subjective assumptions—perhaps using Bernoulli’s theorem as a ‘bridge’:6 I regard this programme for logical reasons as unrealizable.)

49 THE FUNDAMENTAL PROBLEM OF THE THEORY OF CHANCE

The most important application of the theory of probability is to what we may call ‘chance-like’ or ‘random’ events, or occurrences. These seem to be characterized by a peculiar kind of incalculability which makes one disposed to believe—after many unsuccessful attempts— that all known rational methods of prediction must fail in their case. We have, as it were, the feeling that not a scientist but only a prophet could predict them. And yet, it is just this incalculability that makes us conclude that the calculus of probability can be applied to these events.

This somewhat paradoxical conclusion from incalculability to calculability (i.e. to the applicability of a certain calculus) ceases, it is true, to be paradoxical if we accept the subjective theory. But this way of avoiding the paradox is extremely unsatisfactory. For it entails the view that the probability calculus is not a method of calculating predictions, in contradistinction to all the other methods of empirical science. It is, according to the subjective theory, merely a method for carrying out logical transformations of what we already know; or rather what we do not know; for it is just when we lack knowledge that we carry out these transformations.1 This conception dissolves the paradox indeed, but it does not explain how a statement of ignorance, interpreted as a frequency statement, can be empirically tested and corroborated. Yet this is precisely our problem. How can we explain the fact that from incalculability—that is, from ignorance— we may draw conclusions which we can interpret as statements about empirical frequencies, and which we then find brilliantly corroborated in practice?

Even the frequency theory has not up to now been able to give a satisfactory solution of this problem—the fundamental problem of the theory of chance, as I shall call it. It will be shown in section 67 that this problem is connected with the ‘axiom of convergence’ which is an integral part of the theory in its present form. But it is possible to find a satisfactory solution within the framework of the frequency theory, after this axiom has been eliminated. It will be found by analysing the assumptions which allow us to argue from the irregular succession of single occurrences to the regularity or stability of their frequencies.


50 THE FREQUENCY THEORY OF VON MISES

A frequency theory which provides a foundation for all the principal theorems of the calculus of probability was first proposed by Richard von Mises.1 His fundamental ideas are as follows.

The calculus of probability is a theory of certain chance-like or random sequences of events or occurrences, i.e. of repetitive events such as a series of throws with a die. These sequences are defined as ‘chance-like’ or ‘random’ by means of two axiomatic conditions: the axiom of convergence (or the limit-axiom) and the axiom of randomness. If a sequence of events satisfies both of these conditions it is called by von Mises a ‘collective’.

A collective is, roughly speaking, a sequence of events or occurrences which is capable in principle of being continued indefinitely; for example a sequence of throws made with a supposedly indestructible die. Each of these events has a certain character or property; for example, the throw may show a five and so have the property five. If we take all those throws having the property five which have appeared up to a certain element of the sequence, and divide their number by the total number of throws up to that element (i.e. its ordinal number in the sequence) then we obtain the relative frequency of fives up to that element. If we determine the relative frequency of fives up to every element of the sequence, then we obtain in this way a new sequence— the sequence of the relative frequencies of fives. This sequence of frequencies is distinct from the original sequence of events to which it corresponds, and which may be called the ‘event-sequence’ or the ‘propertysequence’.

As a simple example of a collective I choose what we may call an ‘alternative’. By this term we denote a sequence of events supposed to have two properties only—such as a sequence of tosses of a coin. The one property (heads) will be denoted by ‘1’, and the other (tails) by ‘0’. A sequence of events (or sequence of properties) may then be represented as follows:


20004f49v05_0169_003.jpg

Corresponding to this ‘alternative’—or, more precisely, correlated with the property ‘1’ of this alternative—is the following sequence of relative frequencies, or ‘frequency-sequence’:2


20004f49v05_0169_005.jpg

Now the axiom of convergence (or ‘limit-axiom’) postulates that, as the event-sequence becomes longer and longer, the frequency-sequence shall tend towards a definite limit. This axiom is used by von Mises because we have to make sure of one fixed frequency value with which we can work (even though the actual frequencies have fluctuating values). In any collective there are at least two properties; and if we are given the limits of the frequencies corresponding to all the properties of a collective, then we are given what is called its ‘distribution’.

The axiom of randomness or, as it is sometimes called, ‘the principle of the excluded gambling system’, is designed to give mathematical expression to the chance-like character of the sequence. Clearly, a gambler would be able to improve his chances by the use of a gambling system if sequences of penny tosses showed regularities such as, say, a fairly regular appearance of tails after every run of three heads. Now the axiom of randomness postulates of all collectives that there does not exist a gambling system that can be successfully applied to them. It postulates that, whatever gambling system we may choose for selecting supposedly favourable tosses, we shall find that, if gambling is continued long enough, the relative frequencies in the sequence of tosses supposed to be favourable will approach the same limit as those in the sequence of all tosses. Thus a sequence for which there exists a gambling system by means of which the gambler can improve his chances is not a collective in the sense of von Mises.

Probability, for von Mises, is thus another term for ‘limit of relative frequency in a collective’. The idea of probability is therefore applicable only to sequences of events; a restriction likely to be quite unacceptable from a point of view such as Keynes’s. To critics objecting to the narrowness of his interpretation, von Mises replied by stressing the difference between the scientific use of probability, for example in physics, and the popular uses of it. He pointed out that it would be a mistake to demand that a properly defined scientific term has to correspond in all respects to inexact, pre-scientific usage.

The task of the calculus of probability consists, according to von Mises, simply and solely in this: to infer certain ‘derived collectives’ with ‘derived distributions’ from certain given ‘initial collectives’ with certain given ‘initial distributions’; in short, to calculate probabilities which are not given from probabilities which are given.

The distinctive features of his theory are summarized by von Mises in four points:3 the concept of the collective precedes that of probability; the latter is defined as the limit of the relative frequencies; an axiom of randomness is formulated; and the task of the calculus of probability is defined.


51 PLAN FOR A NEW THEORY OF PROBABILITY

The two axioms or postulates formulated by von Mises in order to define the concept of a collective have met with strong criticism— criticism which is not, I think, without some justification. In particular, objections have been raised against combining the axiom of convergence with the axiom of randomness1 on the ground that it is inadmissible to apply the mathematical concept of a limit, or of convergence, to a sequence which by definition (that is, because of the axiom of randomness) must not be subject to any mathematical rule or law. For the mathematical limit is nothing but a characteristic property of the mathematical rule or law by which the sequence is determined. It is merely a property of this rule or law if, for any chosen fraction arbitrarily close to zero, there is an element in the sequence such that all elements following it deviate by less than that fraction from some definite value—which is then called their limit.

To meet such objections it has been proposed to refrain from combining the axiom of convergence with that of randomness, and to postulate only convergence, i.e. the existence of a limit. As to the axiom of randomness, the proposal was either to abandon it altogether (Kamke) or to replace it by a weaker requirement (Reichenbach). These suggestions presuppose that it is the axiom of randomness which is the cause of the trouble.

In contrast to these views, I am inclined to blame the axiom of convergence no less than the axiom of randomness. Thus I think that there are two tasks to be performed: the improvement of the axiom of randomness—mainly a mathematical problem; and the complete elimination of the axiom of convergence—a matter of particular concern for the epistemologist.2 (Cf. section 66.)

In what follows I propose to deal first with the mathematical, and afterwards with the epistemological question.

The first of these two tasks, the reconstruction of the mathematical theory,3 has as its main aim the derivation of Bernoulli’s theorem— the first ‘Law of Great Numbers’—from a modified axiom of randomness; modified, namely, so as to demand no more than is needed to achieve this aim. Or to be more precise, my aim is the derivation of the Binomial Formula (sometimes called ‘Newton’s Formula’), in what I call its ‘third form’. For from this formula, Bernoulli’s theorem and the other limit theorems of probability theory can be obtained in the usual way.

My plan is to work out first a frequency theory for finite classes, and to develop the theory, within this frame, as far as possible—that is, up to the derivation of the (‘first’) Binomial Formula. This frequency theory for finite classes turns out to be a quite elementary part of the theory of classes. It will be developed merely in order to obtain a basis for discussing the axiom of randomness.

Next I shall proceed to infinite sequences, i.e. to sequences of events which can be continued indefinitely, by the old method of introducing an axiom of convergence, since we need something like it for our discussion of the axiom of randomness. And after deriving and examining Bernoulli’s theorem, I shall consider how the axiom of convergence might be eliminated, and what sort of axiomatic system we should be left with as the result.

In the course of the mathematical derivation I shall use three different frequency symbols: F″ is to symbolize relative frequency in finite classes; F′ is to symbolize the limit of the relative frequencies of an infinite frequency-sequence; and finally F, is to symbolize objective probability, i.e. relative frequency in an ‘irregular’ or ‘random’ or ‘chance-like’ sequence.


52 RELATIVE FREQUENCY WITHIN A FINITE CLASS

Let us consider a class α of a finite number of occurrences, for example the class of throws made yesterday with this particular die. This class α, which is assumed to be non-empty, serves, as it were, as a frame of reference, and will be called a (finite) reference-class. The number of elements belonging to α, i.e. its cardinal number, is denoted by ‘N(α)’, to be read ‘the number of α’. Now let there be another class, α, which may be finite or not. We will call β our property-class: it may be, for example, the class of all throws which show a five, or (as we shall say) which have the property five.

The class of those elements which belong to both α and β, for example the class of throws made yesterday with this particular die and having the property five, is called the product-class of α and β and is denoted by α.β’, to be read α and β. Since α.β is a subclass of α, it can at most contain a finite number of elements (it may be empty). The number of elements in α.β is denoted by ‘N.β)’.

Whilst we symbolize (finite) numbers of elements by N, the relative frequencies are symbolized by F″. For example, ‘the relative frequency of the property β within the finite reference-class α’ is written ‘αF″(β)’, which may be read ‘the α-frequency of β’. We can now define

20004f49v05_0173_003.jpg

(Definition 1)


In terms of our example this would mean: ‘The relative frequency of fives among yesterday’s throws with this die is, by definition, equal to the quotient obtained by dividing the number of fives, thrown yesterday with this die, by the total number of yesterday’s throws with this die.’*1

From this rather trivial definition, the theorems of the calculus of frequency in finite classes can very easily be derived (more especially, the general multiplication theorem; the theorem of addition; and the theorems of division, i.e. Bayes’s rules. Cf. appendix ii). Of the theorems of this calculus of frequency, and of the calculus of probability in general, it is characteristic that cardinal numbers (N-numbers) never appear in them, but only relative frequencies, i.e. ratios, or F-numbers. The Nnumbers only occur in the proofs of a few fundamental theorems which are directly deduced from the definition; but they do not occur in the theorems themselves.*2

How this is to be understood will be shown here with the help of one very simple example. (Further examples will be found in appendix ii.) Let us denote the class of all elements which do not belong to β by ‘20004f49v05_0174_002.jpg’ (read: ‘the complement of α’ or simply: ‘non-β’). Then we may write

20004f49v05_0174_002.jpg

While this theorem only contains F-numbers, its proof makes use of Nnumbers. For the theorem follows from the definition (1) with the α-help of a simple theorem from the calculus of classes which asserts that N.β) + N(α.20004f49v05_0174_002.jpg) = N(α ).


53 SELECTION, INDEPENDENCE, INSENSITIVENESS, IRRELEVANCE

Among the operations which can be performed with relative frequencies in finite classes, the operation of selection1 is of special importance for what follows.

Let a finite reference-class α be given, for example the class of buttons in a box, and two property-classes, β (say, the red buttons) and γ (say, the large buttons). We may now take the product-class α.β as a new reference-class, and raise the question of the value of α.βF″ (γ), i.e. of the frequency of γ within the new reference-class.2 The new referenceclass α.β may be called ‘the result of selecting 20004f49v05_0260_003.jpgelements from α’, or the ‘selection from α according to the property γ’; for we may think of it as being obtained by selecting from α all those elements (buttons) which have the property β (red).

Now it is just possible that γ may occur in the new reference-class, α.β, with the same relative frequency as in the original reference-class α; i.e. it may be true that In this case we say (following Hausdorff3) that the properties β and γ are ‘mutually independent, within the reference-class α’. The relation of independence is a three-termed relation and is symmetrical in the properties β and γ.4 If two properties β and γ are (mutually) independent within a reference-class α we can also say that the property γ is, within β, insensitive to the selection of 20004f49v05_0260_003.jpgelements; or perhaps that the reference-class γ is, with respect to this property γ, insensitive to a selection according to the property β.

20004f49v05_0174_008.jpg

The mutual independence, or insensitiveness, of β and γ within α could also—from the point of view of the subjective theory—be interpreted as follows: If we are informed that a particular element of the class α has the property β, then this information is irrelevant if β and γ are mutually independent within α; irrelevant namely, to the question whether this element also has the property γ, or not.*1 If, on the other hand, we know that γ occurs more often (or less often) in the subclass α.β (which has been selected from α according to β), then the information that an element has the property α is relevant to the question whether this element also has the property γ or not.5


54 FINITE SEQUENCES. ORDINAL SELECTION AND NEIGHBOURHOOD SELECTION

Let us suppose that the elements of a finite reference-class α are numbered (for instance that a number is written on each button in the box), and that they are arranged in a sequence, in accordance with these ordinal numbers. In such a sequence we can distinguish two kinds of selection which have special importance, namely selection according to the ordinal number of an element, or briefly, ordinal selection, and selection according to its neighbourhood.

Ordinal selection consists in making a selection, from the sequence α, in accordance with a property β which depends upon the ordinal number of the element (whose selection is to be decided on). For example β may be the property even, so that we select from α all those elements whose ordinal number is even. The elements thus selected form a selected sub-sequence. Should a property γ be independent of an ordinal selection according to β, then we can also say that the ordinal selection is independent with respect to γ; or we can say that the sequence α is, with respect to γ, insensitive to a selection of 20004f49v05_0260_003.jpgelements.

Neighbourhood selection is made possible by the fact that, in ordering the elements in a numbered sequence, certain neighbourhood relations are created. This allows us, for example, to select all those members whose immediate predecessor has the property γ; or, say, those whose first and second predecessors, or whose second successor, have the property γ; and so on.

Thus if we have a sequence of events—say tosses of a coin—we have to distinguish two kinds of properties: its primary properties such as ‘heads’ or ‘tails’, which belong to each element independently of its position in the sequence; and its secondary properties such as ‘even’ or ‘successor of tails’, etc., which an element acquires by virtue of its position in the sequence.

A sequence with two primary properties has been called ‘alternative’. As von Mises has shown, it is possible to develop (if we are careful) the essentials of the theory of probability as a theory of alternatives, without sacrificing generality. Denoting the two primary properties of an alternative by the figures ‘1’ and ‘0’, every alternative can be represented as a sequence of ones and zeros.

Now the structure of an alternative can be regular, or it can be more or less irregular. In what follows we will study this regularity or irregularity of certain finite alternatives more closely.*1


55 N-FREEDOM IN FINITE SEQUENCES

Let us take a finite alternative α, for example one consisting of a thousand ones and zeros regularly arranged as follows:


20004f49v05_0177_004.jpg

In this alternative we have equal distribution, i.e. the relative frequencies of the ones and the zeros are equal. If we denote the relative frequency of the property 1 by ‘F (1)’ and that of 0 by ‘F (0)’, we can write:

20004f49v05_0177_006.jpg

(1)

We now select from α all terms with the neighbourhood-property of immediately succeeding a one (within the sequence α). If we denote this property by ‘β’, we may call the selected sub-sequence ‘α.β’. It will have the structure:


20004f49v05_0177_009.jpg

.β )

This sequence is again an alternative with equal distribution. Moreover, neither the relative frequency of the ones nor that of the zeros has changed; i.e. we have

20004f49v05_0177_012.jpg

(2)

In the terminology introduced in section 53, we can say that the primary properties of the alternative α are insensitive to selection according to the property β; or, more briefly, that α is insensitive to selection according to β.

Since every element of α has either the property β (that of being the successor of a one) or that of being the successor of a zero, we can denote the latter property by ‘20004f49v05_0177_012.jpg’. If we now select the members having the property 20004f49v05_0177_012.jpg we obtain the alternative:

20004f49v05_0178_002.jpg

(α.20004f49v05_0177_012.jpg)


This sequence shows a very slight deviation from equal distribution in so far as it begins and ends with zero (since α itself ends with ‘0, 0’ on account of its equal distribution). If α contains 2000 elements, then α.20004f49v05_0177_012.jpg will contain 500 zeros, and only 499 ones. Such deviations from equal distribution (or from other distributions) arise only on account of the first or last elements: they can be made as small as we please by making the sequence sufficiently long. For this reason they will be neglected in what follows; especially since our investigations are to be extended to infinite sequences, where these deviations vanish. Accordingly, we shall say that the alternative α.20004f49v05_0177_012.jpg has equal distribution, and that the alternative α is insensitive to the selection of elements having the property 20004f49v05_0177_012.jpg. As a consequence, α, or rather the relative frequency of the primary properties of α, is insensitive to both, a selection according to β and according to 20004f49v05_0177_012.jpg; and we may therefore say that α is insensitive to every selection according to the property of the immediate predecessor.

Clearly, this insensitivity is due to certain aspects of the structure of the alternative α; aspects which may distinguish it from other alternatives. For example, the alternatives α.β and α.β-are not insensitive to selection according to the property of a predecessor.

We can now investigate the alternative α in order to see whether it is insensitive to other selections, especially to selection according to the property of a pair of predecessors. We can, for example, select from α all those elements which are successors of a pair 1,1. And we see at once that α is not insensitive to the selection of the successor of any of the four possible pairs 1,1; 1,0; 0,1; 0,0. In none of these cases have the resulting sub-sequences equal distribution; on the contrary, they all consist of uninterrupted blocks (or ‘iterations’), i.e. of nothing but ones, or of nothing but zeros.

The fact that α is insensitive to selection according to single predecessors, but not insensitive to selection according to pairs of predecessors, might be expressed, from the point of view of the subjective theory, as follows. Information about the property of one predecessor of any element in α is irrelevant to the question of the property of this element. On the other hand, information about the properties of its pair of predecessors is of the highest relevance; for given the law according to which α is constructed, it enables us to predict the property of the element in question: the information about the properties of its pair of predecessors furnishes us, so to speak, with the initial conditions needed for deducing the prediction. (The law according to which α is constructed requires a pair of properties as initial conditions; thus it is ‘two-dimensional’ with respect to these properties. The specification of one property is ‘irrelevant’ only in being composite in an insufficient degree to serve as an initial condition. Cf. section 38.*1)

 Remembering how closely the idea of causality—of cause and effect—is related to the deduction of predictions, I shall now make use of the following terms. The assertion previously made about the alternative α, ‘α is insensitive to selection according to a single predecessor’, I shall now express by saying, ‘α is free from any after-effect of single predecessors’ or briefly, ‘α is 1-free’. And instead of saying as before, that α is (or is not) ‘insensitive to selection according to pairs of predecessors’, I shall now say: ‘α is (not) free from the after-effects of pairs of predecessors’, or briefly, ‘α is (not) 2-free.’*2

Using the 1-free alternative α as our prototype we can now easily construct other sequences, again with equal distribution, which are not only free from the after effects of one predecessor, i.e. 1-free (like α), but which are, in addition, free from the after effects of a pair of predecessors, i.e., 2-free; and after this, we can go on to sequences which are 3-free, etc. In this way we are led to a general idea which is fundamental for what follows. It is the idea of freedom from the aftereffects of all the predecessors up to some number n; or, as we shall say, of n-freedom. More precisely, we shall call a sequence ‘n-free’ if, and only if, the relative frequencies of its primary properties are ‘ninsensitive’, i.e. insensitive to selection according to single predecessors and according to pairs of predecessors and according to triplets of predecessors... and according to n-tuples of predecessors.1

An alternative α which is 1-free can be constructed by repeating the generating period

20004f49v05_0180_003.jpg

(A)

any number of times. Similarly we obtain a 2-free alternative with equal distribution if we take

20004f49v05_0180_006.jpg

(B)

as its generating period. A 3-free alternative is obtained from the generating period

20004f49v05_0180_009.jpg

(C)

and a 4-free alternative is obtained from the generating period

 

20004f49v05_0180_013.jpg

(D)

It will be seen that the intuitive impression of being faced with an irregular sequence becomes stronger with the growth of the number n of its n-freedom.

The generating period of an n-free alternative with equal distribution must contain at least 2n + 1 elements. The periods given as examples can, of course, begin at different places; (C) for example can begin with its fourth element, so that we obtain, in place of (C)

20004f49v05_0181_002.jpg

(C′)

There are other transformations which leave the n-freedom of a sequence unchanged. A method of constructing generating periods of n-free sequences for every number n will be described elsewhere.*3

If to the generating period of an n-free alternative we add the first n elements of the next period, then we obtain a sequence of the length 2n + 1 + n. This has, among others, the following property: every arrangement of n + 1 zeros and ones, i.e. every possible n + 1-tuple, occurs in it at least once.*4


56 SEQUENCES OF SEGMENTS. THE FIRST FORM OF THE BINOMIAL FORMULA

Given a finite sequence α, we call a sub-sequence of α consisting of n consecutive elements a ‘segment of α of length n’; or, more briefly, an ‘n-segment of α’. If, in addition to the sequence α, we are given some definite number n, then we can arrange the n-segments of α in a sequence—the sequence of n-segments of α Given a sequence α, we may construct a new sequence, of n-segments of α, in such a way that we begin with the segment of the first n elements of α. Next comes the segment of the elements 2 to n + 1 of α. In general, we take as the xth element of the new sequence the segment consisting of the elements x to x + n -1 of α. The new sequence so obtained may be called the ‘sequence of the overlapping n-segments of α’. This name indicates that any two consecutive elements (i.e. segments) of the new sequence overlap in such a way that they have n -1 elements of the original sequence α in common.

Now we can obtain, by selection, other n-sequences from a sequence of overlapping segments; especially sequences of adjoining n-segments.

A sequence of adjoining n-segments contains only such n-segments as immediately follow each other in α without overlapping. It may begin, for example, with the n-segments of the elements numbered 1 to n, of the original sequence α, followed by that of the elements n + 1 to 2n, 2n + 1 to 3n, and so on. In general, a sequence of adjoining segments will begin with the kth element of α and its segments will contain the elements of α numbered k to n + k -1, n + k to 2n + k -1, 2n + k to 3n + k -1, and so on.

In what follows, sequences of overlapping n-segments of α will be denoted by ‘α(n)’, and sequences of adjoining n-segments by ‘αn’.

Let us now consider the sequences of overlapping segments α(n) a little more closely. Every element of such a sequence is an n-segment of α. As a primary property of an element of α(n), we might consider, for instance, the ordered n-tuple of zeros and ones of which the segment consists. Or we could, more simply, regard the number of its ones as the primary property of the element (disregarding the order of the ones and zeros). If we denote the number of ones by ‘m’ then, clearly, we have m Il_20004f49v05_0182_001.gif n.

Now from every sequence α(n) we again get an alternative if we select a particular m (m Il_20004f49v05_0182_001.gif n), ascribing the property ‘m’ to each element of the sequence α(n) which has exactly m ones (and therefore n -m zeros) and the property ‘ Il_20004f49v05_0182_002.gif ’ (non-m) to all other elements of α(n). Every element of α(n) must then have one or the other of these two properties.

Let us now imagine again that we are given a finite alternative α with the primary properties ‘1’ and ‘0’. Assume that the frequency of the ones, αF″ (1), is equal to p, and that the frequency of the zeros, αF″ (0), is equal to q. (We do not assume that the distribution is equal, i.e. that p = q.)

Now let this alternative α be at least n-1-free (n being an arbitrarily chosen natural number). We can then ask the following question: What is the frequency with which the property m occurs in the sequence αn? Or in other words, what will be the value of α(n)F″(m)?

α(n)Without assuming anything beyond the fact that α is at least n-1-free, we can settle this question1 by elementary arithmetic. The answer is contained in the following formula, the proof of which will be found in appendix iii:

20004f49v05_0183_004.jpg

(1)

The right-hand side of the ‘binomial’ formula (1) was given—in another connection—by Newton. (It is therefore sometimes called Newton’s formula.) I shall call it the ‘first form of the binomial formula’.*1

With the derivation of this formula, I now leave the frequency theory as far as it deals with finite reference-classes. The formula will provide us with a foundation for our discussion of the axiom of randomness.


57 INFINITE SEQUENCES. HYPOTHETICAL ESTIMATES OF FREQUENCY

It is quite easy to extend the results obtained for n-free finite sequences to infinite n-free sequences which are defined by a generating period (cf. section 55). An infinite sequence of elements playing the rôle of the reference-class to which our relative frequencies are related may be called a ‘reference-sequence’. It more or less corresponds to a ‘collective’ in von Mises’s sense.*1

The concept of n-freedom presupposes that of relative frequency; for what its definition requires to be insensitive—insensitive to selection according to certain predecessors—is the relative frequency with which a property occurs. In our theorems dealing with infinite sequences I shall employ, but only provisionally (up to section 64), the idea of a limit of relative frequencies (denoted by F′), to take the place of relative frequency in finite classes (F″). The use of this concept gives rise to no problem so long as we confine ourselves to reference-sequences which are constructed according to some mathematical rule. We can always determine for such sequences whether the corresponding sequence of relative frequencies is convergent or not. The idea of a limit of relative frequencies leads to trouble only in the case of sequences for which no mathematical rule is given, but only an empirical rule (linking, for example the sequence with tosses of a coin); for in these cases the concept of limit is not defined (cf. section 51).

An example of a mathematical rule for constructing a sequence is the following: ‘The nth element of the sequence α shall be 0 if, and only if, n is divisible by four’. This defines the infinite alternative

20004f49v05_0185_002.jpg

(α )

with the limits of the relative frequencies:αF′ (1) = 3/4; andαF′ (0) = 1/4. Sequences which are defined in this way by means of a mathematical rule I shall call, for brevity, ‘mathematical sequences’.

By contrast, a rule for constructing an empirical sequence would be, for instance: ‘The nth element of the sequence α shall be 0 if, and only if, the nth toss of the coin c shows tails.’ But empirical rules need not always define sequences of a random character. For example, I should describe the following rule as empirical: ‘The nth element of the sequence shall be 1 if, and only if, the nth second (counting from some zero instant) finds the pendulum p to the left of this mark.’

The example shows that it may sometimes be possible to replace an empirical rule by a mathematical one—for example on the basis of certain hypotheses and measurements relating to some pendulum. In this way, we may find a mathematical sequence approximating to our empirical sequence with a degree of precision which may or may not satisfy us, according to our purposes. Of particular interest in our present context is the possibility (which our example could be used to establish) of obtaining a mathematical sequence whose various frequencies approximate to those of a certain empirical sequence.

In dividing sequences into mathematical and empirical ones I am making use of a distinction that may be called ‘intensional’ rather than ‘extensional’. For if we are given a sequence ‘extensionally’, i.e. by listing its elements singly, one after the other—so that we can only know a finite piece of it, a finite segment, however long—then it is impossible to determine, from the properties of this segment, whether the sequence of which it is a part is a mathematical or an empirical sequence. Only when a rule of construction is given—that is, an ‘intensional’ rule—can we decide whether a sequence is mathematical or empirical.

Since we wish to tackle our infinite sequences with the help of the concept of a limit (of relative frequencies), we must restrict our investigation to mathematical sequences, and indeed to those for which the corresponding sequence of relative frequencies is convergent. This restriction amounts to introducing an axiom of convergence. (The problems connected with this axiom will not be dealt with until sections 63 to 66, since it turns out to be convenient to discuss them along with the ‘law of great numbers’.)

Thus we shall be concerned only with mathematical sequences. Yet we shall be concerned only with those mathematical sequences of which we expect, or conjecture, that they approximate, as regards frequencies, to empirical sequences of a chance-like or random character; for these are our main interest. But to expect, or to conjecture, of a mathematical sequence that it will, as regards frequencies, approximate to an empirical one is nothing else than to frame a hypothesis—a hypothesis about the frequencies of the empirical sequence.1

The fact that our estimates of the frequencies in empirical random sequences are hypotheses is without any influence on the way we may calculate these frequencies. Clearly, in connection with finite classes, it does not matter in the least how we obtain the frequencies from which we start our calculations. These frequencies may be obtained by actual counting, or from a mathematical rule, or from a hypothesis of some kind or other. Or we may simply invent them. In calculating frequencies we accept some frequencies as given, and derive other frequencies from them.

The same is true of estimates of frequencies in infinite sequences. Thus the question as to the ‘sources’ of our frequency estimates is not a problem of the calculus of probability; which, however, does not mean that it will be excluded from our discussion of the problems of probability theory.

In the case of infinite empirical sequences we can distinguish two main ‘sources’ of our hypothetical estimates of frequencies—that is to say, two ways in which they may suggest themselves to us. One is an estimate based upon an ‘equal-chance hypothesis’ (or equi-probability hypothesis), the other is an estimate based upon an extrapolation of statistical findings.

By an ‘equal-chance hypothesis’ I mean a hypothesis asserting that the probabilities of the various primary properties are equal: it is a hypothesis asserting equal distribution. Equal-chance hypotheses are usually based upon considerations of symmetry.2 A highly typical example is the conjecture of equal frequencies in dicing, based upon the symmetry and geometrical equivalence of the six faces of the cube.

For frequency hypotheses based on statistical extrapolation, estimates of rates of mortality provide a good example. Here statistical data about mortality are empirically ascertained; and upon the hypothesis that past trends will continue to be very nearly stable, or that they will not change much—at least during the period immediately ahead—an extrapolation to unknown cases is made from known cases, i.e. from occurrences which have been empirically classified, and counted.

People with inductivist leanings may tend to overlook the hypothetical character of these estimates: they may confuse a hypothetical estimate, i.e. a frequency-prediction based on statistical extrapolation, with one of its empirical ‘sources’—the classifying and actual counting of past occurrences and sequences of occurrences. The claim is often made that we ‘derive’ estimates of probabilities—that is, predictions of frequencies—from past occurrences which have been classified and counted (such as mortality statistics). But from a logical point of view there is no justification for this claim. We have made no logical derivation at all. What we may have done is to advance a non-verifiable hypothesis which nothing can ever justify logically: the conjecture that frequencies will remain constant, and so permit of extrapolation. Even equal-chance hypotheses are held to be ‘empirically derivable’ or ‘empirically explicable’ by some believers in inductive logic who suppose them to be based upon statistical experience, that is, upon empirically observed frequencies. For my own part I believe, however, that in making this kind of hypothetical estimate of frequency we are often guided solely by our reflections about the significance of symmetry, and by similar considerations. I do not see any reason why such conjectures should be inspired only by the accumulation of a large mass of inductive observations. However, I do not attach much importance to these questions about the origins or ‘sources’ of our estimates. (Cf. section 2.) It is more important, in my opinion, to be quite clear about the fact that every predictive estimate of frequencies, including one which we may get from statistical extrapolation—and certainly all those that refer to infinite empirical sequences—will always be pure conjecture since it will always go far beyond anything which we are entitled to affirm on the basis of observations.

My distinction between equal-chance hypotheses and statistical extrapolations corresponds fairly well to the classical distinction between ‘a priori’ and ‘a posteriori’ probabilities. But since these terms are used in so many different senses,3 and since they are, moreover, heavily tainted with philosophical associations, they are better avoided.

In the following examination of the axiom of randomness, I shall attempt to find mathematical sequences which approximate to random empirical sequences; which means that I shall be examining frequency-hypotheses.*2


58 AN EXAMINATION OF THE AXIOM OF RANDOMNESS

The concept of an ordinal selection (i.e. of a selection according to position) and the concept of a neighbourhood-selection, have both been introduced and explained in section 55. With the help of these concepts I will now examine von Mises’s axiom of randomness—the principle of the excluded gambling system—in the hope of finding a weaker requirement which is nevertheless able to take its place. In von Mises’s theory this ‘axiom’ is part of his definition of the concept of a collective: he demands that the limits of frequencies in a collective shall be insensitive to any kind of systematic selection whatsoever. (As he points out, a gambling system can always be regarded as a systematic selection.)

Most of the criticism which has been levelled against this axiom concentrates on a relatively unimportant and superficial aspect of its formulation. it is connected with the fact that, among the possible selections, there will be the selection, say, of those throws which come up five; and within this selection, obviously, the frequency of the fives will be quite different from what it is in the original sequence. This is why von Mises in his formulation of the axiom of randomness speaks of what he calls ‘selections’ or ‘choices’ which are ‘independent of the result’ of the throw in question, and are thus defined without making use of the property of the element to be selected.1 But the many attacks levelled against this formulation2 can all be answered merely by pointing out that we can formulate von Mises’s axiom of randomness without using the questionable expressions at all.3 For we may put it, for example, as follows: The limits of the frequencies in a collective shall be insensitive both to ordinal and to neighbourhood selection, and also to all combinations of these two methods of selection that can be used as gambling systems.*1

With this formulation the above mentioned difficulties disappear. Others however remain. Thus it might be impossible to prove that the concept of a collective, defined by means of so strong an axiom of randomness, is not self-contradictory; or in other words, that the class of ‘collectives’ is not empty. (The necessity for proving this has been stressed by Kamke.4) At least it seems to be impossible to construct an example of a collective and in that way to show that collectives exist. This is because an example of an infinite sequence which is to satisfy certain conditions can only be given by a mathematical rule. But for a collective in von Mises’s sense there can be, by definition, no such rule, since any rule could be used as a gambling system or as a system of selection. This criticism seems indeed unanswerable if all possible gambling systems are ruled out.*2

Against the idea of excluding all gambling systems, another objection may be raised, however: that it really demands too much. If we are going to axiomatize a system of statements—in this case the theorems of the calculus of probability, particularly the special theorem of multiplication or Bernoulli’s theorem—then the axioms chosen should not only be sufficient for the derivation of the theorems of the system, but also (if we can make them so) necessary. Yet the exclusion of all systems of selection can be shown to be unnecessary for the deduction of Bernoulli’s theorem and its corollaries. It is quite sufficient to demand the exclusion of a special class of neighbourhood-selection: it suffices to demand that the sequence should be insensitive to selections according to arbitrarily chosen n-tuples of predecessors; that is to say, that it should be n-free from after-effects for every n, or more briefly, that it should be ‘absolutely free’.

I therefore propose to replace von Mises’s principle of the excluded gambling system by the less exacting requirement of ‘absolute freedom’, in the sense of n-freedom for every n, and accordingly to define chance-like mathematical sequences as those which fulfil this requirement. The chief advantage of this is that it does not exclude all gambling systems, so that it is possible to give mathematical rules for constructing sequences which are ‘absolutely free’ in our sense, and hence to construct examples. (Cf. section (a) of appendix iv.) Thus Kamke’s objection, discussed above, is met. For we can now prove that the concept of chance-like mathematical sequences is not empty, and is therefore consistent.*3

It may seem odd, perhaps, that we should try to trace the highly irregular features of chance sequences by means of mathematical sequences which must conform to the strictest rules. Von Mises’s axiom of randomness may seem at first to be more satisfying to our intuitions. It seems quite satisfying to learn that a chance sequence must be completely irregular, so that every conjectured regularity will be found to fail, in some later part of the sequence, if only we keep on trying hard to falsify the conjecture by continuing the sequence long enough. But this intuitive argument benefits my proposal also. For if chance sequences are irregular, then, a fortiori, they will not be regular sequences of one particular type. And our requirement of ‘absolute freedom’ does no more than exclude one particular type of regular sequence, though an important one.

That it is an important type may be seen from the fact that by our requirement we implicitly exclude the following three types of gambling systems (cf. the next section). First we exclude ‘normal’ or ‘pure’*4 neighbourhood selections, i.e. those in which we select according to some constant characteristic of the neighbourhood. Secondly we exclude ‘normal’ ordinal selection which picks out elements whose distance apart is constant, such as the elements numbered k, n + k, 2n + k ... and so on. And finally, we exclude [many] combinations of these two types of selection (for example the selection of every nth element, provided its neighbourhood has certain specified [constant] characteristics). A characteristic property of all these selections is that they do not refer to an absolute first element of the sequence; they may thus yield the same selected sub-sequence if the numbering of the original sequence begins with another (appropriate) element. Thus the gambling systems which are excluded by my requirement are those which could be used without knowing the first element of the sequence: the systems excluded are invariant with respect to certain (linear) transformations: they are the simple gambling systems (cf. section 43). Only*5 gambling systems which refer to the absolute distances of the elements from an absolute (initial) element5 are not excluded by my requirement.

The requirement of n-freedom for every n—of ‘absolute freedom’— also seems to agree quite well with what most of us, consciously or unconsciously, believe to be true of chance sequences; for example that the result of the next throw of a die does not depend upon the results of preceding throws. (The practice of shaking the die before the throw is intended to ensure this ‘independence’.)


59 CHANCE-LIKE SEQUENCES. OBJECTIVE PROBABILITY

In view of what has been said I now propose the following definition. An event-sequence or property-sequence, especially an alternative, is said to be ‘chance-like’ or ‘random’ if and only if the limits of the frequencies of its primary properties are ‘absolutely free’, i.e. insensitive to every selection based upon the properties of any n-tuple of predecessors. A frequency-limit corresponding to a sequence which is random is called the objective probability of the property in question, within the sequence concerned; it is symbolized by F. This may also be put as follows. Let the sequence α be a chance-like or random-like sequence with the primary property β; in this case, the following holds:

20004f49v05_0192_005.jpg

We shall have to show now that our definition suffices for the derivation of the main theorems of the mathematical theory of probability, especially Bernoulli’s theorem. Subsequently—in section 64—the definition here given will be modified so as to make it independent of the concept of a limit of frequencies.*1

60 BERNOULLI’S PROBLEM

The first binomial formula which was mentioned in section 56, viz.

20004f49v05_0193_003.jpg

(1)

holds for finite sequences of overlapping segments. It is derivable on the assumption that the finite sequence α is at least n-1-free. Upon the same assumption, we immediately obtain an exactly corresponding formula for infinite sequences; that is to say, if α is infinite and at least n-1-free, then

20004f49v05_0193_006.jpg

(2)

Since chance-like sequences are absolutely free, i.e. n-free for every n, formula (2), the second binomial formula, must also apply to them; and it must apply to them, indeed, for whatever value of n we may choose.

In what follows, we shall be concerned only with chance-like sequences, or random sequences (as defined in the foregoing section). We are going to show that, for chance-like sequences, a third binomial formula (3) must hold in addition to formula (2); it is the formula

20004f49v05_0193_010.jpg

(3)

Formula (3) differs from formula (2) in two ways: First, it is asserted for sequences of adjoining segments αn instead of for sequences of overlapping segments α(n). Secondly, it does not contain the symbol Fœ but the symbol F. This means that it asserts, by implication, that the sequences of adjoining segments are in their turn chance-like, or random; for F, i.e. objective probability, is defined only for chance-like sequences.

The question, answered by (3), of the objective probability of the property m in a sequence of adjoining segments—i.e. the question of the value of αnF(m)—I call, following von Mises, ‘Bernoulli’s problem’. 1 For its solution, and hence for the derivation of the third binomial formula (3), it is sufficient to assume that α is chance-like or random.2 (Our task is equivalent to that of showing that the special theorem of multiplication holds for the sequence of adjoining segments of a random sequence α.)

The proof*1 of formula (3) may be carried out in two steps. First we show that formula (2) holds not only for sequences of overlapping segments α(n), but also for sequences of adjoining sequences αn. Secondly, we show that the latter are ‘absolutely free’. (The order of these steps cannot be reversed, because a sequence of overlapping segments α(n) is definitely not ‘absolutely free’; in fact, a sequence of this kind provides a typical example of what may be called ‘sequences with after-effects’.3)

First step. Sequences of adjoining segments αn are sub-sequences of α(n). They can be obtained from these by normal ordinal selection. Thus if we can show that the limits of the frequencies in overlapping sequences α(n)F΄(m) are insensitive to normal ordinal selection, we have taken our first step (and even gone a little farther); for we shall have proved the formula:

20004f49v05_0194_004.jpg

(4)

I shall first sketch this proof in the case of n = 2; i.e. I shall show that

20004f49v05_0194_007.jpg

(4a)

is true; it will then be easy to generalize this formula for every n.

From the sequence of overlapping segments α(2) we can select two and only two distinct sequences α2 of adjoining segments; one, which will be denoted by (A), contains the first, third, fifth,..., segments of α(2), that is, the pairs of α consisting of the numbers 1,2; 3,4; 5,6;... The other, denoted by (B), contains the second, fourth, sixth,..., segments of α(2), that is, the pairs of elements of α consisting of the numbers 2,3; 4,5; 6,7;..., etc. Now assume that formula (4a) does not hold for one of the two sequences, (A) or (B), so that the segment (i.e. the pair) 0,0 occurs too often in, say, the sequence (A); then in sequence (B) a complementary deviation must occur; that is, the segment 0,0 will occur not often enough (‘too often’, or ‘not often enough’, as compared with the binomial formula). But this contradicts the assumed ‘absolute freedom’ of α. For if the pair 0,0 occurs in (A) more often than in (B), then in sufficiently long segments of α the pair 0,0 must appear more often at certain characteristic distances apart than at other distances. The more frequent distances would be those which would obtain if the 0,0 pairs belonged to one of the two α2-sequences. The less frequent distances would be those which would obtain if they belonged to both α2-sequences. But this would contradict the assumed ‘absolute freedom’ of α; for according to the second binomial formula, the ‘absolute freedom’ of α entails that the frequency with which a particular sequence of the length n occurs in any α(n)-sequence depends only on the number of ones and zeros occurring in it, and not on their arrangement in the sequence.*2

This proves (4a); and since this proof can easily be generalized for any n, the validity of (4) follows; which completes the first step of the proof.

Second step. The fact that the αn-sequences are ‘absolutely free’ can be shown by a very similar argument. Again, we first consider α2-sequences only; and with respect to these it will only be shown, to start with, that they are 1-free. Assume that one of the two α2-sequences, e.g. the sequence (A), is not 1-free. Then in (A) after at least one of the segments consisting of two elements (a particular α-pair), say after the segment 0,0, another segment, say 1,1, must follow more often than would be the case if (A) were ‘absolutely free’; this means that the segment 1,1 would appear with greater frequency in the sub-sequence selected from (A) according to the predecessor-segment 0,0 than the binomial formula would lead us to expect.

This assumption, however, contradicts the ‘absolute freedom’ of the sequence α. For if the segment 1,1 follows in (A) the segment 0,0 too frequently then, by way of compensation, the converse must take place in (B); for otherwise the quadruple 0,0,1,1 would, in a sufficiently long segment of α, occur too often at certain characteristic distances apart— namely at the distances which would obtain if the double pairs in question belonged to one and the same α2-sequence. Moreover, at other characteristic distances the quadruple would occur not often enough—at those distances, namely, which would obtain if they belonged to both α2-sequences. Thus we are confronted with precisely the same situation as before; and we can show, by analogous considerations, that the assumption of a preferential occurrence at characteristic distances is incompatible with the assumed ‘absolute freedom’ of α.

This proof can again be generalized, so that we may say of α-sequences that they are not only 1-free but n-free for every n; and hence that they are chance-like, or random.

This completes our sketch of the two steps. Thus we are now entitled to replace, in (4), Fœ by F; and this means that we may accept the claim that the third binomial formula solves Bernoulli’s problem.

Incidentally we have shown that sequences α(n) of overlapping segments are insensitive to normal ordinal selection whenever α is ‘absolutely free’.

The same is also true for sequences αn of adjoining segments, because every normal ordinal selection from αn can be regarded as a normal ordinal selection from α(n); and it must therefore apply to the sequence α itself, since α is identical with both α(1) and α1

We have thus shown, among other things, that from ‘absolute freedom’—which means insensitiveness to a special type of neighbourhood selection—insensitiveness to normal ordinal selection follows. A further consequence, as can easily be seen, is insensitiveness to any ‘pure’ neighbourhood selection (that is, selection according to a constant characterization of its neighbourhood—a characterization that does not vary with the ordinal number of the element). And it follows, finally, that ‘absolute freedom’ will entail insensitivity to all*3 combinations of these two types of selection.


61 THE LAW OF GREAT NUMBERS (BERNOULLI’S THEOREM)

Bernoulli’s theorem, or the (first1) ‘law of great numbers’ can be derived from the third binomial formula by purely arithmetical reasoning, under the assumption that we can take n to the limit, n→∞. It can therefore be asserted only of infinite sequences α; for it is only in these that the n-segments of αn-sequences can increase in length indefinitely. And it can be asserted only of such sequences α as are ‘absolutely free’, for it is only under the assumption of n-freedom for every n that we can take n to the limit, n→∞.

Bernoulli’s theorem provides the solution of a problem which is closely akin to the problem which (following von Mises) I have called ‘Bernoulli’s problem’, viz. the problem of the value of αnF(m). As indicated in section 56, an n-segment may be said to have the property ‘m’ when it contains precisely m ones; the relative frequency of ones within this (finite) segment is then, of course, m/n. We may now define: An n-segment of α has the property ‘Δp’ if and only if the relative frequency of its ones deviates by less than δ from the value αF(1) = p, i.e. the probability of ones in the sequence α; here,  ‚ is any small fraction, chosen as near to zero as we like (but different from zero). We can express this condition by saying: an n segment has the property ‘Δp’ if and only if Ilf_20004f49v05_0197_005.gif otherwise, the segment has the property Ilf_20004f49v05_0197_006.gif . Now Bernoulli’s theorem answers the question of the value of the frequency, or probability, of segments of this kind—of segments possessing the property ‘Δp’—within the αn-sequences; it thus answers the question of the value of αnF(Δp).

Intuitively one might guess that if the value δ (with δ > 0) is fixed, and if n increases, then the frequency of these segments with the property Δp, and therefore the value of αn(Δp), will also increase (and that its increase will be monotonic). Bernoulli’s proof (which can be found in any textbook on the calculus of probability) proceeds by evaluating this increase with the help of the binomial formula. He finds that if n increases without limit, the value of αnF(Δp) approaches the maximal value 1, for any fixed value of δ, however small. This may be expressed in symbols by

20004f49v05_0198_003.jpg

(for any value of Δp) (1)

This formula results from transforming the third binomial formula for sequences of adjoining segments. The analogous second binomial formula for sequences of overlapping segments would immediately lead, by the same method, to the corresponding formula

20004f49v05_0198_006.jpg

(2)

which is valid for sequences of overlapping segments and normal ordinal selection from them, and hence for sequences with after-effects (which have been studied by Smoluchowski2). Formula (2) itself yields (1) in case sequences are selected which do not overlap, and which are therefore n-free. (2) may be described as a variant of Bernoulli’s theorem; and what I am going to say here about Bernoulli’s theorem applies mutatis mutandis to this variant.


Bernoulli’s theorem, i.e. formula (1), may be expressed in words as follows. Let us call a long finite segment of some fixed length, selected from a random sequence α, a ‘fair sample’ if, and only if, the frequency of the ones within this segment deviates from p, i.e. the value of the probability of the ones within the random sequence α, by no more than some small fixed fraction (which we may freely choose). We can then say that the probability of chancing upon a fair sample approaches 1 as closely as we like if only we make the segments in question sufficiently long.*1

In this formulation the word ‘probability’ (or ‘value of the probability’) occurs twice. How is it to be interpreted or translated here? In the sense of my frequency definition it would have to be translated as follows (I italicize the two translations of the word ‘probability’ into the frequency language): The overwhelming majority of all sufficiently long finite segments will be ‘fair samples’; that is to say, their relative frequency will deviate from the frequency value p of the random sequence in question by an arbitrarily fixed small amount; or, more briefly: The frequency p is realized, approximately, in almost all sufficiently long segments. (How we arrive at the value p is irrelevant to our present discussion; it may be, say, the result of a hypothetical estimate.)

Bearing in mind that the Bernoulli frequency αnF(Δp) increases monotonically with the increasing length n of the segment and that it decreases monotonically with decreasing n, and that, therefore, the value of the relative frequency is comparatively rarely realized in short segments, we can also say:

Bernoulli’s theorem states that short segments of ‘absolutely free’ or chance-like sequences will often show relatively great deviations from p and thus relatively great fluctuations, while the longer segments, in most cases, will show smaller and smaller deviations from p with increasing length. Consequently, most deviations in sufficiently long segments will become as small as we like; or in other words, great deviations will become as rare as we like.

Accordingly, if we take a very long segment of a random sequence, in order to find the frequencies within its sub-sequences by counting, or perhaps by the use of other empirical and statistical methods, then we shall get, in the vast majority of cases, the following result. There is a characteristic average frequency, such that the relative frequencies in the whole segment, and in almost all long sub-segments, will deviate only slightly from this average, whilst the relative frequencies of smaller sub-segments will deviate further from this average, and the more often, the shorter we choose them. This fact, this statistically ascertainable behaviour of finite segments, may be referred to as their ‘quasi-convergent-behaviour’; or as the fact that random sequences are statistically stable.*2

Thus Bernoulli’s theorem asserts that the smaller segments of chance-like sequences often show large fluctuations, whilst the large segments always behave in a manner suggestive of constancy or convergence; in short, that we find disorder and randomness in the small, order and constancy in the great. It is this behaviour to which the expression ‘the law of great numbers’ refers.


62 BERNOULLI’S THEOREM AND THE INTERPRETATION OF PROBABILITY STATEMENTS

We have just seen that in the verbal formulation of Bernoulli’s theorem the word ‘probability’ occurs twice.

The frequency theorist has no difficulty in translating this word, in both cases, in accordance with its definition: he can give a clear interpretation of Bernoulli’s formula and the law of great numbers. Can the adherent of the subjective theory in its logical form do the same?

The subjective theorist who wants to define ‘probability’ as ‘degree of rational belief’ is perfectly consistent, and within his rights, when he interprets the words ‘The probability of... approaches to I as closely as we like’ as meaning, ‘It is almost certain1 that...’ But he merely obscures his difficulties when he continues ‘... that the relative frequency will deviate from its most probable value p by less than a given amount...’, or in the words of Keynes,2 ‘that the proportion of the event’s occurrences will diverge from the most probable proportion p by less than a given amount... ’. This sounds like good sense, at least on first hearing. But if here too we translate the word ‘probable’ (sometimes suppressed) in the sense of the subjective theory then the whole story runs: ‘It is almost certain that the relative frequencies deviate from the value p of the degree of rational belief by less than a given amount... ’, which seems to me complete nonsense.*1 For relative frequencies can be compared only with relative frequencies, and can deviate or not deviate only from relative frequencies. And clearly, it must be inadmissible to give after the deduction of Bernoulli’s theorem a meaning to p different from the one which was given to it before the deduction.3

Thus we see that the subjective theory is incapable of interpreting Bernouilli’s formula in terms of the statistical law of great numbers. Derivation of statistical laws is possible only within the framework of the frequency theory. If we start from a strict subjective theory, we shall never arrive at statistical statements—not even if we try to bridge the gulf with Bernoulli’s theorem.*2

63 BERNOULLI’S THEOREM AND THE PROBLEM OF CONVERGENCE

From the point of view of epistemology, my deduction of the law of great numbers, outlined above, is unsatisfactory; for the part played in our analysis by the axiom of convergence is far from clear.

I have in effect tacitly introduced an axiom of this kind, by confining my investigation to mathematical sequences with frequency limits. (Cf. section 57.) Consequently one might even be tempted to think that our result—the derivation of the law of great numbers—is trivial; for the fact that ‘absolutely free’ sequences are statistically stable might be regarded as entailed by their convergence which has been assumed axiomatically, if not implicitly.

But this view would be mistaken, as von Mises has clearly shown. For there are sequences1 which satisfy the axiom of convergence although Bernoulli’s theorem does not hold for them, since with a frequency close to 1, segments of any length occur in them which may deviate from p to any extent. (The existence of the limit p is in these cases due to the fact that the deviations, although they may increase without limit, cancel each other.) Such sequences look as if they were divergent in arbitrarily large segments, even though the corresponding frequency sequences are in fact convergent. Thus the law of great numbers is anything but a trivial consequence of the axiom of convergence, and this axiom is quite insufficient for its deduction. This is why my modified axiom of randomness, the requirement of ‘absolute freedom’, cannot be dispensed with.

Our reconstruction of the theory, however, suggests the possibility that the law of great numbers may be independent of the axiom of convergence. For we have seen that Bernoulli’s theorem follows immediately from the binomial formula; moreover, I have shown that the first binomial formula can be derived for finite sequences and so, of course, without any axiom of convergence. All that had to be assumed was that the reference-sequence α was at least n-1-free; an assumption from which the validity of the special multiplication theorem followed, and with it that of the first binomial formula. In order to make the transition to the limit, and to obtain Bernoulli’s theorem, it is only necessary to assume that we may make n as large as we like. From this it can be seen that Bernoulli’s theorem is true, approximately, even for finite sequences, if they are n-free for an n which is sufficiently large.

It seems therefore that the deduction of Bernoulli’s theorem does not depend upon an axiom postulating the existence of a frequency limit, but only on ‘absolute freedom’ or randomness. The limit concept plays only a subordinate rôle: it is used for the purpose of applying some conception of relative frequency (which, in the first instance, is only defined for finite classes, and without which the concept of nfreedom cannot be formulated) to sequences that can be continued indefinitely.

Moreover, it should not be forgotten that Bernoulli himself deduced his theorem within the framework of the classical theory, which contains no axiom of convergence; also, that the definition of probability as a limit of frequencies is only an interpretation—and not the only possible one—of the classical formalism.

I shall try to justify my conjecture—the independence of Bernoulli’s theorem of the axiom of convergence—by deducing this theorem without assuming anything except n-freedom (to be appropriately defined).*1 And I shall try to show that it holds even for those mathematical sequences whose primary properties possess no frequency limits.

Only if this can be shown shall I regard my deduction of the law of great numbers as satisfactory from the point of view of the epistemologist. For it is a ‘fact of experience’—or so at least we are sometimes told—that chance-like empirical sequences show that peculiar behaviour which I have described as ‘quasi-convergent’ or ‘statistically stable’. (Cf. section 61.) By recording statistically the behaviour of long segments one can establish that the relative frequencies approach closer and closer to a definite value, and that the intervals within which the relative frequencies fluctuate become smaller and smaller. This so-called ‘empirical fact’, so much discussed and analysed, which is indeed often regarded as the empirical corroboration of the law of great numbers, can be viewed from various angles. Thinkers with inductivist leanings mostly regard it as a fundamental law of nature, not reducible to any simpler statement; as a peculiarity of our world which has simply to be accepted. They believe that expressed in a suitable form—for example in the form of the axiom of convergence—this law of nature should be made the basis of the theory of probability which would thereby assume the character of a natural science.

My own attitude to this so-called ‘empirical fact’ is different. I am inclined to believe that it is reducible to the chance-like character of the sequences; that it may be derived from the fact that these sequences are n-free. I see the great achievement of Bernoulli and Poisson in the field of probability theory precisely in their discovery of a way to show that this alleged ‘fact of experience’ is a tautology, and that from disorder in the small (provided it satisfies a suitably formulated condition of nfreedom), there follows logically a kind of order of stability in the large.

If we succeed in deducing Bernoulli’s theorem without assuming an axiom of convergence, then we shall have reduced the epistemological problem of the law of great numbers to one of axiomatic independence, and thus to a purely logical question. This deduction would also explain why the axiom of convergence works quite well in all practical applications (in attempts to calculate the approximate behaviour of empirical sequences). For even if the restriction to convergent sequences should turn out to be unnecessary, it can certainly not be inappropriate to use convergent mathematical sequences for calculating the approximate behaviour of empirical sequences which, on logical grounds, are statistically stable.

64 ELIMINATION OF THE AXIOM OF CONVERGENCE. SOLUTION OF THE ‘FUNDAMENTAL PROBLEM OF THE THEORY OF CHANCE’

So far frequency limits have had no other function in our reconstruction of the theory of probability than that of providing an unambiguous concept of relative frequency applicable to infinite sequences, so that with its help we may define the concept of ‘absolute freedom’ (from after-effects). For it is a relative frequency which is required to be insensitive to selection according to predecessors.

Earlier we restricted our inquiry to alternatives with frequency limits, thus tacitly introducing an axiom of convergence. Now, so as to free us from this axiom, I shall remove the restriction without replacing it by any other. This means that we shall have to construct a frequency concept which can take over the function of the discarded frequency limit, and which may be applied to all infinite reference sequences.*1

One frequency concept fulfilling these conditions is the concept of a point of accumulation of the sequence of relative frequencies. (A value a is said to be a point of accumulation of a sequence if after any given element there are elements deviating from a by less than a given amount, however small.) That this concept is applicable without restriction to all infinite reference sequences may be seen from the fact that for every infinite alternative at least one such point of accumulation must exist for the sequence of relative frequencies which corresponds to it. Since relative frequencies can never be greater than 1 nor less than 0, a sequence of them must be bounded by 1 and 0. And as an infinite bounded sequence, it must (according to a famous theorem of Bolzano and Weierstrass) have at least one point of accumulation.1

For brevity, every point of accumulation of the sequence of relative frequencies corresponding to an alternative α will be called ‘a middle frequency of α’. We can then say: If a sequence α has one and only one middle frequency, then this is at the same time its frequency limit; and conversely: if it has no frequency limit, then it has more than one2 middle frequency.

The idea of a middle frequency will be found very suitable for our purpose. Just as previously it was our estimate—perhaps a hypothetical estimate—that p was the frequency limit of a sequence α, so we now work with the estimate that p is a middle frequency of α. And provided we take certain necessary precautions,3 we can make calculations with the help of these estimated middle frequencies, in a way analogous to that in which we calculate with frequency limits. Moreover the concept of middle frequency is applicable to all possible infinite reference sequences, without any restriction.

If we now try to interpret our symbol αF′ (β) as a middle frequency, rather than a frequency limit, and if we accordingly alter the definition of objective probability (section 59), most of our formulae will still be derivable. One difficulty arises however: middle frequencies are not unique. If we estimate or conjecture that a middle frequency is αF′(β) = p, then this does not exclude the possibility that there are values of αF′ (β) other than p. If we postulate that this shall not be so, we thereby introduce, by implication, the axiom of convergence. If on the other hand we define objective probability without such a postulate of uniqueness,4 then we obtain (in the first instance, at least) a concept of probability which is ambiguous; for under certain circumstances a sequence may possess at the same time several middle frequencies which are ‘absolutely free’ (cf. section c of appendix iv). But this is hardly acceptable, since we are accustomed to work with unambiguous or unique probabilities; to assume, that is to say, that for one and the same property there can be one and only one probability p, within one and the same reference sequence.

However, the difficulty of defining a unique probability concept without the limit axiom can easily be overcome. We may introduce the requirement of uniqueness (as is, after all, the most natural procedure) as the last step, after having postulated that the sequence shall be ‘absolutely free’. This leads us to propose, as a solution of our problem, the following modification of our definition of chance-like sequences, and of objective probability.

Let α be an alternative (with one or several middle frequencies). Let the ones of α have one and only one middle frequency p that is ‘absolutely free’; then we say that α is chance-like or random, and that p is the objective probability of the ones, within α.

It will be helpful to divide this definition into two axiomatic requirements.*2

(1) Requirement of randomness: for an alternative to be chancelike, there must be at least one ‘absolutely free’ middle frequency, i.e. its objective probability p.

(2) Requirement of uniqueness: for one and the same property of one and the same chance-like alternative, there must be one and only one probability p.

The consistency of the new axiomatic system is ensured by the example previously constructed. It is possible to construct sequences which, whilst they have one and only one probability, yet possess no frequency limit (cf. section b of appendix iv). This shows that the new axiomatic demands are actually wider, or less exacting, than the old ones. This fact will become even more evident if we state (as we may) our old axioms in the following form:

(1) Requirement of randomness: as above.

(2) Requirement of uniqueness: as above.

(2΄) Axiom of convergence: for one and the same property of one and the same chance-like alternative there exists no further middle frequency apart from its probability p.

From the proposed system of requirements we can deduce Bernoulli’s theorem, and with it all the theorems of the classical calculus of probability. This solves our problem: it is now possible to deduce the law of great numbers within the framework of the frequency theory without using the axiom of convergence. Moreover, not only does the formula (1) of section 61 and the verbal formulation of Bernoulli’s theorem remain unchanged,5 but the interpretation we have given to it also remains unchanged: in the case of a chance-like sequence without a frequency limit it will still be true that almost all sufficiently long sequences show only small deviations from p. In such sequences (as in chance-like sequences with frequency limits) segments of any length behaving quasi-divergently will of course occur at times, i.e. segments which deviate from p by any amount. But such segments will be comparatively rare, since they must be compensated for by extremely long parts of the sequence in which all (or almost all) segments behave quasi-convergently. As calculation shows, these stretches will have to be longer by several orders of magnitude, as it were, than the divergently-behaving segments for which they compensate.*3

This is also the place to solve the ‘fundamental problem of the theory of chance’ (as it was called in section 49). The seemingly paradoxical inference from the unpredictability and irregularity of singular events to the applicability of the rules of the probability calculus to them is indeed valid. It is valid provided we can express the irregularity, with a fair degree of approximation, in terms of the hypothetical assumption that one only of the recurring frequencies—of the ‘middle frequencies’— so occurs in any selection according to predecessors that no after-effects result. For upon these assumptions it is possible to prove that the law of great numbers is tautological. It is admissible and not selfcontradictory (as has sometimes been asserted6) to uphold the conclusion that in an irregular sequence in which, as it were, anything may happen at one time or another—though some things only rarely—a certain regularity or stability will appear in very large sub-sequences. Nor is this conclusion trivial, since we need for it specific mathematical tools (the Bolzano and Weierstrass theorem, the concept of n-freedom, and Bernoulli’s theorem). The apparent paradox of an argument from unpredictability to predictability, or from ignorance to knowledge, disappears when we realize that the assumption of irregularity can be put in the form of a frequency hypothesis (that of freedom from aftereffects), and that it must be put in this form if we want to show the validity of that argument.

It now also becomes clear why the older theories have been unable to do justice to what I call the ‘fundamental problem’. The subjective theory, admittedly, can deduce Bernoulli’s theorem; but it can never consistently interpret it in terms of frequencies, after the fashion of the law of great numbers (cf. section 62). Thus it can never explain the statistical success of probability predictions. On the other hand, the older frequency theory, by its axiom of convergence, explicitly postulates regularity in the large. Thus within this theory the problem of inference from irregularity in the small to stability in the large does not arise, since it merely involves inference from stability in the large (axiom of convergence), coupled with irregularity in the small (axiom of randomness) to a special form of stability in the large (Bernoulli’s theorem, law of great numbers).*4

The axiom of convergence is not a necessary part of the foundations of the calculus of probability. With this result I conclude my analysis of the mathematical calculus.7

We now return to the consideration of more distinctively methodological problems, especially the problem of how to decide probability statements.


65 THE PROBLEM OF DECIDABILITY

In whatever way we may define the concept of probability, or whatever axiomatic formulations we choose: so long as the binomial formula is derivable within the system, probability statements will not be falsifiable. Probability hypotheses do not rule out anything observable; probability estimates cannot contradict, or be contradicted by, a basic statement; nor can they be contradicted by a conjunction of any finite number of basic statements; and accordingly not by any finite number of observations either.

Let us assume that we have proposed an equal-chance hypothesis for some alternative α; for example, that we have estimated that tosses with a certain coin will come up ‘1’ and ‘0’ with equal frequency, so that αF(1) =αF(0) =  Ilf_20004f49v05_0210_007.gif; and let us assume that we find, empirically, that ‘1’ comes up over and over again without exception: then we shall, no doubt, abandon our estimate in practice, and regard it as falsified. But there can be no question of falsification in a logical sense. For we can surely observe only a finite sequence of tosses. And although, according to the binomial formula, the probability of chancing upon a very long finite segment with great deviations from Ilf_20004f49v05_0210_007.gif is exceedingly small, it must yet always remain greater than zero. A sufficiently rare occurrence of a finite segment with even the greatest deviation can thus never

  contradict the estimate. In fact, we must expect it to occur: this is a consequence of our estimate. The hope that the calculable rarity of any such segment will be a means of falsifying the probability estimate proves illusory, since even a frequent occurrence of a long and greatly deviating segment may always be said to be nothing but one occurrence of an even longer and more greatly deviating segment. Thus there are no sequences of events, given to us extensionally, and therefore no finite n-tuple of basic statements, which could falsify a probability statement.

Only an infinite sequence of events—defined intensionally by a rule—could contradict a probability estimate. But this means, in view of the considerations set forth in section 38 (cf. section 43), that probability hypotheses are unfalsifiable because their dimension is infinite. We should therefore really describe them as empirically uninformative, as void of empirical content.1

Yet any such view is clearly unacceptable in face of the successes which physics has achieved with predictions obtained from hypothetical estimates of probabilities. (This is the same argument as has been used here much earlier against the interpretation of probability statements as tautologies by the subjective theory.) Many of these estimates are not inferior in scientific significance to any other physical hypothesis (for example, to one of a determinist character). And a physicist is usually quite well able to decide whether he may for the time being accept some particular probability hypothesis as ‘empirically confirmed’, or whether he ought to reject it as ‘practically falsified’, i.e., as useless for purposes of prediction. It is fairly clear that this ‘practical falsification’ can be obtained only through a methodological decision to regard highly improbable events as ruled out—as prohibited. But with what right can they be so regarded? Where are we to draw the line? Where does this ‘high improbability’ begin?

Since there can be no doubt, from a purely logical point of view, about the fact that probability statements cannot be falsified, the equally indubitable fact that we use them empirically must appear as a fatal blow to my basic ideas on method which depend crucially upon my criterion of demarcation. Nevertheless I shall try to answer the questions I have raised—which constitute the problem of decidability—by a resolute application of these very ideas. But to do this, I shall first have to analyse the logical form of probability statements, taking account both of the logical inter-relations between them and of the logical relations in which they stand to basic statements.*1


66 THE LOGICAL FORM OF PROBABILITY STATEMENTS

Probability estimates are not falsifiable. Neither, of course, are they verifiable, and this for the same reasons as hold for other hypotheses,

Ilf_20004f49v05_0210_007.gifseeing that no experimental results, however numerous and favourable, can ever finally establish that the relative frequency of ‘heads’ is, and will always be Ilf_20004f49v05_0210_007.gif .

Probability statements and basic statements can thus neither contradict one anther nor entail one another. And yet, it would be a mistake to conclude from this that no kind of logical relations hold between probability statements and basic statements. And it would be equally wide of the mark to believe that while logical relations do obtain between statements of these two kinds (since sequences of observations may obviously agree more or less closely with a frequency statement), the analysis of these relations compels us to introduce a special probabilistic logic1 which breaks the fetters of classical logic. In opposition to such views I believe that the relations in question can be fully analysed in terms of the ‘classical’ logical relations of deducibility and contradiction.*1

From the non-falsifiability and non-verifiability of probability statements it can be inferred that they have no falsifiable consequences, and that they cannot themselves be consequences of verifiable statements. But the converse possibilities are not excluded. For it may be (a) that they have unilaterally verifiable consequences (purely existential consequences, or there-is-consequences) or (b) that they are themselves consequences of unilaterally falsifiable universal statements (allstatements).

Possibility (b) will scarcely help to clarify the logical relation between probability statements and basic statements: it is only too obvious that a non-falsifiable statement, i.e. one which says very little, can belong to the consequence class of one which is falsifiable, and which thus says more.

What is of greater interest for us is possibility (a) which is by no means trivial, and which in fact turns out to be fundamental for our analysis of the relation between probability statements and basic statements. For we find that from every probability statement, an infinite class of existential statements can be deduced, but not vice versa. (Thus the probability statement asserts more than does any of these existential statements.) For example, let p be a probability which has been estimated, hypothetically, for a certain alternative (and let 0 ≠ p ≠‚ 1); then we can deduce from this estimate, for instance, the existential consequence that both ones and zeros will occur in the sequence. (Of course many far less simple consequences also follow—for example, that segments will occur which deviate from p only by a very small amount.)

But we can deduce much more from this estimate; for example that there will ‘over and over again’ be an element with the property ‘1’ and another element with the property ‘o’; that is to say, that after any element x there will occur in the sequence an element y with the property ‘1’, and also an element z with the property ‘o’. A statement of this form (‘for every x there is a y with the observable, or extensionally testable, property α’) is both non-falsifiable—because it has no falsifiable consequences—and non-verifiable—because of the ‘all’ or ‘for every’ which made it hypothetical.*2 Nevertheless, it can be better, or less well ‘confirmed’—in the sense that we may succeed in verifying many, few, or none of its existential consequences; thus it stands to the basic statement in the relation which appears to be characteristic of probability statements. Statements of the above form may be called ‘universalized existential statements’ or (universalized) ‘existential hypotheses’.

My contention is that the relation of probability estimates to basic statements, and the possibility of their being more, or less, well ‘confirmed’, can be understood by considering the fact that from all probability estimates, existential hypotheses are logically deducible. This suggests the question whether the probability statements themselves may not, perhaps, have the form of existential hypotheses.

Every (hypothetical) probability estimate entails the conjecture that the empirical sequence in question is, approximately, chance-like or random. That is to say, it entails the (approximate) applicability, and the truth, of the axioms of the calculus of probability. Our question is, therefore, equivalent to the question whether these axioms represent what I have called ‘existential hypotheses’.

If we examine the two requirements proposed in section 64 then we find that the requirement of randomness has in fact the form of an existential hypothesis.2 The requirement of uniqueness, on the other hand, has not this form; it cannot have it, since a statement of the form ‘There is only one ...’ must have the form of a universal statement. (It can be translated as ‘There are not more than one...’ or ‘All... are identical’.)

Now it is my thesis here that it is only the ‘existential constituent’, as it might be called, of probability estimates, and therefore the requirement of randomness, which establishes a logical relation between them and basic statements. Accordingly, the requirement of uniqueness, as a universal statement, would have no extensional consequences whatever. That a value p with the required properties exists, can indeed be extensionally ‘confirmed’—though of course only provisionally; but not that only one such value exists. This latter statement, which is universal, could be extensionally significant only if basic statements could contradict it; that is to say, if basic statements could establish the existence of more than one such value. Since they cannot (for we remember that non-falsifiability is bound up with the binomial formula), the requirement of uniqueness must be extensionally without significance.*3

This is the reason why the logical relations holding between a probability estimate and basic statements, and the graded ‘confirmability’ of the former, are unaffected if we eliminate the requirement of uniqueness from the system. By doing this we could give the system the form of a pure existential hypothesis.3 But we should then have to give up the uniqueness of probability estimates,*4 and thereby (so far as uniqueness is concerned) obtain something different from the usual calculus of probability.

Therefore the requirement of uniqueness is obviously not superfluous. What, then, is its logical function?

Whilst the requirement of randomness helps to establish a relation between probability statements and basic statements, the requirement of uniqueness regulates the relations between the various probability statements themselves. Without the requirement of uniqueness some of these might, as existential hypotheses, be derivable from others, but they could never contradict one another. Only the requirement of uniqueness ensures that probability statements can contradict one another; for by this requirement they acquire the form of a conjunction whose components are a universal statement and an existential hypothesis; and statements of this form can stand to one another in exactly the same fundamental logical relations (equivalence, derivability, compatibility, and incompatibility) as can ‘normal’ universal statements of any theory—for example, a falsifiable theory.

If we now consider the axiom of convergence, then we find that it is like the requirement of uniqueness in that it has the form of a non-falsifiable universal statement. But it demands more than our requirement does. This additional demand, however, cannot have any extensional significance either; moreover, it has no logical or formal but only an intensional significance: it is a demand for the exclusion of all intensionally defined (i.e. mathematical) sequences without frequency limits. But from the point of view of applications, this exclusion proves to be without significance even intensionally, since in applied probability theory we do not of course deal with the mathematical sequences themselves but only with hypothetical estimates about empirical sequences. The exclusion of sequences without frequency limits could therefore only serve to warn us against treating those empirical sequences as chance-like or random of which we hypothetically assume that they have no frequency limit. But what possible action could we take in response to this warning?4 What sort of considerations or conjectures about the possible convergence or divergence of empirical sequences should we indulge in or abstain from, in view of this warning, seeing that criteria of convergence are no more applicable to them than are criteria of divergence? All these embarrassing questions5 disappear once the axiom of convergence has been got rid of.

Our logical analysis thus makes transparent both the form and the function of the various partial requirements of the system, and shows what reasons tell against the axiom of randomness and in favour of the requirement of uniqueness. Meanwhile the problem of decidability seems to be growing ever more menacing. And although we are not obliged to call our requirements (or axioms) ‘meaningless’,6 it looks as if we were compelled to describe them as non-empirical. But does not this description of probability statements—no matter what words we use to express it—contradict the main idea of our approach?


67 A PROBABILISTIC SYSTEM OF SPECULATIVE METAPHYSICS

The most important use of probability statements in physics is this: certain physical regularities or observable physical effects are interpreted as ‘macro laws’; that is to say, they are interpreted, or explained, as mass phenomena, or as the observable results of hypothetical and not directly observable ‘micro events’. The macro laws are deduced from probability estimates by the following method: we show that observations which agree with the observed regularity in question are to be expected with a probability very close to 1, i.e. with a probability which deviates from 1 by an amount which can be made as small as we choose. When we have shown this, then we say that by our probability estimate we have ‘explained’ the observable effect in question as a macro effect.

But if we use probability estimates in this way for the ‘explanation’ of observable regularities without introducing special precautions, then we may immediately become involved in speculations which in accordance with general usage can well be described as typical of Speculative metaphysics.

For since probability statements are not falsifiable, it must always be possible in this way to ‘explain’, by probability estimates, any regularity we please. Take, for example, the law of gravity. We may contrive hypothetical probability estimates to ‘explain’ this law in the following way. We select events of some kind to serve as elementary or atomic events; for instance the movement of a small particle. We select also what is to be a primary property of these events; for instance the direction and velocity of the movement of a particle. We then assume that these events show a chance-like distribution. Finally we calculate the probability that all the particles within a certain finite spatial region, and during a certain finite period of time—a certain ‘cosmic period’—will with a specified accuracy move, accidentally, in the way required by the law of gravity. The probability calculated will, of course, be very small; negligibly small, in fact, but still not equal to zero. Thus we can raise the question how long an n-segment of the sequence would have to be, or in other words, how long a duration must be assumed for the whole process, in order that we may expect, with a probability close to 1 (or deviating from 1 by not more than an arbitrarily small value ε) the occurrence of one such cosmic period in which, as the result of an accumulation of accidents, our observations will all agree with the law of gravity. For any value as close to 1 as we choose, we obtain a definite, though extremely large, finite number. We can then say: if we assume that the segment of the sequence has this very great length—or in other words, that the ‘world’ lasts long enough—then our assumption of randomness entitles us to expect the occurrence of a cosmic period in which the law of gravity will seem to hold good, although ‘in reality’ nothing ever occurs but random scattering. This type of ‘explanation’ by means of an assumption of randomness is applicable to any regularity we choose. In fact we can in this way ‘explain’ our whole world, with all its observed regularities, as a phase in a random chaos—as an accumulation of purely accidental coincidences.

It seems clear to me that speculations of this kind are ‘metaphysical’, and that they are without any significance for science. And it seems equally clear that this fact is connected with their nonfalsifiability— with the fact that we can always and in all circumstances indulge in them. My criterion of demarcation thus seems to agree here quite well with the general use of the word ‘metaphysical’.

Theories involving probability, therefore, if they are applied without special precautions, are not to be regarded as scientific. We must rule out their metaphysical use if they are to have any use in the practice of empirical science.*1


68 PROBABILITY IN PHYSICS

The problem of decidability troubles only the methodologist, not the physicist.*1 If asked to produce a practically applicable concept of probability, the physicist might perhaps offer something like a physical definition of probability, on lines such as the following: There are certain experiments which, even if carried out under controlled conditions, lead to varying results. In the case of some of these experiments—those which are ‘chance-like’, such as tosses of a coin—frequent repetition leads to results with relative frequencies which, upon further repetition, approximate more and more to some fixed value which we may call the probability of the event in question. This value is ‘... empirically determinable through long series of experiments to any degree of approximation’;1 which explains, incidentally, why it is possible to falsify a hypothetical estimate of probability.

Against definitions on these lines both mathematicians and logicians will raise objections; in particular the following:

(1) The definition does not agree with the calculus of probability since, according to Bernoulli’s theorem, only almost all very long segments are statistically stable, i.e. behave as if convergent. For that reason, probability cannot be defined by this stability, i.e. by quasi-convergent behaviour. For the expression ‘almost all’—which ought to occur in the definiens—is itself only a synonym for ‘very probable’. The definition is thus circular; a fact which can be easily concealed (but not removed) by dropping the word ‘almost’. This is what the physicist’s definition did; and it is therefore unacceptable.

(2) When is a series of experiments to be called ‘long’? Without being given a criterion of what is to be called ‘long’, we cannot know when, or whether, we have reached an approximation to the probability.

(3) How can we know that the desired approximation has in fact been reached?

Although I believe that these objections are justified, I nevertheless believe that we can retain the physicist’s definition. I shall support this belief by the arguments outlined in the previous section. These showed that probability hypotheses lose all informative content when they are allowed unlimited application. The physicist would never use them in this way. Following his example I shall disallow the unlimited application of probability hypotheses: I propose that we take the methodological decision never to explain physical effects, i.e. reproducible regularities, as accumulations of accidents. This decision naturally modifies the concept of probability: it narrows it.*2 Thus objection (1) does not affect my position, for I do not assert the identity of the physical and the mathematical concepts of probability at all; on the contrary, I deny it. But in place of (1), a new objection arises.

(1œ) When can we speak of ‘accumulated accidents’? Presumably in the case of a small probability. But when is a probability ‘small’? We may take it that the proposal which I have just submitted rules out the use of the method (discussed in the preceding section) of manufacturing an arbitrarily large probability out of a small one by changing the formulation of the mathematical problem. But in order to carry out the proposed decision, we have to know what we are to regard as small.

In the following pages it will be shown that the proposed methodological rule agrees with the physicist’s definition, and that the objections raised by questions (1œ), (2), and (3) can be answered with its help. To begin with, I have in mind only one typical case of the application of the calculus of probability: I have in mind the case of certain reproducible macro effects which can be described with the help of precise (macro) laws—such as gas pressure—and which we interpret, or explain, as due to a very large accumulation of micro processes, such as molecular collisions. Other typical cases (such as statistical fluctuations or the statistics of chance-like individual processes) can be reduced without much difficulty to this case.*3

αnF(Δp)

Let us take a macro effect of this type, described by a wellcorroborated law, which is to be reduced to random sequences of micro events. Let the law assert that under certain conditions a physical magnitude has the value p. We assume the effect to be ‘precise’, so that no measurable fluctuations occur, i.e. no deviations from p beyond that interval, ± φ (the interval of imprecision; cf. section 37) within which our measurements will in any case fluctuate, owing to the imprecision inherent in the prevailing technique of measurement. We now propose the hypothesis that p is a probability within a sequence α of micro events; and further, that n micro events contribute towards producing the effect. Then (cf. section 61) we can calculate for every chosen value δ, the probability αnF(Δp), i.e. the probability that the value measured will fall within the interval Δp. The complementary probability may be denoted by ‘ε’. Thus we have αnF Ilf_20004f49v05_0222_006.gif= ε. According to Bernoulli’s theorem, ε tends to zero as n increases without limit.

 

We assume that ε is so ‘small’ that it can be neglected. (Question (1΄) which concerns what ‘small’ means, in this assumption, will be dealt with soon.) The Δp is to be interpreted, clearly, as the interval within which the measurements approach the value p. From this we see that the three quantities: ε, n, and Δp correspond to the three questions (1΄), (2), and (3). Δp or δ can be chosen arbitrarily, which restricts the arbitrariness of our choice of ε and n. Since it is our task to deduce the exact macro effect p (± φ) we shall not assume δ to be greater than φ. As far as the reproducible effect p is concerned, the deduction will be satisfactory if we can carry it out for some value δ Il_20004f49v05_0222_001.gif φ. (Here φ is given, since it is determined by the measuring technique.) Now let us choose δ so that it is (approximately) equal to φ. Then we have reduced question (3) to the two other questions, (1΄) and (2).

By the choice of δ (i.e. of Δp) we have established a relation between n and ε, since to every n there now corresponds uniquely a value of ε. Thus (2), i.e. the question When is n sufficiently long? has been reduced to (1œ), i.e. the question When is ε small? (and vice versa).

But this means that all three questions could be answered if only we could decide what particular value of ε is to be neglected as ‘negligibly small’. Now our methodological rule amounts to the decision to neglect small values of ε; but we shall hardly be prepared to commit ourselves for ever to a definite value of ε.

If we put our question to a physicist, that is, if we ask him what ε he is prepared to neglect—0.001, or 0.000001, or... ? he will presumably answer that ε does not interest him at all; that he has chosen not ε but n; and that he has chosen n in such a way as to make the correlation between n and Δp largely independent of any changes of the value ε which we might choose to make.

The physicist’s answer is justified, because of the mathematical peculiarities of the Bernoullian distribution: it is possible to determine for every n the functional dependence between ε and Δp.*4 An examination of this function shows that for every (‘large’) n there exists a characteristic value of Δp such that in the neighbourhood of this value Δp is highly insensitive to changes of ε. This insensitiveness increases with increasing n. If we take an n of an order of magnitude which we should expect in the case of extreme mass-phenomena, then, in the neighbourhood of its characteristic value, Δp is so highly insensitive to changes of ε that Δp hardly changes at all even if the order of magnitude of ε changes. Now the physicist will attach little value to more sharply defined boundaries of Δp. And in the case of typical mass phenomena, to which this investigation is restricted, Δp can, we remember, be taken to correspond to the interval of precision ± φ which depends upon our technique of measurement; and this has no sharp bounds but only what I called in section 37 ‘condensation bounds’. We shall therefore call n large when the insensitivity of Δp in the neighbourhood of its characteristic value, which we can determine, is at least so great that even changes in order of magnitude of ε cause the value of Δp to fluctuate only within the condensation bounds of ± φ. (If n→, then Δp becomes completely insensitive.) But if this is so, then we need no longer concern ourselves with the exact determination of ε: the decision to neglect a small ε suffices, even if we have not exactly stated what has to be regarded as ‘small’. It amounts to the decision to work with the characteristic values of Δp above mentioned, which are insensitive to changes of ε.

The rule that extreme improbabilities have to be neglected (a rule which becomes sufficiently explicit only in the light of the above) agrees with the demand for scientific objectivity. For the obvious objection to our rule is, clearly, that even the greatest improbability always remains a probability, however small, and that consequently even the most improbable processes—i.e. those which we propose to neglect— will some day happen. But this objection can be disposed of by recalling the idea of a reproducible physical effect—an idea which is closely connected with that of objectivity (cf. section 8). I do not deny the possibility that improbable events might occur. I do not, for example, assert that the molecules in a small volume of gas may not, perhaps, for a short time spontaneously withdraw into a part of the volume, or that in a greater volume of gas spontaneous fluctuations of pressure will never occur. What I do assert is that such occurrences would not be physical effects, because, on account of their immense improbability, they are not reproducible at will. Even if a physicist happened to observe such a process, he would be quite unable to reproduce it, and therefore would never be able to decide what had really happened in this case, and whether he may not have made an observational mistake. If, however, we find reproducible deviations from a macro effect which has been deduced from a probability estimate in the manner indicated, then we must assume that the probability estimate is falsified.

Such considerations may help us to understand pronouncements like the following of Eddington’s in which he distinguishes two kinds of physical laws: ‘Some things never happen in the physical world because they are impossible; others because they are too improbable. The laws which forbid the first are primary laws; the laws which forbid the second are secondary laws.’2 Although this formulation is perhaps not beyond criticism (I should prefer to abstain from non-testable assertions about whether or not extremely improbable things occur), it agrees well with the physicist’s application of probability theory.

Other cases to which probability theory may be applied, such as statistical fluctuations, or the statistics of chance-like individual events, are reducible to the case we have been discussing, that of the precisely measurable macro effect. By statistical fluctuations I understand phenomena such as the Brownian movement. Here the interval of precision of measurement (± φ) is smaller than the interval Δp characteristic of the number n of micro events contributing to the effect; hence measurable deviations from p are to be expected as highly probable. The fact that such deviations occur will be testable, since the fluctuation itself becomes a reproducible effect; and to this effect my earlier arguments apply: fluctuations beyond a certain magnitude (beyond some interval Δp) must not be reproducible, according to my methodological requirements, nor long sequences of fluctuations in one and the same direction, etc. Corresponding arguments would hold for the statistics of chance-like individual events.

I may now summarize my arguments regarding the problem of decidability.

Our question was: How can probability hypotheses—which, we have seen, are non-falsifiable—play the part of natural laws in empirical science? Our answer is this: Probability statements, in so far as they are not falsifiable, are metaphysical and without empirical significance; and in so far as they are used as empirical statements they are used as falsifiable statements.

But this answer raises another question: How is it possible that probability statements—which are not falsifiable—can be used as falsifiable statements? (The fact that they can be so used is not in doubt: the physicist knows well enough when to regard a probability assumption as falsified.) This question, we find, has two aspects. On the one hand, we must make the possibility of using probability statements understandable in terms of their logical form. On the other hand, we must analyse the rules governing their use as falsifiable statements.

According to section 66, accepted basic statements may agree more or less well with some proposed probability estimate; they may represent better, or less well, a typical segment of a probability sequence. This provides the opportunity for the application of some kind of methodological rule; a rule, for instance, which might demand that the agreement between basic statements and the probability estimate should conform to some minimum standard. Thus the rule might draw some arbitrary line and decree that only reasonably representative segments (or reasonably ‘fair samples’) are ‘permitted’, while atypical or non-representative segments are ‘forbidden’.

A closer analysis of this suggestion showed us that the dividing line between what is permitted and what is forbidden need not be drawn quite as arbitrarily as might have been thought at first. And in particular, that there is no need to draw it ‘tolerantly’. For it is possible to frame the rule in such a way that the dividing line between what is permitted and what is forbidden is determined, just as in the case of other laws, by the attainable precision of our measurements.

Our methodological rule, proposed in accordance with the criterion of demarcation, does not forbid the occurrence of atypical segments; neither does it forbid the repeated occurrence of deviations (which, of course, are typical for probability sequences). What this rule forbids is the predictable and reproducible occurrence of systematic deviations; such as deviations in a particular direction, or the occurrence of segments which are atypical in a definite way. Thus it requires not a mere rough agreement, but the best possible one for everything that is reproducible and testable; in short, for all reproducible effects.

69 LAW AND CHANCE

One sometimes hears it said that the movements of the planets obey strict laws, whilst the fall of a die is fortuitous, or subject to chance. In my view the difference lies in the fact that we have so far been able to predict the movement of the planets successfully, but not the individual results of throwing dice.

In order to deduce predictions one needs laws and initial conditions; if no suitable laws are available or if the initial conditions cannot be ascertained, the scientific way of predicting breaks down. In throwing dice, what we lack is, clearly, sufficient knowledge of initial conditions. With sufficiently precise measurements of initial conditions it would be possible to make predictions in this case also; but the rules for correct dicing (shaking the dice-box) are so chosen as to prevent us from measuring initial conditions. The rules of play and other rules determining the conditions under which the various events of a random sequence are to take place I shall call the ‘frame conditions’. They consist of such requirements as that the dice shall be ‘true’ (made from homogeneous material), that they shall be well shaken, etc.

There are other cases in which prediction may be unsuccessful. Perhaps it has not so far been possible to formulate suitable laws; perhaps all attempts to find a law have failed, and all predictions have been falsified. In such cases we may despair of ever finding a satisfactory law. (But it is not likely that we shall give up trying unless the problem does not interest us much—which may be the case, for example, if we are satisfied with frequency predictions.) In no case, however, can we say with finality that there are no laws in a particular field. (This is a consequence of the impossibility of verification.) This means that my view makes the concept of chance subjective.*1 I speak of ‘chance’ when our knowledge does not suffice for prediction; as in the case of dicing, where we speak of ‘chance’ because we have no knowledge of the initial conditions. (Conceivably a physicist equipped with good instruments could predict a throw which other people could not predict.)

In opposition to this subjective view, an objective view has sometimes been advocated. In so far as this uses the metaphysical idea that events are, or are not, determined in themselves, I shall not examine it further here. (Cf. section 71 and 78.) If we are successful with our prediction, we may speak of ‘laws’; otherwise we can know nothing about the existence or non-existence of laws or of irregularities.*2

Perhaps more worth considering than this metaphysical idea is the following view. We encounter ‘chance’ in the objective sense, it may be said, when our probability estimates are corroborated; just as we encounter causal regularities when our predictions deduced from laws are corroborated.

The definition of chance implicit in this view may not be altogether useless, but it should be strongly emphasized that the concept so defined is not opposed to the concept of law: it was for this reason that I called probability sequences chance-like. In general, a sequence of experimental results will be chance-like if the frame conditions which define the sequence differ from the initial conditions; when the individual experiments, carried out under identical frame conditions, will proceed under different initial conditions, and so yield different results. Whether there are chance-like sequences whose elements are in no way predictable, I do not know. From the fact that a sequence is chance-like we may not even infer that its elements are not predictable, or that they are ‘due to chance’ in the subjective sense of insufficient knowledge; and least of all may we infer from this fact the ‘objective’ fact that there are no laws.*3

Not only is it impossible to infer from the chance-like character of the sequence anything about the conformity to law, or otherwise, of the individual events: it is not even possible to infer from the corroboration of probability estimates that the sequence itself is completely irregular. For we know that chance-like sequences exist which are constructed according to a mathematical rule. (Cf. appendix iv.) The fact that a sequence has a Bernoullian distribution is not a symptom of the absence of law, and much less identical with the absence of law ‘by definition’.1 In the success of probability predictions we must see no more than a symptom of the absence of simple laws in the structure of the sequence (cf. sections 43 and 58)—as opposed to the events constituting it. The assumption of freedom from after-effect, which is equivalent to the hypothesis that such simple laws are not discoverable, is corroborated, but that is all.


70 THE DEDUCIBILITY OF MACRO LAWS FROM MICRO LAWS

There is a doctrine which has almost become a prejudice, although it has recently been criticized severely—the doctrine that all observable events must be explained as macro events; that is to say, as averages or accumulations or summations of certain micro events. (The doctrine is somewhat similar to certain forms of materialism.) Like other doctrines of its kind, this seems to be a metaphysical hypostatization of a methodological rule which in itself is quite unobjectionable. I mean the rule that we should see whether we can simplify or generalize or unify our theories by employing explanatory hypotheses of the type mentioned (that is to say, hypotheses explaining observable effects as summations or integrations of micro events). In evaluating the success of such attempts, it would be a mistake to think that nonstatistical hypotheses about the micro events and their laws of interaction could ever be sufficient to explain macro events. For we should need, in addition, hypothetical frequency estimates, since statistical conclusions can only be derived from statistical premises. These frequency estimates are always independent hypotheses which at times may indeed occur to us whilst we are engaged in studying the laws pertaining to micro events, but which can never be derived from these laws. Frequency estimates form a special class of hypotheses: they are prohibitions which, as it were, concern regularities in the large.1 Von Mises has stated this very clearly: ‘Not even the tiniest little theorem in the kinetic theory of gases follows from classical physics alone, without additional assumptions of a statistical kind.’2

Statistical estimates, or frequency statements, can never be derived simply from laws of a ‘deterministic’ kind, for the reason that in order to deduce any prediction from such laws, initial conditions are needed. In their place, assumptions about the statistical distribution of initial conditions—that is to say specific statistical assumptions—enter into every deduction in which statistical laws are obtained from micro assumptions of a deterministic or ‘precise’ character.*1

It is a striking fact that the frequency assumptions of theoretical physics are to a great extent equal-chance hypotheses, but this by no means implies that they are ‘self-evident’ or a priori valid. That they are far from being so may be seen from the wide differences between classical statistics, Bose-Einstein statistics, and Fermi-Dirac statistics. These show how special assumptions may be combined with an equal-chance hypothesis, leading in each case to different definitions of the reference sequences and the primary properties for which equal distribution is assumed.

The following example may perhaps illustrate the fact that frequency assumptions are indispensable even when we may be inclined to do without them.

Imagine a waterfall. We may discern some odd kind of regularity: the size of the currents composing the fall varies; and from time to time a splash is thrown off from the main stream; yet throughout all such variations a certain regularity is apparent which strongly suggests a statistical effect. Disregarding some unsolved problems of hydrodynamics (concerning the formation of vortices, etc.) we can, in principle, predict the path of any volume of water—say a group of molecules—with any desired degree of precision, if sufficiently precise initial conditions are given. Thus we may assume that it would be possible to foretell of any molecule, far above the waterfall, at which point it will pass over the edge, where it will reach bottom, etc. In this way the path of any number of particles may, in principle, be calculated; and given sufficient initial conditions we should be able, in principle, to deduce any one of the individual statistical fluctuations of the waterfall. But only this or that individual fluctuation could be so obtained, not the recurring statistical regularities we have described, still less the general statistical distribution as such. In order to explain these we need statistical estimates—at least the assumption that certain initial conditions will again and again recur for many different groups of particles (which amounts to a universal statement). We obtain a statistical result if, and only if, we make such specific statistical assumptions—for example, assumptions concerning the frequency distribution of recurring initial conditions.


71 FORMALLY SINGULAR PROBABILITY STATEMENTS

I call a probability statement ‘formally singular’ when it ascribes a probability to a single occurrence, or to a single element of a certain class of occurrences;*1 for example, ‘the probability of throwing five with the next throw of this die is 1/6΄ or ‘the probability of throwing five with any single throw (of this die) is 1/6΄. From the standpoint of the frequency theory such statements are as a rule regarded as not quite correct in their formulation, since probabilities cannot be ascribed to single occurrences, but only to infinite sequences of occurrences or events. It is easy, however, to interpret these statements as correct, by appropriately defining formally singular probabilities with the help of the concept of objective probability or relative frequency. I use ‘αPk(β)’ to denote the formally singular probability that a certain occurrence k has the property β, in its capacity as an element of a sequence α—in symbols:1 k ε α—and I then define the formally singular probability as follows:

20004f49v05_0232_002.jpg

(Definition)

This can be expressed in words as: The formally singular probability that the event k has the property β—given that k is an element of the sequence α—is, by definition, equal to the probability of the property β within the reference sequence α.

This simple, almost obvious, definition proves to be surprisingly useful. It can even help us to clarify some intricate problems of modern quantum theory. (Cf. sections 75–76.)

As the definition shows, a formally singular probability statement would be incomplete if it did not explicitly state a reference-class. But although α is often not explicitly mentioned, we usually know in such cases which α is meant. Thus the first example given above does not specify any reference sequence α, but it is nevertheless fairly clear that it relates to all sequences of throws with true dice.

In many cases there may be several different reference sequences for an event k. In these cases it may be only too obvious that different formally singular probability statements can be made about the same event. Thus the probability that an individual man k will die within a given period of time may assume very different values according to whether we regard him as a member of his age-group, or of his occupational group, etc. It is not possible to lay down a general rule as to which out of several possible reference-classes should be chosen. (The narrowest reference-class may often be the most suitable, provided that it is numerous enough to allow the probability estimate to be based upon reasonable statistical extrapolation, and to be supported by a sufficient amount of corroborating evidence.)

Not a few of the so-called paradoxes of probability disappear once we realize that different probabilities may be ascribed to one and the same occurrence or event, as an element of different reference-classes. For example, it is sometimes said that the probability aPk(β) of an event αbefore its occurrence is different from the probability of the same event after it has occurred: before, it may equal 1/6, while afterwards it can only be equal to 1 or 0. This view is, of course, quite mistaken.αPk(β) is αalways the same, both before and after the occurrence. Nothing has changed except that, on the basis of the information k ε β (or k ε β)— information which may be supplied to us upon observing the occurrence—we may choose a new reference-class, namely β (or 20004f49v05_0260_003.jpg), and then ask what is the value ofβPk(β). The value of this probability is of course 1; just as βPk(β) = 0. Statements informing us about the actual outcome of single occurrences—statements which are not about some frequency but rather of the form ‘k ε φ’—cannot change the probability of these occurrences; they may, however, suggest to us the choice of another reference-class.

The concept of a formally singular probability statement provides a kind of bridge to the subjective theory, and thereby also, as will be shown in the next section, to the theory of range. For we might agree to interpret formally singular probability as ‘degree of rational belief’ (following Keynes)—provided we allow our ‘rational beliefs’ to be guided by an objective frequency statement. This then is the information upon which our beliefs depend. In other words, it may happen that we know nothing about an event except that it belongs to a certain reference-class in which some probability estimate has been successfully tested. This information does not enable us to predict what the property of the event in question will be; but it enables us to express all we know about it by means of a formally singular probability statement which looks like an indefinite prediction about the particular event in question.*2

Thus I do not object to the subjective interpretation of probability statements about single events, i.e. to their interpretation as indefinite predictions—as confessions, so to speak, of our deficient knowledge about the particular event in question (concerning which, indeed, nothing follows from a frequency statement). I do not object, that is to say, so long as we clearly recognize that the objective frequency statements are fundamental, since they alone are empirically testable. I reject, however, any interpretation of these formally singular probability statements—these indefinite predictions—as statements about an objective state of affairs, other than the objective statistical state of affairs. What I have in mind is the view that a statement about the probability 1/6 in dicing is not a mere confession that we know nothing definite (subjective theory), but rather an assertion about the next throw—an assertion that its result is objectively both indeterminate and undetermined— something which as yet hangs in the balance.*3 I regard all attempts at this kind of objective interpretation (discussed at length by Jeans, among others) as mistaken. Whatever indeterministic airs these interpretations may give themselves, they all involve the metaphysical idea that not only can we deduce and test predictions, but that, in addition, nature is more or less ‘determined’ (or ‘undetermined’); so that the success (or failure) of predictions is to be explained not by the laws from which they are deduced, but over and above this by the fact that nature is actually constituted (or not constituted) according to these laws.*4


72 THE THEORY OF RANGE

In section 34 I said that a statement which is falsifiable to a higher degree than another statement can be described as the one which is logically more improbable; and the less falsifiable statement as the one which is logically more probable. The logically less probable statement entails1 the logically more probable one. Between this concept of logical probability and that of objective or formally singular numerical probability there are affinities. Some of the philosophers of probability (Bolzano, von Kries, Waismann) have tried to base the calculus of probability upon the concept of logical range, and thus upon a concept which (cf. section 37) coincides with that of logical probability; and in doing so, they also tried to work out the affinities between logical and numerical probability.

Waismann2 has proposed to measure the degree of interrelatedness between the logical ranges of various statements (their ratios, as it were) by means of the relative frequencies corresponding to them, and thus to treat the frequencies as determining a system of measurement for ranges. I think it is feasible to erect a theory of probability on this foundation. Indeed we may say that this plan amounts to the same thing as correlating relative frequencies with certain ‘indefinite predictions’ —as we did in the foregoing section, when defining formally singular probability statements.

It must be said, however, that this method of defining probability is only practicable when a frequency theory has already been constructed. Otherwise one would have to ask how the frequencies used in defining the system of measurement were defined in their turn. If, however, a frequency theory is at our disposal already, then the introduction of the theory of range becomes really superfluous. But in spite of this objec- tion I regard the practicability of Waismann’s proposal as significant. It is satisfactory to find that a more comprehensive theory can bridge the gaps—which at first appeared unbridgeable—between the various attempts to tackle the problem, especially between the subjective and the objective interpretations. Yet Waismann’s proposal calls for some slight modification. His concept of a ratio of ranges (cf. note 2 to section 48) not only presupposes that ranges can be compared with the help of their subclass relations (or their entailment relations); but it also presupposes, more generally, that even ranges which only partially overlap (ranges of non-comparable statements) can be made comparable. This latter assumption, however, which involves considerable difficulties, is superfluous. It is possible to show that in the cases concerned (such as cases of randomness) the comparison of subclasses and that of frequencies must lead to analogous results. This justifies the procedure of correlating frequencies to ranges in order to measure the latter. In doing so, we make the statements in question (non-comparable by the subclass method) comparable. I will indicate roughly how the procedure described might be justified.

If between two property classes γ and β the subclass relation

20004f49v05_0236_003.jpg

holds, then we have:

20004f49v05_0236_005.jpg

(cf. section 33)

so that the logical probability or the range of the statement (k ε γ) must be smaller than, or equal to, that of (k ε β). It will be equal only if there is a reference class α (which may be the universal class) with respect to which the following rule holds which may be said to have the form of a ‘law of nature’:

20004f49v05_0236_008.jpg

If this ‘law of nature’ does not hold, so that we may assume randomness in this respect, then the inequality holds. But in this case we obtain, provided α is denumerable, and acceptable as a reference sequence:

20004f49v05_0237_002.jpg

This means that, in the case of randomness, a comparison of ranges must lead to the same inequality as a comparison of relative frequencies. Accordingly, if we have randomness, we may correlate relative frequencies with the ranges in order to make the ranges measurable. But this is just what we did, although indirectly, in section 71, when we defined the formally singular probability statement. Indeed, from the assumptions made, we might have inferred immediately that

So we have come back to our starting point, the problem of the interpretation of probability. And we now find that the conflict between objective and subjective theories, which at first seemed so obdurate, may be eliminated altogether by the somewhat obvious definition of formally singular probability.

*1 Within the theory of probability, I have made since 1934 three kinds of changes.

(1) The introduction of a formal (axiomatic) calculus of probabilities which can be interpreted in many ways—for example, in the sense of the logical and of the frequency interpretations discussed in this book, and also of the propensity interpretation discussed in my Postscript.

(2) A simplification of the frequency theory of probability through carrying out, more fully and more directly than in 1934, that programme for reconstructing the frequency theory which underlies the present chapter.

(3) The replacement of the objective interpretation of probability in terms of frequency by another objective interpretation—the propensity interpretation—and the replacement of the calculus of frequencies by the neo-classical (or measure-theoretical) formalism.

The first two of these changes date back to 1938 and are indicated in the book itself (i.e. in this volume): the first by some new appendices, *ii to *v, and the second—the one which affects the argument of the present chapter—by a number of new footnotes to this chapter, and by the new appendix *vi. The main change is described here in footnote *1 to section 57.

The third change (which I first introduced, tentatively, in 1953) is explained and developed in the Postscript, where it is also applied to the problems of quantum theory.

1 Cf. for example von Mises, Wahrscheinlichkeit, Statistik und Wahrheit, 1928, pp. 62 ff.; 2nd edition, 1936, pp. 84 ff.; English translation by J. Neyman, D. Sholl, and E. Rabinowitsch, Probability, Statistics and Truth, 1939, pp. 98 ff. *Although the classical definition is often called ‘Laplacean’ (also in this book), it is at least as old as De Moivre’s Doctrine of Chances, 1718. For an early objection against the phrase ‘equally possible’, see C. S. Peirce, Collected Papers 2, 1932 (first published 1878), p. 417, para. 2, 673.

*1 The reasons why I count the logical interpretation as a variant of the subjective interpretation are more fully discussed in chapter *ii of the Postscript, where the subjective interpretation is criticized in detail. Cf. also appendix *ix.

2 Waismann, Logische Analyse des Wahrscheinlichkeitsbegriffs, Erkenntnis 1, 1930, p. 237: ‘Probability so defined is then, as it were, a measure of the logical proximity, the deductive connection between the two statements’. Cf. also Wittgenstein, op. cit., proposition 5.15 ff.3 J. M. Keynes, A Treatise on Probability, 1921, pp. 95 ff.

4 Wittgenstein, op. cit., proposition 5.152: ‘If p follows from q, the proposition q gives to the proposition p the probability 1. The certainty of logical conclusion is a limiting case of probability.’

5 For the older frequency theory cf. the critique of Keynes, op. cit., pp. 95 ff., where special reference is made to Venn’s The Logic of Chance. For Whitehead’s view cf. section 80 (note 2). Chief representatives of the new frequency theory are: R. von Mises (cf. note 1 to section 50), Dφrge, Kamke, Reichenbach and Tornier. *A new objective interpretation, very closely related to the frequency theory, but differing from it even in its mathematical formalism, is the propensity interpretation, introduced in sections *53 ff. of my Postscript.6 Keynes’s greatest error; cf. section 62, below, especially note 3. *I have not changed my view on this point even though I now believe that Bernoulli’s theorem may serve as a ‘bridge’ within an objective theory—as a bridge from propensities to statistics. See also appendix *ix and sections *55 to *57 of my Postscript.

1 Waismann, Erkenntnis 1, 1930, p. 238, says: ‘There is no other reason for introducing the concept of probability than the incompleteness of our knowledge.’ A similar view is held by C. Stumpf (Sitzungsberichte der Bayerischen Akademie der Wissenschaften, phil.-hist. Klasse, 1892, p. 41). *I believe that this widely held view is responsible for the worst confusions. This will be shown in detail in my Postscript, chapters *ii and *v.

1 R. von Mises, Fundamentalsätze der Wahrscheinlichkeitsrechnung, Mathematische Zeitschrift 4, 1919, p. 1; Grundlagen der Wahrscheinlichkeitsrechnung, Mathematische Zeitschrift 5, 1919, p. 52; Wahrscheinlichkeit, Statistik, und Wahrheit (1928), 2nd edition 1936, English translation by J. Neyman, D. Sholl, and E. Rabinowitsch: Probability, Statistics and Truth, 1939; Wahrscheinlichkeitsrechnung und ihre Anwendung in der Statistik und theoretischen Physik (Vorlesungen über angewandte Mathematik 1), 1931.

2 We can correlate with every sequence of properties as many distinct sequences of relative frequencies as there are properties defined in the sequence. Thus in the case of an alternative there will be two distinct sequences. Yet these two sequences are derivable from one another, since they are complementary (corresponding terms add up to 1). For this reason I shall, for brevity, refer to ‘the (one) sequence of relative frequencies correlated with the alternative (α)’, by which I shall always mean the sequence of frequencies correlated with the property ‘1’ of this alternative (α).

3 Cf. von Mises, Wahrscheinlichkeitsrechnung, 1931, p. 22.

1 Waismann, Erkenntnis 1, 1930, p. 232.

2 This concern is expressed by Schlick, Naturwissenschaften 19, 1931. *I still believe that these two tasks are important. Although I almost succeeded in the book in achieving what I set out to do, the two tasks were satisfactorily completed only in the new appendix *vi.

3 A full account of the mathematical construction will be published separately. *Cf. the new appendix *vi.

*1 Definition 1 is of course related to the classical definition of probability as the ratio of the favourable cases to the equally possible cases; but it should be clearly distinguished from the latter definition: there is no assumption involved here that the elements of α are ‘equally possible’.

*2 By selecting a set of F-formulae from which the other F-formulae can be derived, we obtain a formal axiom system for probability; compare the appendices ii, *ii, *iv, and *v.

1 Von Mises’s term is ‘choice’ (‘Auswahl’).

2 The answer to this question is given by the general division theorem (cf. appendix ii).

3 Hausdorff, Berichte über die Verhandlungen der sächsischen Ges. d. Wissenschaften, Leipzig, mathem.-physik. Klasse 53, 1901, p. 158.

4 It is even triply symmetrical, i.e. for α, β and γ, if we assume β and γ also to be finite. For the proof of the symmetry assertion cf. appendix ii, (1s) and (1s). *The condition of finitude for triple symmetry asserted in this note is insufficient. I may have intended to express the condition that β and γ are bounded by the finite reference class α, or, most likely, that α should be our finite universe of discourse. (These are sufficient conditions.) The insufficiency of the condition, as formulated in my note, is shown by the following counter-example. Take a universe of 5 buttons; 4 are round (α); 2 are round and black (αβ); 2 are round and large (αγ); 1 is round, black, and large (αβγ); and 1 is square, black, and large ( βγ). Then we do not have triple symmetry since βF” (γ) ‚ βF” (γ).

*1 Thus any information about the possession of properties is relevant, or irrelevant, if and only if the properties in question are, respectively, dependent or independent. Relevance can thus be defined in terms of dependence, but the reverse is not the case. (Cf. the next footnote, and note *1 to section 55.)

5 Keynes objected to the frequency theory because he believed that it was impossible to define relevance in its terms; cf. op. cit., pp. 103 ff. *In fact, the subjective theory cannot define (objective) independence, which is a serious objection as 1 show in my Postscript, chapter *ii, especially sections *40 to *43.

*1 I suggest that sections 55 to 64, or perhaps only 56 to 64, be skipped at first reading. It may even be advisable to turn from here, or from the end of section 55, direct to chapter 10.

*1 This is another indication of the fact that the terms ‘relevant’ and ‘irrelevant’, figuring so largely in the subjective theory, are grossly misleading. For if p is irrelevant, and likewise q, it is a little surprising to learn that p.q may be of the highest relevance. See also appendix *ix, especially points 5 and 6 of the first note.

*2 The general idea of distinguishing neighbourhoods according to their size, and of operating with well-defined neighbourhood-selections was introduced by me. But the term ‘free from after-effect’ (‘nachwirkungsfrei’) is due to Reichenbach. Reichenbach, however, used it at the time only in the absolute sense of ‘insensitive to selection according to any preceding group of elements’. The idea of introducing a recursively definable concept of 1-freedom, 2-freedom,... and n-freedom, and of thus utilizing the recursive method for analysing neighbourhood selections and especially for constructing random sequences is mine. (I have used the same recursive method also for defining the mutual independence of n events.) This method is quite different from Reichenbach’s, See also footnote 4 to section 58, and especially footnote 2 to section 60, below. Added 1968: I have now found that the term was used long before Reichenbach by Smoluchowski.

1 As Dr. K. Schiff has pointed out to me, it is possible to simplify this definition. It is enough to demand insensitivity to selection of any predecessor n-tuple (for a given n). Insensitivity to selection of n-1-tuples (etc.) can then be proved easily.

*3 Cf. note *1 to appendix iv. The result is a sequence of the length 2n + n -1 such that by omitting its last n -1 elements, we obtain a generating period for an m-free alternative, with m = n -1.

*4 The following definition, applicable to any given long but finite alternative A, with equidistribution, seems appropriate. Let N be the length of A, and let n be the greatest integer such that 2n + 1 Il_20004f49v05_0181_002.gif N. Then A is said to be perfectly random if and only if the relative number of occurrences of any given pair, triplet,..., m-tuplet (up to m = n) deviates from that of any other pair, triplet,..., m-tuplet, by not more than, say, m/N Il_20004f49v05_0181_003.gif respectively. This characterization makes it possible to say of a given alternative A that it is approximately random; and it even allows us to define a degree of approximation. A more elaborate definition may be based upon the method (of maximizing my E-function) described under points 8 ff. of my Third Note reprinted in appendix *ix.

1 The corresponding problem in connection with infinite sequences of adjoining segments I call ‘Bernoulli’s problem’ (following von Mises, Wahrscheinlichkeitsrechnung, 1931, p. 128); and in connection with infinite sequences of overlapping segments I call it ‘the quasi-Bernoulli problem’ (cf. note 1 to section 60). Thus the problem here discussed would be the quasi-Bernoulli problem for finite sequences.

*1 In the original text, I used the term ‘Newton’s formula’; but since this seems to be rarely used in English, I decided to translate it by ‘binomial formula’.

*1 I come here to the point where I failed to carry out fully my intuitive programme— that of analysing randomness as far as it is possible within the region of finite sequences, and of proceeding to infinite reference sequences (in which we need limits of relative frequencies) only afterwards, with the aim of obtaining a theory in which the existence of frequency limits follows from the random character of the sequence. I could have carried out this programme very easily by constructing, as my next step (finite) shortest nfree sequences for a growing n, as I did in my old appendix iv. It can then be easily shown that if, in these shortest sequences, n is allowed to grow without bounds, the sequences become infinite, and the frequencies turn without further assumption into frequency limits. (See note *2 to appendix iv, and my new appendix *vi.) All this would have simplified the next sections which, however, retain their significance. But it would have solved completely and without further assumption the problems of sections 63 and 64; for since the existence of limits becomes demonstrable, points of accumulation need no longer be mentioned.

These improvements, however, remain all within the framework of the pure frequency theory: except in so far as they define an ideal standard of objective disorder, they become unnecessary if we adopt a propensity interpretation of the neo-classical (measure-theoretical) formalism, as explained in sections *53 ff of my Postscript. But even then it remains necessary to speak of frequency hypotheses—of hypothetical estimates and their statistical tests; and thus the present section remains relevant, as does much in the succeeding sections, down to section 64.

1 Later, in sections 65 to 68, I will discuss the problem of decidability of frequency hypotheses, that is to say, the problem whether a conjecture or hypothesis of this kind can be tested; and if so, how; whether it can be corroborated in any way; and whether it is falsifiable. *Cf. also appendix *ix.

2 Keynes deals with such questions in his analysis of the principle of indifference. Cf. op. cit., Chapter IV, pp. 41–64.

3 Born and Jordan, for instance, in Elementare Quantenmechanik, 1930, p. 308, use the first of these terms in order to denote a hypothesis of equal distribution. A. A. Tschuprow, on the other hand, uses the expression ‘a priori probability’ for all frequency hypotheses, in order to distinguish them from their statistical tests, i.e. the results, obtained a posteriori, of empirical counting.

*2 This is precisely the programme here alluded to in note *1 above, and carried out in appendices iv and *vi.

1 Cf. for example von Mises’s Wahrscheinlichkeit, Statistik und Wahrheit, 1928, p. 25; English translation, 1939, p. 33.

2 Cf. for instance, Feigl, Erkenntnis 1, 1930, p. 256, where that formulation is described as ‘not mathematically expressible’. Reichenbach’s criticism, in Mathematische Zeitschrift 34, 1932, p. 594 f., is very similar.

3 Dφrge has made a similar remark, but he did not explain it.

*1 The last seven words (which are essential) were not in the German text.

4 Cf. for instance, Kamke, Einführung in die Wahrscheinlichkeitstheorie, 1932, p. 147, and Jahresbericht der Deutschen mathem. Vereinigung 42, 1932. Kamke’s objection must also be raised against Reichenbach’s attempt to improve the axiom of randomness by introducing normal sequences, since he did not succeed in proving that this concept is non-empty. Cf. Reichenbach, Axiomatik der Wahrscheinlichkeitsrechnung, Mathematische Zeitschrift 34, 1932, p. 606.

*2 It is, however, answerable if any given denumerable set of gambling systems is to be ruled out; for then an example of a sequence may be constructed (by a kind of diagonal method). See section *54 of the Postscript (text after note 5), on A. Wald.

*3 The reference to appendix iv is of considerable importance here. Also, most of the objections which have been raised against my theory were answered in the following paragraph of my text.

*4 Cf. the last paragraph of section 60, below.

*5 The word ‘only’ is only correct if we speak of (predictive) gambling systems; cf. note *3 to section 60, below, and note 6 to section *54 of my Postscript.

5 Example: the selection of all terms whose number is a prime.

*1 At present I should be inclined to use the concept of ‘objective probability’ differently—that is, in a wider sense, so as to cover all ‘objective’ interpretations of the formal calculus of probabilities, such as the frequency interpretation and, more especially, the propensity interpretation which is discussed in the Postscript. Here, in section 59, the concept is used merely as an auxiliary concept in the construction of a certain form of the frequency theory.

1 The corresponding question for sequences of overlapping segments, i.e. the problem of  α(n)F′(m>), answered by (2), can be called the ‘quasi-Bernoulli problem’; cf. note 1 to section 56 as well as section 61.

2 Reichenbach (Axiomatik der Wahrscheinlichkeitsrechnung, Mathematische Zeitschrift 34, 1932, p. 603) implicitly contests this when he writes, ‘... normal sequences are also free from after-effect, whilst the converse does not necessarily hold’. But Reichenbach’s normal sequences are those for which (3) holds. (My proof is made possible by the fact that I have departed from previous procedure, by defining the concept ‘freedom from aftereffect’ not directly, but with the help of ‘n-freedom from after-effect’, thus making it accessible to the procedure of mathematical induction.)

*1 Only a sketch of the proof is here given. Readers not interested in the proof may turn to the last paragraph of the present section.

3 Von Smoluchowski based his theory of the Brownian movement on after-effect sequences, i.e. on sequences of overlapping segments.

*2 The following formulation may be intuitively helpful: if the 0,0 pairs are more frequent in certain characteristic distances than in others, then this fact may be easily used as the basis of a simple system which would somewhat improve the chances of a gambler. But gambling systems of this type are incompatible with the ‘absolute freedom’ of the sequence. The same consideration underlies the ‘second step’ of the proof.

*3 Here the word ‘all’ is, I now believe, mistaken, and should be replaced, to be a little more precise, by ‘all those... that might be used as gambling systems’. Abraham Wald showed me the need for this correction in 1935. Cf. footnotes *1 and *5 to section 58 above (and footnote 6, referring to A. Wald, in section *54 of my Postscript).

1 Von Mises distinguishes Bernoulli’s—or Poisson’s—theorem from its inverse which he calls ‘Bayes’s theorem’ or ‘the second law of great numbers’.

2 Cf. note 3 to section 60, and note 5 to section 64.

*1 This sentence has been reformulated (without altering its content) in the translation by introducing the concept of a ‘fair sample’: the original operates only with the definiens of this concept.

*2 Keynes says of the ‘Law of Great Numbers’ that ‘the “Stability of Statistical Frequencies” would be a much better name for it’. (Cf. his Treatise, p. 336.)

1 Von Mises also uses the expression ‘almost certain’, but according to him it is of course to be regarded as defined by ‘having a frequency close to [or equal to] I’.

2 Keynes, A Treatise on Probability, 1921, p. 338. *The preceding passage in quotation marks had to be inserted here because it re-translates the passage I quoted from the German edition of Keynes on which my text relied.

*1 It may be worth while to be more explicit on this point. Keynes writes (in a passage preceding the one quoted above): ‘If the probability of an event’s occurrence under certain conditions is p, then... the most probable proportion of its occurrences to the total number of occasions is p ...’ This ought to be translatable, according to his own theory, into: ‘If the degree of rational belief in the occurrence of an event is p, then p is also a proportion of occurrences, i.e. a relative frequency—that, namely, in whose emergence the degree of our rational belief is greatest.’ I am not objecting to the latter use of the expression ‘rational belief ’. (It is the use which might also be rendered by ‘It is almost certain that...’.) What I do object to is the fact that p is at one time a degree of rational belief and at another a frequency; in other words, I do not see why an empirical frequency should be equal to a degree of rational belief; or that it can be proved to be so by any theorem however deep. (Cf. also section 49 and appendix *ix.)

3 This was first pointed out by von Mises in a similar connection in Wahrscheinlichkeit, Statistik und Wahrheit, 1928, p. 85 (2nd edition 1936, p. 136; the relevant words are missing in the English translation). It may be further remarked that relative frequencies cannot be compared with ‘degrees of certainty of our knowledge’ if only because the ordering of such degrees of certainty is conventional and need not be carried out by correlating them with fractions between o and I. Only if the metric of the subjective degrees of certainty is defined by correlating relative frequencies with it (but only then) can it be permissible to derive the law of great numbers within the framework of the subjective theory (cf. section 73).

*2 But it is possible to use Bernoulli’s theorem as a bridge from the objective interpretation in terms of ‘propensities’ to statistics. Cf. sections *49 to *57 of my Postscript.

20004f49v05_0237_005.jpg

1 As an example von Mises cites the sequence of figures occupying the last place of a six-figure table of square roots. Cf. for example, Wahrscheinlichkeit, Statistik und Wahrheit, 1928, p. 86 f.; (2nd edition 1936, p. 137; English translation, p. 165), and Wahrscheinlichkeitsrechnung, 1931, p. 181 f.

*1 I still consider my old doubt concerning the assumption of an axiom of convergence, and the possibility of doing without it, perfectly justified: it is justified by the developments indicated in appendix iv, note *2, and in appendix *vi, where it is shown that randomness (if defined by ‘shortest random-like sequences’) entails convergence which therefore need not be separately postulated. Moreover, my reference to the classical formalism is justified by the development of the neo-classical (or measure-theoretical) theory of probability, discussed in chapter *iii of the Postscript; in fact, it is justified by Borel’s ‘normal numbers’. But I do not agree any longer with the view implicit in the next sentence of my text, although I agree with the remaining paragraphs of this section.

*1 In order not to postulate convergence, I appealed in the following paragraph to what can be demonstrated—the existence of points of accumulation. All this becomes unnecessary if we adopt the method described in note *1 to section 57, and in appendix *vi.

1 A fact which, surprisingly enough, has not hitherto been utilized in probability theory.

2 It can easily be shown that if more than one middle frequency exists in a reference sequence then the values of these middle frequencies form a continuum.

3 The concept of ‘independent selection’ must be interpreted more strictly than hitherto, since otherwise the validity of the special multiplication theorem cannot be proved. For details see my work mentioned in note 3 to section 51. (*This is now superseded by appendix *vi.)

4 We can do this because it must be possible to apply the theory for finite classes (with the exception of the theorem of uniqueness) immediately to middle frequencies. If a sequence α has a middle frequency p, then it must contain—whatever the term with which the counting starts—segments of any finite magnitude, the frequency of which deviates from p as little as we choose. The calculation can be carried out for these. That p is free from after-effect will then mean that this middle frequency of α is also a middle frequency of any predecessor selection of α.

*2 It is possible to combine the approach described in note *1 to section 57, and in appendices iv and *vi, with these two requirements by retaining requirement (1) and replacing requirement (2) by the following:

(+ 2) Requirement of finitude: the sequence must become, from its commencement, as quickly n-free as possible, and for the largest possible n; or in other words, it must be (approximately) a shortest random-like sequence.

5 The quasi-Bernoulli formulae (symbol: F΄ also remain unambiguous for chancelike sequences (according to the new definition), although ‘F΄ now symbolizes only a middle frequency.

*3 I am in full agreement with what follows here, even though any reference to ‘middle frequencies’ becomes redundant if we adopt the method described in section 57, note *1, and appendix iv.

6 Cf., for instance, Feigl, Erkenntnis 1, 1930, p. 254: ‘In the law of great numbers an attempt is made to reconcile two claims which prove on closer analysis to be in fact mutually contradictory. On the one hand... every arrangement and distribution is supposed to be able to occur once. On the other hand, these occurrences... are to appear with a corresponding frequency.’ (That there is in fact no incompatibility here is proved by the construction of model sequences; cf. appendix iv.)

*4 What is said in this paragraph implicitly enhances the significance, for the solution of the ‘fundamental problem’, of an objectively interpreted neo-classical theory. A theory of this kind is described in chapter *iii of my Postscript.

7 Cf. note 3 to section 51. In retrospect I wish to make it clear that I have taken a conservative attitude to von Mises’s four points (cf. end of section 50). I too define probability only with reference to random sequences (which von Mises calls ‘collectives’). I too set up a (modified) axiom of randomness, and in determining the task of the calculus of probability I follow von Mises without reservation. Thus our differences concern only the limit axiom which I have shown to be superfluous and which I have replaced by the demand for uniqueness, and the axiom of randomness which I have so modified that model sequences can be constructed. (Appendix iv.) As a result, Kamke’s objection (cf. note 3 to section 53) ceases to be valid.

1 But not as void of ‘logical content’ (cf. section 35); for clearly, not every frequency hypothesis holds tautologically for every sequence.

*1 I believe that my emphasis upon the irrefutability of probabilistic hypotheses—which culminates in section 67—was healthy: it laid bare a problem which had not been discussed previously (owing to the general emphasis on verifiability rather than falsifiability, and the fact that probability statements are, as explained in the next section, in some sense verifiable or ‘confirmable’). Yet my reform, proposed in note *1 to section 57 (see also note *2 to section 64), changes the situation entirely. For this reform, apart from achieving other things, amounts to the adoption of a methodological rule, like the one proposed below in section 68, which makes probability hypotheses falsifiable. The problem of decidability is thereby transformed into the following problem: since empirical sequences can only be expected to approximate to shortest random-like sequences, what is acceptable and what is unacceptable as an approximation? The answer to this is clearly that closeness of approximation is a matter of degree, and that the determination of this degree is one of the main problems of mathematical statistics.

Added 1972. A new solution is given by D. Gillies. See below p. 443.

1 Cf. Section 80, especially notes 3 and 6.

*1 Although I do not disagree with this, I now believe that the probabilistic concepts ‘almost deducible’ and ‘almost contradictory’ are extremely useful in connection with our problem; see appendix *ix, and chapter *iii of the Postscript.

*2 Of course, I never intended to suggest that every statement of the form ‘for every x, there is a y with the observable property α’ is non-falsifiable and thus non-testable: obviously, the statement ‘for every toss with a penny resulting in 1, there is an immediate successor resulting in 0’ is both falsifiable and in fact falsified. What creates nonfalsifiability is not just the form ‘for every x there is a y such that... ’ but the fact that the ‘there is’ is unbounded—that the occurrence of the y may be delayed beyond all bounds: in the probabilistic case, y may, as it were, occur as late as it pleases. An element ‘0’ may occur at once, or after a thousand tosses, or after any number of tosses: it is this fact that is responsible for non-falsifiability. If, on the other hand, the distance of the place of occurrence of y from the place of occurrence of x is bounded, then the statement ‘for every x there is a y such that...’ may be falsifiable.

My somewhat unguarded statement in the text (which tacitly presupposed section 15) has led, to my surprise, in some quarters to the belief that all statements—or ‘most’ statements, whatever this may mean—of the form ‘for every x there is a y such that...’ are non-falsifiable; and this has then been repeatedly used as a criticism of the falsifiability criterion. See, for example, Mind 54, 1945, pp. 119 f. The whole problem of these ‘all-and-some statements’ (this term is due to J. W. N. Watkins) is discussed more fully in my Postscript; see especially sections *24 f.

2 It can be put in the following form: For every positive ε, for every predecessor n-tuple, and every element with the ordinal number x there is an element, selected according to predecessor selection, with the ordinal number y > x such that the frequency up to the term y deviates from a fixed value p by an amount less than ε .

*3 The situation is totally different if the requirement (+ 2) of note *2 to section 64 is adopted: this is empirically significant, and renders the probability hypotheses falsifiable (as asserted in note *1 to section 65).

3 The formulae of the probability calculus are also derivable in this axiomatization, only the formulae must be interpreted as existential formulae. The theorem of Bernoulli, for

example, would no longer assert that the single probability value for a particular n of αnF(Δp) lies near to 1, but only that (for a particular n) among the various probability values of αnF(Δp) there is at least one which lies near to 1.

*4 As has been shown in the new footnote *2 to section 64, any special requirement of uniqueness can be eliminated, without sacrificing uniqueness.

4 Both the axiom of randomness and the axiom of uniqueness can properly be regarded as such (intensional) warnings. For example, the axiom of randomness cautions us not to treat sequences as random if we suppose (no matter on what grounds) that certain gambling systems will be successful for them. The axiom of uniqueness cautions us not to attribute a probability q (with q ‚ p) to a sequence which we suppose can be approximately described by means of the hypothesis that its probability equals p.

5 Similar misgivings made Schlick object to the limit axiom (Die Naturwissenschaften 19, 1931, p. 158).

6 Here the positivist would have to recognize a whole hierarchy of ‘meaninglessnesses’. To him, non-verifiable natural laws appear ‘meaningless’ (cf. section 6, and quotations in notes 1 and 2), and thus still more so probability hypotheses, which are neither verifiable nor falsifiable. Of our axioms, the axiom of uniqueness, which is not extensionally significant, would be more meaningless than the meaningless axiom of irregularity, which at least has extensional consequences. Still more meaningless would be the limit axiom, since it is not even intensionally significant.

*1 When writing this, I thought that speculations of the kind described would be easily recognizable as useless, just because of their unlimited applicability. But they seem to be more tempting than I imagined. For it has been argued, for example by J. B. S. Haldane (in Nature 122, 1928, p. 808; cf. also his Inequality of Man, pp. 163 f.) that if we accept the probability theory of entropy, we must regard it as certain, or as almost certain, that the world will wind itself up again accidentally if only we wait long enough. This argument has of course been frequently repeated since by others. Yet it is, I think, a perfect example of the kind of argument here criticized, and one which would allow us to expect, with near certainty, anything we liked. Which all goes to show the dangers inherent in the existential form shared by probability statements with most of the statements of metaphysics. (Cf. section 15.)

*1 The problem here discussed has been treated in a clear and thorough way long ago by the physicists P. and T. Ehrenfest, Encycl. d. Math, Wiss. 4th Teilband, Heft 6 (12.12.1911) section 30. They treated it as a conceptual and epistemological problem. They introduced the idea of ‘probability hypotheses of first, second,... k th order’: a probability hypothesis of second order, for example, is an estimate of the frequency with which certain frequencies occur in an aggregate of aggregates. However, P. and T. Ehrenfest do not operate with anything corresponding to the idea of a reproducible effect which is here used in a crucial way in order to solve the problem which they expounded so well. See especially the opposition between Boltzmann and Planck to which they refer in notes 247 f., and which can, I believe, be resolved by using the idea of a reproducible effect. For under appropriate experimental conditions, fluctuations may lead to reproducible effects, as Einstein’s theory of Brownian movement showed so impressively. See also note *1 to section 65, and appendices *vi and *ix.

1 The quotation is from Born-Jordan Elementare Quantenmechanik, 1930, p. 306, cf. also the beginning of Dirac’s Quantum Mechanics, p. 10 of the 1st edition, 1930. A parallel passage (slightly abbreviated) is to be found on p. 14 of the 3rd edition, 1947. See also Weyl, Gruppentheorie und Quantenmechanik, 2nd edition, 1931, p. 66; English translation by H. P. Robertson: The Theory of Groups and Quantum Mechanics, 1931, p. 74 f.

*2 The methodological decision or rule here formulated narrows the concept of probability—just as it is narrowed by the decision to adopt shortest random-like sequences as mathematical models of empirical sequences, cf. note *1 to section 65.

*3 I am now a little dubious about the words ‘without much difficulty’; in fact, in all cases, except those of the extreme macro effects discussed in this section, very subtle statistical methods have to be used. See also appendix *ix, especially my ‘Third Note’.

*4 The remarks that follow in this paragraph (and some of the discussions later in this section) are, I now believe, clarified and superseded by the considerations in appendix *ix; see especially points 8 ff of my Third Note. With the help of the methods there used, it can be shown that almost all possible statistical samples of large size n will strongly undermine a given probabilistic hypothesis, that is to say give it a high negative degree of corroboration; and we may decide to interpret this as refutation or falsification. Of the remaining samples, most will support the hypothesis, that is to say, give it a high positive degree of corroboration. Comparatively few samples of large size n will give a probabilistic hypothesis an undecisive degree of corroboration (whether positive or negative). Thus we can expect to be able to refute a probabilistic hypothesis, in the sense here indicated; and we can expect this perhaps even more confidently than in the case of a non-probabilistic hypothesis. The methodological rule or decision to regard (for a large n) a negative degree of corroboration as a falsification is, of course, a specific case of the methodological rule or decision discussed in the present section—that of neglecting certain extreme improbabilities.

2 Eddington, The Nature of the Physical World, 1928, p. 75.

*1 This does not mean that I made any concession here to a subjective interpretation of probability, or of disorder or randomness.

*2 In this paragraph, I dismissed (because of its metaphysical character) a metaphysical theory which I am now, in my Postscript, anxious to recommend because it seems to me to open new vistas, to suggest the resolution of serious difficulties, and to be, perhaps, true. Although when writing this book I was aware of holding metaphysical beliefs, and although I even pointed out the suggestive value of metaphysical ideas for science, I was not alive to the fact that some metaphysical doctrines were rationally arguable and, in spite of being irrefutable, criticizable. See especially the last section of my Postscript.

*3 It would have been clearer, I think, had I argued as follows. We can never repeat an experiment precisely—all we can do is to keep certain conditions constant, within certain limits. It is therefore no argument for objective fortuity, or chance, or absence of law, if certain aspects of our results repeat themselves, while others vary irregularly; especially if the conditions of the experiment (as in the case of spinning a penny) are designed with a view to making conditions vary. So far, I still agree with what I have said. But there may be other arguments for objective fortuity; and one of these, due to Alfred Landé (‘Landé’s blade’) is highly relevant in this context. It is now discussed at length in my Postscript, sections *90, f.

1 As Schlick says in Die Kausalität in der gegenwärtigen Physik, Naturwissenschaften 19, 1931, p. 157.

1 A. March well says (Die Grundlagen der Quantenmechanik 1931, p. 250) that the particles of a gas cannot behave ‘... as they choose; each one must behave in accordance with the behaviour of the others. It can be regarded as one of the most fundamental principles of quantum theory that the whole is more than the mere sum of the parts’.

2 Von Mises, Über kausale und statistische Gesetzmässigkeiten in der Physik, Erkenntnis 1, 1930, p. 207 (cf. Naturwissenschaften 18, 1930).

*1 The thesis here advanced by von Mises and taken over by myself has been contested by various physicists, among them P. Jordan (see Anschauliche Quantentheorie, 1936, p. 282, where Jordan uses as argument against my thesis the fact that certain forms of the ergodic hypothesis have recently been proved). But in the form that probabilistic conclusions need probabilistic premises—for example, measure-theoretical premises into which certain equiprobabilistic assumptions enter—my thesis seems to me supported rather than invalidated by Jordan’s examples. Another critic of this thesis was Albert Einstein who attacked it in the last paragraph of an interesting letter which is here reprinted in appendix *xii. I believe that, at that time, Einstein had in mind a subjective interpretation of probability, and a principle of indifference (which looks in the subjective theory as if it were not an assumption about equiprobabilities). Much later Einstein adopted, at least tentatively, a frequency interpretation (of the quantum theory).

*1 The term ‘formalistisch’ in the German text was intended to convey the idea of a statement which is singular in form (or ‘formally singular’) although its meaning can in fact be defined by statistical statements.

1 The sign ‘... ε...’, called the copula, means ‘... is an element of the class...’; or else, ‘... is an element of the sequence...’.

*2 At present I think that the question of the relation between the various interpretations of probability theory can be tackled in a much simpler way—by giving a formal system of axioms or postulates and proving that it is satisfied by the various interpretations. Thus I regard most of the considerations advanced in the rest of this chapter (sections 71 and 72) as being superseded. See appendix *iv, and chapters *ii, *iii, and *v of my Postscript. But I still agree with most of what I have written, provided my ‘reference classes’ are determined by the conditions defining an experiment, so that the ‘frequencies’ may be considered as the result of propensities.

*3 I do not now object to the view that an event may hang in the balance, and I even believe that probability theory can best be interpreted as a theory of the propensities of events to turn out one way or another. (See my Postscript.) But I should still object to the view that probability theory must be so interpreted. That is to say, I regard the propensity interpretation as a conjecture about the structure of the world.

*4 This somewhat disparaging characterization fits perfectly my own views which I now submit to discussion in the ‘Metaphysical Epilogue’ of my Postscript, under the name of ‘the propensity interpretation of probability’.

1 Usually (cf. section 35).

2 Waismann, Logische Analyse des Wahrscheinlichkeitsbegriffes, Erkenntnis 1, 1930, p. 128 f.