A random sample of a dataset is a subset whose elements are randomly selected. The given dataset is called the population and is usually very large; for example, all male Americans aged 25-35 years.
In a simulation, selecting a random sample is straightforward, using a random number generator. But in a real world context, random sampling is nontrivial.
Random sampling is an important part of quality control in nearly all types of manufacturing, and it is an essential aspect of polling in the social sciences. In industrial sectors such as pharmaceuticals, random sampling is critical to both production and testing.
To understand the principles of random sampling, we must take a brief detour into the field of mathematical probability theory. This requires a few technical definitions.
A random experiment is a process, real or imagined, that has a specified set of possible outcomes, any one of which could result from the experiments. The set S of all possible outcomes is called a sample space.
For example, the process of flipping four balanced coins. If the four coins are recognizably different, such as using a penny, a nickel, a dime, and a quarter, then we could specify the sample space as S1 = {HHHH, HHHT, HHTH,…, TTTT}. The notation of an element such as HHTH means a head on the penny, a head on the nickel, a tail on the dime, and a head on the quarter. That set has 16 elements. Alternatively, we could use the same process without distinguishing the coins, say using four quarters, and then specify the sample space as S22 = {0, 1, 2, 3, 4}, the number indicating how many of the coins turned up heads. These are two different possible sample spaces.
A probability function on a sample space S is a function p that assigns a number p(s) to each element s of S, and is subject to these rules:
For the first example, we could assign p(s) = 1/16 for each s ∈ S1; that choice would make a good model of the experiment, if we assume that the coins are all balanced.
For the second example, a good model would be to assign the probabilities shown in the following table:
s |
p(s) |
---|---|
0 |
1/16 |
1 |
1/4 |
2 |
3/8 |
3 |
1/4 |
4 |
1/16 |
Table 4-1. Probability distribution
As a third example, imagine the set S3 of the 26 (lower case) letters of the Latin alphabet, used in English text. The frequency distribution shown in Figure 3-20 provides a good probability function for that sample space. For example, p("a") = 0.082.
The alphabet example reminds us of two general facts about probabilities:
A probability function p assigns a number p(s) to each element s of a sample space S. From that, we can derive the corresponding probability set function: it assigns a number P(U) to each subset of the sample space S, simply by summing the probabilities of the subset's elements:
For example, in S2, let U be the subset U = {3, 4}. This would represent the event of getting at least three heads from the four coins:
Probability set functions obey these rules:
The third rule can be seen from the following Venn diagram:
Figure 4-4. Venn diagram
The disk on the left represents the elements of U, and the right disk represents the elements of V. The values of P(U) and P(U) both include the values of P(U ⋂ V), the green area, so those get counted twice. So, to sum the probabilities in all three regions only once each, we must subtract those that were counted twice.
A subset U of a sample space S is called an event, and the number P(U) represents the probability that the event will occur during the experiment. So, in this example, the probability of the event getting at least three heads is 5/16, or about 31%. In terms of relative frequencies, we could say that if we repeated the experiment many times, we should expect that event to be the outcome about 31% of the time.