Chapter Three: Statistics for Data Science
This chapter sheds some light on the statistical concepts used in data science. It is important to remember that data science is not a new concept, and most statisticians can work as data scientists. Data science uses many concepts from statistics, since this is the best tool to help you process and interpret information in the data set. There is a lot of information you can gather from the data set using statistical tools. If you want to master data science and become an expert in the field, you need to learn everything you can about statistics. While there are many topics in statistics a data scientist should know about, the following are the most important aspects to consider:
●
Descriptive Statistics
●
Inferential Statistics
Descriptive Statistics
Descriptive statistics is the process of expressing or looking at the data to help you read it easily. This method helps you deal with quantitative summarization of the data through numerical representation or graphs. Some topics you need to learn about are listed below:
-
Normal Distribution
-
Central Tendency
-
Variability
-
Kurtosis
Normal Distribution
A normal distribution, also known as a Gaussian distribution, is a continuous distribution often used in statistics. Any data set following a normal distribution is spread across a graph, also a bell-shaped curve. In normal distributions, the data points in the set peak at the center of the bell-shaped curve, which represents the center of the data set. When the data moves away from the mean, it will fall to the end of the curve. You need to ensure the data you look at is distributed normally if you want to make inferences from the data set.
Central Tendency
Measures of central tendency help you determine the center values of the data set. There are three measures commonly used – mean, median and mode. The mean or the arithmetic mean of any distribution is found at the center of the data set. You can use the following formula to calculate the mean or average of the data set: (sum of all points in the data set) / (number of data points)
The median is another measure that is the mid value of the data set when the points have been sorted in an ascending order. If you have an odd set of values, you can easily find the mid value, but if you have an even number of data points, you take the average of the two data points in the middle of the data set. The last measure is the mode, and this value is the data point, which occurs the maximum number of times in the data set.
Variability
Variability is a factor which helps you determine the distance between the data points and the average or mean of the data points in the data set. This value also shows the difference between the data points selected. You can view and assess the variability based on the measures of central measure, such as range, variation and standard deviation. The range is a value, which is the difference between the smallest and largest value in the data set.
Kurtosis and Skewness
The skewness of the data set helps you determine how symmetrical the data set is. If the data set is distributed uniformly, it will take the shape of a bell curve. If the curve is shaped evenly, the data is not skewed. If the curve moves either to the right or left side of the data points, it means the data is negatively or positively skewed, respectively. This means the data is either dominant on the left or right side of the measures of central tendency. Kurtosis is a measure that helps you determine the tails of the distribution. When you plot the data points on a graph, you can determine if the data is light or heavy-tailed. You can make this assumption based on the middle section of the distribution.
Inferential Statistics
Descriptive statistics provide information about the data, but inferential statistics are about obtaining insights about the data set. Inferential statistics is about concluding about the large population based on a small data set or sample. Let us assume you need to count the number of people in Africa who have received the polio vaccine. You can perform this analysis in two different ways:
-
Ask every individual in Africa if they have been given the polio vaccine
-
Take a sample of people from the continent, make sure it is from different parts of the continent, and extrapolate the findings across the entire continent
The first process is difficult to complete, and it is impossible to do it. You cannot go around the entire country and ask people to tell you if they have received the vaccine. The second method is the better way to do this since you can draw insights or conclusions from the sample selected and extrapolate the findings to the large population. The following are some tools used in inferential statistics.
Central Limit Theorem
According to the central limit theorem, "the average of the sample is same as that of the entire population." This shows the properties and measures of the data's central tendency, such as standard deviation, are the same for both the sample and population. This means you can increase the number of data points selected, resulting in a normal curve.
You need to understand the concept of confidence intervals if you want to use the central limit theorem. This goes to show the approximate value of the mean of the population. The process of creating an interval in the population has things like the sum of an error margin. You can calculate this error by "multiplying the standard error of mean with the z-score of the percentage of confidence level."
Hypothesis Testing
Hypothesis testing is the extent to which you can test any
assumption you make about the data set. In this process of testing, you can collect the results of your analysis of the hypothesis on a smaller population. The hypothesis you need to test is called the null hypothesis, and we need to check the validity of this hypothesis against the alternative hypothesis. The null hypothesis is the case you need to test. Consider the following example – you are conducting a survey to determine who smokes and who does not, and how people who smoke have cancer.
When you conduct this survey, you begin with the assumption that the number of subjects with cancer who smoke is the same as the number of subjects with cancer do not smoke. This is your null hypothesis, and you need to test it to reject the hypothesis. The alternate hypothesis will be that the number of subjects with cancer who smoke is greater than the number of subjects with cancer who do not smoke.
Based on the given data and evidence, you can test the hypotheses and analyze the data to determine if the null hypothesis is correct or not.
ANOVA
ANOVA is another statistical concept that is used to test hypotheses across different groups of data. This concept helps you determine if the groups you are checking have similar averages and variances. ANOVA allows you to carry out this form of analysis with limited error rates. You can use the F-ratio to calculate ANOVA. The F-ratio is used to calculate the ratio of the mean square error between groups to the mean square error in specific groups. The following are the methods to calculate ANOVA:
-
Write the hypotheses and understand the need for those. Every analysis should have a null and alternative hypothesis
-
In case of null hypothesis, you need to assume the average of the groups is identical
-
The alternative hypothesis will have a different average