Missing
data, where a variable score has not been gathered for an observation
(for example where various people have not answered a survey question),
can be a serious issue in most data analysis situations.
In many data analysis
situations, missing data is especially problematic. If a single variable
involved in the analysis is missing its data point, then
most
statistical programs will throw out the entire observation!
Since missing data is common in many datasets, especially survey-type
data, you can lose substantial amounts of data this way.
Figure 4.5 Example of the missing data problem in data analysis illustrates
this problem.
Missing data can severely
affect the accuracy of statistics, so you must think about and analyze
the missing data issue carefully before starting statistical analysis.
Note that missing data can be a problem in a variable (if many observations
do not have data for a specific variable column) or for a given observation
(if a given observation row is missing data for a lot of the variables).
I suggest several practical
steps and a process to use when performing missing data analysis.
First,
always try to pre-design the study to minimize the missing data problem.
Sometimes you can plan your research methodology in such a way that
missing data is minimized. For instance:
-
Multi-item
scales (where each variable is measured through multiple measurements
as discussed previously) can help. If most of the multiple items have
data, then when you aggregate the items into the final total variable
score, you can smooth over the missing pieces. For instance, if your
Stress index is made up of 10 survey questions and someone fails to
answer two of them, the average of the other eight survey items should
be a reasonable approximation of the overall Stress score. This does
not work for those specific individual observations where most of
the data is missing for a given set of multi-item scales. (If all
of the Stress items are missing then you have nothing to work with,
if most of the Stress items are missing, the combined index is possibly
inaccurate). In the case of these individuals, you could either replace
missing data or delete the individuals from the analysis, if necessary.
(See below for more on both these options).
-
If you use an online
or computerized data collection method, it can be programmed to not
allow missing data, although this can discourage participation. (Sometimes
people leave out data for good reason, especially if you ask silly
questions that don’t apply to everyone.)
Second,
assess missing data within observations: Here
you assess how much missing data exists in each row. If an observation
is missing any data at all, it is often left out in analyses (not
always), although we’d like to avoid this. Usually we try
to save data. However, some observations have so much missing data
(such as people who stop answering surveys half-way through) that
they’re not worth saving. Chapter 9 discusses how to do this
assessment practically in SAS. You can decide whether to try save
the observation. In the following two situations, saving an observation
might not be worth the trouble:
-
Missing dependent
variables: If you are doing a study with definite
dependent variables that you are trying to explain or predict, then
I do not recommend trying to retain any observations that are missing
the data for the dependent variables. Delete them, instead. Any attempt
to guess or smooth over the main dependent variables is likely to
throw out the whole analysis.
-
Large section missing: If
an observation is missing data for a lot of the final main variables,
then consider deleting it.
Third,
assess and deal with missing data in variables:
Once you are content that each observation has most of its relevant
data and variables, start analyzing the variables (columns). Again,
Chapter 9 shows you how to do this practically in SAS. Differentiate
between those variables that stand alone as single-item variable measures
(such as a person’s age) and those that are part of bigger
multi-item scales:
-
Variables that stand alone as single-item variable
measures: Here you are concerned with the proportion
of missing data and whether it is small enough to either leave alone
(you will generally lose all observations for which it is missing)
or do something about it.
-
If the proportion of data missing in the
variable is extremely large you may have no other
option than to delete the variable, but this depends on the importance
of the variable.
-
If the proportion missing is moderate then
you might consider remedial action. There are various options:
-
Simple imputation:
Under some (but not all) circumstances you can replace the missing
data with a score that reflects the usual answers for that question.
For example, if respondent #5 left out a trust question, we could
replace the blank with an answer reflective of the rest of the group’s
answers on that question. For answer scales using a 1-5 or 1-7 type
scale (e.g. the Likert scale), you could replace the blanks with the
median of
each question (not the average). For answer scales with truly continuous
data, for example, percentages, we could replace with an average.
-
More complex solutions:
There are more complex missing data protocols through which you can
deal with the issue (notably multiple imputation, full information
maximum likelihood, and bootstrapping). I cannot go into much more
detail here; the interested reader should consult more specialist
texts.
If at all possible, use these.
-
If the proportion missing is very small (say
less than 5% but you can decide) consider leaving it alone. However,
are you sure whether this is right? One way to check is to run the
missing data analysis with each option (leaving it alone, replacement,
variable deletion, etc.) and see if there is a big difference in results.
-
Variables that are items in a multi-item scale: Most
of the above considerations for single-item measures also apply to
these, except that you have one extra option. With multi-item scales
we can sometimes smooth over the missing items when we aggregate.
For instance, say that for a given observation you have ten survey
items for a scale and without missing data the answers are 2, 3, 3,
4, 2, 3, 2, 4, 1, 2. When we aggregate these items into a variable
score by averaging we get 2.60. Now, if the person is missing one
of the scores, the average of the other nine will range between 2.44
and 2.78 depending on which is missing. Is this an acceptable approximation?
The smaller the proportion of missing items the better; the choice
is yours as to whether you are happy to take this route.
Chapter 9 discusses
how to analyze and deal with missing data in SAS.