Dealing with Data Once It Has Been Captured

Post-Capturing Issues

Once you have gathered and captured raw data into your spreadsheet or database, you are ready to do further cleaning and organization of the data. The major things you might wish to deal with are (a) checking for mistakes, (b) dealing with the multi-item scales, and (c) dealing with missing data. The next sections deal with each of these issues.

Checking for Mistakes in Your Data

When doing data analysis, it is crucial that you prevent mis-entered data from corrupting the entire analysis. For example, if we manually capture data on a 1- to 5-point answer scale, it is possible to enter a ”55” by mistake. If this happens, even simple averages and correlations might be badly affected by the mis-entered data point.

To check data for each column, you can at least check the minimum and maximum values in the range to ensure that there are no major errors. I like to also check the frequencies of all answers; we will see later how to do this. If you do run across a mis-entered data point, then simply correct it.

Chapter 9 will tell you how to use SAS descriptive and associational statistics to check, clean and prepare data.

Dealing with Missing Data

Missing data, where a variable score has not been gathered for an observation (for example where various people have not answered a survey question), can be a serious issue in most data analysis situations.

In many data analysis situations, missing data is especially problematic. If a single variable involved in the analysis is missing its data point, then most statistical programs will throw out the entire observation! Since missing data is common in many datasets, especially survey-type data, you can lose substantial amounts of data this way. Figure 4.5 Example of the missing data problem in data analysis illustrates this problem.

Missing data can severely affect the accuracy of statistics, so you must think about and analyze the missing data issue carefully before starting statistical analysis. Note that missing data can be a problem in a variable (if many observations do not have data for a specific variable column) or for a given observation (if a given observation row is missing data for a lot of the variables).

Figure 4.5 Example of the missing data problem in data analysis

I suggest several practical steps and a process to use when performing missing data analysis.

First, always try to pre-design the study to minimize the missing data problem. Sometimes you can plan your research methodology in such a way that missing data is minimized. For instance:

Multi-item scales (where each variable is measured through multiple measurements as discussed previously) can help. If most of the multiple items have data, then when you aggregate the items into the final total variable score, you can smooth over the missing pieces. For instance, if your Stress index is made up of 10 survey questions and someone fails to answer two of them, the average of the other eight survey items should be a reasonable approximation of the overall Stress score. This does not work for those specific individual observations where most of the data is missing for a given set of multi-item scales. (If all of the Stress items are missing then you have nothing to work with, if most of the Stress items are missing, the combined index is possibly inaccurate). In the case of these individuals, you could either replace missing data or delete the individuals from the analysis, if necessary. (See below for more on both these options).
If you use an online or computerized data collection method, it can be programmed to not allow missing data, although this can discourage participation. (Sometimes people leave out data for good reason, especially if you ask silly questions that don’t apply to everyone.)

Second, assess missing data within observations: Here you assess how much missing data exists in each row. If an observation is missing any data at all, it is often left out in analyses (not always), although we’d like to avoid this. Usually we try to save data. However, some observations have so much missing data (such as people who stop answering surveys half-way through) that they’re not worth saving. Chapter 9 discusses how to do this assessment practically in SAS. You can decide whether to try save the observation. In the following two situations, saving an observation might not be worth the trouble:

Missing dependent variables: If you are doing a study with definite dependent variables that you are trying to explain or predict, then I do not recommend trying to retain any observations that are missing the data for the dependent variables. Delete them, instead. Any attempt to guess or smooth over the main dependent variables is likely to throw out the whole analysis.
Large section missing: If an observation is missing data for a lot of the final main variables, then consider deleting it.

Third, assess and deal with missing data in variables: Once you are content that each observation has most of its relevant data and variables, start analyzing the variables (columns). Again, Chapter 9 shows you how to do this practically in SAS. Differentiate between those variables that stand alone as single-item variable measures (such as a person’s age) and those that are part of bigger multi-item scales:

Variables that stand alone as single-item variable measures: Here you are concerned with the proportion of missing data and whether it is small enough to either leave alone (you will generally lose all observations for which it is missing) or do something about it.
1. If the proportion of data missing in the variable is extremely large you may have no other option than to delete the variable, but this depends on the importance of the variable.
2. If the proportion missing is moderate then you might consider remedial action. There are various options:
  
  Simple imputation: Under some (but not all) circumstances you can replace the missing data with a score that reflects the usual answers for that question. For example, if respondent #5 left out a trust question, we could replace the blank with an answer reflective of the rest of the group’s answers on that question. For answer scales using a 1-5 or 1-7 type scale (e.g. the Likert scale), you could replace the blanks with the median of each question (not the average). For answer scales with truly continuous data, for example, percentages, we could replace with an average.
  
  More complex solutions: There are more complex missing data protocols through which you can deal with the issue (notably multiple imputation, full information maximum likelihood, and bootstrapping). I cannot go into much more detail here; the interested reader should consult more specialist texts[1]. If at all possible, use these.
3. If the proportion missing is very small (say less than 5% but you can decide) consider leaving it alone. However, are you sure whether this is right? One way to check is to run the missing data analysis with each option (leaving it alone, replacement, variable deletion, etc.) and see if there is a big difference in results.
Variables that are items in a multi-item scale: Most of the above considerations for single-item measures also apply to these, except that you have one extra option. With multi-item scales we can sometimes smooth over the missing items when we aggregate. For instance, say that for a given observation you have ten survey items for a scale and without missing data the answers are 2, 3, 3, 4, 2, 3, 2, 4, 1, 2. When we aggregate these items into a variable score by averaging we get 2.60. Now, if the person is missing one of the scores, the average of the other nine will range between 2.44 and 2.78 depending on which is missing. Is this an acceptable approximation? The smaller the proportion of missing items the better; the choice is yours as to whether you are happy to take this route.

Chapter 9 discusses how to analyze and deal with missing data in SAS.

Dealing With Multi –Item Scales

The Multi-Item Scale Issue

In the previous section on research methodology around gathering data, I highlighted the usefulness in many situations of using multi-item scales (multiple measurements of a single construct). Once you have gathered this raw data, you need to undertake various analyses of the multi-item scales.

The major question when you have a multi-item scale is “do these different variables actually form part of the same thing – in which case I can potentially aggregate them into a summary variable – or not?”

Figure 4.6 Pictorial summary of collapsing multi-item scales into aggregate scores shows a pictorial version of this process, where some initial Satisfaction variables and other initial Trust items may or may not be eligible to gather together into aggregated, summary variables.

Figure 4.6 Pictorial summary of collapsing multi-item scales into aggregate scores

Short Summary of the Main Tasks in Preparing Multi-Item Scales

Given the above discussion, there are two main steps when you have multi-item scales.

Establish whether the set of variables in a multi-item scale really do belong together. Just because you have a set of items that you think indicate one underlying idea – like Trust – does not mean that the data agree. As a technical issue, you would have to deal with a situation where some of your items address trust and some measure distrust – we call these reverse items. More importantly, there are statistical tests to establish whether or not the final set of trust sub-variables seem to form a consistent common measure. Chapter 9 deals with such tests.
If the items are eligible to group together, then aggregate them. There are different ways to aggregate sub-items into one main item. You might average the sub-items, sum them (if no missing data exists), or use various, more complex techniques. Your final step is to choose your aggregation method.

Chapter 9 deals with the basic practical SAS steps for dealing with multi-item scales.