Chapter 4 banner art.

Describing Your Research Design: Studies With and Without Experimental Manipulations

As I mentioned in Chapter 1, Table A1.1 should be filled out whenever the research project involves the collection of new data. This table will get you through the common material that should be contained in all research reports, regardless of the research design and statistical analyses used. But different research designs require that different aspects of the design be reported. This chapter describes these standards for studies with research designs that involve a manipulation of experimental conditions.

Table A4.1 describes reporting standards for all research designs involving purposive or experimental manipulations. Table A4.2a provides separate items for reporting a study that used random assignment of subjects to conditions and Table A4.2b for reporting a study that used a purposive manipulation but in which subjects were assigned to conditions using a procedure other than random assignment. You would choose to use either Table A4.2a or Table A4.2b but not both. Table A6.4 presents some additional reporting standards for reporting clinical trials. It also highlights a few extra reporting standards for the abstract and the introduction. These are primarily included to emphasize their particular importance in the case of clinical trials.

Studies With Experimental Manipulations or Interventions

The Journal Article Reporting Standards for quantitative studies (JARS; Appelbaum et al., 2018) recommends that research reports contain detailed descriptions of all of the different conditions that subjects were exposed to in studies with experimental manipulations (Table A4.1). To illustrate the different items in JARS, I use primarily an article by Adank, Evans, Stuart-Smith, and Scott (2009) titled “Comprehension of Familiar and Unfamiliar Native Accents Under Adverse Listening Conditions,” published in the Journal of Experimental Psychology: Human Perception and Performance. The researchers created two groups of listeners. One group of listeners was drawn from the residents of London, England. They heard taped sentences spoken in standard English (SE) or Glaswegian English (GE) and were assumed to be familiar with the former but unfamiliar with the latter. The second group of listeners was drawn from Glasgow, Scotland, and listened to the same tapes but were assumed to be familiar with both accents (because of media exposure). Also, noise at three volumes was added to the tapes to create four conditions of noise (one being a no-noise condition). Therefore, there were eight experimental conditions. In their overview of their design, Adank et al. also informed readers that

Ninety-six true/false sentences were presented to the participants, that is, 12 sentences per experimental condition. The sentences were counterbalanced across conditions, so that all 96 sentences were presented in all conditions across all subjects within a listener group. All listeners heard sentences in all four conditions. This was repeated for the SE and GE listener groups, thus ensuring that all sentences were presented in all conditions for both groups. Furthermore, not more than two sentences of one speaker were presented in succession, as Clarke and Garrett (2004) found that familiarization occurs after as few as two sentences from a speaker. (p. 521)

Thus, Adank et al. used a two-factor (2 × 4) within-subjects experimental design with an additional individual difference—whether the subject spoke SE or GE—serving as a between-subjects factor. The study focused on perceptual processes and was conducted in a laboratory.

Experimental Manipulation

If your study involved an experimental manipulation, your report should describe

  • the details of the experimental manipulations intended for each study condition, including the comparison conditions, and
  • how and when manipulations were actually administered, including
    • the content of the specific experimental manipulations (a summary or paraphrasing of instructions is OK unless they are unusual, in which case they may be presented verbatim) and
    • the method of experimental manipulation delivery, including a description of apparatus and materials used, such as specialized equipment by model and supplier, and their function in the experiment.

The methods and materials used by Adank et al. (2009) to deliver the spoken sentences in their study involved, first, the creation of the spoken sentences stimulus materials. They gave an impressive amount of detail about how the manipulation was delivered. First, Adank et al. described how the tapes were made to ensure that only the speakers’ accents varied across the spoken sentences:

For every speaker, recordings were made of the 100 sentences of Version A of the SCOLP [Speed and Capacity of Language Processing] test (Baddeley et al., 1992). The sentence was presented on the screen of a notebook computer, and speakers were instructed to quietly read the sentence and subsequently to pronounce the sentence as a declarative statement. All sentences were recorded once. However, if the speaker made a mistake, the interviewer went back two sentences, and the speaker was instructed to repeat both. . . .
The GE speakers were recorded in a sound-treated room, using an AKG SE300B microphone (AKG Acoustics, Vienna, Austria), which was attached to an AKG N6-6E preamplifier, on a Tascam DA-P1 DAT recorder (Tascam Div., TEAC Corp., Tokyo, Japan). Each stimulus was transferred directly to hard disk using a Kay Elemetrics DSP sonagraph (Kay Elemetrics, Lincoln Park, NJ). . . . The recordings of the SE speakers were made in an anechoic room, using a Brüel and Kjær 2231 sound level meter (Brüel and Kjær Sound & Vibration Measurement, Nærum, Denmark) as a microphone/amplifier. This microphone was fitted with a 4165 microphone cartridge and its A/C output was fed to the line input of a Sony 60ES DAT recorder (Sony Corp., Tokyo) and the digital output from the DAT recorder fed to the digital input of a Delta 66 sound card (M-Audio UK, Watford, UK) in the Dell Optiplex GX280 personal computer (Dell Corp., Fort Lauderdale, FL). . . . The difference in recording conditions between the two speaker groups was not noticeable in the recordings, and it is thus unlikely that intelligibility of the two accents was affected.
Next, all sentences were saved into their own file with beginning and end trimmed at zero crossings (trimming on or as closely as possible to the onset and offset of initial and final speech sounds) and resampled at 22,050 Hz. Subsequently, the speech rate differences across all eight speakers were equalized, so that every sentence had the same length across all eight speakers. This was necessary to ensure straightforward interpretation of the dependent variable (i.e., to be able to express the results in milliseconds). First, for each of the 96 sentences, the average duration across all speakers was calculated. Second, we used the Pitch Synchronous Overlap Add Method, or PSOLA (Moulines & Charpentier, 1990), as implemented in the Praat software package (Boersma & Weenink, 2003), to digitally shorten or lengthen the sentence for each speaker separately. The effect of the shortening or lengthening was in some cases just audible, but it was expected that any effects due to this manipulation were small to negligible, as the manipulations were relatively small and were carried out across all sentences for all speakers in the experiment. (p. 522)

This level of detail is called for in the Publication Manual of the American Psychological Association (APA, 2010):

If a mechanical apparatus was used to present stimulus materials or collect data, include in the description of procedures the apparatus model number and manufacturer (when important, as in neuroimaging studies), its key settings or parameters (e.g., pulse settings), and its resolution (e.g., regarding stimulus delivery, recording precision). (p. 31)

The level of detail also suggests that this area of research is probably working with very subtle and sensitive effects (note the attention to the speed of speech). This requires of the experimenters a great deal of control over what and how stimulus materials are presented to produce it. Note that Adank et al. also mentioned the locations (London and Glasgow) at which the recordings for the experiment took place.

Adank et al. (2009) then went on to describe how they accomplished the noise manipulation:

Finally, speech-shaped noise was added. . . . This speech-shaped noise was based on an approximation to the long-term average speech spectrum for combined male and female voices. . . . The root mean square levels per one third of the octave band were converted into spectrum level and plotted on an octave scale. A three-line approximation was used to capture the major part of the shape from 60 Hz to 9 kHz. This consisted of a low-frequency portion rolling off below 120 Hz at 17.5 dB/octave and a high frequency portion rolling off at 7.2 dB/octave above 420 Hz, with a constant spectrum portion in between. Per sentence, the noise sound file was cut at a random position from a longer (6-s) segment of speech-shaped noise, so that the noise varied randomly across sentences. The speech-shaped noise had the same duration as the sentence and started and ended with the onset and offset of the sentence. The root mean squares of the sentence and the noise were determined and scaled as to fit the SNR [signal-to-noise] level and finally were combined through addition. Finally, using Praat, we peak normalized and scaled the intensity of the sound file to 70 dB sound pressure level (SPL). (p. 522)

Deliverers

If your study involved an experimental manipulation, describe

  • the deliverers, including
    • who delivered the manipulations,
    • their level of professional training, and
    • their level of training in the specific experimental manipulations;
  • the number of deliverers; and
  • the mean, standard deviation, and range of number of individuals or units treated by each deliverer.

For Adank et al. (2009), the deliverers of the manipulation might be considered the people whose voices were recorded:

Recordings were made of four SE speakers and four GE speakers. All speakers were male, middle class, and between 20 and 46 years old. Only male speakers were selected because including both genders would have introduced unwanted variation related to the gender differences in larynx size and vocal tract length. . . . The GE speakers were recorded in Glasgow, and the SE speakers were recorded in London. (p. 552)

Taking extra precaution, these researchers also described the persons helping the speakers create the tapes:

Because our investigative team speaks in Southern English accents, we arranged for the GE recordings to be conducted by a native GE interviewer to avoid the possibility of speech accommodation toward Southern English (Trudgill, 1986). . . . The SE recordings were conducted by a native SE interviewer. (p. 522)

The level of detail these authors provide for how their experimental conditions were delivered also is impressive. What might seem like trivial detail to an outsider can be critical to enabling someone immersed in this field to draw strong causal inferences from a study. How will you know what these details are? By reading the detailed reports of those who have done work in your area before you.

Setting, Exposure, and Time Span

If your study involved an experimental manipulation, describe

  • the setting where the experimental manipulations occurred and
  • the exposure quantity and duration, including
    • how many sessions, episodes, or events were intended to be delivered;
    • how long they were intended to last; and
    • the time span, or how long it took to deliver the manipulation to each unit.

Adank et al.’s (2009) description of the experimental procedures focused on the physical characteristics of the experimental room, the mode of the manipulation delivery, and what subjects had to do to respond:

Procedure. The SE listeners were tested in London, and the GE listeners were tested in Glasgow. All listeners were tested individually in a quiet room while facing the screen of a notebook computer. They received written instructions. The listeners responded using the notebook’s keyboard. Half of the participants were instructed to press the q key with their left index finger for true responses and to press the p key with their right index finger for false responses. The response keys were reversed (i.e., p for true and q for false) for the other half of the participants. Listeners were not screened for handedness. The stimuli were presented over headphones (Philips SBC HN110; Philips Electronics, Eindhoven, the Netherlands) at a sound level that was kept constant for all participants. Stimulus presentation and the collection of the responses were performed with Cogent 2000 software (Cogent 2000 Team, Wellcome Trust, London, UK), running under Matlab (Mathworks, Cambridge, UK). The response times were measured relative to the end of the audio file, following the computerized SCOLP task in May et al. (2001).
Each trial proceeded as follows. First, the stimulus sentence was presented. Second, the program waited for 3.5 s before playing the next stimulus, allowing the participant to make a response. If the participant did not respond within 3.5 s, the trial was recorded as no response. The participants were asked to respond as quickly as they could and told that they did not have to wait until the sentence had finished (allowing for negative response times, as response time was calculated from the offset of the sound file).
Ten familiarization trials were presented prior to the start of the experiment. The familiarization sentences had been produced by a male SE speaker. This speaker was not included in the actual experiment, and neither were the 10 familiarization sentences. The experiment’s duration was 15 min, without breaks. (pp. 522–523)

Activities to Increase Compliance

If your study involved an experimental manipulation, describe in detail the activities you engaged in to increase compliance or adherence (e.g., incentives).

Adank et al. (2009) did little to increase compliance, but in this type of study it is likely that little was needed. They told readers that the subjects in their study were recruited from London and Glasgow and that “all were paid for their participation” (p. 522). We can consider the payments as a method to increase compliance.

Use of Language Other Than English

If your study involved an experimental manipulation, describe the use of a language other than English and the translation method.

In Adank et al.’s (2009) study, the use of a language other than SE was one of the experimental manipulations. JARS recommends (see Table A4.1) that the research report indicate whether the language used to conduct the study was other than English. According to the Publication Manual,

When an instrument is translated into a language other than the language in which it was developed, describe the specific method of translation (e.g., back-translation, in which a text is translated into another language and then back into the first to ensure that it is equivalent enough that results can be compared). (APA, 2010, p. 32)

Replication

If your study involved an experimental manipulation, include sufficient detail to allow for replication, including

  • reference to or a copy of the manual of procedures (i.e., whether the manual of procedures is available and, if so, how others may obtain it).

The design descriptions presented above provide excellent examples of the details that should be presented so that others might be able to repeat the studies with excellent fidelity. Adank et al.’s (2009) report went to great lengths to describe the manipulations, equipment, deliverers, and setting of their study.

Unit of Delivery and Analysis

If your study involved an experimental manipulation, describe the unit of delivery—that is, how participants were grouped during delivery—including

  • a description of the smallest unit that was analyzed (in the case of experiments, that was randomly assigned to conditions) to assess manipulation effects (e.g., individuals, work groups, classes) and
  • if the unit of analysis differed from the unit of delivery, a description of the analytical method used to account for this (e.g., adjusting the standard error estimates by the design effect, using multilevel analysis).

JARS calls for a description of how subjects were grouped during delivery of the experimental manipulation. Was the manipulation administered individual by individual, to small or large groups, or to intact, naturally occurring groups? It also recommends that you describe the smallest unit you used in your data analysis.

The reason for recommending that this information be reported goes beyond the reader’s need to know the method of manipulation delivery and analysis. In most instances, the unit of delivery and the unit of analysis need to be the same for the assumptions of statistical tests to be met. Specifically, at issue is the assumption that each data point in the analysis is independent of the other data points. When subjects are tested in groups, it is possible that the individuals in each group affect one another in ways that would influence their scores on the dependent variable.

Suppose you conducted your fragrance labeling study by having 10 groups of five subjects each smell the fragrance when it was labeled roses and another 10 groups of five subjects each smell it when it was labeled manure. Before you had subjects rate the fragrances, you allowed them to engage in a group discussion. The discussions might have gone very differently in the different groups. The first person to speak up in one of the manure groups might have framed the discussion for the entire group by stating, “I would never buy a fragrance called manure!” whereas in another group, the first speaker might have said, “Manure! What a bold marketing strategy!” Such comments could lower and raise, respectively, all group members’ ratings. Therefore, the ratings within groups are no longer independent. By knowing the subjects’ groups, you know something about whether their ratings will be lower or higher than those of other subjects not in their group, even within the manure and roses conditions.

This example, of course, is unlikely to happen (I hope); clearly, subjects should complete this type of labeling study individually, unless the research question deals with the influence of groups on labeling (in which case the framing comments likely would also be experimentally manipulated). However, there are many instances in which treatments are applied to groups formed purposively (e.g., group therapy) or naturalistically (e.g., classrooms). Still, in many of these instances, there will be a temptation to conduct the data analysis with the individual participant as the unit of analysis because either (a) the analysis is simpler or (b) using the participant as the unit of analysis increases the power of the test by increasing the available degrees of freedom. In your labeling experiment run in groups, there are 100 subjects but only 20 groups.

Some data analysis techniques take into account the potential dependence of scores within groups (Luke, 2004). If you used such an analytic method, describe it. Describing these techniques is beyond the scope of this discussion; for now, it is important only that you understand the rationale behind why JARS asks for this information.

A study that addressed the issue of units of delivery and analysis was conducted by Vadasy and Sanders (2008), who examined the effects of a reading intervention on second and third graders’ reading fluency. The intervention was delivered in groups of two students. Here is how Vadasy and Sanders described their method of composing the units of delivery:

Group assignment was a two-stage process. First, eligible students were randomly assigned to dyads (pairs of students) within grade and school using a random number generator. Although it would have been preferable to randomly assign students to dyads within classrooms in order to minimize extreme mismatches between students within a dyad, there were too few eligible students within classrooms to make this practical. Nevertheless, there were sufficient numbers of eligible students for random assignment of dyads within grades (within schools). For schools with uneven numbers of students within grades, we used random selection to identify singletons that were subsequently excluded from participation. Dyads were then randomly assigned to one of two conditions: treatment (supplemental fluency tutoring instruction) or control (no tutoring; classroom instruction only). (p. 274)

These researchers went on to describe their unit of analysis and the rationale for their choice of data analysis strategy by simply stating,

Due to nesting structures present in our research design, a multilevel (hierarchical) modeling approach was adopted for analyzing group differences on pretests and gains. (Vadasy & Sanders, 2008, p. 280)

Studies Using Random Assignment

If your study used random assignment to place subjects into conditions, your report should describe

  • the unit of randomization and
  • the procedure used to generate the random assignment sequence, including details of any restriction (e.g., blocking, stratification).

Knowing whether subjects have been assigned to conditions in a study at random or through some self-choice or administrative procedure is critical to readers’ ability to understand the study’s potential for revealing causal relationships. Random assignment makes it unlikely that there will be any differences between the experimental groups before the manipulation occurs (Christensen, 2012; see Table A4.2a). If subjects are placed into experimental groups using random assignment, then you can assume the groups do not differ on characteristics that might be related to the dependent variable.1

In the case of self- or administrative selection into groups, it is quite plausible that the groups will differ on many dimensions before the manipulation is introduced, and some of these dimensions are likely related to the dependent variable. This potential compromises the ability of studies using self- or administrative assignment techniques to lead to strong causal inferences about the manipulation. For example, if subjects in the labeling study were first asked whether they would like to smell a fragrance called “Roses” or one called “Manure” and then they were assigned to their preference, there is simply no way to determine whether their subsequent ratings of the fragrances were caused by a labeling effect or by individual differences subjects brought with them to the study (perhaps more subjects in the manure condition grew up on a farm?). For this reason, studies using random assignment are called experiments or randomized field trials, whereas those using nonrandom assignment are called quasi-experiments.

Types of random assignment. It is essential that your report tells readers what procedure you used to assign subjects to conditions. When random assignment is used in a laboratory context, it is often OK to say only that random assignment happened. However, you can add how the random assignment sequence was generated—for example, by using a random numbers table, an urn randomization method, or statistical software.

Your study may also have involved a more sophisticated random assignment technique. For example, when the number of subjects in an experiment is small, it is often good practice to first create groups that include a number of subjects equal to the number of conditions in the study and then to randomly assign one member of the group to each condition. This technique, called blocking, ensures that you end up with equal numbers of subjects in each condition.

Sometimes stratified random assignment is used to ensure that groups are equal across important strata of subjects—for example, using the blocking strategy to assign each pair of women and each pair of men to the two labeling conditions. In this way, you can ensure that the roses and manure conditions have equal numbers of subjects of each sex. This is important if you think fragrance ratings will differ between sexes. If your study uses such a technique, say so in your report.

Random assignment implementation and method of concealment.

If your study used random assignment, state whether and how the sequence was concealed until experimental manipulations were assigned. Also state

  • who generated the assignment sequence,
  • who enrolled participants, and
  • who assigned participants to groups.

The next JARS recommendation concerns the issue of concealment of the allocation process. It raises the issue of whether the person enrolling people in a study knew which condition was coming up next. In most psychology laboratory experiments, this is of little importance; the enrollment is done by computer.

Studies Using Nonrandom Assignment: Quasi-Experiments

If your study used a method other than random assignment to place subjects into conditions, describe

  • the unit of assignment (i.e., the unit being assigned to study conditions; e.g., individual, group, community);
  • the method used to assign units to study conditions, including details of any restriction (e.g., blocking, stratification, minimization); and
  • the procedures employed to help minimize selection bias (e.g., matching, propensity score matching).

When a nonrandom process is used to assign subjects to conditions, some other technique typically is used to make the groups as equivalent as possible (Table A4.2b; see Shadish, Cook, & Campbell, 2002, for how this is done). Frequently used techniques include (a) matching people across the groups on important dimensions and removing the data of subjects who do not have a good match and (b) post hoc statistical control of variables believed related to the outcome measure using an analysis of covariance or multiple regression.

Evers, Brouwers, and Tomic (2006) provided an example of reporting a quasi-experiment in their article “A Quasi-Experimental Study on Management Coaching Effectiveness,” published in Consulting Psychology Journal: Practice and Research. In this study, the experimental group was composed of managers who had signed up (self-selected) to be coached. Evers et al. were then faced with constructing a no-coaching comparison group. Here is how they described the process:

The Experimental Group
We asked staff managers of various departments which managers were about to be coached. . . . We got the names of 41 managers who were about to register for coaching; 30 managers agreed to participate in our quasi-experiment, 19 men (63.3%) and 11 women (36.7%). Their ages ranged between 27 and 53 years (M = 38.8; SD = 8.20). The mean number of years as a manager was 5.34 years (SD = 5.66), and the mean number of years in the present position was 1.76 years (SD = 2.52).
The Control Group
We asked 77 managers of the Department of Housing and Urban Development to fill out our questionnaires: Of this group, 22 did not respond, whereas 48 of them answered the questionnaires both at the beginning and at the end of our quasi-experiment. We matched the groups with the help of salary scales, for these scales of the federal government are indicative of the weight of a position. We also matched the groups as much as possible according to sex and age, which ultimately resulted in a control group of 30 managers, of which 20 (66.7%) were men and 10 (33.3%) were women. Their ages ranged from 26 to 56 years, with a mean of 43.6 years (SD = 8.31), which means that the mean age of the control group is 4.8 years older. The mean number of years as a manager was 8.64 years (SD = 7.55), which means that they worked 3.3 years longer in this position than members of the experimental group did. The members of the control group had been working in the present position for a mean of 2.76 years (SD = 2.39), which is 1 year more than the members of the experimental group.
The experimental and the control group . . . were equal on sex, χ2(1) = 0.14, p = .71; the number of years as a manager, t(58) = 1.91, p = .06; and the total number of years in the present position, t(58) = 1.58, p = .12; but not on age, t(58) = 2.25, p = .03. (pp. 176–177)

First, as recommended in JARS, it is clear from Evers et al.’s (2006) description that the individual manager was the unit of assignment. The first sentence tells readers that subjects had not been randomly assigned to the coaching condition; the method of assignment was self-choice or administrative selection. We do not know (and I suspect the researchers did not either) whether subjects requested coaching or were nominated for it (or both), but the managers in the coaching group were clearly people who might have been different from all managers in numerous ways: They might have needed coaching because of poor job performance, or they might have been good managers who were highly motivated to become even better. The group of not-coached managers who were invited to take part was chosen because they were similar in salary.

The researchers stated that 77 managers were asked to be in the control group, of which 48 provided the necessary data. These numbers relate more to our assessment of the representativeness of control group managers than to issues related to condition assignment. Then, within the responding invitees, further matching was done on the basis of the managers’ sex and age. This resulted in a control group of 30, comprising the best matches for the managers in the coached group.

Finally, Evers et al. (2006) informed readers that their matching procedure produced groups that were equivalent in sex and experience but not in age. The researchers found that coached managers had some higher outcome expectancies and self-efficacy beliefs. Because the coached and not-coached groups also differed in age, is it possible that managers scored higher on these two variables simply because these two self-beliefs grow more positive as a function of growing older?

Methods to Report for Studies Using Random and Nonrandom Assignment Masking

Regardless of whether random assignment or another method was used to place subjects into experimental conditions, your Method section should

  • report whether participants, those administering experimental manipulations, and those assessing the outcomes were unaware of condition assignments, and
  • provide a statement regarding how masking was accomplished and how the success of masking was evaluated.

If you did not use masking, this should be stated as well.

As I mentioned in the Chapter 3, masking refers to steps to ensure that subjects, those administering the interventions or experimental manipulations, and those assessing the outcomes are unaware of which condition each subject is in. Masking is most likely to take place in circumstances in which knowledge of the condition could itself influence the behavior of the subject or the person interacting with him or her. In the fragrance labeling experiment, for example, subjects should not know that there is also another condition in which people are reacting to a fragrance called something else. Otherwise, they might react to their assignment (“Why did I have to smell manure?”) rather than the fragrance itself.

The person interacting with the subject also should not know what the label is for this subject; this knowledge might influence how the person behaves, either wittingly or unwittingly (e.g., smiling more at subjects in the roses condition). Also, if you had someone rate, say, the facial expression of subjects after sniffing the fragrance, you would want them to be unaware of the subjects’ condition to guard against a possible bias in their ratings.

Masking used to be referred to as blinding, and it still is not unusual to see reference to single-blind and double-blind experiments. The first refers to a study in which subjects do not know which condition they are in, and the second refers to a study in which neither the subject nor the experimenter knows the condition.2 For example, Amir et al.’s (2009) article “Attention Training in Individuals With Generalized Social Phobia: A Randomized Controlled Trial” was discussed earlier for its power analysis. In this study, all parties who could introduce bias were kept unaware of condition assignment. Clinicians rated some outcome measures and did not know whether the subjects received the experimental treatment or placebo:

Measures
We used a battery of clinician- and self-rated measures at pre- and postassessment. Clinician ratings were made by raters blind to treatment condition. (Amir et al., 2009, p. 964)

Also, subjects and experimenters were kept unaware of the subjects’ condition:

Procedure
Prior to each experimental session, participants entered the number in their file into the computer. . . . Participants did not know which condition the numbers represented, and the research assistants working with the participants could not see the number in the envelope. Thus, participants, experimenters, and interviewers remained blind to a participant’s condition until all posttreatment assessments had been conducted. (Amir et al., 2009, p. 965)

Further, the researchers assessed whether their procedure to keep subjects unaware of their condition was successful:

To assess whether participants remained blind to their respective experimental condition, we asked participants at postassessment whether they thought they had received the active versus placebo intervention. Of the participants who provided responses, 21% (4/19) of participants in the AMP [attention modification program] and 28% (5/18) of ACC [attention control condition] participants thought they had received the active treatment, χ2(1, N = 37) = 0.23, p = .63. These findings bolster confidence that participants did not systematically predict their respective experimental condition. (Amir et al., 2009, p. 965)

Amir et al. (2009) also pointed out in their discussion two limitations of the masking procedures they used:

Follow-up data should be interpreted with caution because assessors and participants were no longer blind to participant condition. . . . (p. 969)
We did not collect data regarding interviewers’ or experimental assistants’ guesses regarding participants’ group assignment and therefore cannot definitively establish that interviewers and assistants remained blind to participant condition in all cases. (p. 971)

From these descriptions, it is easy for the reader to assess whether knowledge of the treatment condition might have affected the study’s outcomes (unlikely, I suspect) and for replicators to know what they need to do to reproduce or improve on what Amir et al. did.

Statistical Analyses

Regardless of whether random assignment or another method was used to place subjects into experimental conditions, your Method section should describe the statistical methods used

  • to compare groups on primary outcomes;
  • for additional analyses, such as subgroup comparisons and adjusted analysis (e.g., methods for modeling pretest differences and adjusting for them when random assignment was not used); and
  • for mediation or moderation analyses, if conducted.

Research reports should conclude the Method section with a subsection titled “Statistical Analysis” or “Analysis Strategy” or some variation thereof. Sometimes this is presented as a subsection of the Method section. Other times, the description of each analysis strategy appears in the Results section just before the results of that analysis are described. Regardless of where it appears, you should lay out the broad strategy you used to analyze your data and give the rationale for your choices.

As an example of how the analysis strategy might be described in a study using nonexperimental data, here is how Tsaousides et al. (2009) described their two-step analytic approach in their study of efficacy belief and brain injury:

Data Analysis
First, correlations were conducted to explore the relationship among demographic, injury related, employment, predictor (PEM [Perceived Employability subscale] and IND [Independence subscale of the Bigelow Quality of Life Questionnaire]), and outcome variables (PQoL [perceived quality of life] and UIN [unmet important needs]). Subsequently, two hierarchical regression analyses were conducted, to determine the relevant contribution of each predictor to each outcome variable. In addition to the predictor variables, all demographic variables that correlated moderately (r ≤ .20) with either outcome variable and all injury-related variables were included in the regression analysis. . . . Employment status was included in both regression analyses as well, to compare the predictive value of employment to the self-efficacy variables. (p. 302)

Amir et al. (2009) provided an example of how to report the analytic strategy for an experiment using analysis of variance (ANOVA). The researchers’ subsection on their statistical analysis of the study on attention training covered about a column of text. Here is a portion of what they wrote about their primary measures:

To examine the effect of the attention training procedure on the dependent variables, we submitted participants’ scores on self-report and interviewer measures to a series of 2 (group: AMP, ACC) × 2 (time: preassessment, postassessment) ANOVAs with repeated measurement on the second factor. Multivariate ANOVAs were used for conceptually related measures . . . and a univariate ANOVA was used for disability ratings. . . . Significant multivariate effects were followed up with corresponding univariate tests. We followed up significant interactions with within-group simple effects analyses (t tests) as well as analyses of covariance on posttreatment scores covarying pretreatment scores to determine whether the AMP resulted in a significant change on the relevant dependent measures from pre- to postassessment. . . . A series of 2 (group) × 2 (time) × 2 (site: UGA [University of Georgia], SDSU [San Diego State University]) analyses established that there was no significant effect of site on the primary dependent measures (all ps > .20). The magnitude of symptom change was established by calculating (a) within-group effect sizes = (preassessment mean minus postassessment mean)/preassessment standard deviation, and (b) between-group controlled effect sizes (postassessment ACC covariance adjusted mean minus postassessment AMP covariance adjusted mean)/pooled standard deviation. (Amir et al., 2009, pp. 966–967)

The researchers began by telling readers the variables of primary interest. Then they described how the initial data analysis mirrored the research design; there were two experimental groups with each participant measured at two times, so they had a 2 × 2 design with the time of measurement as a repeated-measures factor. They protected against an inflated experiment-wise error by grouping conceptually related variables and followed up only significant multivariate effects with univariate tests. Significant interactions (tests of potential mediators that might influence the strength of the effect of the treatment) were followed by simple tests of subgroups. They finished by telling readers how they calculated their effect sizes—in this case, standardized mean differences, or d indexes. This description perfectly matches the recommendations of JARS.

Studies With No Experimental Manipulation

Many studies in psychology do not involve conditions that were manipulated by the experimenter or anyone else. The main goals of these studies can be to describe, classify, or estimate as well as to look for the naturally occurring relations between the variables of interest. Such studies are sometimes called observational, correlational, or natural history studies. In an observational study, the investigators watch participants and measure the variables of interest without attempting to influence participants’ behavior in any way. The status of the participant is determined by their naturally occurring characteristics, circumstances, or history and is not at all affected (at least not intentionally) by the researchers. Of course, observations can occur in experimental studies as well.

In a correlational study, the researchers observe the participants, ask them questions, or collect records on them from archives. The data gathered are then correlated with one another (or are subjected to complex data analyses, discussed in Chapter 5) to see how they relate to one another. Again, the key is that the data are collected under the most naturalistic circumstances possible, with no experimental manipulation.

A natural history study is used most often in medicine, but it can apply to studies of psychological phenomena as well, especially studies concerned with mental health issues. As the name implies, natural history studies collect data over time on an issue, often a disease or psychological condition, so that the researchers can create a description of how the condition changes or emerges with the passage of time. It is hoped that the progression of the changes will provide insight into how and when it might be best to treat the condition. Thus, a natural history study can be considered a special type of observational study.

Given the nature of the research question, such studies may have special features to their designs or sampling plans. The studies may be prospective—that is, they follow people over time with the intention of collecting planned data in the future. Or they may be retrospective—that is, the participants’ states or conditions in the past are gleaned from present-day self-reports, reports from others, or archival records.

A cohort study is a type of nonexperimental design that can be prospective or retrospective. In a prospective cohort study, participants are enrolled before the potential causal effect has occurred. For example, indigent mothers-to-be may be enlisted to take part in a study that tracks the effects of poverty on their children. In a retrospective cohort study, the study begins after the dependent event occurs, such as after the mothers have given birth and their children have exhibited developmental problems in preschool.

In case-control studies, existing groups that differ on an outcome are compared to one another to see if some past event correlates with the outcome. So students who are having trouble reading at grade level and those who are on grade level might be compared to see if they grew up under different conditions, perhaps in or out of poverty.

These are just a few examples of many nonexperimental designs and how they can be described. Table A4.3 presents the additional JARS standards for reporting these types of studies. These standards were added to JARS because they were part of the STROBE standards (Strengthening the Reporting of Observational Studies in Epidemiology; Vandenbroucke et al., 2014) and they either elaborated on the general JARS or asked for additional information. Table A4.3 is intended to be used along with Table A1.1. Because the STROBE guidelines are generally more detailed than JARS and are specific to nonexperimental designs, the article by Vandenbroucke et al. (2014) containing an example and explanation of each STROBE entry is very useful.3

Participant Selection

If your study did not include an experimental manipulation,

  • describe methods of selecting participants (e.g., the units to be observed or classified), including
    • methods of selecting participants for each group (e.g., methods of sampling, place of recruitment),
    • number of cases in each group, and
    • matching criteria (e.g., propensity score) if matching was used;
  • identify data sources used (e.g., sources of observations, archival records) and, if relevant, include codes or algorithms to select participants or link records; and
  • define all variables clearly, including
    • exposure;
    • potential predictors, confounders, and effect modifiers; and
    • method of measuring each variable.

I have already introduced several studies that were not experimental in design. The study I described by Fagan, Palkovitz, Roy, and Farrie (2009) looked at whether risk factors such as fathers’ drug problems, incarceration, and unemployment were associated with resilience factors (e.g., fathers’ positive attitudes, job training, religious participation) and ultimately with the fathers’ paternal engagement with offspring. O’Neill, Vandenberg, DeJoy, and Wilson (2009) reported a study on (a) the relationships involving retail store employees’ anger, turnover rate, absences, and accidents on the job and (b) the employees’ perception of how fully the organization they worked for supported them. All of the variables were measured, not manipulated.

Moller, Forbes-Jones, and Hightower (2008) studied the relationship between the age composition of a classroom, specifically how much variability there was in children’s ages in the same preschool class, and the developmental changes they underwent. This also was a correlational study. These researchers measured the variability in classroom age composition and children’s scores on the cognitive, motor, and social subscales of the Child Observation Record. They then related the scores to the classrooms’ variability in age composition. Finally, Tsaousides et al. (2009) looked at the relationship between self-efficacy beliefs and measures of perceived quality of life among individuals with self-reported traumatic brain injury attending a community-based training center.

I presented the last three of these studies as examples of good descriptions of how to report measures in Chapter 3 under the Intended Sample Size subsection, and the Measures, Methods of Data Collection, Measurement Quality, and Instrumentation section. I will not describe them again here, but you can return to them and see how they fulfilled the JARS requirements for reporting measures in nonexperimental designs. Good descriptions of measures are always important, but this is especially true when the study contains no manipulation. When a manipulation is present, the description of the purposively created conditions also must be detailed and transparent.

Comparability of Assessment

If your study did not include an experimental manipulation, your Method section should describe the comparability of assessment across groups (e.g., the likelihood of observing or recording an outcome in each group for reasons unrelated to the effect of the intervention).

An added reporting standard for the Method section of nonexperimental studies calls for the researchers to describe whether the measurements were taken under comparable conditions across groups. This is especially important when the study compares intact groups, or groups that existed before the study began. Such groups might differ not only because of differences in the measured variable of interest but also because the measurements were taken at different times, in different circumstances, and by different data collectors. If this was true of your study, JARS suggests that you say so in your report and describe what you did to demonstrate that any difference between your static groups were not caused by these differences in measurement.

None of our article examples involve static groups, so let’s complicate one for purposes of demonstration. Suppose a retail organization approached you with a problem. They had a store (in a chain) that was experiencing an unusually high rate of absenteeism and employee turnover (similar to O’Neill et al., 2009). They wanted to know why. They suspected the management team at the store was seen as unsupportive by the staff. So they asked you to survey the workers at the location of concern and at another store in the chain. Because the employees were not randomly assigned to store locations, you would need to take special care to determine that absenteeism (did both stores regard personal and sick days in the same way when calculating absences?), turnover (did both consider a request for transfer as an instance of turnover?), and perceived organizational support (how similar were the questionnaires, administrators, and administration circumstances?) were measured in comparable ways. These efforts need to be described in the report. Otherwise, readers will not be able to assess the importance of any of these potential sources of bias in the measures.

With the description of the Method section behind us, let’s look at the JARS standards for reporting the results of studies.