The Journal Article Reporting Standards (JARS; Appelbaum et al., 2018) contains additional recommendations for research designs that have a few other unique features. These standards relate to studies that (a) collected data on participants at more than one time; (b) are intended to be replications of previously conducted studies; (c) are conducted using only one subject or unit; and (d) are clinical trials—that is, meant to test the effectiveness of physical or mental health treatments. The flowchart in Figure 1.1 diagrams the strategy for deciding which JARS tables are relevant to your project. Here, I discuss the four additional design features in turn.
Table A6.1 presents the additional JARS recommendations for studies that collect longitudinal data—that is, that collect data on each participant on more than one occasion. These types of studies are most frequent in developmental psychology, when you want to see how people change over time, or in clinical and health psychology, when you want to see whether the effects of an intervention persist over time. Because these types of studies have this time dimension, the JARS table covers an aspect of design covered in none of the other tables. Therefore, you would use Table A6.1 in addition to the other tables that are relevant to your design if you collected longitudinal data.
If your study involves collecting data on subjects at more than one time, your Method section should describe for each unit sampled
The reporting recommendation on sample characteristics is really a reiteration of a recommendation made in the general reporting standards (Table A1.1): Your report should present a complete picture of who the participants were. There are some characteristics, such as age and sex, that need to be reported universally because they so often interact with psychological phenomena. The importance of other characteristics will depend on the topic and context of the study you are conducting.
The key reason why sample characteristics are reemphasized in the JARS table for longitudinal studies is not because the characteristics that should be reported in these types of studies are different from those of other studies. Rather, it is a reminder that the composition of the sample needs to be described for each of the measurement occasions. For example, the study by Fagan, Palkovitz, Roy, and Farrie (2009) on risk factors (e.g., fathers’ drug problems, incarceration, unemployment) associated with resilience factors (e.g., fathers’ positive attitudes, job training, religious participation) and paternal engagement with offspring was a longitudinal study. The description of the samples is provided in Chapter 3. Remember that Fagan et al. used data drawn from the Fragile Family and Child Wellbeing Study, which followed nearly 5,000 children born in the United States between 1998 and 2000. The researchers used longitudinal data collected at a child’s birth, 1-year follow-up, and 3-year follow-up. To obtain their final sample from the larger database, Fagan et al. did the following:
We excluded married and cohabiting couples at baseline to ensure that the study focused on the most fragile families at the time of the child’s birth. Next, we selected fathers who also participated at Year 1. Of the 815 eligible fathers who participated at baseline and Year 1, 130 were missing substantial data, so we removed them from the sample. These steps resulted in a sample of 685 fathers. We then excluded fathers who were not interviewed at Year 3, yielding a sample of 569 fathers. About 4% of the 569 fathers had substantial missing Year 3 data that could not be easily imputed. We omitted these fathers from the study, yielding a final sample of 549 fathers with complete data. (p. 1392)
The researchers then presented the average age of their sample at baseline and its composition by ethnic group and level of education. Also, because it was relevant to this study, they specified the relationship between the mother and father at 1-year and 3-year follow-up.
Fagan et al.’s (2009) study did not involve an experimental manipulation. One of the other examples did: Norman, Maley, Li, and Skinner’s (2008) study evaluating the effectiveness of an Internet-based intervention to prevent and stop adolescents from smoking. Norman et al. took measures at baseline, postintervention, and 3-month and 6-month follow-up. After presenting a thorough description of their sampling technique and sample, they informed readers,
The distribution of cultural groups in the smoker sample was significantly different, t(13) = 3.105, p < .001, than the nonsmoker sample, with a higher proportion of smokers identifying with Eastern European or Mediterranean cultures. Those from Central Asian or African cultural groups were the least represented in the smoker sample. (Norman et al., 2008, p. 801)
If your study involves collecting data on subjects at more than one time, your Method section should describe for each unit sampled
Fagan et al. (2009) told readers that the reason for attrition, or the exclusion of a participant’s complete data, in their study was too much missing data; they began with a sample of 815 fathers who were eligible and participated at baseline. But only 549 of these fathers had complete data after the two follow-ups. This is about a 33% attrition rate. The question you must ask is whether the initial sample, if meant to be representative of a national sample of unwed fathers at baseline, would still be representative if one in three fathers were not included in the 1-year and 3-year follow-up sample. Could the missing fathers be different from the retained fathers in some way that was related to the variables of interest?
In fact, Fagan et al. (2009) looked at whether fathers who dropped out differed from ones who did not:
We conducted attrition analyses to compare the sample of fathers who met criteria for inclusion in the present study at baseline, Year 1, and Year 3 and who were interviewed at all three times (n = 549) with fathers who were interviewed at baseline and Year 1 but not at Year 3 (n = 116). Chi-square or one-way analyses of variance were conducted on father’s age, race, education, relationship with the mother of the baby, and prenatal involvement; family risk and resilience indexes at baseline; individual, family, and child risk indexes at Year 1; individual and family resilience indexes at Year 1; and whether the father resided with the child at Year 1. One significant difference was found: Fathers in the sample were more involved prenatally, F(1, 664) = 4.7, p < .05, than fathers who dropped out at Year 3. (p. 1396)
Regardless of how you answer the question of how much this difference might matter to the generalizability of the findings, the important point here is that the report gave you sufficient information to make the judgment.
Norman et al. (2008) showed the attrition rates for their different conditions in the diagram presented in Figure 5.2. For an experimental study, you are concerned about not only whether overall attrition has been appreciable, potentially threatening external validity, but also whether attrition is different for each of the conditions at each measurement, a possible threat to internal validity. If it is, the ability of the researchers to make claims about the effect of the experimental variable is compromised. If differential attrition occurs, it is possible that the composition of the groups changed, creating a plausible rival hypothesis in addition to the treatment’s effect if the groups differ on subsequent tests. You can see from Figure 5.2 that attrition was less than 10% among nonsmokers in the two experimental conditions but between 15% and 30% in the two different conditions at different times for smokers. Is this difference large enough to raise your concern about whether the subjects who were smokers and completed the study were different from the smokers who did not complete the study? Regardless, JARS recommends that researchers report attrition rates for different experimental conditions so that such a determination can be made.
If your study involves collecting data on subjects at more than one time, your Method section should describe
This reporting recommendation is pretty straightforward. People, local contexts, and larger societies change over time. The researchers should be aware of any of these changes that might influence how participants respond and should report these, along with any effort they made to adjust for or take these changes into account. For example, a new report showing that smoking is bad for your health that got widespread media attention between the baseline and 3-year follow-up waves of the Norman et al. (2008) study might have affected participants’ follow-up responses. It could be a plausible rival hypothesis to a treatment effect. Researchers should make readers aware of any changes in participants and context that might affect their results.
If your study involves collecting data on subjects at more than one time, your Method section should describe the independent variables and dependent variables at each wave of data collection. It should also report the years in which each wave of data collection occurred.
This recommendation of JARS points out that measurements are sometimes not the same from one data collection time to another. For example, a revision of one of your instruments might have been developed between the posttreatment and follow-up data collections. Most researchers would be reluctant to change measurements midstream, but if the validity and reliability of the measures have improved greatly, the researchers might find that the tradeoff of quality over consistency is worth it. If you do this, you should fully detail all the measures and tell how you adjusted scores to make them as equivalent as possible (e.g., rescaling, adjustment for attenuation). Thus, just as you should mention in your report all of the measures that were taken as part of a study at only a single point in time, you should repeat this detail for every occasion of measurement, even if just to say that the measurements were the same.
If your study involves collecting data on subjects at more than one time, your Method section should describe the amount of missing data and how issues of missing data were handled analytically.
This is also a reiteration of an earlier JARS recommendation. Note above how Fagan et al. (2009) removed from their sample all participants with “substantial” missing data. Did this removal mean the fathers used in the analyses were different from fathers without missing data? Fagan et al. (2009) provided the following information:
We compared the sample of fathers who participated at all three times and who had no missing data with fathers who had substantial missing data. The fathers with no missing data were more likely to be in a romantic relationship with the mother of the baby at baseline, χ2(1, N = 569) = 7.9, p < .01, were less likely to have been merely friends with the mother of the baby at baseline, χ2(1, N = 569) = 6.2, p < .05, and experienced more individual resilience factors between baseline and Year 1, F(1, 568) = 4.3, p < .05, than fathers who had missing data. (p. 1396)
When there was missing data, but not substantially so, Fagan et al. (2009) did the following:
The mothers’ data also had missing cases that needed to be imputed. At Year 1, 19 mothers were not interviewed and 49 had missing data; at Year 3, 29 mothers were not interviewed and 82 had missing data. For those cases in which all mothers’ data were missing, the fathers’ data were first consulted to see whether they reported having any contact with the child. If they did not, zeros were imputed for all mother items. Among mothers who had some missing items, we imputed the respondent’s mean for those items that were answered. For the remaining cases, the sample mean for the mother’s index was imputed. The Cronbach alpha was .95 for Year 1 and .96 for Year 3. (p. 1393)
Throughout, these researchers used substitutions of fathers’ data for missing mothers’ data as well as substitutions of means when an insubstantial amount of data were missing for mothers at each time of measurement.
Norman et al. (2008) used a very different strategy for imputing missing data:
The use of multilevel logistic regression models provided alternative strategies to deal with missing data that are comparable to estimation procedures, but rely on less data to effectively assess change over time (Little & Rubin, 1987). (p. 804)
If your study involves collecting data on subjects at more than one time, your Method section should specify the analytic approaches used and assumptions made in performing these analyses.
Longitudinal data can be analyzed in numerous ways, from repeated measures analysis of variance to highly complex analysis strategies. For example, Fagan et al. (2009) first did a series of univariate analyses using data from each time point separately. Then they conducted a sophisticated path analysis that used the time of measurement as one variable. Thus, variables at baseline were used to predict variables at 1-year follow-up, and then both of these times of measurement were used to predict all variables at Year 3. Norman et al. (2008) used logistic growth models to analyze their data. Because these strategies require very advanced calculations (and a new set of terms), I will not describe their results here. However, the important point is that your paper needs full descriptions of all the analytic strategies in the Method or Results section.
If your study involves collecting data on subjects at more than one time, your Method section should provide information on where any portions of the data have been previously published and the degree of overlap with the current report.
You might publish a report of your data collected at baseline and immediately after treatment soon after the data are analyzed. Then perhaps years later, you might write another report covering your follow-up data. As with reports of data that have been used in more than one article published roughly at the same time, you should provide information in your later reports whenever you reuse data that were published previously, even though only part of the data in the later report are reused. Fagan et al. (2009) did this when they noted that their data were drawn from an already existing database, the Fragile Families and Child Wellbeing Study, collected in United States between 1998 and 2000. It also is not good practice to wait until the follow-up data are collected but then try to publish the longitudinal data in pieces. This is called piecemeal publication (see Cooper, 2016).
Table A6.2 sets out reporting standards for studies that are meant to repeat earlier studies and to do so with as much fidelity as possible. The most important reporting issues concern the similarities and differences between the original study and the study meant to repeat it. Because these issues are unique to replication studies, Table A6.2 should be used with any combination of other JARS tables I have described, depending on the design of the study that is being replicated.
It will be clearest if I use as the example the hypothetical fragrance labeling study. If you are interested in reading a published replication study, look at Gurven, von Rueden, Massenkoff, Kaplan, and Lero Vie (2013) for a study examining whether the five-factor model of personality was replicable using participants from a largely illiterate society or Jack, O’Shea, Cottrell, and Ritter (2013) for an attempt to replicate the ventriloquist illusion in selective listening.
If your study is meant to be a replication of a previously conducted study, your report should
If your study is a replication of an earlier work, it is critical that the title reflects this. So, if you were replicating an earlier study by “Capulet (2018)” on the effect of labeling fragrances, you might title your study “The Effects of Labels on Subjective Evaluations: A Replication of Capulet (2018).” Note that I used part of the title of the original study (from Chapter 2) but not all of it. This way, the two studies are clearly linked but cannot be confused as the same. The fact that your study is a replication is also made clear.
The replicative nature of your experiment should also be mentioned as early in the introduction as possible, preferably the first paragraph. You should also mention as early as possible whether your study is a direct or approximate replication—that is, whether it is meant to be as operationally similar to the original study as is possible. The difference between direct and approximate replication is not black and white. For example, a replication can never be direct, or exact, because it will always be done at a different time, often at a different place, and with different participants, researchers, and data collectors. Still, if these are the only things that vary, the study is often called a direct replication. It could also be called an approximate replication but typically is not. The label approximate replication should be used if, for example, changes were made to the way measures of the pleasantness of the fragrance were constructed (e.g., a scale with a different number or type of responses) but these are expected to have little or no effect on the outcomes of the study.
A conceptual replication occurs when the manipulations or measures were changed but the underlying relationships between variables are expected to hold. For example, instead of using roses and manure to label your fragrances, you might want to see whether the different positive and negative labels for neutral fragrances would be rated as similarly pleasant if the labels are lilies and mud.
JARS also recommends that any additional conditions you added to the original study design be fully described. For example, you might include in your replication study an extract from roses but also one from manure, making for four conditions rather than only the two conditions with different labels for the same fragrance used in the original study by “Capulet (2018).” Thus, the two conditions using extract of roses might be a direct replication of “Capulet (2018),” but the two conditions using an extract with a less pleasant fragrance are an extension of the original research.
Describe in your introduction why you added the new conditions and your hypotheses involving them. In your Method section, detail how you created and manipulated the new conditions. In addition to new conditions, JARS recommends that any additional measures be fully described in your report. So, if “Capulet (2018)” did not include a measure of subjects’ intention to buy the fragrance but you did, this needs to be detailed in your report.
Also, report any indications of the fidelity of the experimental treatments that were reported in the original study and in your study. This allows readers to assess how comparable your manipulations were to the originals. What was the rated pleasantness of the fragrances in the original studies before the labels were applied? Were your prelabeling ratings similar? Did the studies measure whether subjects remembered the label? Were the rates of recollection the same in each study?
If your study is meant to be a replication of a previously conducted study, your report should include
Your report should compare the demographic characteristics of the subjects in both studies. If the units of analysis are not people, such as classrooms, then you should report the appropriate descriptors of their characteristics.
These JARS recommendations allow the reader to assess whether the subjects in your study were similar to or different from those in the original study in any important ways that might relate to the labeling effect. For example, if the original study was conducted with a sample of female subjects only but your study was conducted with both female and male subjects, you need to say this in your Method section. Also, you should present any appraisal you make about whether this difference in subjects might influence the results. Might there be a difference between the sexes—might Juliet think both labels are equally sweet but Romeo think they are different? The results for female subjects might be a direct or approximate replication of the original study and the results for male subjects an extension of it, testing the generalizability of the original findings across the sexes.
In both the original labeling study and your replication, it is likely that subjects were randomly assigned to conditions. Your report should mention this. If the assignment to conditions was different in the two studies, it is critical that you report both and discuss the implications. For example, in an original school-based study of a reading intervention, the principal might have picked teachers for the new program, but in your study, teachers might have volunteered. This could create differences between the groups before the intervention began. For example, whereas the principal may have picked the teachers most in need, volunteer teachers may have been the most motivated to have their students improve, yielding two groups of teachers who would not necessarily react to the intervention in the same way.
If your study is meant to be a replication of a previously conducted study, you should report
As noted above, any changes you made to the apparatus and instruments you used in your replication need careful documentation. Such changes might include any language translations, how these were accomplished, and how successful they were in creating equivalent instruments. Was the original fragrance labeling study conducted in Italy, with measures written in Italian, and your replication study conducted in the United States, with measures in English? If so, how were the measures translated, and what evidence do you have that the measures were equivalent?
Also, the psychometric properties of your measures and the equivalent measures used in the earlier study need to be reported. What were their reliabilities and internal consistencies? Did these differ in a way that might affect the results? You should also detail any differences in the settings and instructions to subjects. Was one study conducted in a lab and the other at a department store perfume counter? How might this difference have affected the results?
If your study is meant to be a replication of a previously conducted study, you should report
The notion of replication extends not only to the methods used in the original study but to the data analysis strategy as well. It is important that when you conduct a replication, you use the same analytic strategy that was used in the original study. For example, if the original labeling study used a two-group t test to test the differences between labels, you should do this test as well. This also means that it is important to do a power analysis for your sample size and report it. Readers will be interested in whether the statistical power of your study roughly matches that of the original study, whether your study had a greater or lesser chance of finding statistical significance, and whether it measured parameters (e.g., means, standard deviations, effect sizes) with more or less precision.
You can also report any new ways you used to analyze the data. For example, you might do the t test used by the original researchers but also do a two-way analysis of variance by adding the sex of subjects to the analysis. You can improve on the original study’s analyses, but only after you have reproduced them as closely as possible.
Finally, you need to spell out what criteria you used to decide whether the original results were, in fact, successfully replicated. You might use the criteria of whether the findings were statistically significant (did both you and Capulet find that the roses label led to significantly higher favorability ratings?), how the magnitudes of the experimental effect size compared, whether the two effect sizes were contained within each other’s confidence intervals, or how the Bayesian postexperiment (a posteriori) expectations were similar or different. There are guidelines for when you might deem a replication as successful or unsuccessful (e.g., Open Science Collaboration, 2015), and Verhagen and Wagenmakers (2014) introduced a Bayesian approach and gave a brief overview of alternative methods using frequentist statistics. You can use these resources to help you pick your criteria.
Studies that are conducted using a single subject or unit focus on trying to understand how an individual (or other type of single unit) changed over time or was affected by a treatment. This design is in contrast to those I have discussed thus far, which are concerned with what makes one group of people different from another group on average. Single-case research is used most frequently in clinical psychology, especially when a treatment is meant to help an individual who has a rare disorder or when people with the disorder are hard to find and recruit for a group study with enough power to detect a reasonable effect size. It is also used in developmental psychology, when researchers are interested in how individuals change over time, with or without an intervention. Finally, single-case designs are sometimes used in applied psychology when, say, the effects of an intervention on a large unit motivates the study. This unit might be a company, a classroom or school, or an entire city, state, or country rather than the individuals who make up the unit. In this case, the average or sum of the individual scores of everyone in the unit (e.g., a factory’s productivity, the percentage of a school’s students scoring below grade level) represents the “behavior” of the single unit.
Table A6.3 will be of interest if your study is being conducted on a single individual or other type of unit. The single-case design is a unique type of research, not just because of the question it asks. You would complete Table A6.3 along with Table A1.1 and any other JARS tables that are relevant (see especially the section below on clinical trials).
The material called for in the introduction and Discussion sections of Table A1.1 applies to all types of designs. So, does much of the material reported in the Method section, including participant characteristics and sampling procedures (in the case of single-case research, this would constitute the individual’s demographic characteristics and how the unit was recruited). However, some of the material in the Method and Results sections would look quite different. For example, in a study with a single subject, the sample size and total number of participants is 1, and that is the intended sample size. Power analysis is irrelevant for an N-of-1 design, but a complete description of the characteristics of the measures is no less important. Similarly, you should examine the entries in Table A1.1 relating to results. You will see that most are relevant to N-of-1 designs, but they would be reported in a different way.
I will not use the hypothetical fragrance labeling study as an example of a single-case design. If you were conducting a labeling study, you might be able to use the two label conditions as a within-subjects factor, perhaps by exposing subjects to the same fragrance twice but once labeling it roses and once manure. But N-of-1 designs require numerous measurements, sometimes hundreds, over time, so this would not be the design of choice where labeling of fragrances is concerned. A subject would be incredulous if asked to sniff the fragrances too many times with the same two labels used over and over again. Also, if done too quickly, the subject’s sense of smell would change. You could ask the subject to return day after day, but this likely would be impractical.
So instead, I will use a hypothetical example in which you wish to test the effectiveness of a drug to help a smoker cut down on the number of cigarettes smoked each day or to eliminate smoking entirely. I will also introduce another example drawn from the literature involving the effectiveness of a video modeling treatment meant to improve how well an 8-year-old child with autism can perceive different emotions (happy, sad, angry, and afraid). It was conducted by Corbett (2003) and appeared in The Behavior Analyst Today. Video modeling was described by Corbett as follows:
Video modeling is a technique that has been developed to facilitate observational learning. It generally involves the subject observing a videotape of a model showing a behavior that is subsequently practiced and imitated. The child is typically seated in front of a television monitor and asked to sit quietly and pay attention to the video. The child is praised for attending and staying on task. Following the presentation of a scene showing a target behavior, the child is asked to engage in or imitate the behavior that was observed. This procedure is then repeated across examples and trials. (p. 368)
Corbett went on to say that video modeling had been used successfully with children with autism to teach them a variety of skills and that there were some instances in which it was shown to be more effective than live modeling. Can video modeling be used to teach a child with autism how to better recognize the emotions expressed on other peoples’ faces, a skill that is notably difficult for these children?
If your study was conducted on a single subject, your report should describe
Single-case designs can take many forms. When an experimental manipulation is examined in the study (such as the autism study), the most general variations in design relate to (a) whether the intervention is introduced but then withdrawn or reversed after a time, (b) whether more than one treatment is used and the treatments are introduced individually or in combination, (c) whether multiple measures are taken and whether these were meant to be affected differently by the treatment, and (d) what criteria will be used to decide when an intervention should be introduced or changed. Single-case studies without an intervention can be less complicated, but the researchers still need to decide whether multiple different measures will be taken and what length of time to allow between measures. JARS recommends that these basic design features be described in the report, and it would be odd, indeed, if they were not. Two excellent books for studying up on single-case research designs are Barlow, Nock and Hersen (2008) and Kazdin (2011).
JARS also specifies that the length of the phases used in the study be described and the method of determining these lengths. Sometimes the length of phases is preset before the experiment begins. Other times phase length is determined by the subject’s pattern of behavior once the study begins. For example, in single-case studies with interventions, it is important to achieve a stable baseline for the behaviors of interest before the intervention is introduced; if you do not have a good estimate of how the subject was behaving before the intervention begins, how can you tell whether the intervention had an effect? Once the intervention starts, it might last for a preset amount of time, or it might be withdrawn or changed when a preset criterion is met. For example, if a drug meant to help people stop smoking reduces the number of cigarettes smoked in a day from 30 to 20, this might lead you to increase the drug dosage to test whether the rate of smoking can be reduced yet further.
Corbett (2003) described the phases of the video modeling treatment as follows:
The video modeling treatment consisted of the participant observing brief (3 to 15 seconds) videotaped scenes showing the target behavior of happy, sad, angry or afraid performed by typically developing peer models. The videotape consisted of five examples of each emotion for a total of twenty different social or play situation scenes. Each scene was shown only once per day resulting in a total intervention time of 10 to 15 minutes per day. The videotape was shown five days per week in the participant’s home generally at the same time and in the same location. (p. 370)
She reported the baseline conditions as follows:
The baseline condition consisted of the child being seated in front of the television monitor. The therapist instructed the child to “Pay attention.” The first scene was played for the child. The therapist asked, “What is he/she feeling?” The therapist presented the child with a laminated sheet of four primary emotions with cartoons and words representing the four emotion categories. The child could provide a verbal response or point to the word or picture on the sheet. The therapist documented a “1” for correct responding or a “0” for incorrect responding for each scene across the four emotion categories. In order to establish a stable trend, a minimum of two days of baseline data was obtained. (Corbett, 2003, p. 371)
Her treatment conditions were described as follows:
The treatment condition consisted of the child being seated in front of the television. Again, the therapist said, “Pay attention.” After the first scene, the participant was asked: “What is he/she feeling?” referring to the primary emotion conveyed in the scene. If the child answered correctly, social reinforcement was given. If the child answered incorrectly, then corrective feedback was provided and the therapist said, “The child is feeling (Emotion).” Next, the therapist enthusiastically responded, “Let’s do what we saw on the tape. Let’s do the same!” The therapist then initiated the interaction observed on the tape. The therapist interacted with the child in a role-play using imaginary materials to simulate the social and play situations. This interaction was intended to be enjoyable and was not rated or scored. The child was encouraged to imitate and display the feeling that he observed. The emotion response was scored 1 or 0 and the next scene was introduced. (Corbett, 2003, p. 371)
In sum, Corbett described her design as a
multiple-probe-across-behaviors design. The treatment involved providing the subject with the fewest probes and modifications needed followed by more supportive intervention, as required. (p. 371)
If your study was conducted on a single subject, your description of the research design should include any procedural changes that occurred during the course of the investigation after the start of the study.
You would need to document in your report of a single-case study any changes that occurred to your procedures once the study began. Corbett (2003) had no such procedural change:
The participant’s mother, who was familiar with basic behavioral principles, was trained and supervised weekly on the procedure by the author. The child observed the tapes in a structured, supportive environment that was devoid of extraneous visual and auditory stimuli during video watching and rehearsal periods. (p. 370)
In the hypothetical smoking cessation study, a change in procedure would need to be reported if, say, after a short period of time the original dosage of the drug proved ineffective and you decided to increase it.
If your study was conducted on a single subject, your report should include a description of any planned replications.
Although a single-case study, by definition, is conducted on only a single unit, it is not uncommon for the same report to include multiple replications of the same study using different units. Thus, Corbett (2003) might have reported the effects of video modeling separately for five different children with autism. If she did this, she would present the data for each child individually and then consider the similarities and differences in results in the Discussion.
If your study was conducted on a single subject, your report should
Randomization can occur in single-case research but, obviously, not by randomly assigning subjects to conditions. Instead, randomization can occur by randomly deciding the phases in which the different conditions of the experiment will appear. For example, in the antismoking drug experiment, you might decide that each phase will be a week long. After establishing a baseline (How many cigarettes does the subject smoke per day before the drug is introduced?), you flip a coin to decide whether the subject will or will not get the drug during each of the next 10 weeks. Note that the subject might then get the drug for some consecutive weeks and get no treatment for other consecutive weeks. This randomization helps you rule out numerous threats to the validity of your assertion that the drug was the cause of any change in smoking behavior (especially history or other events that are confounded with your treatment).
If you used randomization in your single-case study, you should be sure to report it: It is a positive feature of your study that you will want readers to know about. In Corbett’s (2003) description of her phases of treatment above, she did not say explicitly whether the presentation of the emotions on the faces of the video models was randomized. In this type of experiment, it is not clear that randomization is necessary; the treatment was administered fairly quickly (over 3 weeks) and introducing and withdrawing the treatment would likely only diminish its effectiveness.
If your study was conducted on a single subject, your description of the research design should report
Corbett (2003) reported that the full sequence was completed and that there were 15 trials. She reported no incomplete replications. All reports of single-case designs include this information.
If your study was conducted on a single subject, your description should include the results for each participant, including raw data for each target behavior and other outcomes.
Analyses of the data from single-case research designs can be conducted using techniques that vary from simple to highly complex. Highly complex techniques have been developed specifically for data that are interdependent because of time. A compendium of these techniques can be found in Kratochwill and Levin (2014). What is also unique about reporting results of single-case designs is that the raw data should be presented, which often occurs in the form of a graph. Corbett (2003) presented her results in the graph shown in Figure 6.1. She described her results without conducting any formal statistical analysis, a procedure that is not uncommon in simple single-case studies when the data are clear and irrefutable:
Video modeling resulted in the rapid acquisition of the perception of the four basic emotions. As can be observed in Figure 1, D.W. quickly mastered the identification of happy, which was already somewhat established in his repertoire. Subsequently, D.W. showed gradual improvement across the remaining emotion categories. D.W. showed rapid, stable acquisition and maintenance of all the emotion categories. (Corbett, 2003, p. 372)
Figure 6.1. Example of Raw Data Presentation in a Single-Subject Design
When studies evaluate a specific type of experimental treatment, namely the effects of health interventions, these are called clinical trials. To briefly revisit what I wrote in Chapter 1, clinical trials can relate to mental health or physical health interventions. In psychology, clinical trials often examine the psychological components of maintaining good physical health—for example, interventions meant to motivate people to comply with drug regimens or the implications of maintaining good physical health for good mental functioning, such as the effects of exercise programs on cognitive abilities in older adults. Clinical trials may involve manipulated interventions introduced by the experimenters. Or they can be used to evaluate an intervention that was introduced by others, perhaps administrators at a hospital who want to know whether a program they have begun is working. A clinical trial may use random or nonrandom assignment.
There are other types of interventions or treatments for which you might consider completing Table A6.4 even though health issues are not involved. For example, reports on studies like that by Vadasy and Sanders (2008) examining the effects of a reading intervention might benefit from completing this table. In these cases, many of the questions on standards in Table A6.4 will help you make your report as useful to others as possible.
The example I will use to illustrate the reporting of a clinical trial is from an article by Vinnars, Thormählen, Gallop, Norén, and Barber (2009) titled “Do Personality Problems Improve During Psychodynamic Supportive–Expressive Psychotherapy? Secondary Outcome Results From a Randomized Controlled Trial for Psychiatric Outpatients With Personality Disorders,” published in Psychotherapy Theory: Research, Practice, Training.1 In this study, two treatments for personality disorder were compared. The experimental treatment was a manualized version of supportive–expressive psychotherapy (SEP), and the control condition was nonmanualized community-delivered psychodynamic treatment (CDPT). Thus, this was a simple one-factor between-subjects design with two experimental conditions.
If you are reporting a clinical trial, your report should
Vinnars et al. (2009) described their inclusion and exclusion criteria for participants and how they recruited them in the same paragraph:
Participants were consecutively recruited from two community mental health centers (CMHCs) in the greater Stockholm area. They had either self-applied for treatment or were referred, mainly from primary health care. In general, participants asked for nonspecific psychiatric help, although a few specifically asked for psychotherapy. The inclusion criteria consisted of presence of at least one DSM–IV [Diagnostic and Statistical Manual of Mental Disorders, fourth edition; American Psychiatric Association, 1994] PD [personality disorder] diagnosis, or a diagnosis of passive–aggressive or depressive PD from the DSM–IV appendix. Exclusion criteria included age over 60 years, psychosis, bipolar diagnosis, severe suicidal intent, alcohol or drug dependence during the year before intake, organic brain damage, pregnancy, or unwillingness to undergo psychotherapy. Participants also needed to be fluent in Swedish. (p. 364)
This description gives readers a good idea about who was included and excluded from treatment so that researchers who want to replicate the results or extend them to different populations will know how to do so.
If you are reporting a clinical trial, you should state
From the description of subjects, it appears that in the Vinnars et al. (2009) clinical trial, people had been diagnosed with any personality disorder or specifically with passive–aggressive or depressive personality disorder. It also suggests that this clinical diagnosis happened before the study began (because subjects self-nominated or were referred). So the clinical assessors prior to treatment were not involved in providing treatment.
With regard to posttreatment assessments, Vinnars et al. (2009) told readers that clinical assessments for the study were taken at three time points: pretreatment, 1 year after treatment, and 1 additional year later. They stated that “Clinical psychologists with extensive clinical experience conducted the Structural Clinical Interview for DSM–IV II [as well as other clinical assessment instruments]” (p. 366). Thus, we can assume that the clinicians taking treatment-related assessments (the outcome variables) were not the same clinicians who administered the treatment. The researchers did not say whether the assessors were unaware of the subjects’ experimental condition (and it would have been nice if they did so), but I think it is safe to assume they were.
In your report of a clinical trial, include
Vinnars et al. (2009) used a manualized version of the psychotherapy manipulation, so we should not be surprised to find that their description of it begins with reference to where the manuals can be found:
The SEP in this study followed Luborsky’s treatment manual (Barber & Crits-Christoph, 1995; Luborsky, 1984). The treatment consisted of 40 sessions delivered, on average, over the course of 1 calendar year. (p. 365)
Vinnars et al. sent us elsewhere to find the details of their experimental treatment. This is a perfectly appropriate reporting approach, as long as (a) the document referred to is easily available (e.g., has been published, is available on the Internet), (b) the conceptual underpinnings of the treatment were carefully detailed (as they were in the introduction to Vinnars et al., 2009), and (c) any details are reported regarding how the treatment was implemented that might be different from the manual description or not be contained therein. Vinnars et al.’s description also contains information on the exposure quantity (40 treatment sessions) and the time span of the treatment (about 1 calendar year).
Equally important is the researchers’ description of the comparison group:
The comparison group was intended to be a naturalistic treatment as usual for PD patients. The basic psychotherapeutic training of these clinicians was psychoanalytically oriented, and they received supervision from a psychoanalyst during the study period. They were not provided with any clinical guidelines on how to treat their patients. They chose their own preferred treatments and had the freedom to determine the focus of the treatment, its frequency, and when to terminate treatment. However, it was quite clear that the predominant orientation of all but one therapist was psychodynamic. This reinforced our conclusion that this trial involved the comparison of two psychodynamic treatments, one manualized and time limited and one nonmanualized. (Vinnars et al., 2009, p. 365)
In much evaluation research conducted in naturalistic settings, the comparison group often receives “treatment as usual.” Many times, this creates a problem for readers and potential replicators of the study findings because they have no way of knowing exactly what the comparison treatment was. The revealed effect of a treatment—the differences between it and its comparator on the outcome variables—is as much dependent on what it is being compared to as on what effect the treatment had. After all, even the effect of different perfume labels depends on both labels (roses vs. manure as opposed to roses vs. daisies). If your control group receives treatment as usual, consider in your research design how you will collect information on what usual treatment is. This may involve observing the control group and/or giving treatment deliverers and control group members additional questions about what they did or what happened to them. The important point is that just collecting the outcome variables from control group members typically will not be enough for you (and your readers) to interpret the meaning of any group differences you find. To address this problem, JARS explicitly asks for descriptions of the control groups and how they were treated. Vinnars et al. (2009) described this; be sure to do the same.
Vinnars et al. (2009) told readers that the aim of their study was “to explore the extent to which SEP can improve maladaptive personality functioning in patients with any PD from the Diagnostic and Statistical Manual of Mental Disorders (4th ed.; DSM–IV)” (p. 363). Thus, we might conclude that their aim was to gather data on the generalizability of the treatment’s effectiveness.
In your report of a clinical trial, include
It appears that in the Vinnars et al. (2009) study, there were no changes to the protocol once the study began. If there were, these would be reported in this section.
A data and safety monitoring board is an independent group of experts whose job it is to monitor the clinical trial to ensure the safety of the participants and the integrity of the data. The National Institutes of Health (NIH; 1998) have a policy about when and how a formal monitoring board should function:
The establishment of the data safety monitoring boards (DSMBs) is required for multi-site clinical trials involving interventions that entail potential risk to the participants. The data and safety monitoring functions and oversight of such activities are distinct from the requirement for study review and approval by an Institutional Review Board (IRB).
The policy goes on to say that although all clinical trials need to be monitored, exactly how they are to be monitored depends on (a) the risks involved in the study and (b) the size and complexity of the study:
Monitoring may be conducted in various ways or by various individuals or groups, depending on the size and scope of the research effort. These exist on a continuum from monitoring by the principal investigator or NIH program staff in a small phase I study to the establishment of an independent data and safety monitoring board for a large phase III clinical trial.
NIH also has rules regarding how monitoring boards should operate and when they are needed. Because the Vinnars et al. (2009) study was conducted in Sweden, they would have had to meet their country’s requirements for safety and data integrity. We might conclude, because this was a relatively small study with minimal risk to participants, that the establishment of an independent group of experts to monitor the trial was not necessary. It is also the case that if your study is being conducted through a research center or institute, the organization may have an independent board that oversees many projects done under its auspices.
Clinical trials are instances in which stopping rules can play an important role in ensuring the safety of participants. If too many instances of defined adverse effects occur, the trial should be stopped. Similarly, if the treatment is showing strong and clear benefits to people, stopping the trial early may facilitate getting the treatment to people in need as soon as possible.
If your study involved a clinical trial, your report should include a description of
Treatment fidelity refers to how reliably the treatment was administered to subjects. If the treatment was not administered the way it was intended—“by the book”—then it is hard to interpret what the active ingredients in the treatment were. The same is true if there was lots of variation in administration from one treatment deliverer to the next. Vinnars et al. (2009) took care to provide details on how the fidelity of their treatment administration was monitored and with what reliability:
All therapy sessions were videotaped, and two trained raters evaluated adherence to the SEP method using an adherence/competence scale (Barber & Crits-Christoph, 1996; Barber, Krakauer, Calvo, Badgio, & Faude, 1997). The 7th session for each patient for whom we had recordings available was rated. The raters had a training period of 2 years and completed roughly 30 training tapes during this period. The scale includes three technique subscales: general therapeutic (non–SEP-specific interventions), supportive (interventions aimed at strengthening therapeutic alliance), and interpretative–expressive (primarily CCRT [Core Conflictual Relationship Theme] specific interventions). Intraclass correlations for the two independent raters’ adherence ratings were calculated and found to be good in the current sample (general therapeutic techniques .81, supportive techniques .76, and expressive techniques .81). (p. 365)
In the Vinnars et al. (2009) study, the experimental and comparison interventions were delivered by psychotherapists. So as JARS recommends, the researchers paid careful attention in their Method section to the professional training of these treatment deliverers:
Six psychologists conducted the SEP, and 21 clinicians performed the CDPT treatment. Three senior SEP therapists, with more than 20 years of experience in psychiatry and dynamic psychotherapy, had trained the remaining SEP therapists, whose experience varied from 1 to 10 years. The three senior SEP therapists had received their training in both SEP and the use of the adherence/competence rating scale . . . by the developers of the treatment. The training cases of these three therapists were translated into English and rated by adherence raters at the University of Pennsylvania Center for Psychotherapy Research until they reached acceptable adherence levels to start training the other SEP therapists and adherence raters in Sweden.
The CDPT clinicians had a mean experience of 12.5 years in psychiatry and dynamic psychotherapy. All therapists except one had at least 1 year of full-time formal postgraduate training in dynamic psychotherapy consistent with a psychotherapist certification. They all received weekly psychodynamic psychotherapy supervision prior to and during the study. Within the public health care system, dynamic therapists tend to emphasize supportive techniques when dealing with patients with severe pathology. The CDPT group included two psychiatrists who had three patients in treatment, six psychologists who had 13 patients in treatment, five psychiatric nurses who had 42 patients in treatment, six psychiatric social workers who had 16 patients in treatment, and two psychiatric nurses’ assistants who had two patients in treatment. All therapists were unaware of the specific hypotheses. (p. 365)
Vinnars et al. (2009) discussed the training of therapists in the comparison group when they described the treatment itself:
Although the manualized SEP was time limited, CDPT was open ended. That is, the research team did not determine the length of treatment in the CDPT group. However, this did not mean that all CDPT patients received long-term therapy. In fact, the mean number of total treatment sessions attended between pretreatment and the 1-year follow-up assessment did not differ between the two groups (Mann–Whitney U = 2,994.00, p < .87). On average, SEP patients received 26 sessions (SD = 15.2, Mdn = 30, range = 0–78) and CDPT patients received 28 sessions (SD = 23.7, Mdn = 22, range = 0–101). Even when one focuses only on the time between the pretreatment and posttreatment assessments, the SEP patients had a mean of 25 (SD = 13.0, Mdn = 30, range = 0–40) sessions in contrast to 22 sessions for the CDPT patients (SD = 15.5, Mdn = 21, range = 0–61), respectively (Mann–Whitney U = 2,638.00, p < .19). (pp. 365–366)
Vinnars et al. (2009) said that subjects were recruited from two community mental health centers and had either applied to take part themselves or been referred for treatment. They had a compliance issue because subjects in the SEP condition entered into contracts before the treatment began but CDPT subjects did not. Here is how they reported on the problem:
Treatment Attendance
As the nonmanualized treatment, by its nature, did not have a protocol on which treatment contracts were agreed, the definition of dropouts for this condition was complicated. To solve the attendance issue, we collected session data from patients’ medical records. Treatment attendance was classified into (a) regular once a week (SEP = 52, 65.0%; CDPT = 43, 56.6%); (b) irregular, that is, less frequent than once a week, including an inability to keep regular appointments scheduled once a week (SEP = 16, 20.0%; CDPT = 17, 22.4%); or (c) no treatment, that is, not attending more than two sessions after randomization (SEP = 12, 15.0%; CDPT = 16, 21.1%). No significant difference was found between the two treatments in terms of attendance, χ2(2) = 1.35, p < .51. (Vinnars et al., 2009, p. 366)
If you were enrolling subjects in Vinnars et al.’s (2009) study of SEP therapy versus treatment as usual, you would not want to steer people in or out of one group or the other on the basis of whether you felt the subject would benefit from the experimental treatment. This is why Vinnars et al. reported both that “patients were randomized using a computerized stratification randomization procedure” (p. 368) and that “participants were consecutively recruited from two community mental health centers” (p. 364). The word consecutively indicates that subjects were screened only on whether they had a personality disorder and the other prespecified inclusion and exclusion criteria; the enroller had no personal leeway in deciding whom to invite into the study and into which condition. Similarly, who generates the assignment sequence, who enrolls subjects, and who assigns subjects to groups needs to be reported in experiments in which some form of bias in recruiting and assignment might creep into the process at each of these time points.
If you are reporting a clinical trial, your report should provide a rationale for the length of follow-up assessment.
Vinnars et al. (2009) did not provide an explicit rationale for their timing of measurements. However, the intervals they chose are easily defended on the basis of the type of clinical intervention they were testing. Interestingly, these researchers encountered an ethical issue with regard to treatment and assessments that deserves mention:
Because the comparison condition was not a time-limited treatment, patients could still conceivably be in treatment at the time that SEP patients were in posttreatment and follow-up. The questionnaires were filled out at the CMHCs in connection with the assessment interviews. Twenty-nine patients (38.2%) in the CDPT condition and 13 patients (16.3%) in the SEP group were still in treatment during the follow-up phase. Ethical protocols for this study allowed patients who were still under considerable distress to continue SEP treatment. As a result, there were still a number of patients in SEP treatment during follow-up. (p. 366)
Thus, for ethical reasons, the treatment for subjects with personality disorders did not stop when the clinical trial was over.
Now that we have covered the major research designs and statistical analyses that psychological researchers use, we need to look at what your readers will need you to report in your Discussion section so they can make sense of your findings.