12
MYTH: MULTIPLE-CHOICE EXAMS ARE INFERIOR TO OTHER EXAM FORMATS

Multiple-choice tests are one of the most maligned educational tools in existence. Instructors often criticize multiple-choice tests, based primarily on the assumption that such tests can only assess superficial knowledge and therefore other forms of examination are better measures of deeper and more meaningful student learning (see Frederiksen, 1984). Students sometimes complain that multiple-choice tests are tricky (Appleby, 2008), even going as far as to claim a multiple-choice learning disability (Demystifying learning disabilities, n.d.). Moreover, it is easy to identify critics who claim that multiple-choice tests are “nearly always worthless” (Yermish, 2010: para. 1). The key to evaluating such claims is to determine whether multiple-choice tests assess fundamentally different forms of knowledge than tests in other formats. Hundreds of studies concerning multiple-choice tests have been conducted over nearly a century. These studies vary widely in terms of both research methodology and the specific subject matter being tested. The objective of this chapter is not to demonstrate that the multiple-choice format is superior to other formats, but rather to briefly review some representative findings to demonstrate that multiple-choice tests have their place and are not the educational boogeymen they are often made out to be.

Many researchers have compared multiple-choice with alternate test formats. These alternate formats vary, but always involve some sort of open-ended or constructed response so that the test-taker must provide an answer rather than choosing an answer from a set of provided options. To evaluate whether tests in various formats are assessing the same type of knowledge or skill, it is useful to begin by reviewing tests where data are available from large samples of students taking the same test. Classroom research generally does not provide such an opportunity, but large data sets are available for research using Advanced Placement exams, which hundreds of thousands of high school students take each year to earn college credit. One group of researchers (Lukhele, Thissen, & Wainer, 1994) compared multiple-choice to essay items on a number of Advanced Placement exams. These researchers pointed out that, relative to multiple-choice exams, essay exams carry great costs in terms of the time required for students to take them and instructors to score them. Moreover, essay exams introduce concerns about inter-rater reliability – whether different scorers evaluate essay responses in the same way – that are of little concern on more objective multiple-choice tests. Justifying this relative inefficiency, and the potential for subjectivity in scoring, demands a search for evidence that essay items provide information that multiple-choice items do not.

Lukhele and colleagues (1994) analyzed data from Advanced Placement Chemistry and United States History exams, and determined that the multiple-choice items were better measures of student proficiency than the essay items. Moreover, they concluded that at least for these two exams, “There is no evidence to indicate that these two kinds of questions are measuring fundamentally different things” (p. 245). The researchers also examined five years of data from seven different Advanced Placement exams and found that multiple-choice and essay sections correlated more highly than essays did with other essays. In other words, multiple-choice items are superior to essay items in predicting a student’s performance on other essay items. Lukhele and colleagues note that Advanced Placement essays are constructed by highly-trained test developers and scored by highly-trained raters, which reduces measurement error. On less rigorously designed and scored tests, the researchers concluded, multiple-choice items would present an even greater advantage over essays.

Other researchers have likewise examined the equivalence of the multiple-choice and free-response sections of Advanced Placement exams. Bennett, Rock, and Wang (1991) tested the claim that free-response items assess higher-order thinking skills, while multiple-choice items assess only recognition of factual knowledge. Bennett and colleagues examined data from the Advanced Placement Computer Science test, which, they note, is designed specifically so that the multiple-choice and open-ended items assess the same content, but differ in terms of the required depth of analysis. The researchers examined data from 2,000 students drawn randomly from all students who took the computer science exam in one year. Using factor analysis, a statistical procedure for detecting underlying factors that link various tasks together, the researchers concluded that a single-factor best fit the exam data. That is, both item formats appeared to measure the same underlying psychological characteristic. Bennett and colleagues noted that although individual multiple-choice items might have some limitations in terms of measuring some cognitive processes, groups of items together likely assess many processes usually assumed to be measured only by open-ended items. They concluded that there is little evidence that multiple-choice and open-ended items are measuring different things. In a subsequent study of both computer science and chemistry Advanced Placement exams (Thissen, Wainer, & Wang, 1994), researchers reached essentially the same conclusion, asserting that although there may be some very small statistical effects associated with differences in test format, these effects are likely to have little practical significance.

In another interesting study, Bridgeman and Lewis (1994) compared the effectiveness of the multiple-choice and essay sections of several Advanced Placement exams for predicting subsequent college course grades. The researchers analyzed data from more than 7,000 students from 32 public and private colleges. For biology and American history exams, scores on the multiple-choice sections correlated more highly than essay scores with first-year college grade point average (GPA). Further, composite scores using both multiple-choice and essay items did not predict first-year GPA any better than multiple-choice items alone. In the case of English and European History exams, multiple-choice and essay sections correlated equally well with first-year GPA. There was no subject test where the essay portion was more highly correlated than the multiple-choice portion with college GPA. Bridgeman and Lewis acknowledge that Advanced Placement exams are not designed to predict future college performance, but the observed pattern of correlations does not support the notion that the essay and multiple-choice portions of the exams are measuring different things, nor that the essays are better measures of deeper knowledge or understanding.

Although most researchers comparing test formats study achievement tests designed to assess acquired knowledge, Ward (1982) examined format equivalence on a test of verbal aptitude. Based on previous research, Ward suspected that tests requiring examinees to produce an answer might demand different skills than tests requiring examinees to select an answer. He gave verbal aptitude tests containing multiple-choice items and three open-ended formats of varying complexity to 315 college students. After correcting for measurement error – the imprecision inherent in measuring any characteristic – the median correlation between test formats was .80 on a scale where 1.00 would indicate a perfect association between formats. A factor analysis revealed a single primary factor underlying all item types. Ward concluded that with respect to verbal aptitude, open-ended items provide little information not assessed by multiple-choice items, and both types of items assess similar abilities.

Rodriguez (2003) used meta-analysis, a technique for combining the results from multiple studies, to examine the equivalence of test formats across a variety of domains and educational levels. He combined the results from 67 studies and found that the average correlation between multiple-choice and open-ended tests, after correcting for measurement error, was .87. There was some variation in the correlations across studies based on whether the items with different response formats contained similar wording, but all correlations between test formats designed to measure the same content knowledge or cognitive process were very high, suggesting that tests in different formats are assessing similar characteristics. Rodriguez further noted that the correlation between different test formats was very similar regardless of whether the study had been conducted with primary, secondary, or post-secondary students.

Perhaps most relevant to teachers deciding what test format to use in their classes are studies conducted in actual classroom settings. In one study conducted more than four decades ago, Bracht and Hopkins (1970) gave students in five college psychology courses an exam containing 24 multiple-choice items assessing content from assigned reading, and two essay items designed to measure higher-order thinking such as application and analysis. A unique strength of this study is that the essays were scored by course instructors who were trained to apply a carefully constructed rubric. The correlations between the multiple-choice and essay sections, after correcting for measurement error, ranged from .81 to .95. Bracht and Hopkins concluded that many common concerns about multiple-choice tests are not grounded in empirical evidence, and that empirical data are not consistent with the claim that multiple-choice and essay exams assess different things.

In one very impressive study in a classroom setting, Hancock (1994) designed items to evaluate both knowledge and higher-order thinking skills. Hancock cited many common criticisms of multiple-choice tests, most notably that they can measure only knowledge and that open-ended items are necessary for measuring important thinking skills. He suggested that many critics fail to define what they mean by higher-order thinking and assume that essay exams measure complex thinking when often they do not. Hancock used Bloom’s (1984) taxonomy of learning objectives as a framework for constructing tests for two undergraduate courses. Bloom’s well-known framework hierarchically categorizes levels of learning ranging from memory of content knowledge to evaluation of information based on evidence. Each exam in Hancock’s study contained both multiple-choice and open-ended items assessing course content along four of Bloom’s dimensions: knowledge, comprehension, application, and analysis. Like other researchers, Hancock found that the two types of items were highly correlated, and that a single factor appeared to underlie both types of tasks – suggesting that the two test formats are comparable in terms of what they assess. He acknowledged that multiple-choice and open-ended items may at times demand somewhat different skills, but the skills are highly correlated. Hancock asserted that a multiple-choice test must be carefully designed if it is to measure higher thinking skills, but this is equally true of open-ended exams. He noted common assumptions that open-ended items measure complex thinking and multiple-choice items cannot measure complex thinking, and concluded that both assumptions are incorrect.

There is at least one specific academic skill – writing ability – that may not be adequately assessed by multiple-choice tests. Ackerman and Smith (1988) cited evidence that for tests of writing, multiple-choice and essay formats may measure different abilities. These researchers studied over 200 tenth grade students who took a multiple-choice test of basic skills such as spelling and punctuation, as well as more complex writing skills such as verbal expression and appropriate paragraph structure. Two weeks after taking the multiple-choice test, the students completed a free-response essay exam, which was scored by six English teachers who were not employed at the school from which the students were recruited. The researchers concluded from the students’ test scores that when assessing writing, multiple-choice and essay exams provide different types of information. This is perhaps unsurprising for two reasons. First, the multiple-choice and essay exams used in this study were specifically designed to assess different types of knowledge. Second, an essay test in writing assesses a specific skill rather than knowledge in some particular content domain. Since declarative and procedural memory represent distinct systems pertaining to factual information and skills, respectively, it makes sense that an adequate test of writing skills would require the procedural task of actual writing. Ackerman and Smith recommend that instructors assessing writing skills use both multiple-choice and essay formats, since multiple-choice items are effective for assessing declarative aspects of writing skills and essay items are effective for assessing procedural writing skills.

Aside from the question of whether tests in differing formats assess similar constructs, a secondary criticism of multiple-choice tests is that many students are consistently disadvantaged by such tests. It is common to hear students assert that they are simply bad test-takers or do not do well on multiple-choice tests. Three researchers (Bleske-Rechek, Zeug, & Webb, 2007) recently investigated this issue. The researchers studied students in three different college psychology courses. The exams in these courses all contained both multiple-choice and short, open-ended items assessing the same content. All items were designed to assess application of concepts in addition to general retention and comprehension. The researchers compared discrepancies in performance between multiple-choice and open-ended sections across students and exams, and made two important observations. First, students’ performance was not usually discrepant across the two exam formats. In other words, students tended to perform at approximately the same level on both types of exam item. Second, even students whose multiple-choice and open-ended performance was discrepant on one exam tended not to repeat the pattern on other exams. In fact, students whose scores on one test suggested format discrepancies in favor of one type of item were just as likely to demonstrate the opposite pattern on other tests. Bleske-Rechek and colleagues concluded that “students were not consistently favored by one form of assessment over another” (p. 98).

The bulk of the existing research does not support claims that multiple-choice and other exam formats assess meaningfully different constructs, or that many students are disadvantaged by having to demonstrate their knowledge using a multiple-choice format. Although students perceive essay tests to be more effective for assessing knowledge (Zeidner, 1987), Bleske-Rechek and colleagues (2007) concluded that little is objectively gained by using such tests. Wainer and Thissen (1993) went as far as to challenge readers to provide any evidence they could to counter the conclusion, based on numerous large data sets, that the multiple-choice format is superior to the open-ended format when effective item writers design tests to assess specific knowledge. These authors emphasized, however, that rather than claiming that multiple-choice tests are always superior, they were simply insisting that conclusions should be based on data rather than rhetoric. In their words: “Departing from a test format that can span the content domain of a subject without making undue time demands on examinees and that yields objective, reliable scores ought not to be done without evidence that the replacement test format does a better job” (p. 116).

Students’ beliefs about test formats certainly seem to affect their approach to studying. Although many students complain that multiple-choice tests are unfair, Zeidner (1987) found that most students tend to view multiple-choice tests more positively than essay tests – believing the former to be easier, clearer, fairer, and less anxiety-provoking than the latter. In one study (Kulhavy, Dyer, & Silver, 1975), researchers found that students prepare differently for multiple-choice exams than they do for open-ended exams, but the results of the study did not show that students’ study strategies for multiple-choice tests are necessarily less effective. In a subsequent study, Rickards and Friedman (1978) found that students expecting to take an essay exam took better quality notes than students expecting a multiple-choice test, but that these differing approaches did not lead to differences in subsequent test performance.

In recent years there has been a great expansion of research on how test-taking can actually enhance student learning – a phenomenon known as the testing effect. Most research on the testing effect has been based on multiple-choice tests. This research illustrates that the potential utility of multiple-choice tests goes well beyond their efficiency. For example, Roediger and Marsh (2005) had undergraduate students read nonfiction reading comprehension passages and then take a multiple-choice test on the content. A short time later, the students took a recall test on the same material. Although taking the initial multiple-choice test led students to provide some incorrect information on the recall test, students also answered more recall items correctly as a result of having taken the multiple-choice test – despite the fact that they received no feedback on their multiple-choice performance.

In a review of research on the testing effect (Marsh, Roediger, Bjork, & Bjork, 2007), researchers cited many studies demonstrating that testing improves performance on subsequent tests. Marsh and colleagues acknowledged that multiple-choice tests expose students to incorrect information in the form of incorrect response options, but they argued that the benefits of multiple-choice testing with respect to enhancing memory and improving later test performance outweigh the effects of misinformation. Butler and Roediger (2008) further demonstrated that providing feedback to students after they respond to multiple-choice items strengthens the testing effect and also reduces the amount of misinformation retained by students. Importantly, students’ likelihood of retaining misinformation from having been exposed to incorrect alternative answers was reduced whether the feedback they received came immediately or after a delay. Instructors can therefore maximize the learning benefits of multiple-choice testing whether or not it is practical to provide feedback immediately.

In addition to enhancing performance on subsequent multiple-choice tests, recent evidence suggests that taking multiple-choice tests can enhance later performance on tests in another format. A group of researchers (Little, Bjork, Bjork, & Angello, 2012) had undergraduate participants read nonfiction passages and then complete either a multiple-choice or completion test on the material. After a short delay, all students took another completion test. The researchers found that taking the multiple-choice test led to better subsequent performance than taking the completion test. Moreover, taking the multiple-choice test slightly enhanced student performance on subsequent items related (but not identical) to those on the original test, whereas taking a completion test actually led to poorer performance on subsequent related items. Little and colleagues concluded that multiple-choice tests can enhance learning of content that is specifically tested, as well as related content associated with plausible incorrect alternative response options – conclusions echoed in other current literature (e.g., Glass & Sinha, 2013). Although the learning benefits of taking tests may not be limited to the multiple-choice format, the demonstrated positive effects of responding to multiple-choice items further demonstrate that the multiple-choice format has value.

Given that realities in contemporary education are likely to continue to make multiple-choice testing necessary, it is fortunate that a rich literature exists to assist instructors and other professionals who must construct such exams. Cantor (1987) noted that multiple-choice is the most popular testing format because it can be used to assess knowledge of many different subject areas as well as higher-order thinking. Cantor and others (Aiken, 1982; Stupans, 2006) provide useful guidelines for writing good multiple-choice items. In addition, Haladyna, Downing, and Rodriguez (2002) conducted a comprehensive review of research and textbook guidelines for writing multiple-choice items. Their work is an excellent resource for instructors wishing to maximize the reliability, validity, and perceived fairness of their multiple-choice tests. Appleby (2008) even developed a teaching exercise, in which students read a brief passage about different memory systems and then consider multiple-choice questions designed to assess different levels of thinking about the passage content, to help students recognize the “myth” that multiple-choice tests can assess only simple recognition and rote memory (p. 119).

In light of practical educational realities it appears that multiple-choice tests are here to stay. As class sizes at many schools continue to increase, there is a corresponding necessity to develop assessment instruments that are both valid and efficient. Fortunately, research to date suggests that many common concerns about multiple-choice testing are exaggerated or unfounded. Multiple-choice items allow instructors to assess student knowledge of broad content domains in a short period of time. Although open-ended test formats appear to permit instructors to assess higher cognitive processes and greater depth of content knowledge, evidence is sparse that typical free-response tests achieve such objectives – or that open-ended and multiple-choice tests consistently assess different things. Tests are tools and, as is always the case, choosing the right tool depends on an accurate assessment of the specific objectives one wishes to achieve.

References

  1. Ackerman, T. A. & Smith, P. L. (1988). A comparison of the information provided by essay, multiple-choice, and free-response writing tests. Applied Psychological Measurement, 12, 117–128.
  2. Aiken L. R. (1982). Writing multiple-choice items to measure higher-order educational objectives. Educational and Psychological Measurement, 42, 803–806.
  3. Appleby, D. C. (2008). A cognitive taxonomy of multiple-choice questions. In: L. T. Benjamin Jr. (Ed.), Favorite activities for the teaching of psychology (pp. 119–123). Washington, DC: American Psychological Association.
  4. Bennett, R. E., Rock, D. A., & Wang, M. (1991). Equivalence of free-response and multiple-choice items. Journal of Educational Measurement, 28, 77–92.
  5. Bleske-Rechek, A., Zeug, N., & Webb., R. M. (2007). Discrepant performance on multiple-choice and short-answer assessments and the relation of performance to general scholastic aptitude. Assessment & Education in Higher Education, 32, 89–105.
  6. Bloom, B. (Ed.). (1984). Taxonomy of educational objectives, book 1: Cognitive domain. 2nd edn. New York: Longman.
  7. Bracht, G. H. & Hopkins, K. D. (1970). The communality of essay and objective tests of academic achievement. Educational and Psychological Measurement, 30, 359–364.
  8. Bridgeman, B. & Lewis, C. (1994). The relationship of essay and multiple-choice scores with grades in college courses. Journal of Educational Measurement, 31, 37–50.
  9. Butler, A. C. & Roediger, H. L. (2008). Feedback enhances the positive and reduces the negative effects of multiple-choice testing. Memory & Cognition, 36, 604–616.
  10. Cantor, J. A. (1987). Developing multiple-choice test items. Training and Development Journal, 41, 85–88.
  11. Demystifying learning disabilities (n.d.). Available at: http://www.emory.edu/ACAD_EXCHANGE/2000/octnov/learningdis.html.
  12. Frederiksen, N. (1984). The real test bias: Influences of testing on teaching and learning. American Psychologist, 39, 193–202.
  13. Glass, A. L. & Sinha, N. (2013). Multiple-choice questioning is an efficient instructional methodology that may be widely implemented in academic courses to improve exam performance. Current Directions in Psychological Science, 22, 471–477.
  14. Haladyna, T. M., Downing, S. M., & Rodriguez, M. C. (2002). A review of multiple-choice item-writing guidelines for classroom assessment. Applied Measurement in Education, 15, 309–334.
  15. Hancock, G. R. (1994). Cognitive complexity and the comparability of multiple-choice and constructed-response test formats. Journal of Experimental Education, 62, 143–157.
  16. Kulhavy, R. W., Dyer, J. W., & Silver, L. (1975). The effects of notetaking and test expectancy on the learning of text material. Journal of Educational Research, 68, 363–365.
  17. Little, J. L., Bjork, L. B., Bjork, R. A., & Angello, G. (2012). Multiple-choice tests exonerated, at least of some charges: Fostering test-induced learning and avoiding test-induced forgetting. Psychological Science, 23, 1337–1344.
  18. Lukhele, R., Thissen, D., & Wainer, H. (1994). On the relative value of multiple-choice, constructed-response and examinee-selected items on two achievement tests. Journal of Educational Measurement, 31, 234–250.
  19. Marsh, E. J., Roediger, H. L., Bjork, R. A., & Bjork, E. L. (2007). The memorial consequences of multiple-choice testing. Psychonomic Bulletin & Review, 14, 194–199.
  20. Rickards, J. P. & Friedman, F. (1978). The encoding versus the external storage hypothesis in note taking. Contemporary Educational Psychology, 3, 136–143.
  21. Rodriguez, M. C. (2003). Construct equivalence of multiple-choice and constructed-response items: A random effects synthesis of correlations. Journal of Educational Measurement, 40, 163–184.
  22. Roediger, H. L. & Marsh, E. J. (2005). The positive and negative consequences of multiple-choice testing. Journal of Experimental Psychology: Learning, Memory, and Cognition, 31, 1155–1159.
  23. Stupans, I. (2006). Multiple-choice questions: Can they examine application of knowledge? Pharmacy Education, 6, 59–63.
  24. Thissen, D., Wainer, H., & Wang, X. (1994). Are tests comprising both multiple-choice and free-response items necessarily less unidimensional than multiple-choice tests? An analysis of two tests. Journal of Educational Measurement, 31, 113–123.
  25. Wainer, H. & Thissen, D. (1993). Combining multiple-choice and constructed-response test scores: Toward a Marxist theory of test construction. Applied Measurement in Education, 6, 103–118.
  26. Ward, W. C. (1982). A comparison of free-response and multiple-choice forms of verbal aptitude tests. Applied Psychological Measurement, 6, 1–11.
  27. Yermish, A. (2010). Bubble, bubble, toil and trouble … (multiple choice exams). Available at: https://davincilearning.wordpress.com/2010/12/23/bubble-bubble-toil-and-trouble-multiple-choice-exams.
  28. Zeidner, M. (1987). Essay versus multiple-choice type classroom exams: The student’s perspective. Journal of Educational Research, 80, 352–358.