In this chapter, we present a list of guidelines for writing MC items and demonstrate their application with examples. After some background on the development of the guidelines, we provide recommendations on the selection of item content. We then review the basic structure of the MC item and present the item-writing guidelines in detail.
Item writers now have a strong set of item-writing guidelines that support item development and validation (see Haladyna and Rodriguez, 2013). Although these guidelines have mostly been evaluated in the context of K–12 classrooms, there is some evidence to support their use in college classrooms. Moreover, many of these guidelines are commonly found in the item-writing training materials of large-scale testing programs, for example in guidelines for K–12 state testing programs; many college, graduate, and professional school admissions testing programs; as well as certificate and licensure testing programs.
Measurement specialists and researchers have accumulated evidence of the effectiveness of professional development opportunities to improve item-writing skills of college instructors. We know that college instructors can improve their item-writing and test-development skills. With some training, instructors can write items that are more cognitively challenging, with better-functioning distractors and fewer item-writing flaws (Abdulghani et al., 2015; Naeem, van der Vleuten, and Alfaris, 2012). Some researchers have focused on the use of MC items in the health sciences, including medicine (Downing, 2005; Jozefowicz et al., 2002) and nursing (Tarrant, Knierim, Hayes, and Ware, 2006; Tarrant and Ware, 2008), illustrating how common and problematic item-writing flaws can be. Some of these flaws tend to be about the stem or question, including unfocused stems, negatively worded or contradictory stems, or stems with unnecessary information. Most flaws tend to be about the options, including no correct or multiple correct answers, implausible distractors, more details in the correct answer, or options with clues. These are addressed below.
The MC item-writing guidelines address the four aspects of item content, the structural format of the item, the stem, and the options. Each guideline is presented in a systematic manner, starting with a statement of the guideline, details on why it is important (evidence based when possible), and an example item that violates the guideline followed by an improved version of the item.
Your instructional learning objectives provide the first level of support for identifying the appropriate content for test items. Beyond this, if you use a textbook, many textbooks now have learning objectives at the beginning of each chapter, which attempt to identify key learning targets. However, these may not align with the instructional approach you take in your course, so published objectives should be reviewed carefully in light of your own intentions.
Good resources for test item content are class lectures, discussions, and activities. Following each session, keep a list of main ideas, concepts, and issues discussed. Take particular note of the challenges students faced or misconceptions they may have offered in discussions or questions. Often, questions raised by students can be rephrased to become excellent test items. When you collect these ideas for test items, it is often helpful to write them in declarative complete sentences. This makes the content available for a number of item formats (reviewed earlier).
Another source of test items is the item bank that accompanies your textbook if you are using a textbook. However, these items are typically quickly written, often by undergraduate or graduate students. They have not necessarily been piloted or reviewed by measurement specialists and almost never come with quality information—no evidence of their statistical quality. In our review of textbook item banks, many are poorly written and contain multiple item-writing flaws.
Because of this, we used item banks from many different college course textbooks to provide examples of poor and better test items in the next chapter. And, to be sure, we are not content experts in every possible field, so we may have missed technical flaws in many of these items, as the question itself or the options may be technically inadequate, inappropriate, or simply wrong. It’s nearly impossible to know if an item has only one best answer, no best answer, or multiple best answers unless you are a subject-matter expert. It is your subject-matter expertise that makes you qualified to judge the content adequacy of a test item. So if you use a textbook item bank as a resource, be sure to review the item using the item-writing guidelines we provide here. And feel free to edit, modify, or alter the item to meet your instructional learning objectives.
Example:
What will improve test score reliability? [STEM]
These guidelines are primarily based on the work of Haladyna and Rodriguez (2013), reflecting decades of item-writing research and guidance. In addition, we reviewed dozens of guidelines from offices of teaching and learning at a variety of colleges and universities to secure the most comprehensive guidance available. Some of these guidelines have a research basis. The empirical evidence for these guidelines is comprehensively reviewed by Haladyna, Downing, and Rodriguez (2003) and Haladyna and Rodriguez (2013). We simply state when an item is research based in the introduction to the guideline.
Each guideline is described here, including examples of MC items with poor and better versions. In Chapter 6, we provide many examples of items from different fields, again including poor and better versions.
The main goal here is to make each item as direct and precise as possible—clearly stating the question, challenge, or problem being posed to the student. Many classroom test items are flawed because of failure to understand and follow this guideline.
The poor version is a common format, but it is asking for two different elements of content. In addition, option C is not plausible, as “zero” is too absolute. This item is less problematic than most, since both pieces of information are unique in each option. Consider this version:
In both cases, we can improve measurement by focusing on one aspect of the task and then be more certain about the nature of student understanding or misunderstanding.
Sometimes the complexity is subtle. The following question provides a question with two different scenarios when only one is required.
Avoid using direct quotes, examples, and other materials from textbooks, reading assignments, and lectures from class. This invites simple recall and remembering the materials. Novel material and contexts let you tap cognitive tasks that are more important and that measure the ability of students to generalize their knowledge and skills. The poor example that follows is based on the correlation item from guideline 1. With the context now removed, we’re assessing recall of information rather than application to a new situation.
At the end of this chapter, we provide a list of recommendations and question templates that can be used to measure higher-order thinking.
This concern is more common with constructed-response items, particularly in mathematics, where the solution from one question is used to solve the next question. But this is also important in MC items.
The correct response to one item should not depend on the response to another item. If the student gets the first item incorrect, and the next item depends on this response, the next item will likely be answered incorrectly as well. You want each item to provide unique and independent information about a student’s KSAs.
Poor: If the mean is 50, the median is 40, and the mode is 30, what is the shape of the distribution?
Based on this distribution (from previous question), is the distribution symmetric?
Another way this can create problems is by providing clues to the correct answer. If the following question accompanied the first poor question, it might help the student realize the correct option to the preceding item.
A distribution that is positively skewed has one tail “pulled” or extended in which direction?
It is best to create unique and independent items. Many times, multiple items must be written that tap important content, but they should not be connected in any way.
This guideline should really state: Test your course and instructional learning objectives. And those learning objectives should declare the important elements of knowledge, skills, and abilities that result from taking the class. There are many possible examples of overly specific questions that students should not be required to memorize.
Consider the following question from a course on educational and psychological measurement, which focuses on application of the principles of measurement.
There is no real answer to this question (it is hypothetical). Anyway, we should all cry being faced with such a question. But we can reclaim it.
Better: Consider the following test question:
The reliability of the Hypochondria scale of the MMPI-2 is
The item-writing flaw in this item is that it
A better option might be to offer a series of MC items, each with an item-writing flaw, with a list of guidelines. The students can then match or identify the guideline being violated in each MC item.
Sometimes the content is simply too general. Not only do these tend to be easy items, we lose the opportunity to test content that is more important—content we want students to understand and be able to apply, analyze, and so on.
The poor question is very general and doesn’t really tap specific knowledge or understanding about validity—hopefully we all know now that we validate test score interpretations and uses. It is much more important to inquire about students’ understanding of the characteristics and principles of validity and what affects it.
Opinion items often don’t have a clear and consistent best option, as it depends on who you ask—someone’s opinion. However, if the point is to connect the opinion with a specific person, then the item must include the source of the opinion. This is one of the dangers of true-false items; they are often written so that they appear to be opinions.
This item still has a problem since the word “school” appears in both the stem and the options (clang association). If the term “school” is not used in the stem, then the level of accountability is left ambiguous, and any option could be correct. Unfortunately, it is sometimes not easy (or even possible) to ask questions we want without providing some level of clue to students.
Trick items are a different story. There is very little research on the use of trick items, but it is a common complaint from students. Dennis Roberts (1993) surveyed 174 college students and 41 college faculty, asking them to define trick questions. Nearly half reported that MC items can be trick items, with fewer reporting that true-false (20%), short-answer (3%), matching (1%), essay (1%), or any item format (27%) could be tricky. Most participants also reported that when trick questions appear on the test, it is deliberate. There were several themes that emerged from the definitions of trick items from these college students and faculty:
Roberts then created a 25-item test, with 13 of the items being trick items that included many of the issues presented in the tricky-item-characteristic list above (had irrelevant content, ambiguous stems, principles presented in the opposite way it was introduced in class, and others). More than 100 students were asked not to answer the items but simply to rate them in terms of the extent to which it may be a trick item on a 1 (not a trick item) to 4 (definitely a trick item) rating scale. Students were not effective at distinguishing between trick items and not-trick items. However, among the not-trick items, students were more accurate in their identification of the items (that they actually were not trick items), whereas students were far less able to correctly identify the trick items.
Here are two of the example trick items from Roberts:
This item is tricky because it is testing trivial content (knowing the sum is the mean times the sample size), but more importantly, it contains irrelevant information about other descriptive statistics.
This item is tricky because of the ambiguity in the stem. Are the test scores X or Y? Is the relationship described by .4 a correlation? Is “variability” the same as variance?
Another example of a tricky item is one similar to an item found in an educational measurement textbook item bank. Consider the following item.
This is a tricky item because there are two things being considered in the list. One is the predictor; which predictor of school grades might be most useful? We probably believe successful students are committed to learning. Note that positive identity is a strange characteristic, typically meaning self-assured and motivated. And of course we hope that successful students should not be mentally distressed (although it may depend on the school!). But the most accurate prediction is simply a function of the largest correlation, regardless of the direction or sign or the construct being measured. We often think about “prediction” being positive, but statistically, we get more accurate prediction from the correlation with the largest absolute value; it doesn’t matter what the predictor is.
MC items can take up a fair amount of room on a page, especially if there are four or five or more options listed one per line vertically on the page. But this is the preferred formatting, as it avoids potential errors due to formatting the item horizontally, which we sometimes want to do to save room. Consider the following examples:
In the horizontal format, it’s possible to read through the row of letters and numbers and inadvertently select the corresponding letter following the answer choice rather than the letter that precedes it. It can be confusing to tell which letter goes with which option. This is especially possible if the space between options is reduced:
This should go without saying. If you write items throughout the course of the term, when you start to put them together in a test form, you can review your earlier work and potentially edit items with fresh eyes. Even we occasionally administer class tests that contain typos—often times discovered by students during the exam. This is where we reluctantly get up in front of the class and make the correction announcement, or write the correction on the board. But the best way to ensure limited errors is to have a colleague or teaching assistant review the test before you administer it to the class.
In mathematics and other technical fields, the potential for typographical errors is great. Small errors can indicate very different things. Consider the following:
Poor: (2.5 × 103)
Here, the closing parenthesis is in the superscript. Is this supposed to be (2.5 × 103) or is it supposed to be (2.5 × 10)3?
Poor: The concepts of reliability and validity of test scores is …
Grammar must be consistent (“is” should be “are”).
The word “each” should not be underlined. If you do underline key words, there should be one underlined in each option. But such writing techniques tend to be distracting and not helpful.
This is a challenging guideline. We know from a great deal of research that language complexity often interferes with our ability to measure substantive KSAs. This is particularly important for students with some kinds of disabilities and for students who are nonnative speakers of the language of the test. Unless you are testing language skills and the complexity of reading passages and test questions is relevant to instruction and reflecting the learning objectives, language should be as simple and direct as possible. We don’t want unnecessarily complex language in our test items to interfere with our measurement of KSAs in mathematics, science, business, education, or other areas. Even in language-based courses (e.g., literature), the complexity of the language used in test items should be appropriate for the students and reflect the learning objectives.
Unless it was part of the curriculum in the course, the term “attenuate” is not one commonly understood to be synonymous with “decrease.” A better version of this stem is provided as an example for guideline #7.
Words matter, particularly in diverse classrooms that are likely to include first-generation college students, nonnative speakers of the language of instruction (for most of us, nonnative English speakers), and even students who are not majoring in the field of the course. Terms and language that we might take for granted can make the difference between understanding a question or not. If we want to test for understanding, application, or other higher-order skills, we don’t want language to get in the way—unless of course the point of the question is knowledge of language or terminology.
Things to avoid in item writing, unless it is part of the learning objective:
For more information on creating test items that minimize irrelevant complex language, particularly for nonnative English speakers, see the work of Jamal Abedi (2016), who provides many examples of how language can interfere with measuring KSAs.
This is related to but different from the previous guideline on minimizing language complexity. Another way to address the same issue is to reduce the amount of reading in the context materials and test items—unless of course reading is the target of measurement. If the learning objectives are about reading skills and abilities, then the reading load should be appropriate. But in most subject-matter tests, reading skills may interfere with our ability to measure KSAs in the (nonreading) subject matter. Reducing the amount of reading also has the effect of reducing the cognitive load, so that the task presented in the item is focused and allows us to interpret the response as a function of the intended content and cognitive level. In this way, we know how to interpret responses. And it reduces the amount of time required to take the test, possibly making room for additional test items, improving our coverage of the content and learning objectives.
The better version has almost one third the number of words (36 words) compared to the poor version (95 words). Notice that the options are also more direct in the better version.
In this case, the number of words was reduced from 51 (poor) to 30 (better). In both cases, the poor versions started with context information that is irrelevant to the target of measurement.
Context is often used in mathematics and science tests, as we want students to be able to apply mathematical and scientific principles in context. But in many cases, the context is not required to solve the problem. If context is used, it must be essential to the question. Otherwise, it introduces irrelevant information that may interfere with measuring the intended learning objective.
Now, the poor version may be more interesting than the better version, but it does contain irrelevant information and requires more time and effort to read; the number of words was reduced from 48 to 14.
The fact that Maria José is doing a study of test anxiety is interesting but not needed in order to answer the question. Both versions of this item also have something called a “clang-association” which is described in what follows (guideline 22c), where “positively skewed” appears in the stem and is also one of the options. But in this case, it is an important element of the question. There is a common misconception that standardizing scores also normalizes distributions—which is not true. It would be awkward to avoid this clang association:
Introducing “a heavier right tail” also suggests a strange variation in the kurtosis of the distribution rather than a simple focus on skewness. It also requires an additional inference: When one tail is heavier, it is probably skewed. But the focus of the item is on the fact that transforming scores to z-scores does not change the shape of the distribution.
There is some research on this guideline. The evidence suggests that either format is effective for MC items. The following examples illustrate the same item in both formats.
Complete: What is the difference between the observed score and the error score?
Phrase: The difference between the observed score and the error score is the
Notice that when we use the phrase-completion format, each option starts with a lowercase word and ends with a period. Each option is written so that it completes the stem. We often see other formats that are not appropriate.
Ideas for questions or statements for test items can come from many places, including textbook test banks, items in online resources, and course materials, lectures, and discussions. Our most valuable sources of test items are the interactions we have as instructors with our students. Discussions, questions, and comments made during class are full of potential test items. Unfortunately, textbook item banks are not carefully constructed, and often the majority of items contain item-writing flaws. We use textbook item banks as a source for many of the example items in Chapter 6.
If you use textbook test banks for your test items, be sure to modify the item to be consistent with these guidelines.
The goal is to make the question clear and unambiguous. This is a challenge when the stem is a phrase or part of a statement that is completed in the options. When the item stem does not clearly convey the intended content, too much is left up to the student for interpretation, leaving greater ambiguity in the item and the student’s response. In these cases, there are often multiple correct answers, since the distractors are written to be plausible.
This item also was flawed because the word “percentile” appears in both the stem and the first option. To correct that, we simply removed the word percentile from the stem.
We’ve seen examples such as the following four. Each of these shows the entire stem.
Poor: In the 1970s …
Psychotherapy is …
According to the textbook …
Gender …
These stems are simply inadequate and create a great deal of ambiguity and possibly multiple correct options. A good stem is one that allows a student to hypothesize or produce the correct response directly from the stem without reading the options.
This guideline has some empirical research behind it. The trouble with negatively worded stems occurs when students overlook the negative word and respond incorrectly because of that rather than responding in a way consistent with their ability. In the health sciences, negatively worded stems are more common and serve an important purpose. In these cases, it is important for the test taker to distinguish among a set of conditions, contexts, symptoms, tests, or related options to identify the one that is not appropriate or relevant. In such cases, the negative word should be underlined and in italics and/or bolded, or even in all caps, so the test taker understands what is being asked and doesn’t overlook the negative term.
This revised item also is flawed because two of the option start with a common word, “only,” making them a pair, and the word “only” is an absolute quantity that is extreme and would rarely be true. Usually this can be corrected by simply removing the word “only.”
Negative phrasing of the stem tends to make items slightly more difficult but also reduces test score reliability—suggesting that it introduces more measurement error in item responses and ultimately in total test scores.
This helps achieve earlier guidelines by minimizing the amount of reading. Repetitive words are unnecessary and can usually be consolidated in the stem, reducing the amount of reading effort and time for the student.
This reduced the number of words from 27 to 19 (a 30% reduction). This adds up quickly across items in a test.
The quality of the distractors is more important than the number of distractors. But nearly 100 years of experimental research on this topic is unanimous in its finding: Three options is sufficient (Rodriguez, 2005). If three is the optimal number of options, why do most tests have items with four or five options? In large part, this is because of the fear of increasing the chance of randomly guessing correctly. But this same body of research has shown that the effect of guessing does not materialize, particularly in classroom tests and tests that matter to the test taker. Even so, the chance of obtaining a high score on a test of three-option items is relatively low, especially if there are enough items (see Table 4.1).
However, in some cases, four options is the better choice, particularly when trying to achieve balance in the options. Especially for quantitative options, balance can only be achieved by including two positive and two negative options or two odd and two even options or two high and two low options. We shouldn’t strive to write three-option items simply for the sake of writing three-option items, especially if it creates an unbalanced set of options. The real goal is to write distractors that are plausible and relevant to the content of the item and that might provide us with feedback about the nature of student errors.
Number of test items | Chance to score 70% |
|
|
10 | 1 out of 52 |
20 | 1 out of 1,151 |
30 | 1 out of 22,942 |
40 | 1 out of 433,976 |
You do not need to use the same number of options for every item. The number of options should reflect the nature of the item, the important plausible distractors, and a single best option. In Chapter 6, nearly all of the example items from various fields have four or five options. A common edit could be the reduction of the number of options. Reducing the number of options to three reduces the time required to develop one or two more plausible distractors (see the next guideline) and reduces the reading time for students—particularly reading that is mostly irrelevant to the task at hand. In addition, by reducing the reading time, it’s possible to include a few more items and cover the content more thoroughly. This does far more for improving content coverage and the validity of inferences regarding what students know and can do.
To ensure distractors are plausible, they should reflect common misconceptions, typical errors, or careless reasoning. Some of the best distractors are based on uninformed comments that come from students during class discussions or errors students repeatedly make on assignments and other class projects.
We should be able to justify each distractor as plausible. Here, the distractors are not plausible:
Again, each distractor should be justifiable. Here, the distractors are plausible:
In both distractors, the options are plausible, related to the stem, but not the best answer. Herein we have a great deal of control over the item difficulty. To make the item easier, we could write distractors that are more different (more heterogeneous) but also use real (plausible) scores, such as T-score or z-score. Similarly, to make items more difficult, we could write distractors that are more similar (more homogeneous), such as the latent score, trait score, or something that is a kind of true score. These other types of scores may be plausible and correct, but only under certain kinds of measurement models that are not described in the item. To make the best option, “true score,” the best answer when using even more similar terms as these, it might be necessary to add context to the stem, such as “In Classical Test Theory …,” making “true score” the only best option.
Option A contains a word from the stem (validity), which is a clang-association (see guideline 22c) but will result in a more stable score. Option C should improve test score reliability since students will not have to guess if they run out of time. Option D is not plausible since statistical power does not apply to individual items.
Now each option will improve the quality of test scores, but only one affects content-related validity evidence.
Many of the example items in Chapter 6 illustrate this guideline. Very few four- or five-option items have three or four plausible distractors.
This is a challenge. Make the distractors plausible but not the best answer. The correct option should be unequivocally correct. This is another reason why it is important to have your test reviewed by a colleague, an advanced student, or the teaching assistant.
In the context of what was discussed in class (and in most measurement textbooks), standardizing scores does not change the skewness of the distribution, only the location and spread of the distribution on the score scale. However, there are methods of standardizing scores that are nonlinear, rarely used, but may actually change the shape of the distribution. One such transformation is the normal-curve equivalent (NCE) transformation, which normalizes scores in the transformation, resulting in a normal distribution.
Option A is also correct, since the sampling frame represents the population available from which to draw the sample. Option D is not of the same kind as the others and is not plausible.
This is one of the most common item-writing flaws we find in textbook item banks. It commonly results from limited knowledge of the subject matter or from hasty item writing. We identify this flaw in the example items in the next chapter. How do we catch such errors? Subject-matter expertise is one step, but even then, peer review is still helpful.
In order to avoid clues in how we order the options, particularly the location of the correct option, it is best to be systematic and adopt a simple rule when we order the options. This also helps support the student’s thinking and reading clarity. It’s distracting and confusing when a list of numbers or dates or names is ordered randomly—without structure. Using logical systems for ordering options reduces cognitive load and allows the student to focus on the important content issues in the item rather than trying to reorder or decipher the order of the options. Not to mention, it’s just a nice thing to do.
Notice we also removed one option, 1.00, as this is likely to be the least plausible. In addition to ordering the options numerically, we provided more white space between the letters and the numeric values, so the decimals are not immediately adjacent to each other. When using numbers as options, it is best to align them at the decimal point, again to support the reading effort and to remove additional irrelevant challenges.
There are a few problems with the poor version. First, there are two measurements that are “more” precise than the other. The measurements are not ordered in terms of their magnitudes (centimeters to kilometers), although this might be part of the problem you hope to have students address—but recognize that this does violate the first guideline: Test one aspect of content at a time. But the real issue is the ordering and alignment of the numeric values. The better version eliminates visual challenges and makes the options more accessible to the student.
The measurement community recommends that each option should be designated to be the correct option about equally often. This is to avoid the tendency to make the middle or the last option correct (which some of us have done). This is also something that many students believe: “If you don’t know, guess C!”
Traditionally, the guideline was to vary the location of the right answer according to the number of options. Tests were often devised so that every item had the same number of options, such as all items having five options. In this case, each options would be correct about 20% (1/5) of the time. But we now know that the number of options should fit the demands of the item, in terms of what’s plausible, and also what fits the specific question. So that version of the guideline doesn’t really work. We just need to make sure that all options have about the same probability of being correct so that one option is not obviously more likely to be correct.
By logically or numerically ordering the options, according to guideline 17, we likely avoid these tendencies. But even with this guideline, be sure to check through the key and make adjustments if necessary. Also, we don’t want a string of As being correct, or a string of Bs being correct, and so on. We want to mix it up and spread out the correct option across all options. Some students, particularly those who are less prepared, will look for patterns to capitalize on their chances of guessing correctly.
This is another way of saying: Make sure only one option is the best answer. If the options overlap, then multiple options may be correct.
The problem with the poor version is that B includes C, so both B and C are correct. This issue arises from the use of the term “approximate” in the stem.
In the poor version, A (less than chance) is also part of B (less than 50%). However, there is a more significant technical flaw in this item that only a well-versed item response theory expert will recognize (or a well-read graduate student). The correct answer is 50% for one- and two-parameter IRT models, but not for three-parameter models, in which case the probability of a correct response is greater than 50% (because of the influence of a nonzero lower asymptote).
This guideline is one of the most studied empirically. Evidence suggests that it tends to make items slightly less discriminating or less correlated with the total score.
When we use none of the above as the correct option, students who may not know the correct answer select this option by recognizing that none of the options is correct. We want students to answer items correctly because they have the KSAs being tapped by the item, not because they know what’s not correct.
This is a challenging item. The poor version has a negative term in the stem (NOT) and then in the options (none), creating a double-negative situation. But, more importantly, the question isn’t a particularly useful one, since there are potentially many possible modifications to MC items. If a modification is not possible, then it’s probably not possible for any items. Even the better option is not ideal, since supporting higher-order thinking depends on the context, and C could also be a correct option for some graphical displays and depending on what is being asked of the student.
When we use all of the above as a correct option, students only need to know two of the options are correct, again allowing them to respond correctly with only partial knowledge. If the student recognizes that at least one of the options is incorrect, all of the above cannot be the correct option. This option provides clues in both cases, making it a bad choice in item writing.
The problem with the poor version is that a student only needs to know that two of the four options are correct to realize E is the correct answer. Another technical flaw is that, according to Samuel Messick (a prominent validity theorist), all validity evidence supports the construct interpretation of scores.
Finally, you may be tempted to use I don’t know as an option, but this leaves us without any information about student understanding, since they did not select the option that best reflected their interpretation of the item. If the distractors are common errors or misconceptions, we can obtain diagnostic information about student abilities. The I don’t know option eliminates this function of testing and wastes our hard effort to develop plausible distractors.
We’ve seen examples already of using negative words in the stem and sometimes in the options. Using negative words in the options simply increases the possibility that students will inadvertently miss the negative term and get the answer wrong, not because of misinformation or lack of knowledge but because of reading too quickly.
The poor version has a number of problems. First, it uses a vague quantifier in “a significant limitation.” But the primary problem here is the use of NOT within the correct option. First, its use is unique among the options. Moreover, the first option is not a limitation, but it is also not true (since it is challenging to develop many MC items). The better version is still problematic, since all of the options are limitations, so we rely on the clarity of instruction to identify the most critical limitation—we trust that this is clear to students.
Furthermore, when it comes to developing MC items, we argue that the first three options in the better item can be addressed, but the exclusion of innovative responses is nearly impossible to avoid.
There are several ways that clues can be given to the right answer. These tend to be clues for testwise students—students who are so familiar with testing that they are able to find clues to the correct answer without the appropriate content knowledge. The six most common clues are included here.
This guideline has been studied empirically. The evidence consistently indicates that making the correct option longer makes items easier (due to it providing a clue to the correct answer) and significantly reduces validity (correlations with other similar measures).
The most common error committed by classroom teachers and college instructors is to make the correct option longer than the distractors. We tend to make the correct option longer by including details that help us defend its correctness. Testwise students recognize that the longest option is often the correct option.
The wording on this guideline is deliberate. The options don’t need to be exactly the same length, which is not practical. They just need to be “about equal.” There should not be noticeably large differences in option length.
Words such as “always,” “never,” “all,” “none,” and others that are specific and extreme are rarely, if ever, true. These words clue testwise students, who know that these terms are rarely defensible.
The poor version has at least two issues. One is that A and C use the term “all.” Second, A and C present a pair of options, since they have the same structure, which is different than B. This second issue is still somewhat of a problem in the better item.
Here, the poor version includes “complete” in option B, which was changed to “some” in the better version. Also, option A in the poor version is probably not true (especially these days with the negative environment around testing) and is likely the least plausible option, so it was removed.
This is also a common flaw for true-false items. Here, the term “only” is too difficult to justify, potentially leading students to select false.
Another clue to testwise students (most students actually) is when a word that appears in the stem also appears in one or more options. This is also a problem when a different form or version of the same word appears in both the stem and an option. There were multiple examples of this in some of the other item-writing guidelines, including example items for guidelines 9 and 15 (it’s a common error).
A clang association is a speech characteristic of some psychiatric patients. It is the tendency to choose words because of their sound not their meaning. Sometimes the choice of words is because of rhyming or the beginning sound of the words. This is associated to another disorder psychiatrists call the “word salad,” a random string of words that do not make sense but might sound interesting together. We don’t want items to contain clang associations or take the form of a word salad!
The key point here is that we don’t want students to select options or discredit options because of a similar word or words that sound the same in the stem and the options.
The poor version contains the term “validity” in both the stem and option B. In addition, in the poor version, options A and B present a pair (see guideline 22d), since they are both “coefficients.” All of the distractors are forms of validity evidence, but only C directly addresses the content of the test.
When two or three options are similar in structure, share common terms, or are similar in other ways, students can often see which options to eliminate or which options might be correct.
The poor version has a number of problems, including the pair of options in A and B, both about the test itself, and “property” and “characteristic” are synonyms. If you know one is wrong, you know the other is wrong. In addition, the first sentence of the stem is unnecessary—window dressing. Also, in the better version, the distractors A and B are both forms of validity evidence but not the focus of validity itself.
Other examples of option pairs can be found in the example items presented in guidelines 12, 22b, and 22c.
There is no experimental research on this topic, but it has been written about. Berk (2000) found evidence that humor can reduce anxiety, tension, and stress in the college classroom. The bulk of the research on humor in testing has occurred in undergraduate psychology classes. The results regarding the effects of humor in testing to reduce anxiety and stress were mixed. Berk conducted a set of experiments using humorous test items and measuring anxiety among students in an undergraduate statistics course. Students reported that the use of humor in exams was effective in reducing anxiety and encouraging their best performance on the exam. Here are some examples of how Berk illustrated his recommendations to use humor in the stem or in the options.
Another example he provided is for the matching format, and although it is not an MC item format, it is a humorous example of how this guideline can be applied to other formats. This is slightly modified from the item presented by Berk (2000).
Questionnaire Item | Level of Measurement |
B Wait time to see your doctor | |
☐ 10 minutes or less | A. Nominal |
☐ more than 10 but less than 30 minutes | B. Ordinal |
☐ between 30 minutes and 1 hour | C. Interval |
☐ more than 1 hour and less than 1 day | D. Ratio |
☐ I'm still waiting | |
B Degree of frustration | |
☐ Totally give up | |
☐ Might give up | |
☐ Thinking about giving up | |
☐ Refuse to give up | |
☐ Don't know the meaning of "give up" | |
D Scores on the "Where's Waldo" final exam (0-100) | |
A Symptoms of exposure to statistics | |
☐ Vertigo | |
☐ Nausea | |
☐ Vomiting | |
☐ Numbness | |
☐ Hair loss | |
D Quantity of blood consumed by Dracula per night (in ml) |
Berk (2000) wisely recommended evaluating the use of humor if you decide to use it in your tests. He suggested giving a brief anonymous survey to students, including questions such as:
We recognize that Berk himself is a very humorous writer and measurement specialist. He is known for taking complex materials and controversial topics and making them understandable through humor. Think about how many social controversies have been clarified through cartoons (especially political cartoons). Sometimes it works. But we also recognize that one person’s humor may not be so humorous to another. We adopt the guidelines of McMorris, Boothroyd, and Pietrangelo (1997), who did an earlier review of research on the use of humor in testing. They recommended that if humor is used in testing, the following conditions should be ensured:
When options cover a variety of content, it becomes easier to detect the correct answer, even without much content knowledge. Similarly, a common clue to students is inconsistent or incorrect grammar. These are common item-writing flaws that can be avoided through good editing and review of the test items by peers.
This can also be an effective way to control item difficulty. As described in the section on item difficulty that follows, the extent to which options are similar and homogenous often determines the difficulty of the task presented to students.
The poor version includes distractors that are related to score distributions but not to central tendency—A is about variability and B is about shape. In the better version, the two distractors are both about central tendency.
Items with absurd options also violate this guideline, since they are not homogenous with the other options. But the degree to which options are absurd can be subjective, sometimes including options that are just not plausible.
This item has a number of problems. First, Ronald Reagan was elected to the governorship of California twice (1970s) and the presidency twice (1980s)—so which is the question asking about? Second, these are not all election years. There is a difference between when one is elected and when one takes office. Third, the years are not in order. Fourth, 2000 is not plausible for students who are studying the national leadership timeline. Finally, this item is tapping simple recall and is not particularly informative.
We now see that there are many elements of MC items that may contribute to difficulty. However, in classroom tests, we want item difficulty to be a natural reflection of the cognitive demands of the curriculum. In standardized testing, we typically want items with a wide range of difficulty to assess the KSAs of students across the full ability continuum. But in classroom assessment, assuming that students have had the opportunity to learn the content material and have made some effort to learn it, a large percentage of students should display mastery of the content on the test. Content mastery is, after all, the goal of instruction and the typical purpose for most courses.
Some measurement specialists have argued that for classroom assessments, the target for item difficulty should be about .70—that is, about 70% of the students should get an item correct. This may be a result of the conventional (yet unfounded) wisdom that “70% is passing.” However, this fails to account for the nature of the content, the quality of instruction and learning opportunities, and the purpose of the test—areas in which you now have much more expertise.
More important is our ability to write items that help us distinguish between students who have mastered the content materials and those who have not. We want to develop items so that students with the KSAs will answer the item correctly and students with misconceptions, misinformation, problem-solving errors, and the like will answer the item incorrectly. If everyone has learned the topic or issue presented in a given item, then the item p-value should be 1.0. If all of the students retain misconceptions or make a common error in solving a problem, then the item p-value should be 0.0. But in all cases, each item should cover important and relevant content.
As described, we can manipulate the difficulty of MC items most effectively through the development and selection of distractors (incorrect options). The selection of distractors should capture common misconceptions, misinformation, and problem-solving errors. In addition, when distractors are plausible and closely related to the key, we can create items that distinguish students with deeper understanding while assessing important abilities to parse important nuances of KSAs.
This chapter described the general process of developing MC items, with recommendations on item structure and selection of tested content. MC item-writing guidelines were then provided, with poor and better examples for each one. In these exercises, you will apply your understanding of the guidelines by using them to write and evaluate MC items.
Abdulghani, H. M., Ahmad, F., Irshad, M., Khalil, M. S., Al-Shaikh, G. K., Syed, S., Aldrees, A. A., Alrowais, N., and Haque, S. (2015). Faculty development programs improve the quality of multiple choice questions items’ writing. Scientific Reports, 5 (9556), 1–7.
Abedi, J. (2016). Language issues in item development. In S. Lane, M. R. Raymond, and T. M. Haladyna (Eds.), Handbook of test development (2nd ed., pp. 355–373). New York, NY: Routledge.
Berk, R. A. (2000). Does humor in course tests reduce anxiety and improve performance? College Teaching, 48 (4), 151–158.
Downing, S. M. (2005). The effects of violating standard item writing principles on tests and students: The consequences of using flawed test items on achievement examinations in medical education. Advances in Health Sciences Education, 10, 133–143.
Haladyna, T. M., and Rodriguez, M. C. (2013). Developing and validating test items. New York, NY: Routledge.
Jozefowicz, R. F., Koeppen, B. M., Case, S., Galbraith, R., Swanson, D., and Glew, R. H. (2002). The quality of in-house medical school examinations. Academic Medicine, 77 (2), 156–161.
McMorris, R. F., Boothroyd, R. A., and Pietrangelo, D. J. (1997). Humor in educational testing: A review and discussion. Applied Measurement in Education, 10 (3), 269–297.
Naeem, N., van der Vleuten, C., and Alfaris, E. A. (2012). Faculty development on item writing substantially improves item quality. Advances in Health Science Education, 17, 369–376.
Roberts, D.M. (1993). An empirical study on the nature of trick test questions. Journal of Educational Measurement, 30(4), 331–344.
Rodriguez, M. C. (2005). Three options are optimal for multiple-choice items: A meta-analysis of 80 years of research. Educational Measurement: Issues and Practice, 24 (2), 3–13.
Tarrant, M., Knierim, A., Hayes, S. K., and Ware, J. (2006). The frequency of item writing flaws in multiple-choice questions used in high stakes nursing programs. Nurse Education Today, 26 (8), 662–671.
Tarrant, M., and Ware, J. (2008). Impact of item-writing flaws in multiple-choice questions on student achievement in high-stakes nursing assessments. Medical Education, 42, 198–206.