EVALUATION

Types of Evaluation

Subjective Evaluation Objective Evaluation

Types of Tests

Guidelines for Classroom Testing

Concepts of Testing

Sample Objective Test Items

Testing Procedures

Testing Competence Testing Performance Skills Testing Speaking Testing Culture

Types of Grading Systems

Grading and Recording Grades

INTRODUCTION

Evaluation is inseparably related to both objectives and classroom procedures and must be given equal consideration as the class proceeds through the course. The teacher begins any given learning segment by establishing instructional goals. She then proceeds to plan activities that will enable the students to achieve these goals. In planning the lesson, the teacher is in effect hypothesizing that the planned procedures will foster and develop the desired learning by the students. She is anticipating what will work and what will not work with this particular group of students. Sometimes she anticipates correctly, and sometimes she does not. The important point, if she wants to improve her teaching, is not whether she fails to predict correctly in planning the lesson, but whether she evaluates the results in order to improve subsequent teaching activities.

The primary purpose of evaluation in the classroom is to judge achievement, both student and teacher. Evaluation of achievement is the feedback

that makes improvement possible. By means of evaluation, strengths and weaknesses are identified. Evaluation, in this sense, is another aspect of learning, one that enables the learner to grasp what was missed previously and the teacher to comprehend what can be done in subsequent lessons to improve her teaching. It is the final step in the sequence toward mastery of content and accomplishment of objectives. At times, poor students are not learning because they have failed to develop a practical learning strategy for second-language study. A test may provide clues to them or to their teachers as to what their problems are. Certainly, the teacher should continually stress the importance of evaluation as a learning device. In addition to improving the opportunity for learning, evaluation serves as a prime source of motivation for many students. After all, like most humans, they need extra incentives to operate at maximum capacity. The fact that teachers should never cease in their efforts to instill intrinsic desires in their students is no reason to abandon all forms of extrinsic motivation in the classroom.

Teachers need to evaluate constantly their teaching on the basis of student reaction, interest, and achievement. The conclusions drawn from this evaluation constitute their main source for measuring the effectiveness of instruction. In this sense, their first pressing need is to judge themselves. Second, they must accept the basic, and often burdensome, responsibility of judging the students' performance and attainment. In addition to teacher evaluation, the students are also participants in the evaluation process. Although they may not be so concerned as the teacher (some may be more so), they do, consciously or unconsciously, judge teacher effectiveness as well as their own level of learning in the course. If they are satisfied that they are learning, individual motivation and class participation will be high. If not, both the teacher and the students need to consider adjustments that will improve the situation.

TYPES OF EVALUATION

The term evaluation is an all-inclusive one that encompasses both subjective estimates of student work and formalized testing procedures. The advantages of subjective estimates are that the feedback which the teacher and students receive is immediate, constant, and informal. The weaknesses are that the value judgments reached are highly subjective, may be based on short-term learning, and, if given daily, may be confusing and burdensome to record in the grade book. On the other hand, tests have the advantage of being more concrete and thereby more objective. They also measure more material over longer periods of time, thereby stressing a greater degree of retention, organization, and comprehension of the material. The weakness lies in the fact

Evaluation 483

that tests cause such anxiety among some students that they are unable to perform to their capability. (It is appropriate here to add that such a reaction is rare in a class in which the teacher is maintaining a feeling of confidence and success among the students and in which his tests are a true reflection of his goals and classroom activities.) Too, the argument is made that the score on any given test cannot represent adequately what the students have learned.

Subjective Evaluation

Both subjective evaluation and tests should be component parts of the total picture of classroom performance. Each complements the other and provides information which the other may not, although the correlation between the two is normally quite high. The teacher should be especially aware of the effects of daily classroom activities because it is at this point that the whole process of evaluation begins. As she teaches, she can sense from the student response whether the students are adequately prepared for what they are being asked to do. In this way she has the feedback necessary to judge whether her sequencing of prior activities and her teaching have been adequate. The students' ability to participate in the classroom activities also demonstrates to them whether or not they have learned what they should.

The daily classroom work itself, then, is the crucial factor in the entire evaluation process. This is true from the teacher viewpoint because the alert teacher sensitive to student response receives more feedback during class recitation than from any test. In fact, the experienced teacher in a beginning language class usually knows in advance the approximate scores that the students will receive on any given test, even though she may not be able to predict which items will be missed. The above statement is also true from the viewpoint of the students, because their general attitude toward the class is based on the challenge offered and the success and confidence achieved during daily recitation.

The teacher may decide to assign specific grades based on daily work. Giving a class grade is indeed a nebulous judgment at the end of the grading period, but such a grade often has the effect of increasing participation in the class, especially in the case of shy students who are reluctant to speak in front of the group. Also, for the teacher who chooses not to give speaking tests, the class grade is the only definite evaluation made of the students' ability to speak the language. Giving a daily grade based on classroom recitation hardly seems necessary, and the paperwork can become quite a chore.

In addition to a grade based on classroom participation, a grade may be assigned on the basis of performance in the language laboratory. Certainly, if the lab is used regularly, part of the grade in the course should reflect the students' work there. This grade may range all the way from a single letter

grade based on a global subjective estimate by the teacher to a structured grading system for monitoring, such as that suggested by Stack (1966).

The preceding paragraphs stress the evaluation of cognitive achievement by subjective means. Equally important are the teacher's subjective estimates of the affective-social factors influencing student progress in the class. The teacher should continually be sensitive to the total class atmosphere and individual attitudes within that social environment. From this subjective interpretation of the feelings within the group she can strive to enhance the positive and counter the negative. By keeping an eye open to student actions and reactions in class, the teacher can cultivate a positive, productive relationship with her class and among the students that will make the class more enjoyable for everyone and that will improve cognitive achievement and affective-social attitudes.

For two reasons, subjective evaluation is especially important in the area of affective-social variables. First, attitudes are rather difficult to determine by means of tests, even though self-reporting attitude inventories are available. (For example, see jakobovits [1970].) And second, the teacher needs to make continuous adjustments during the course of the semester in response to the affective-social forces affecting the classroom teaching-learning situation. The completion of an inventory, as helpful as the results can be, is not sufficient to give the teacher the everyday feedback she requires to be responsive to student feelings and needs.

Objective Evaluation

Subjective evaluation may be more fundamental and more important for the teacher, both from a cognitive and an affective-social point of view, than objective evaluation. Specific testing procedures, however, carry more weight with the students in the areas of cognitive acquisition of knowledge and skills and in the affective domain also. Although the students already have a general feeling for how they are doing prior to the test, they need test items over the instructional objectives to indicate to them which goals they have failed to achieve and which weaknesses in their preparation should be strengthened. (Prior to the test, the astute teacher should already be aware of student progress, but he needs concrete information from the results of the tests to impress on the students what he knows beforehand.) In this sense, tests are merely another type of learning activity. For those students who lack the insight to evaluate realistically their achievement during the classroom teaching-learning activities, test results provide the hard facts that give them the feedback they must have to improve and which they are unable to supply for themselves.

Evaluation 485

Testing in the second-language class is often quite inadequate. The teacher is so preoccupied with classroom activities that he fails to maintain a comprehensive perspective of the flow of the language-learning sequence from plans to activities to testing. If he has failed to delineate clearly his objectives, he will not be in a knowledgeable position for preparing the test. The preparation of the test begins with the objectives. They, in effect, are the criteria upon which to judge student performance and from which to prepare test items.

The test should be a sampling of these performance criteria stated in the lesson-plan objectives. Unfortunately, the teacher often does not test the objectives of the material covered. At times, he may not even test the planned classroom activities. (Whether or not these activities reflect the objectives is another question.) In such a situation, the practical objectives of the course are set by the tests. No matter what the teacher states as the goals of the course, the students study for the tests. The tests, then, in spite of all protestations to the contrary, determine what the students emphasize in their study. Some of the dissatisfaction associated with the audio-lingual approach, for example, can be traced to this very problem. Teachers who had been retrained in audio-lingual classroom techniques continued to administer traditional, paper-and-pencil tests. The result was predictably disastrous, and student achievement in oral skills was understandably lower than had been expected. If the teacher is to insure learning, his objectives, planned activities, and testing procedures must reflect his philosophy of learning and language.

TYPES OF TESTS

There are several types of classroom tests. The most common one, which is a favorite of many second-language teachers, is the quiz. The quiz is a compromise between short-term, subjective evaluation based on daily work and the unit, six-week, or semester examination. Quite often the threat of a quiz is used as a scare technique to stimulate the students to prepare their daily lesson. A more desirable situation is the positive approach in which they prepare in order to be able to participate in the day's activities. The quiz may be announced or unannounced. Both are intended to check short-term learning and to encourage daily preparation. There is no consensus on the value of periodic quizzes. Some teachers use them very effectively. Others feel that quizzes, especially the "pop" variety, frustrate students who are anxious and conscientious, and that such tension may actually hinder learning and create a negative reaction toward the class itself.

If she does decide to give quizzes, the teacher should be wary of three tendencies that may negatively affect learning. The first is to place such reliance on the threat of a quiz to motivate the students that she neglects to plan adequately for her own teaching. Quizzes may be used to give an extra incentive, but proper motivation cannot be maintained on the basis of quizzes alone. Second, the teacher should take care not to ask questions of the same difficulty level on a quiz that she would on an examination. The purpose is to check initial comprehension of any segment of material, not final mastery. Third, the teacher should not overtest. If she decides to give quizzes, she should not give one every day, at least not for a grade. Most of the time in class should be devoted to teaching-learning situations, not testing. A good rule-of-thumb is to limit all testing to a maximum of 10 percent of class time.

The greatest amount of expertise has been developed with regard to testing and evaluation in second-language learning in the area of prediction. Carroll and Sapon (1958) and Pimsleur (1965) have published aptitude tests that correlate quite highly with student achievement in language classes. Other factors such as past academic record, verbal and quantitative ability, interest, sense modality preference for learning, etc. have been found to affect achievement. One study even indicated that it might be possible to determine which students would be most likely to succeed in an audio-lingual class and which students would prefer a cognitive type of class. At any rate, it now seems possible to predict the strengths and weaknesses of each student on the basis of information in the counselor's office and from the scores on an aptitude test (Chastain, 1969). In spite of the possibilities in this area for sectioning students to enhance their opportunities for learning the language, few schools do so at present. Perhaps a future trend in this direction may enable the teacher to anticipate and avoid, or at least minimize, student difficulties.

Periodically, the teacher should give an achievement test that covers a major segment of material. With the typical textbook, tests are given at the end of each unit, lesson, or chapter. Tests of this type encourage the students to organize their knowledge, to assimilate larger chunks of material, and to learn for long-term retention. They also present the students with objective evidence with regard to what they have and have not learned. (The teacher is usually more aware of this prior to the test than they are.) Some authors provide tests to be given after each major division of the book. These tests may or may not be appropriate for any given teacher's interpretation of the book. She should examine them carefully in light of her own classroom procedures. Teacher-made tests are likely to be more consistent with the individual classroom approach, but the teacher may not have the time or, in the beginning, the expertise to prepare them.

Standardized achievement or proficiency tests are also available. The MLA-Cooperative Foreign Language Tests (1965) are designed to be given at

Evaluation 487

the end of second level and at the end of the fourth level in high school. These tests are obviously not correlated with any particular textbook, and they would be inappropriate for use as a final examination over the textual materials used in the class. However, they are a valuable asset in curriculum evaluation and for comparing any particular students with the norms on the national sample. By comparing the scores over a period of years, the department can obtain a more objective measure of the quality and focus of instruction.

Although they are not commonly used in most second-language programs in secondary schools, diagnostic tests could be used to advantage. Their major function is to ascertain what the students know and what they do not know. Since the study of second languages is a cumulative process, the results of a diagnostic test would enable the teacher to determine those areas that should be reviewed and those areas that can safely be assumed to be a part of the students' existing knowledge. Such tests could be used beneficially to facilitate student progress from one level to another.

One way to incorporate diagnostic testing and diversified instruction into intermediate and advanced classes is to use pretests over basic grammar points as a diagnostic tool. At the beginning of the week the students take a pretest over the grammar for that week. Those students who score above a teacher- selected percentage are free for the remainder of the week to concentrate on other work. Those students who need additional work or who desire to raise their scores are directed to work they can do to prepare themselves for a posttest to be given at the end of the week.

Classroom tests can also be categorized on the basis of the way the information obtained from the results is to be used. If the test scores are to be used to rank the students, the test is referred to as a norm-referenced test: that is, student achievement is related to some norm. In the case of standardized tests , the results are compared to those of a nation-wide random sample of students at a comparable level. In the case of a teacher-prepared classroom test, the assumption is made that the students in that class are a normal group, i.e., that there is a normal distribution of abilities in that class. As a result of this assumption, the student scores are listed on a sheet in order to find out the numbers of students who have received the various scores. Using this chart, the teacher "curves" the results, i.e., he decides what percentage of the students receive an A, a B, etc. for the test. In theory, approximately 10 percent of the students get A' s, 20 percent B' s, 40 percent C' s, 20 percent D' s, and 10 percent Ps. In practice the teacher gives grades that he feels to be consistent with the students' abilities in the class. The weakness of such a practice is that students may rank very high and learn very little, or they may learn a great deal and receive a comparatively low grade if they happen to be in an especially strong academic group of students. Too, the concept of a normal distribution of abilities is based on the total population. Determining the

extent to which any particular group corresponds to that distribution is practically impossible. In addition, it is assumed that the teacher can prepare examinations that generate normally distributed student scores, justifying this assumption is a tenuous proposition.

If the test is prepared on the basis of the instructional objectives to determine to what degree the students have learned the material presented and practiced in class, the test is called a criterion-referenced test. Although this term is primarily associated with the use of behavioral objectives, the concept can be used with any instructional objectives that the teacher may have for the students. Tests of this type give the students and the teacher information regarding what percentage of the content they have learned, not what percentage of their classmates are above or below their level of achievement. Using this method, the teacher does not curve the scores. He merely determines the percentage of correct answers on the test. Those students who have received a predetermined score receive an A, a B, etc. The emphasis in this type of test is not on comparisons, but on achievement. The advantage of this type of test is that the teacher is obliged to prepare tests that evaluate how well the students have learned. The stress's placed on learning, and the possibility exists for all students to do well or to do poorly, as the case may be.

Criterion-referenced examinations, which are prepared to test the acquisition of behavioral objectives, do have certain weaknesses. (These are discussed in chapter 8, page 226.) In response to the criticisms made against criterion-referenced tests, some writers are proposing domain-referenced tests, which purport to test an entire spectrum of behaviors needed to master some aspect of knowledge. (The interested reader is referred to Educational Technology, vol. 14, June, 1974.)

There are also formative and summative tests. (See chapter 8, pages 226-27.) A formative test consists of items designed to test student achievement of instructional objectives. The test is given at the end of a particular segment of material to give the students feedback on what they have or have not learned. Subsequently, compensatory exercises and activities are provided to help the students fill in the gaps in their learning. The students receive no grade on these tests. The purpose is to assist them in learning, not to grade their performance. Summative examinations are given less often over larger quantities of material. Grades are given on achievement indicated by the scores received on these tests. Both types of tests are intended to be geared toward instructional objectives, not toward student rankings.

Another type of test which has received a great deal of attention among teachers of English as a second language is the cloze test. In a cloze test, every n*h word is deleted from a written passage. The students complete the test by filling in the missing words or suitable equivalents. It has been suggested that

Evaluation

cloze tests measure the learner's "grammar of expectancy/' and the "learner's internalized grammar of expectancy" is the central core of language competence (Jorstad, 1974).

GUIDELINES FOR CLASSROOM TESTING

1. Test objectives. The purpose of a classroom achievement test is to evaluate student attainment of objectives for any given segment of material. Any test, or any portion of a test, that the teacher cannot relate to specific objectives in his lesson plans needs to be carefully scrutinized and probably revamped. Time restrictions may make it impossible to prepare questions for all the objectives, but a representative sample should be covered. The important thing is to remember that any group of items should be thought of as a check on the attainment of goals.

2. Test what has been taught. Just as the test should reflect the objectives, it should also reflect the classroom procedures that were employed to accomplish those objectives. In other words, the test should only include items that are of the same type as the exercises used to teach the structures and vocabulary involved. Words and forms certainly should be recombined, but the types of exercises should be the same. Too often the language teacher is guilty of using certain procedures in class but testing with different techniques. If one were learning to play the piano, for example, the test should include various pieces that have been practiced. It is not uncommon in language classes for the teacher to ask the students to play a different piece, or perhaps even to write a composition, rather than play anything at all.

3. Test all four language skills. If the objectives in the course include all four language skills, all four should be tested. The implication here is that each test should include one section to test listening comprehension, one for speaking, one for reading, and one for writing. If the teacher customarily limits herself to pencil-and-paper tests, these types of exercises become, in effect, the practical goals of the students. She can correct this common fault by including some items involving each of the language skills on each examination.

The aforestated guideline also carries another implication—the word skill. As well as testing all four skills, the teacher should also be careful to test the material covered as a skill, not as a set of memorized vocabulary words and verb forms. Admittedly, language is made up of sounds, words, and structures. There is a great deal of difference, however, between testing concepts in some communicative context as opposed to regurgitating a word from a memorized list. Testing the students' ability to answer the question, "Are you a student?"

489

by writing, "Yes, I am a student," is entirely different from asking him to write the first-person singular of the verb to be. Asking the students to choose a word in the second language that correctly completes a series of statements such as the following does not involve the same skill as asking them how to say "get up" in the second language: The sun was shining. The alarm was ringing.

Slowly I realized that it was time to _

4. Test one thing at a time. There are two considerations related to this statement. First, as far as possible each skill should be tested separately from any other skill. The answers to listening comprehension items, for example, should be given orally. Or the students might possibly choose from pictures or some other visual aid rather than read the answers in the language. There are limits to the implications of this guideline. It would be most difficult to prepare sufficient visual cues to elicit the desired responses in speaking. Therefore, and even though visual aids provide excellent stimuli to certain portions of the speaking test, many items on the speaking test will be cued in the language. This type of item does involve listening comprehension as well as speaking, but this practice can be defended on the basis that such a combination is the normal one in everyday conversational situations. The teacher should, however, remember that the primary function of the speaking portion of the test is to test the speaking skill and consequently to limit the oral cues to those that the students are certain to understand. The same relationship also exists between the written receptive skill, reading, and the productive skill, writing.

Second, within each of the language skill tests, the teacher should test only one point at a time. If he is writing an item to test vocabulary, the students should be aware that a word or words are to be tested and that syntax is not involved. For example:

For their golden wedding anniversary my grandparents received many

gifts and had many_

a. weddings b. visitors c. new excuses.

If he is testing structural knowledge, all the distractors, i.e., the answers given, should involve forms of words rather than just words. For example:

His sister is_the gifts.

a. opening b. received c. looks for

Another case in point is the speaking test. When judging the students' pronunciation, the teacher should grade only one sound per utterance. When evaluating the use of grammatical forms, he should grade only the grammar and not focus upon the pronunciation and the vocabulary. Similarly, vocabulary should be graded independently of the pronunciation and structural

Evaluation 491

manipulation. Only in this fashion is objectivity and identification of specific weaknesses possible.

5. Weight exam according to objectives. Since the test is to reflect the classroom objectives, it should be weighted according to the stress placed on each of the language skills. If the teacher wants to emphasize all four language skills equally and has endeavored to do so in his class activities, one fourth of the exam should test listening comprehension, one fourth speaking, one fourth reading, and one fourth writing. This is not to say that each portion of the exam should be equal in length, but that the weights assigned to each should be equal. On an examination graded according to a scale of one hundred points, each portion would be worth twenty-five points. (A test of any length may be converted to a hundred-point scale by multiplying or dividing.) Normally, a test of speaking and writing will include more separate factors to grade. Therefore, each might be worth fewer points even though the total score would be the same as on the listening and reading portions of the test. Conversely, the listening and reading sections will most likely have fewer points to grade, but each will be worth more.

6. Sequence the items from easy to more difficult. Most students enter an examination slightly apprehensive about their ability to answer the questions. Even though most of this anxiety can be extinguished in any given class before it becomes a real problem, the aura of testing remains. Therefore, one of the teacher's tasks in preparing the items is to include a few at the beginning which most students should be able to do fairly easily. This progressive introduction will help to reestablish their confidence and to set them at ease. Thus, the resulting score should be more representative of their true ability.

Another reason for arranging the items in each portion of the test in an order of ascending difficulty is to insure that some items are included that will challenge each student's ability level. There should be some items that each student can get fairly easily and some that challenge each student. In this sense, arranging the items from easier to more difficult is merely another way of acknowledging individual differences on the test. Slower students, for example, should not be expected to complete all the items if they do not feel capable of doing them. Faster students may be given extra-credit items, which will keep them occupied while the others finish the examination.

7. Avoid incorrect language. This guideline is obvious. The students make enough errors in the classroom without the teacher contributing some of her own. Equally obvious is the fact that, with some ingenuity, the teacher can test the common errors without actually committing them herself.

8. Make directions clear. The purpose of a test is to measure achievement, not cleverness in reading the directions. The students should be encouraged to ask questions if they fail to comprehend what is expected, and, if possible, the teacher should circulate through the class during the exam to check

whether they did in fact understand. (Many beginning teachers are astounded to learn that asking, "Are there any questions?" is not sufficient. Many students do not know what kind of questions they should be asking. The teacher is always responsible for determining whether there indeed are questions; asking if there are any questions is usually an ineffective procedure.)

9. Prepare items in advance and prepare more than you need. The preparation of test items is a difficult and time-consuming task. Often the teacher is not aware of the weaknesses of his examinations. He can improve this situation by preparing the test in advance and by having more items than are needed. In a few days he can go over all the items again, eliminating the weaker ones. After the test, those questions which appear to be good ones can be placed in an item file for future reference. Categorizing these by structural topic will speed up location later on. The vocabulary associated with each item is easily and quickly changed.

10. Specify the material to be tested. Although the teacher may feel that the students should know what the examination will cover, and although he may have been extremely careful to convey the objectives of each class to them, he still should delineate exactly the material that will be tested. Such a summary motivates the students to study and focuses their attention on the most important concepts in the lessons while they are studying. Much time is wasted by students as they attempt to develop a learning strategy and to organize the basic knowledge to be assimilated. By outlining what should be reviewed, the teacher can assist them in spending their time more efficiently.

77. Acquaint the students with techniques. Normally there should be little difference between the normal classroom activities and test items. However, if the teacher does occasionally consider it necessary to include an innovative type of item distinct from classroom techniques, he should provide the students some practice with this type of item before counting the results as part of a grade. For example, if the teacher decides to ask the students to record their speaking tests in the language laboratory, they must first be well acquainted with the lab. Even then, he may want to consider the first test as a practice run.

12. Test in context. This concept was previously considered in the discussion of testing all four language skills. The teacher who tests vocabulary and structure has a tendency to revert to teaching a problem-solving class, i.e., one in which the teacher and the students focus on the means rather than on the ends, on the building blocks of language rather than on the application of these components to actual communication situations. He should always try to write items as similar as possible to language that would occur in actual usage. Furthermore, testing in context implies that often a contextual situation must be established clarifying for the students exactly what the correct answer

Evaluation 493

would be. In distinguishing between the two past tenses in French and Spanish, for example, the teacher is almost forced to write an entire paragraph in order to establish a definite relationship between the two tenses. Even in testing a vocabulary item, a contextual situation may need to be established. For example:

Did you get the letter?

No, the_still has not arrived.

a. card b. code c. male d. mail

13. Avoid sequential items. A sequential item is one that has two or three stages, with the later parts being dependent upon the first. Those students who fail on the first part will automatically be unable to do well on the second or third. An obvious example is giving a dictation and then asking the students to translate the paragraph into English. Those who were not able to write the dictated sentences receive a double penalty. Another example is the test item with a blank in which they are supposed to write the correct form of a verb (given in the first language) in the second language. If they do not know the vocabulary word, it is impossible for them to give the correct form of the verb.

14. Sample fairly the material covered. The teacher should strive on each examination to sample fairly all the objectives and all the various types of activities included in the lesson plans. She may be unable to write specific items testing all the concepts taught, but by randomly sampling those taught in class, she should receive scores that are representative of the students' total learning. This statement also implies two other important concepts in testing: (a) all four language skills should be tested, and (b) items of varying levels of difficulty should be included.

One effective aid for obtaining a random sample is to keep a checklist of all material covered in the view and review in each of the language skills. Using this outline, the teacher should have little difficulty sampling the content in each of the language skills at various levels of difficulty. Attempting to implement the guidelines mentioned in this paragraph, the teacher may tend toward examinations that are too long. Over any given amount of material the possibilities for items are almost unlimited, and the teacher must refrain from attempting to include everything. Testing each selected concept or form once is sufficient. In testing verbs, for example, there is no need to include more than one item testing the ability to use the first-person singular form. In fact, testing all four language skills permits the teacher to test some forms or concepts in one skill and others in another skill, thereby shortening considerably the necessary length of the test.

15. Allow sufficient time to finish. This statement must be qualified in the case of listening and speaking. All students must answer the same listening

items given at the same speed. This also applies to speaking tests given in the laboratory, in which all students must respond at approximately the same time. On the written portions of the examination, however, some students will be able to finish much more quickly than others. Since this variety of speeds is common in most classes, and since having to turn in an unfinished test is frustrating to most students, the teacher should take care to allow everyone to finish before collecting the papers. He can accomplish this trick of keeping everyone busy, either by giving the faster students extra-credit questions or by permitting the slower students to omit certain difficult items on the test. The teacher is protecting the students' right to answer all the items, but he cannot permit them to sit and meditate over particular items. Quizzes especially must be completed quickly and collected. Normally, those students in a second- language class who wait for a sudden flash of inspiration continue to remain in the dark. In the case of languages, they either know it or they do not, so little is accomplished by letting the few who are unwilling to accept their fate waste valuable class time.

16. Make the test easy to grade, if possible. Of course, making the test easy to grade is not a primary consideration, but the teacher must arrange her time as efficiently as possible. One way of decreasing her work load is to consider beforehand the amount of time necessary to grade the examination. By choosing types of items that can be graded quickly, she can eliminate some unnecessary work. One simple example is to have the students always place their answers in a column to the left of the number of the item rather than filling in a blank or circling the letter of the correct answer in multiple-choice questions. Their answers then become a list that can be graded almost in a single glance. Another technique that greatly speeds up the grading of examinations is to grade one group of items on all the papers before continuing to the next section. By grading in this fashion, the teacher has the correct answers memorized after the first few papers, and her speed increases tremendously. Grading the entire test of each student before proceeding to the next student's paper is much slower. Writing the number of errors in each section in the margin will speed up the tabulation of the total score at the completion of the last section. Too, preparation of an answer key along with the test is a time-saving device.

Of course, grading tests in which students select answers is easier and quicker than grading tests in which students give their own answers in writing or speaking. In this sense, normally it takes less time to grade tests over the receptive skills than over the productive. This relative ease in grading, however, is counterbalanced by the fact that the preparation of good objective tests in which the students select answers take longer to prepare than short-answer and essay-type questions. Since a test covering all four skills must, by definition, include both selecting and giving answers, the teacher needs to gain expertise in preparing and grading both types of items.

Evaluation 495

17. Go over the test immediately afterwards, if possible. The students' interest is highest during and immediately after a test. The teacher should take advantage of this peak of interest by giving the correct answers as soon as he collects the papers. Some teachers prefer to have the students exchange papers and grade them in class. At any rate, the point is that an examination is also a learning experience, and as such, it is always preferable to review the correct answers with the students. The best time to clarify difficulties and doubts is at the end of the test. The students should never have to wait more than a day, possibly two, to see the results.

18. Help those students with problems to determine specific areas in need of improvement. Whether or not the test is indeed a learning device depends a great deal upon the teacher. He can return the tests with a lengthy harangue about the lethargy and apathy of the younger generation, or he can praise the students for their fine work, when they deserve it. He can scold them for the poor grades, or he can indicate just what their problems seem to be by means of short notations on the tests or in personal interviews. The student in question may need to work with sounds, with vocabulary, or with grammar. The student may need additional practice in listening, speaking, reading, and/or writing. The teacher can provide assistance not only with identifying the problems, but also with suggesting ways and means of improvement. A personal interview may also reveal affective-social problems with which the teacher can help even though they may have little to do with second-language learning per se.

19. Test each concept only once. Testing a single concept more than once not only inflates the true score of those students who know the correct answer; it deflates the true score of those students who do not know the answer. For example, in a group of items testing the students' knowledge of regular verb endings of a single class, one item testing each of the person forms in the singular and the plural covers all the concepts and forms. Additional items not only take additional time and cause extra work but falsify the results as well.

20. Give credit for what a student knows. Although the argument that a response is not correct unless it is entirely correct can be defended strongly, and often is, such a method of grading test papers is not fair to the students. A closer examination of the question often reveals that the students have been asked to make more than one decision in answering the item. It is each decision that should be given credit separately rather than one single grade that is either totally correct or totally incorrect. In working with object pronouns, for example, some students may know the correct form but err in placing them correctly in the sentence or vice versa. In choosing between the indicative and subjunctive, some may choose the right mood of the verb but fail to make it agree with its subject or vice versa. The point is that the teacher should grade from the point of view of what the students do know rather than

rejecting all but perfect responses. When testing language skills, the teacher should place primary stress on the ability to communicate. Although students who take rather marked liberties with the grammar of the second language obviously do not deserve as good a grade as those who are able to express themselves with great fluency and with very few errors, they have accomplished the major goal in second-language learning, and they deserve credit for being able to communicate.

21. Distinguish between recall and recognition tests. In taking a recognition test , the students select the correct answer from a number of alternatives. The alternatives may range from two in true-false questions to more than two in a multiple-choice test (normally three, four, or five) to several grouped items in a matching exercise. The tendency is to have fewer distractors in listening comprehension items and more in reading items. Since listening and reading are receptive skills, recognition-type questions are most suited to the testing of these two skills.

In taking a recall test , the students must produce the cued responses rather than select the appropriate one. A recall test measures the students' ability to apply functional understanding in order to produce language. Recall items, then, must be used to test the productive skills, speaking and writing, since the nature of the skill itself involves production rather than recognition. One commonly used type of recall item is to ask students to write answers to personalized questions such as "What's your name?" The principle of testing only one language skill at a time can be modified in this case, since in the normal language situation the receptive skills quite commonly serve as stimuli to the productive skills.

22. Distinguish between discrete items and global items. Clark (1972) identifies discrete items and global items. A discrete item consists of a short stimulus followed either by an answer to be chosen (a recognition item) or a space for an answer to be given (a recall item). The important characteristic of a discrete item is that it tests only one point. An example of a discrete item is a question followed by four possible replies. An example of a global item is a longer passage followed by a series of questions testing comprehension of content. The purpose of such an item is to test one of the language skills in a more sustained sequence.

23. At some point in the learning sequence , the students should have to choose between various grammar structures. Tests, as well as exercises, should foster the assimilation of each component of the language system into a large, meaningful whole. To be discouraged is the memorization of fragmented chunks of information that can be placed in indicated slots on tests, but that the students cannot use in any context involving a choice. After the students have studied the past tense, they should be able to complete an exercise in which they use past tense verb forms. However, if they are going to learn to

Evaluation 497

use the language, they should also be able to choose when to use the present or the past and to use both in the same exercise as the context indicates. Students should be able to complete an exercise on a test labeled direct object pronouns. They must also be able to reach the point at which they can use both direct and indirect object pronouns in the same exercise if they are ever to achieve any functional ability to employ these forms for purposes of communication. On each test there should be some items in which the students must make choices that indicate functional control of the components of the language system.

24. Be prepared to incorporate new types of items. The best place to find new types of test items is in the teacher's own classroom. If at all possible, the students should be asked to perform only those types of exercises and drills that they have been doing in class. There is no need to resort to the exotic or the esoteric in the preparation of test items. The object is to assess the language skill that has been developed in the work being done in class. However, if the teacher feels obliged to "do something different," the bibliography at the end of this chapter contains sources for "new" ideas. She should have as a ready reference the most complete listing of example test items to date, Modern Language Testing: A Handbook by Valette. (See Selected References.)

CONCEPTS OF TESTING

The first concept normally mentioned in a discussion of testing is that of validity. The test is sound when it tests what it purports to test, i.e., when it tests entirely or in a random sample all the objectives and content of the material being learned.

Another often mentioned concept is reliability. The implication of this term is that the test continues to give the same or similar results on successive occasions.

Both validity and reliability are concepts with which the teacher is most likely familiar even though he may not be able to define the words themselves. Certainly, he should examine his own tests to determine the relationship between the content of the examination and the activities pursued in the class.

Standard error of measurements an important concept to be kept in mind while assigning grades for a particular score on a test. Standard error implies that a certain amount of error is part of each test score. The amount of error varies from test to test. Therefore, a student's true score probably lies between one standard error above and one standard error below the score the student received on the examination. For example, if Mary's score was 85 on a test with

498 Part Two: Practice

a standard error of 3, the odds are rather high, 68 percent, that her true score is somewhere between 82 and 88. 1

Two other concepts with which the teacher should be familiar are item difficulty and item discrimination. Item difficulty refers to the percentage of students who answer the item correctly; item discrimination is a measure of which students answer the item correctly. Item discrimination involves comparing the student's answer on the item with her total score on the examination. For the item to have high discrimination, those students who give the correct answer should be those students who receive the higher scores on the test.

Even if the means are not available to him for a complete item analysis by computer, he may want to check the difficulty and discrimination of certain items by a show of hands in class or by a subjective estimate gained while grading the examinations. The teacher should always grade the papers or examine them carefully afterwards in order to improve his testing techniques and in order to determine what the focus of the follow-up discussion should be.

With regard to level of difficulty, only the first few items in each part of the examination should be so easy that most students will get them right. The remaining items should be increasingly difficult. One of the characteristics of a good examination is that the range of scores be sufficiently spread out to guard against chance deviation from one letter grade to another. For example, when the B's on the examination range from 59 to 61 and the standard error of measurement is 3.5, then any given student's grade could go from a C+ to an A- merely by chance. Therefore, the examination should be difficult enough to insure a proper range of scores for the purpose of assigning grades and to challenge the brighter students in the class. At the same time, the test should not be so difficult as to discourage the students. Most good test items are answered correctly by 60 to 85 percent of the students. Certainly, any item in which the correct answer is chosen by fewer than a guessing percentage is not a good item. For example, in a multiple-choice question with four possible answers, any time fewer than 25 percent of the students select the correct answer, something is wrong with the item or with the teaching leading up to the test.

With regard to discrimination, the teacher needs to keep in mind that his better students should be answering the questions correctly. He should reexamine any item answered properly by the less able students, but missed by the more capable. Such an item has obviously failed to discriminate the knowledge level of the students properly.

’Sixty-eight percent of the scores on a test having a normal distribution fall within one standard deviation above and below the mean scores on the test. Standard error of measurement follows the same principle. See Lemke & Wiersma (1976) in the Selected References section of this chapter.

Evaluation 499

In improving his test preparation, the teacher should begin by reconsidering his objectives and his classroom activities in relation to the test. After this, he can begin to revise the items themselves. Many items, of course, will need to be discarded or completely revised. Others may need to have only one distractor changed to improve the quality of the item. All distractors should make sense if they are to fulfill their function. Writing three distractors that would seem plausible if the students do not completely understand the item requires a great deal of thought and time. A good item is often ruined by the failure to work on the item until the teacher thinks of satisfactory distractors. Writing the stimulus and the answer is not too difficult. The secret to a successful item, however, is equivalent to the success of the distractors. Since writing such items does require much time, the teacher should, in his own self-interest, begin an item file during his first year of teaching and faithfully improve and expand it in succeeding years.

Sample Objective Test Items

The following examples are not intended to provide a classification of item types, but to demonstrate the concepts of item difficulty and item discrimination. The items are designed to test listening comprehension and reading. The items are multiple choice, and the correct answer depends either on an application of knowledge of vocabulary or an understanding of structure. Item difficulty, i.e., the proportion choosing each alternative, is given in the first column to the left of the letter of each choice. Item discrimination, i.e., the alternative correlation with total score, is given in the right column. A good item has a high positive correlation in the left column opposite the correct answer, and high negative correlations in the right column opposite the incorrect answers. These correlations indicate that the students with the higher scores on the test chose the correct answer and vice versa.

1. Good listening comprehension and/or reading items (vocabulary):

j'habite New York. Je n'ai pas envie de prendre le bateau pour aller en

France. Cette fois je vais prendre_

.188 -.255 a. le train

.079 -.237 b. I'autocar

.109 -.160 c. le metro

.624 .441 *d. I'avion

Senora, ^quiere usted que sirva la carne ahora? .020 -.118 a. ^Cuanto es la ganancia?

.608 .430 *b. No, espere usted un rato.

.137 -.219 c. Si, hay mucho ganado.

.235 -.279 d. No sirve para nada.

2. Good listening comprehension and/or reading item (grammar):

Winieron ellos a la inauguracion?

.641 .370 *a. No, no pudieron aceptar la invitacion.

.109 -.260 b. Si, pude ir.

.250 -.222 c. No, no vinimos en coche.

3. Good reading item (grammar):

—^Se va a casar con el rubio o con el moreno?

—Eso no lo_hasta el domingo proximo.

.083 -.267 a. supimos

.229 -.221 b. saldremos

.458 .503 *c. sabremos

.229 -.200 d. sepamos

All the preceding items are good items from the point of view of item difficulty and item discrimination because an acceptable portion of the students answered the questions correctly. The positive correlations with correct answers and negative correlations with incorrect answers indicate that the better students answered the items correctly. Item number 3 was difficult, but the figures show that its discriminative power was very high, i.e., the best students got the question right. Even the best items can be improved. In the preceding examples some distractors were chosen by a very small percentage of the students. Some distractors should be changed for more realistic choices if the students have not understood the items completely.

4. Poor listening comprehension and/or reading item (vocabulary):

Wo kommen Sie her?

.543 .096 a. Sie kommt aus Clinton in Iowa.

.163 .010 b. Ich komme mit dem Zug.

.209 -.018 *c. Ich komme aus Lafayette.

.085 -.159 d. Sie fahren nach Deutschland.

5. Poor reading item (vocabulary):

Bonn ist_der Bundesrepublik Deutschland.

.008 -.211 a. das Gebaude

.938 .131 *b. die Hauptstadt

.008 -.076 c. die Reihe

.047 .031 d. der Salz

Evaluation 50i

6. Poor listening comprehension and/or reading item (grammar):

^Que era necesario?

.188 .155 *a. Que dejara el puesto.

.708 -.282 b. Que pague la cuenta.

.104 .221 c. Que lo llamo anoche.

7. Poor reading item (grammar):

Paul reve de son voyage a Paris. II_pense souvent.

.386 .172 a. en

.123 -.171 b. le

.465 -.023 *c. y

.026 -.099 d. lui

An item is inherently poor because it is either too difficult or too easy or because it fails to discriminate between those students receiving higher scores and those receiving lower scores on the test. 2 The results obtained from any item may be unsatisfactory either because of a flaw or because the students had not been taught to make the correct selection.

In the preceding examples the correlations indicate that these items were not discriminating between better and slower students. In item number 4, for example, the better students were guessing either a or b with a few of the better and a few of the slower choosing c. Too, the correct answer was chosen by fewer than a guessing percentage. The only definite information obtained from the item was that students who chose d did not do well on the test in general and that it was a poor item in particular. Distractors a, b, and c should be rewritten.

For ranking purposes, item 5 was too easy and failed to discriminate. Distractors a, c, and d need to be changed, the first two to more plausible choices and d to an item that will not deceive almost 5 percent of the better students.

Item 6 has an unsatisfactory difficulty correlation, and the discrimination correlations reveal that the better students selected c, a wrong answer, and the slower students chose b, also incorrect. The item would seem to be a good one, but the results reveal that it is not. One can only conclude that this item

Note 1

was not valid for the group of students taking the test. Evidently, they had not learned to distinguish between a cue that would elicit past subjunctive and present subjunctive or between indicative and subjunctive.

In the case of item 7, almost 47 percent of the students answered the question correctly, but almost 39 percent of the better students chose a as the correct answer. Either the context of the stimulus did not make the meaning entirely clear to the students, or they had not learned the difference between en and y and the contexts in which each is used. Changing distractor a would eliminate much of the problem with the item, but the results seem to indicate that the difference between en and y should be retaught.

The reader should notice that many of the same items can be used to test either listening comprehension or reading. Of course, those that are too long, too complex, or that have blanks in the middle of the sentence should be rewritten. Too, the reader should realize that item analysis reveals the problem, i.e., what should be changed, but not the solution.

TESTING PROCEDURES

Learning requires feedback. Otherwise, the learners have no means of judging the extent and appropriateness of their learning. Although the second- language learners' feedback should not be limited to the objective testing program, carefully prepared tests can be an important component in clarifying progress to the students and in specifying individual strengths and weaknesses.

Tests can encourage or discourage. Tests can promote cognitive achievement and productive affective-social attitudes, or they can produce negative effects, canceling much of the teacher's gains in choosing suitable objectives and teaching-learning activities for the students. For the most part, student reaction depends upon how well the teacher correlates the content of the tests with class goals and activities. Students who participate in class, learn the material, and practice the skills should know what to expect on a test and they should be able to do well. Tests should support, not surprise. Test format should be consistent and predictable from classroom activities.

One testing procedure that provides feedback in a supportive manner is to give a short, formative, self-graded test over each grammar item during the class period following its coverage in the view class activities. The format is quite simple. The teacher gives the students five or six items. To answer these items the students must demonstrate a knowledge of the structure being studied. Afterwards, the students supply the correct answers, thereby giving all students an opportunity to grade their papers and an opportunity to check their progress up to that point. Answering the questions takes only a few

Evaluation 503

minutes, yet each student receives regular progress reports prior to the graded test.

Chapter Notes

2 The reader should recognize that the purpose of criterion-referenced tests is to determine what the students have learned. Therefore, the concepts of item difficulty and item discrimination are much less important than the validity of the test when the teacher is stressing achievement of course content rather than student ranking. However, although often used to obtain rankings, this same type of discrete item can serve a very useful function to test specific aspects of the second language. In other words, good objective test items often used for ranking purposes can also be used as criterion-referenced items.