Foreign Language Assessment Compiled by Po-Sen Liao Introduction Why do we test at all? Is it the only way to get students to learn? To test out of habit? To test as a punitive measure? A discouraging barrier for students to face or a hurdle they have to jump at prescribed points? Can tests be a positive challenge? (questionable and promising testing procedures, see p.4 ) The assessment tasks should be nonthreatening and developmental in nature, allowing the learners ample opportunities to demonstrate what they know and do not know, and providing useful feedback both for the learners and for their teachers. Advances in language testing over the last decade have included: 1. The development of a theoretical view that sees language ability as multi-componential and recognizes the influence of test-method and 2. 3. test-taker characteristics on test performance. The use of more sophisticated measurement and statistical tools. The development of communicative language tests commensurate with the increased teaching of communicative skills. Terminology 1. Assessment: A variety of ways of collecting information on a learner’s language ability or achievement. An umbrella term encompasses tests, observation, or project works. Proficiency assessment: The assessment of general language abilities acquired by the learner independent of a course of study (e.g. TOFEL, TOEIC). Achievement assessment: To establish what a student has learned in relation to a particular course (e.g. tests carried out by the teacher and based on the specific content of the course). It is to determine acquisition of course objectives at the end of instruction. Diagnostic assessment: It is designed to diagnose a particular aspect of a language. A diagnostic test in pronunciation might have the purpose of determining which phonological features of English are difficult for learners and should therefore become a part of a curriculum. Such assessment may offer a checklist of features for the teacher to use in pinpointing difficulties. Placement assessment: Its purpose is to place a student into an appropriate level or section of a language curriculum. Certain proficiency tests and diagnostic tests can act in the role of placement assessments. Aptitude assessment: It is designed to measure a person’s capacity or general 1 ability to learn a foreign language and to be successful. It is considered to be independent of a particular language. This test usually requires learners to perform such tasks as memorizing numbers and vocabulary, listening to foreign words, and detecting spelling clues and grammatical patterns. Formative assessment: It is often closely related to the instructional program and may take forms of quizzes and chapter tests. Its results are often used in a diagnostic manner by teachers to modify instruction. Summative assessment: The type of assessment that occurs at the end of a period of study. It goes beyond the material of specific lessons and focuses on evaluating general course outcomes. Norm-referenced assessment: to evaluate ability against a standard or normative performance of a group. It provides a broad indication of relative standing. (e.g. a score in an exam reports a learner’s standing compared to other students). Criterion-referenced assessment: to assess achievement or performance against a cut-off score that is determined as a reflection of mastery or attainment of specified objectives. This approach is used to see whether a respondent has met certain instructional objectives or criteria. Focus is on ability to perform tasks rather than group ranking. (e.g. a learner can give basic personal information). 2. Evaluation: Refers to the overall language program and not just with what individual students have learned. Assessment of an individual students’ progress or achievement is an important component of evaluation. Evaluation goes beyond student achievement to consider all aspects of teaching and learning, and to look at how educational decisions can be informed by the results of alternative forms of assessment. How to evaluate the assessment instrument? 1. Validity: A test is valid when it measures effectively what it is intended to measure. A test must be reliable in order to be valid. Types of validity: A. Content validity: Checking all test items to make certain that they correspond to the instructional objectives of the course. B. Criterion-related validity: Determining how closely learner’s performance on a given new test parallels their performance on another instrument, or criterion. If the instrument to be validated is correlated with another criterion instrument at the same time, then this is refereed to as concurrent validity. If the correlation takes place at some future time, then it is referred to as 2 predictive validity. C. Construct validity:It refers to the degree to which scores on an assessment instrument permit inferences about underlying trait. It examines whether the instrument is a true reflection of the theory of the trait being measured. D. System validity: the effects of instructional changes brought about by the introduction of the test into an educational system. (p.41) Washback effect: how assessment instruments affect educational practices and beliefs. E. Face validity/ perceived validity: whether the test looks as if is measuring what it is supposed to measure. 2. Reliability:The degree to which it can be trusted to produce the same result upon repeated administrations. A language test must produce consistent results and give consistent information. Types of reliability: A. Test-retest reliability: the degree of consistency of scores for the same test given to the same students on different occasions. B. Alternate-forms reliability: the consistency of scores for the same students on different occasions on different but comparable forms of the test. C. Split-half reliability: a special case of alternate-forms reliability. The same individuals are tested on one occasion with a single test. A score is calculated for each half of the test for each individual and the consistency of the two halves is compared. D. Scorer reliability: the degree of consistency of scores from different scorers for the same individuals on the same test (interrater reliability) or from the same scorer for the same individuals on the same test but on different occasions (interrater reliability). Score reliability is an issue when scores are based on subjective judgments. (e.g., rater unreliability: two teachers observe two ESL students during a conversation together. Both teachers listened to and assessed the same conversation and reported different interpretations. Instrument-related, person-related reliability) Measuring reliability: The reliability index is a number ranging from .00 to 1.00 that indicates what proportion of measurement is reliable. An index of .80 means that your measurement 3 is 80% reliable and 20% error. For the purpose of classroom testing, a reliability coefficient of at least .70 is good. Higher reliability coefficients would be expected of standardized tests used for large-scale administration (.80 or better). Improving reliability: Rater reliability: to use more than one well-trained and experienced observer, interviewer, or composition reader. Person-related reliability: to assess on several occasions. Instrument-related reliability: to use a variety of methods of information collection. (A test which is reliable is not necessarily valid. A test may have maximum consistence, but may not be measuring what it is specifically intended to measure But an instrument can be only as valid as it is reliable. Inconsistency in a measurement reduces validity.) 3. Practicality: practical considerations like cost, administration time, administrator qualifications, and acceptability. Background: Historically, language-testing trends and practices have followed the changing winds of the teaching methodology. In the 1950s and 1960s, under the influence of behaviorism and structural linguistics, language tests were designed to assess learners’ mastery of different areas of the linguistic system such as phoneme discrimination, grammatical knowledge and vocabulary. Tests often used objective testing formats such as multiple choice. However, such discrete item tests provided no information on learners’ ability to use language for communicative purposes. In the 1970s and early 1980s this led to an upsurge of integrative tests such as cloze and dictation, which required learners to use linguistic and contextual knowledge to reconstitute the meaning of written or spoken texts. Since the early 1980s, with the widespread of Communicative Language Teaching (CLT), assessment has become increasingly direct. Many language tests often contain tasks which resemble the kinds of language-use situations that test takers would encounter in using the language for communicative purposes in everyday life. The tasks typically include activities such as oral interviews, listening to and reading extracts from the media, and various kinds of authentic writing tasks which reflect real-life demands. Today, test designers are still challenged in their quest for more authentic, content-valid instruments that stimulate real-world interaction 4 while still meeting reliability and practicality criteria. The best way to evaluate students’ performance in a second language is still a matter of debate. Given the wide variety of assessment methods available and the lack of consensus on the most appropriate means to use, the best way to assess language performance in the classroom may be through a multifaceted or eclectic approach, whereby a variety of methods are used. Discrete-point/ integrative assessment: (p.161) Since the 1960s, the notion of discrete-point assessment, that is, assessing one and only one point at a time, has met with some disfavor among theorists. They feel that such a method provides little information on the student’s ability to function in actual language-use situations. They also contend that it is difficult to determine which points are being assessed. In the past testing points have determined in part by a contrastive analysis of differences between the target and the native languages. But this contrastive analysis was criticized for being too limiting. About 20 years ago an integrative approach emerged, with the emphasis on testing more than one point at a time. There is actually a continuum from the most discrete-point on the one hand to the most integrative items or procedures on the other. Most items fall somewhere in between. Direct/ indirect assessment: A direct measure samples explicitly from the behavior being evaluated, while an indirect measure is contrived to the extent that the task differs from a normal language-using task. There is an increasing concern being voiced that assessment need to be developed that directly reflect the traits are supposed to measure. Traditional test items: 1. Multiple-choice (show your items to your colleagues) The multiple-choice item, like other paper-and-pencil tests (e.g. true-false items, matching items, short questions), measures whether the student knows or understands what to do when confronted with a problem situation. Multiple-choice items are favored because their scoring can be reliable, rapid, and economical. However, they cannot determine how the student actually will perform in that situation. Furthermore, it is not well adapted to measuring some problem-solving skills, or the ability to organize and present ideas. 5 Suggestions for constructing multiple-choice items: 1. The correct answer must not be dubious. (e.g.) Which is the odd one out? a. rabbit b. hare c. bunny d. deer 2. Items should be presented in context. (e.g.) Fill in the blank with the most suitable option: Visitor: Thank you very much for such a wonderful visit. Hostess: We were so glad you could come. Come back______. a. soon b. later c. today d. tomorrow 3. All distracters should be plausible. (e.g.) What is the major purpose of the United Nations? a. To develop a new system of international law. b. To provide military control of nations that have recently attained their independence. (vs. To provide military control). c. To maintain peace among the peoples of the world. d. To establish and maintain democratic forms of government in newly formed nations (vs. To form new governments). 4. For young students, 3-choice items may be preferable in order to reduce the amount of reading. For other learners, 4 or 5 choices are favored to reduce the chances of guessing the correct answer. 2. Essay questions: Learning outcomes concerned with the abilities to select, organize, integrate, relate, and evaluate ideas require the freedom of response provided by essay questions. It emphasizes on the integration and application of thinking and problem-solving skills. However, the most serious limitation is the unreliability of the scoring. Another closely related limitation is the amount of time required for scoring the answers. A series of studies has shown that answers to essay questions are scored differently by different teachers and that even the same teachers score the answers differently at 6 different times. One teacher stresses factual content, one organization of ideas, and another writing skills. With each teacher evaluating the degree to which different learning outcomes are achieved, it is not surprising that their scoring diverge so widely. The scoring reliability should be increased by clearly defining the outcomes to be measured, properly framing the questions, carefully following scoring rules, and obtaining practice in scoring. Suggestions for constructing essay questions: A prompt for the essay is presented in the form of a mini-teat that the respondents need to understand and operationalize. We need to give careful consideration to the instructions the respondents attend to in accomplishing the testing tasks. 1. Instructions should be brief, but explicit. 2. Specific about the form the answers are to take—if possible, presenting a sample question and answer. 3. Informative as to the value of each item and section of the assessment instrument, the time allowed for the test, and whether speed is a factor. 4. Formulate questions that will call forth the behavior specified in the learning outcomes. 5. Phrase each question so that the students’ task is clearly defined. (e.g.): (the incorrect answers may due to misinterpretation or lack of achievement) Write a one-page statement defending the importance of conserving our natural resources. Your answer will be evaluated in terms of its organization, comprehensiveness, and the relevance of the arguments presented. (30%, 30 minutes) Describe the similarities and differences between --- (comparing) What are major causes of --- (cause and effect) Briefly summarize the contents of --- (summarizing) Describe the strengths and weaknesses of the following --- (evaluating) Appraising tests: Instead of discarding the test after a classroom test has been administered and the students have discussed the results, a better approach is to appraise the effectiveness of the test items and to build a file of high-quality items for future use. Scoring tests is not the final step in the evaluation process. Scores are arbitrary. The main concern is the interpretation of the scores: (p.98) Raw score: the score obtained directly as a result of tallying up all the items answered correctly, usually is not easy to interpret. 7 Percentage score: the number of items that students answered correctly divided by the total items on the test. Percentile: a number tells what percent of individuals within he specified norm group scored lower than the raw score of a given student. Mean score: the average score or a given group of students. We divide the students scores added together by the number of scores involved. Item difficulty: the ratio of correct responses to total responses for a given test item. A norm-referenced assessment: (aims to differentiate among high and low achievers) should have items that approximately 60% to 80% of the respondents answer correctly. A criterion-reference assessment: (aims to determine whether students have achieved the objectives of a course) to obtain item difficulty of 90% or better. Formula: P = R/N x 100 (P = item difficulty, R = the number of students who got the item right, N = the total number of students who tried the item) Item discrimination/ item discriminating power: how well an item performs in separating the better students form the weaker ones. (While the item-difficulty index focuses on how the items fare, item discrimination looks at how the respondents fare from item to item) An item-discrimination level of .30 or above is generally agreed to be desirable. Formula: D = Ru–Rl / 1/2T ( D = item discriminating power, Ru = the number of students in the upper who get the item right, Rl = the number of students in the lower group who get the item right, T = the total number of students included in the item analysis) An item with maximum positive discriminating power is one in which all students in the upper group get the item right and all the students in the lower group get the item wrong. (D=10-0 / 10 = 1) An item with no discriminating power is one in which an equal number of students in both the upper and lower groups gets the item right. (D=10-10 / 10 = 0)It is also possible to calculate an index of negative discriminating power; that is, one in which more students in the lower group than the upper group get the item right. Such items should be revised so that they discriminate positively, or they should be discarded. Piloting of assessment instruments Ideally, an assessment instrument that is intended to perform an important function in an institution would undergo piloting on a small sample of respondents 8 similar to those for whom it is designed. The pilot administration provides the assessor feedback on the items and procedures. The assessor can obtain some valuable insights about what part of the instrument needs to be revised before it is administered to a larger group. Assessing the speaking skills If students are asked to participate in communicative, open-ended activities in the classroom, then it is hypocritical to assess their progress with discrete point grammar tests. The test should be designed to give students a real-life, culturally authentic task. 1. Interviews: Greeting, warm-up chat, close-up. 2. Pair discourse 3. Group oral 4. Tape recording A. Oral descriptions of visuals: comprises appropriate conversation stimulus at the novice level, not only because they provide a psychological prop, but also because they facilitate listing and identifying tasks for the students. There may be many possibilities for appropriate answers. Sources of visuals: the teacher’s or students’ personal slides and photos, yearbook pictures, magazine pictures. B. Role-play: To function in a “survival situation” in which students might encounter in a real life situation. (e.g.) Situation cards: On which are listed role-playing instructions for the learners. 1. When you see two of your friends at the mall, you decide to invite them to your birthday party. Tell them when and where it is, what you will do at the party, how many people will be there, and any other details you think they would be interested in knowing. 2. Leave a message on the answering machine with the following information: - Leave your name and the time you called - Tell the person where you are going tonight. - Tell the person you’ll see him/her tomorrow at a particular place and time.) Assessing reading comprehension (p.211) When readers approach text on the basis of the prior content, language, and textual schemata that they may have with regard to that particular text, this is referred to as top-down reading. When readers focus exclusively on what is present in the text 9 itself, and especially on the words and sentences of the text, this is referred to as bottom-up reading. Successful learners usually display a combination of top-down and bottom-up reading. Test constructors and users of assessment instruments should be aware of the skills tested by reading comprehension questions. There are numerous taxonomies of such skills: 1. The recognition of words and phrases of similar or opposing meaning. 5. The identifying or locating of information. 6. The discriminating of elements or features within context; the analysis of elements within a structure and of the relationship among them --- e.g. causal, sequential, chronological, hierarchical. 7. The interpreting of complex ideas, actions, events, and relationships. 8. Inferring --- the deriving of conclusions, and predicting the continuation. 9. Synthesis 10. Evaluation Testing methods: 1. Fixed-response format: multiple-choice 2. Structure-response format: Cloze: The cloze test is extensively used a completion measure, ideally aimed at tapping reading skills interactively, with respondents using cues from the text in a bottom-up fashion as well as bring their background knowledge to bear on the task. (e.g.) I am one of people who simply cannot up. To me, the of an orderly living is as remote and as trying to climb Fuji. The cloze has been used as a measure of readability, global reading skills, grammar, and writing. It can be scored according to an exact-word, acceptable-word, or multiple-choice approach. (p.139) Assessing listening skills The following are examples of listening comprehension items and procedure, ranging from the most discrete-point items to more integrative assessment tasks. The teacher must decide when it is appropriate to use any of these approaches for assessing listening in the classroom. 1. Discrimination of sounds: sound-discrimination items are of particular 10 benefit in assessing points of contrast between two given languages. (e.g.) The respondents indicate which sound of three is different from the other two. (1) sun, (2) put, (3) dug 2. Listening for grammatical distinction: The respondent could listen for inflectional marker, or to determine whether the subject and verb are in the singular or the plural. (e.g.) The boys sing well: (1) singular, (2) plural, (3) same form for singular and plural 3. Listen for vocabulary: The respondents perform an action in response to a command (e.g. getting up, walking to the window) or draw a picture according to oral instructions (e.g. coloring a picture a certain way). 4. Auditory comprehension: a. The respondents hear a statement and must indicate the appropriate paraphrase for the statement. (e.g.) What’d you get yourself into this time? (1) What are you wearing this time? (2) What did you buy this time? (3) What’s your problem this time? b. The respondents listen in on a telephone conversation and at appropriate times must indicate what they would say if they were one of the speakers in the conversation. (e.g.) Mother: Well, Mary, you know you were supposed to call me last week. Mary: I know, Mom, but I got tied up. Mother: That’s really no excuse. Mary: (1) Yes, I’ll call him. (2) You’re right. I’m sorry. (3) I’ve really had nothing to do. 5. Communicative assessment: There has been a general effort in recent years to make communicative types of listening comprehension tests more authentic. It has been suggested that increased attention be paid to where the content of the assessment instrument falls along the oral/literate continuum --- from a news broadcast, to a lecture, to a consultative dialogue. a. Lecture task: the respondents hear a lecture, with filled pauses and other features that make it different from oral recitation of a written 11 text. After the lecture, tape-recorded multiple-choice, structured, or open-ended questions are presented, with responses to be written on the answer sheet. b. Dictation: Dictation can serve as a measure of auditory comprehension if is given at a fast enough pace so that it is not simply a spelling test. Nonnative respondents must segment the sounds into words, phrases, and sentences. Assessing writing skills: Two principal types of scoring scales for rating writing; 1. Holistic scoring: a single score is assigned to a student’s overall writing performance. This is basically what teachers do when they assign number or letter grades to students’ compositions. Holistic scores represent teachers’ overall impressions and judgments. As such, they can serve as general incentives for learning, and they can distinguish students with respect to their general achievement in writing. However, because they provide no detailed information about specific aspects of performance (e.g. grammatical ability), they are not very useful in guiding teaching and learning. 2. Analytic scoring: different components or features of the students’ responses are given separate scores (on an essay, spelling, grammar, organization, and punctuation might be scored separately). Individual analytic scores are sometimes added together to yield a total score. The scoring categories included in an analytic system should reflect instructional objectives and plans. Determining levels of performance for each category generally reflects teachers’ expectations, based on past experience. Analytic scoring provides useful feedback to students and diagnostic information to teachers about specific areas of performance that satisfactory or unsatisfactory. This information can be useful for planning instruction and studying, Alternative assessment methods: Not only is assessment reflecting more of an integrative approach, but it has also become clear that assessment of language benefits from the use of multiple means over time. Alternative assessment methods including observation, portfolios, conferences, and dialogue journals, have led to the incorporation of the results they provide into students’ grades. Observation: Observation is an integral part of everyday teaching. Teachers continuously observe their students’ language use during formal instruction or while the students 12 are working individually at their desks; teachers may arrange individual conference times during which they observe students carefully on a one-to-one basis. It is important to identify why you want to observe and what kinds of decisions you want to be able to make based on your observations. A number of decisions need to be made when planning observation: 1. Why do you want to observe and what decisions do you want to make as a result of your observations? 2. What aspects of teaching or learning that are appropriate to these decisions do you want to observe? 3. Do you want to observe individual students, small groups of students, or the whole class? 4. Will you observe on one occasion or repeatedly? 5. How will you record your observations? Three ways of recording classroom observations: 1. Anecdotal records 2. Checklist 3. Rating scales Portfolios: A purposeful collection of students’ work that demonstrates to students and others their efforts, progress, and achievements in given areas. They provide a continuous record of students’ language development that can be shared with others. If portfolios are reviewed routinely by teachers and students in conference together, then they can also provide information about students’ views of their own language learning and strategies they apply in learning. This in turn can enhance student involvement in and ownership of their own learning. Portfolios can encourage students to assess their own strengths and weaknesses, and to identify their own goals for learning. Portfolio assessment is student-centered, collaborative, holistic, in contrast to many assessment methods that treat students as objects of evaluation and place the responsibility and task of assessment in the hands of teachers. A file folder, box, or any durable container can serve as a portfolio. Samples of writing, lists of books that have been read, book report, tape-recordings of speaking samples, favorite short stories, and so on can all be included in a portfolio. 1. Have students include a brief note describing why each piece is included in their portfolio; what they like about it; what they learned when they did it; and where there could be improvement. 2. During portfolio conferences, ask students to describe their current strengths 13 and weaknesses and to indicate where they have made progress; ask them to give evidence of this progress. 3. Teachers should be interested, supportive, and constructive when providing responses to portfolio pieces and students’ reflections on their work. Problems with portfolio assessment: One problem that arises is that teachers have a lot of work to do in this approach to assessment. There is the issue of how much time the teachers are willing to devote to this endeavor. Another problem is that the approach may make teachers anxious since any failures will reflect on them. In addition, the emphasis on revising has been viewed as pampering students too much and letting lazy students get by. Conferences Conferences can be used as part of evaluation, and generally take the form of a conversation or discussion between teachers and students about school work. Conferences can include individual students, several students, or even the whole class. They can be conversations about completed work, or about work in progress. At all times, students must feel that the conference is under their control and for their benefit. Beginning by having students review the work for you; permit them to comment on whatever is important from their point of view even though it might not seem so to you. To facilitate discussion, the following kinds of questions can be asked of students: 1. What do you like about this work? 2. What do you think you did well? 3. How does it show improvement from previous work? Can you show me the improvement? 4. Did you have any difficulties with this piece of work? Can you show me where you had difficulty? What did you do to overcome it? 5. What strategies did you use to figure out the meaning of words you could not read? Or what did you do when you did not know a word that you wanted to write? 6. Are there things about this work you do not like? Are there things you would like to improve? Conducting your conferences with each student on a regular basis throughout the year or course in order to monitor progress and difficulties that might be impeding progress and to plan lessons or instruction that is responsive to students’ ongoing needs. We do not recommend using conferences for grading purposes because grading generally focuses on learning outcomes or achievement, whereas the primary focus of conferences is process. 14 Dialogue journal Journals are written conversation between students and teachers. They provide opportunities for students to provide feedback about their learning experiences. There are a number of important benefits: 1. They provide useful information for individualizing instruction, for example: a. writing skills b. writing strategies c. students’ experiences in and outside of school d. learning process e. attitudes and feeling about themselves, their teachers, schooling f. their interests, expectations, goals 2. They increase opportunities for functional communication between students and teachers. 3. They give students opportunities to use language for genuine communication and personalized reading. 4. They permit teachers to individualize language teaching by modeling writing skills in their responses to student journals. 5. They promote the development of certain writing skills. 6. The enhance student involvement in and ownership of learning. The following guidelines for using journals are suggested: 1. Collect students’ journals on a regular basis and read them carefully before returning them. Keep the interval between readings as brief as possible; otherwise, students may perceive the feedback contained in their journals as unimportant and merely a writing exercise. 2. Encourage students to write about their success as well as their difficulties and hardships. Similarly, encourage them to write about classroom activities and events that they found useful, effective, and fun as well as those they found to be confusing, useless, uninteresting, or frustrating. 3. Be patient and allow students time to develop confidence in the process of sharing their personal impressions. 4. Avoid the use of evaluative or judgmental comments to ensure students’ confidence and candor. Self-assessment: having the learners evaluate their own performance. Learners could be asked to rate their ability to perform specific functional tasks or to indicate whether they would be able to respond successfully to given assessment items or procedures. 15 Careful self-assessment by students in a classroom could be one of the means for multiple assessment. (p.197) Reporting results: We need to deemphasize the significance of certain standardized test scores by complementing them with a host of ongoing and comprehensive assessment measures, including portfolios, self-assessment, and observations of student performance in class, student journals, and other forms of assessment. Having a variety of information about teaching and learning can enhance the reliability of your assessments and the validity of your decision making. Appendix: 1. Halo effect: Rating an examinee high on one trait or characteristic simply because he or she rates high on other characteristics. 2. Hawthorne effect: (placebo effect) The influence of researcher’s presence on the outcome of study (lighting on worker productivity/ attention increases productivity). If a new teaching method/textbook is used, there may be improvement in learning which is due not the method/textbook, but to the fact that it is new. How to avoid the Hawthorne effect: not to emphasize the experiment/observation when the new treatment becomes routine. 3. John Henry effect: (improved control group performance) The tendency of control group subjects who know they are in an experiment to exert extra effort and hence to perform above their expected average (a black railroad worker vs. streamdrill in laying railroad tracks). How to avoid: provide baseline data before the experiment is introduced; measurement should also occur after the experiment is over. 16