testing

LANGUAGE ASSESSMENT ELT Teacher Training Tarık İNCE CHAPTER 1 TESTING ASSESSING AND TEACHING In an era of communicative language teaching: Tests should measure up to standards of authenticity and meaningfulness. Ts should design tests that serve as motivating learning experiences rather than anxiety-provoking threats. Tests; should be positive experiences should build a person’s confidence and become learning experiences should bring out the best in students shouldn’t be degrading shouldn’t be artificial shouldn’t be anxiety-provoking Language Assessment aims; to create more authentic, intrinsically motivating assessment procedures that are appropriate for their context & designed offer constructive feedback to sts What is a test? A test is measuring a person’s ability, knowledge or performance in a given domain. 1. Method A set of techniques, procedures or items. To qualify as a test, the method must be explicit and structured. Like; Multiple-choice questions with prescribed correct answers A writing prompt with a scoring rubric An oral interview based on a question script and a checklist of expected responses to be filled by the administrator 2 Measure A means for offering the test-taker some kind of result. If an instrument does not specify a form of reporting measurement, then that technique cannot be defined as a test. Scoring may be like the followings Classroom-based short answer essay test may earn the test-taker a letter grade accompanied by the instructor’s marginal comments. Large-scale standardized tests provide a total numerical score, a percentile rank, and perhaps some sub-scores. 3. The test-taker(the individual) = The person who takes the test. Testers need to understand; who the test-takers are? what is their previous experience and background? whether the test is appropriately matched to their abilities? how should test-takers interpret their scores? 4. Performance Test measures performance, but results imply test-taker’ ability or competence. Some language tests measure one’s ability to perform language: To speak, write, read or listen to a subset of language Some others measure a test-taker’s knowledge about language: Defining a vocabulary item, reciting a grammatical rule or identifying a rhetorical feature in written discourse. 5. Measuring a given domain It means measuring the desired criterion and not including other factors. Proficiency tests: Even though the actual performance on the test involves only a sampling of skills, that domain is overall proficiency in a language – general competence in all skills of a language. Classroom-based performance tests: These have more specific criteria. For example: A test of pronunciation might well be a test of only a limited set of phonemic minimal pairs. A vocabulary test may focus on only the set of words covered in a particular lesson. A well-constructed test is an instrument that provides an accurate measure of the test taker’s ability within a particular domain. TESTING, ASSESSMENT & TEACHING TESTING are prepared administrative procedures that occur at identifiable times in a curriculum. When tested, learners know that their performance is being measured and evaluated. When tested, learners muster all their faculties to offer peak performance. Tests are a subset of assessment. They are only one among many procedures and tasks that teachers can ultimately use to assess students. Tests are usually time-constrained (usually spanning a class period or at most several hours) and draw on a limited sample of behaviour. ASSESSMENT Assessment is an ongoing process that encompasses a much wider domain. A good teacher never ceases to assess students, whether those assessments are incidental or intended. Whenever a student responds to a question, offers a comment, or tries out a new word or structure, the teacher subconsciously makes an assessment of the student’s performance. Assessment includes testing. Assessment is more extended and it includes a lot more components. What about TEACHING? For optimal learning to take place, learners must have opportunities to “play” with language without being formally graded. Teaching sets up the practice games of language learning: the opportunities for learners to listen, think, take risks, set goals, and process feedback from the teacher (coach) and then recycle through the skills that they are trying to master. During these practice activities, teachers are indeed observing students’ performance and making various evaluations of each learner. Then, it can be said that testing and assessment are subsets of teaching. ASSESSMENT Informal Assessment They are incidental, unplanned comments and responses. Examples include: “Nice job!” “Well done!” “Good work!” “Did you say can or can’t?” “Broke or break!”, or putting a ☺ on some homework. Classroom tasks are designed to elicit performance without recording results and making fixed judgements about a student’s competence. Examples of unrecorded assessment: marginal comments on papers, responding to a draft of an essay, advice about how to better pronounce a word, a suggestion for a strategy for compensating for a reading difficulty, and showing how to modify a student’s note-taking to better remember the content of a lecture. Formal Assessment They are exercises or procedures specifically designed to tap into a storehouse of skills and knowledge. They are systematic, planned sampling techniques constructed to give Ts and sts an appraisal of student achievement. They are tournament games that occur periodically in the course of teaching. It can be said that all tests are formal assessments, but not all formal assessment is testing. Example 1: A student’s journal or portfolio of materials can be used as a formal assessment of attainment of the certain course objectives, but it is problematic to call those two procedures “test”. Example 2: A systematic set of observations of a student’s frequency of oral participation in class is certainly a formal assessment, but not a “test”. THE FUNCTION OF AN ASSESSMENT Formative Assessment Summative Assessment Evaluating students in the It aims to measure, or summarize, what a student process of “forming” their has grasped, and typically competencies and skills with occurs at the end of a course. the goal of helping them to continue that growth process. It does not necessarily point the way to future progress. It provides the ongoing development of learner’s lang Example: Final exams in a course and general Example: When you give sts a proficiency exams. comment or a suggestion, or call attention to an error, that feedback is offered to improve learner’s language ability. Virtually all kinds of informal assessment are formative. All tests/formal assessment (quizzes, periodic review tests, midterm exams, etc.) are summative. IMPORTANT: As far as summative assessment is considered, in the aftermath of any test, students tend to think that “Whew! I’m glad that’s over. Now I don’t have to remember that stuff anymore!” An ideal teacher should try to change this attitude among students. A teacher should: · instill a more formative quality to his lessons · offer students an opportunity to convert tests into “learning experiences”. Norm-Referenced Tests TESTS Each test-taker’s score is interpreted in relation to a mean (average score), median (middle score), standard deviation (extend of variance in scores), and/or percentile rank. The purpose is to place test-takers along a mathematical continuum in rank order. Scores are usually reported back to the test-taker in the form of a numerical score. (230 out of 300, 84%, etc.) Typical of these tests are standardized tests like SAT. TOEFL, ÜDS, KPDS, DS, etc. These tests are intended to be administered to large audiences, with results efficiently disseminated to test takers. They must have fixed, predetermined responses in a format that can be scored quickly at minimum expense. Money and efficiency are primary concerns in these tests. Criterion-Referenced Tests They are designed to give testtakers feedback, usually in the form of grades, on specific course or lesson objectives. Tests that involve the sts in only one class, and are connected to a curriculum, are Criterion-Referenced Tests. Much time and effort on the part of the teacher are required to deliver useful, appropriate feedback to students. The distribution of students’ scores across a continuum may be of little concern as long as the instrument assesses appropriate objectives. As opposed to standardized, large scale testing with its emphasis on classroom-based testing, CriterionReferenced Testing is of more prominent interest than Norm-Referenced Testing. Approaches to Language Testing: A Brief History Historically, language-testing trends have followed the trends of teaching methods. During 1950s: An era of behaviourism and special attention to contrastive analysis. Testing focused on specific lang elements such as phonological, grammatical, and lexical contrasts between two languages. During 1970s and 80s: Communicative Theories were widely accepted. A more integrative view of testing. Today: Test designers are trying to form authentic, valid instruments that simulate real world interaction. APPROACHES TO LANGUAGE TESTING A) Discrete-Point Testing B) Integrative Testing Language can be broken down into its component parts and those parts can be tested successfully. Component parts; listening, speaking, reading and writing. Units of language (discrete points); phonology, graphology, morphology, lexicon, syntax and discourse. An language proficiency test should sample all 4 skills and as many linguistic discrete points as possible In the face of evidence that in a study each student scored differently in various skills depending on his background, country and major field, Oller admitted that “unitary trait hypothesis was wrong.” Language competence is a unified set of interacting abilities that cannot be tested separately. Communicative competence is global and requires such integration that it cannot be captured in additive tests of grammar, reading, vocab, and other discrete points of lang. Two types of tests examples of integrative tests: *cloze test and **dictation. Unitary trait hypothesis: It suggests an “indivisible” view of language proficiency; that vocabulary, grammar, phonology, “4 skills”, and other discrete points of lang could not be disentangled from each other in lang performance. Cloze Test: Cloze Test results are good measures of overall proficiency. The ability to supply appropriate words in blanks requires a number of abilities that lie at the heart of competence in a language: knowledge of vocabulary, grammatical structure, discourse structure, reading skills and strategies. It was argued that successful completion of cloze items taps into all of those abilities, which were said to be the essence of global language proficiency. Dictation Essentially, learners listen to a passage of 100 to 150 words read aloud by an administrator (or audiotape) and write what they hear, using correct spelling. Supporters argue that dictation is an integrative test because success on a dictation requires careful listening, reproduction in writing of what is heard, efficient short-term memory, to an extent, some expectancy rules to aid the short-term memory. c) Communicative Language Testing ( recent approach after mid 1980s) What does it criticise? In order for a particular langtest to be useful for its intended purposes, test performance must correspond in demonstrable ways to language use in non-test situations. Integrative tests such as cloze only tell us about a candidate’s linguistic competence. They do not tell us anything directly about a student’s performance ability. (Knowledge about a language, not the use of language) Any suggestion? A quest for authenticity, as test designers centered on communicative performance. The supporters emphasized the importance of strategic competence (the ability to employ communicative strategies to compensate for breakdowns as well as to enhance the rhetorical effect of utterances) in the process of communication. Any problem in using this approach? Yes, communicative testing presented challenges to test designers, because they began to identify the real-world tasks that language learners were called upon to perform. But, it was clear that the contexts for those tasks were extraordinarily widely varied and that the sampling of tasks for any one assessment procedure needed to be validated by what language users actually do with language. As a result: The assessment field became more and more concerned with the authenticity of tasks and the genuineness of texts. d) Performance-Based Assessment performance-based assessment of language typically involves oral production, written production, open-ended responses, integrated performance (across skill areas), group performance, and other interactive tasks. Any problems? It is time-consuming and expensive, but those extra efforts are paying off in more direct testing because sts are assessed as they perform actual or simulated real-world tasks. The advantage of this approach? Higher content validity is achieved because learners are measured in the process of performing the targeted linguistic acts. Important performance-based assessment means that Ts should rely a little less on formally structured tests and a little more on evaluation while sts are performing various tasks. In performance-based assessment: Interactive Tests (speaking, requesting, responding, etc.) IN ☺ Paper-and-pencil OUT Result: in this test tasks can approach the authenticity of real life language use. CURRENT ISSUES IN CLASSROOM TESTING The design of communicative, performance-based assessment continues to challenge both assessment experts and classroom teachers. There’re three issues which are helping to shape our current understanding of effective assessment. These are: · The effect of new theories of intelligence on the testing industry · The advent of what has come to be called “alternative assessment” The increasing popularity of computer-based testing New Views on Intelligence In the past: Intelligence was once viewed strictly as the ability to perform linguistic and logical-mathematical problem solving. For many years, we’ve lived in a word of standardized, norm-referenced tests that are timed in a multiple-choice format consisting of a multiplicity of logic constrained items, many of which are inauthentic. We were relying on timed, discrete-point, analytical tests in measuring lang. We were forced to be in the limits of objectivity and give impersonal responds. Recently: Spatial intelligence musical intelligence bodily-kinesthetic intelligence interpersonal intelligence intrapersonal intelligence EQ (Emotional Quotient) underscore emotions in our cognitive processing. Those who manage their emotions tend to be more capable of fully intelligent processing, because anger, grief, resentment, other feelings can easily impair peak performance in everyday tasks as well as higher-order problem solving. These conceptualizations of intelligence’ intuitive appeal infused the 1990s with a sense of both freedom and responsibility in our testing agenda. In past, our challenge was to test interpersonal, creative, communicative, interactive skills, doing so to place some trust in our subjectivity and intuition. Traditional and “Alternative” Assessment Traditional Assessment -One-shot, standardized exams -Timed, multiple-choice format -Decontextualized test items -Scores suffice for feedback -Norm-referenced scores -Focus on the “right” answer -Summative -Oriented to product -Non-interactive process -Fosters extrinsic motivation Alternative Assessment Continuous longterm assessment Untimed, free-response format Contextualized communicative tests Individualized feedback and washback Criterion-referenced scores Open-ended, creative answers Formative Oriented to process Interactive process Fosters intrinsic motivation IMPORTANT It is difficult to draw a clear line of distinction between traditional and alternative assessment. Many forms of assessment fall in between the two, and some combine the best of both. More time and higher institutional budgets are required to administer and score assessments that presuppose more subjective evaluation, more individualization, and more interaction in the process of offering feedback. But the payoff of the “Alternative Assessment” comes with more useful feedback to students, the potential for intrinsic motivation, and ultimately a more complete description of a student’s ability. Computer-Based Testing Some computer-based tests are small-scale. Others are standardized, large scale tests (e.g. TOEFL) in which thousands of test-takers are involved. A type of computer-based test (Computer-Adaptive Test / CAT) is available In CAT, the test-taker sees only one question at a time, and the computer scores each question before selecting the next one. Test-takers cannot skip questions, and, once they have entered and confirmed their answers, they cannot return to questions. Advantages of Computer-Based Testing: o Classroom-based testing o Self-directed testing on various aspects of a lang (vocabulary, grammar, discourse, etc) o Practice for upcoming high-stakes standardized tests o Some individualization, in the case of CATs. o Scored electronically for rapid reporting of results. Disadvantages of Computer-Based Testing: Lack of security and the possibility of cheating in unsupervised computerized tests. Home-grown quizzes may be mistaken for validates assessments. Open-ended responses are less likely to appear because of need for human scorers. The human interactive element is absent. An Overall summary Tests Assessment is an integral part of the teaching-learning cycle. In an interactive, communicative curriculum, assessment is almost constant. Tests can provide authenticity, motivation, and feedback to the learner. Tests are essential components of a successful curriculum and learning process. Assessments Periodic assessments can increase motivation as milestones of student progress. Appropriate assessments aid in the reinforcement and retention of information. Assessments can confirm strength and pinpoint areas needing further work. Assessments provide sense of periodic closure to modules within a curriculum. Assessments promote sts autonomy by encouraging self-evaluation progress. Assessments can spur learners to set goals for themselves. Assessments can aid in evaluating teaching effectiveness. Decide whether the following statements are TRUE or FALSE. 1. It’s possible to create authentic and motivating assessment to offer constructive feedback to the sts. ----------2. All tests should offer the test takers some kind of measurement or result. ----3. Performance based tests measure test takers’ knowledge about language. ----4. Tests are the best tools to assess students. ----------5. Assessment and testing are synonymous terms. ----------6. Ts’ incidental and unplanned comments and responses to sts is an example of formal assessment. ------7. Most of our classroom assessment is summative assessment. ----------8. Formative assessment always points toward future formation of learning. ---9. The distribution sts’ scores across a continuum is a concern in norm referenced test. ----------10. C riterion referenced testing has more instructional value than normreferenced testing for classroom teachers. ----------1. TRUE 2. TRUE 3. FALSE They are designed to test actual use of lang not knowledge about lang 4. FALSE (We cannot say they are best, but one of useful devices to assess sts.) 5. FALSE (They are not.) 6. FALSE (They are informal assessment) 7. FALSE (formative assessment) 8. TRUE 9. TRUE 10. TRUE CHAPTER 2 PRINCIPLES OF LANGUAGE ASSESSMENT There’re five testing criteria for “testing a test”: 1. Practicality 2. Reliability 3. Validity 4. Authenticity 5. Washback 1. PRACTICALITY A practical test · is not excessively expensive, · stays within appropriate time constraints, · is relatively easy to administer, and · has a scoring/evaluation procedure that is specific and time-efficient. For a test to be practical · administrative details should clearly be established before the test, · sts should be able to complete the test reasonably within the set time frame, · the test should be able to be administered smoothly (prosedürle boğmamalı), · all materials and equipment should be ready, · the cost of the test should be within budgeted limits, · the scoring/evaluation system should be feasible in the teacher’s time frame. · methods for reporting results should be determined in advance. 2. RELIABILITY A reliable test is consistent and dependable. The issue of reliability of a test may best be addressed by considering a number of factors that may contribute to the unreliability of a test. Consider following possibilities: fluctuations · in the student (Student-Related Reliability), · in scoring (Rater Reliability), · in test administration (Test Administration Reliability), and · in the test (Test Reliability) itself. Student-Related Reliability: Temporary illness, fatigue, a bad day, anxiety, other physical or psychological factors may make an “observed” score deviate from one’s “true” score. Also a test-taker’s “test-wiseness” or strategies for efficient test taking can also be included in this category. Rater Reliability: Human error, subjectivity, lack of attention to scoring criteria, inexperience, inattention, or even preconceived (peşin hükümlü) biases may enter into scoring process. Inter-rater unreliability occurs when 2 or more scorers yield inconsistent scores of the same test. Intra-rater unreliability is because of unclear scoring criteria, fatigue, bias toward particular “good” and “bad” students, or simple carelessness. One solution to such intra-rater unreliability is to read through about half of the tests before rendering any final scores or grades, then to recycle back through the whole set of tests to ensure an even-handed judgment. The careful specification of an analytical scoring instrument can increase raterreliability. Test Administration Reliability: Unreliability may also result from the conditions in which the test is administered. Street noise, photocopying variations, poor light, temperature, desks and chairs. Test Reliability: Sometimes the nature of the test itself can cause measurement errors. Timed tests may discriminate against sts who do not perform well with a time limit. Poorly written test items may be a further source of test unreliability. 3. VALIDITY The extent to which the assessment requires students to perform tasks that were included in the previous classroom lessons. How is the validity of a test established? There is no final, absolute measure of validity, but several different kinds of evidence may be invoked in support. it may be appropriate to examine the extent to which a test calls for performance that matches that of the course or unit of study being tested. In other cases we may be concerned with how well a test determines whether or not students have reached an established set of goals or level of competence. it could be appropriate to study statistical correlation with other related but independent measures. Other concerns about a test’s validity may focus on the consequences – beyond measuring the criteria themselves - of a test, or even on the test-taker’s perception of validity. We will look at these five types of evidence below. Content Validity: If a test requires the test-taker to perform the behaviour that is being measured, content-related evidence of validity, often popularly referred to as content validity. If you assess a person’s ability to speak TL, asking sts answer paper-and-pencil multiple choice questions requiring grammatical judgements does not achieve content validity. for content validity to be achieved, one should be able to elicit the following conditions: · Classroom objectives should be identified and appropriately framed. The first measure of an effective classroom test is the identification of objectives. · Lesson objectives should be represented in the form of test specifications. A test should have a structure that follows logically from lesson or unit you are testing. If you clearly perceive the performance of test-takers as reflective of the classroom objectives, then you can argue this, content validity has probably been achieved. To understand content validity consider difference between direct and indirect testing. Direct testing involves the test-taker in actually performing the target task. Indirect testing involves performing not target task itself, but that related in some way. Direct testing is most feasible (uygun) way to achieve content validity in assessment. Criterion-related Validity: It examines the extent to which the criterion of test has actually been achieved. For example, a classroom test designed to assess a point of grammar in communicative use will have criterion validity if test scores are corroborated either by observed subsequent behavior or by other communicative measures of the grammar point in question. Criterion-related evidence usually falls into one of two categories: Concurrent (uygun, aynı zamanda olan) validity: A test has concurrent validity if its results are supported by other concurrent performance beyond the assessment itself. For example, the validity of a high score on the final exam of a foreign language course will be substantiated by actual proficiency in the language. · Predictive (öngörüsel, tahmini) validity: The assessment criterion in such cases is not to measure concurrent ability but to assess (and predict) a test-taker’s likelihood of future success. For example, the predictive validity of an assessment becomes important in the case of placement tests, language aptitude tests, and the like. · Construct Validity: Every issue in language learning and teaching involves theoretical constructs. In the field of assessment, construct validity asks, “Does this test actually tap into the theoretical construct as it has been identified?” (test gerçekten de test etmek istediğim konu ya da beceriyi test etmede gerekli olan yapısal özellikleri taşıyor mu?) Imagine that you have been given a procedure for conducting an oral interview. The scoring analysis for the interview includes several factors in the final score: pronunciation, fluency, grammatical accuracy, vocabulary use, and sociolinguistic appropriateness. The justification for these five factors lies in a theoretical construct that claims those factors to be major components of oral proficiency. So if you were asked to conduct on oral proficiency interview that evaluated only pronunciation and grammar, you could be justifiably suspicious about the construct validity of that test. “Large-scale standardized tests” olarak nitelediğimiz sınavlar “construct validity” açısından pek de uygun değildir. Çünkü pratik olması açısından (yani hem zaman hem de ekonomik nedenlerden) bu testlerde ölçülmesi gereken bütün dil becerileri ölçülememektedir. Örneğin TOEFL’ da “oral production” bölümünün olmaması “construct validity” açısından büyük bir engel olarak karşımıza çıkmaktadır. Consequential Validity: Consequential validity encompasses all the consequences of a test, including such considerations as its accuracy in measuring intended criteria, its impact on the preparation of test-takers, its effect on the learner, and the (intended and unintended) social consequences of a test’s interpretation and use. McNamara (2000, p. 54) cautions against test results that may reflect socioeconomic conditions such as opportunities for coaching (özel ders, özel ilgi). For example, only some families can afford coaching, or because children with more highly educated parents get help from their parents. Teachers should consider the effect of assessments on students’ motivation, subsequent performance in a course, independent learning, study habits, and attitude toward school work. Face Validity: the degree to which a test looks right, and appears to measure the knowledge or abilities it claims to measure, based on the subjective judgment of test-takers · Face validity means that the students perceive the test to be valid. Face validity asks the question “Does the test, on the ‘face’ of it, appear from the learner’s perspective to test what it is designed to test? · Face validity is not something that can be empirically tested by a teacher or even by a testing expert. It depends on subjective evaluation of the test-taker. · A classroom test is not the time to introduce new tasks. · If a test samples the actual content of what the learner has achieved or expects to achieve, face validity will be more likely to be perceived. · Content validity is a very important ingredient in achieving face validity. · Students will generally judge a test to be face valid if directions are clear, the structure of the test is organized logically, its difficulty level is appropriately pitched, the test has no “surprises”, and timing is appropriate. · To give an assessment procedure that is “biased for best” a teacher offers students appropriate review and preparation for the test, suggests strategies that will be beneficial, and structures the test so that the best students will be modestly challenged and the weaker students will not be overwhelmed. 4. AUTHENTICITY In an authentic test · the language is as natural as possible, · items are as contextualized as possible, · topics and situations are interesting, enjoyable and/or humorous, · some thematic (konuyla ilgili) organization, such as through a story line or episode is provided, · tasks represent real-world tasks. Reading passages are selected from real-world sources that test-takers are likely to have encountered or will encounter. Listening comprehension sections feature natural language with hesitations, white noise, and interruptions. More and more tests offer items that are “episodic” in that they are sequenced to form meaningful units, paragraphs, or stories. 5. WASHBACK Washback includes the effects of an assessment on teaching and learning prior to the assessment itself, that is, on preparation for the assessment. Informal performance assessment is by nature more likely to have built-in washback effects because the teacher is usually providing interactive feedback. Formal tests can also have positive washback, but they provide no washback if the students receive a simple letter grade or a single overall numerical score. Tests should serve as learning devices through which washback is achieved. Sts’ incorrect responses can become windows of insight into further work. Their correct responses need to be praised, especially when they represent accomplishments in a student’s inter-language. Washback enhances a number of basic principles of language acquisition: intrinsic motivation, autonomy, self-confidence, language ego, interlanguage, and strategic investment, among others. To enhance washback comment generously & specifically on test performance. Washback implies that students have ready access to the teacher to discuss the feedback and evaluation he has given. Teachers can raise the washback potential by asking students to use test results as a guide to setting goals for their future effort. What is washback? In general terms: The effect of testing on teaching and learning In large-scale assessment: Refers to the effects that the tests have on instruction in terms of how students prepare for the test In classroom assessment: The information that washes back to students in the form of useful diagnoses of strengths and weaknesses What does washback enhance? Intrinsic motivation Language ego Autonomy Inter-language Self-confidence Strategic investment What should teachers do to enhance washback? Comment generously and specifically on test performance Respond to as many details as possible Praise strengths Criticize weaknesses constructively Give strategic hints to improve performance Decide whether the following statements are TRUE or FALSE. 1. An expensive test is not practical. 2. One of the sources of unreliability of a test is the school. 3. Sts, raters, test, and administration of it may affect the test’s reliability. 4. In indirect tests, students do not actually perform the task. 5. If students are aware of what is being tested when they take a test, and think that the questions are appropriate, the test has face validity. 6. Face validity can be tested empirically. 7. Diagnosing strengths and weaknesses of students in language learning is a facet of washback. 8. One way of achieving authenticity in testing is to use simplified language. 1. TRUE 2. FALSE 3. TRUE 4. TRUE 5. TRUE 6. FALSE 7. TRUE 8. FALSE Decide which type of validity does each sentence belong to? 1. It is based on subjective judgment. ---------------------2. It questions the accuracy of measuring the intended criteria. ---------------------- 3. It appears to measure the knowledge and abilities it claims to measure. ------------4. It measures whether the test meets the objectives of classroom objectives. -------5. It requires the test to be based on a theoretical background. ---------------------6. Washback is part of it. ---------------------- 7. It requires the test-taker to perform the behavior being measured. -----------------8. The students (test-takers) think they are given enough time to do the test. ----------9. It assesses a test-taker's likelihood of future success. (e.g. placement tests). --------10. The students' psychological mood may affect it negatively or positively. -------------11. It includes the consideration of the test's effect on the learner. ---------------------12. Items of the test do not seem to be complicated. ---------------------13. The test covers the objectives of the course. ---------------------14. The test has clear directions. ---------------------1. Face 2. Consequential 3. Face 4. Content 5. Construct 6. Content 7. Criterion related 8. Face 9. Criterion related 10. Consequential 11. Consequential 12. Face validity 13. Content validity 14. Face validity Decide with which type of reliability could each sentence be related? 1. There are ambiguous items. 2. The student is anxious. 3. The tape is of bad quality. 4. The teacher is tired but continues scoring. 5. The test is too long. 6. The room is dark. 7. The student has had an argument with the teacher. 8. The scorers interpret the criteria differently. 9. There Is a lot of noise outside the building. 1. Test reliability 3. Test administration reliability 2. Student-related reliability 4. Rater reliability 5. Test reliability 6. Test administration reliability 7. Student-related reliability 8. Rater reliability 9. Test administration reliability CHAPTER 3 DESIGNING CLASSROOM LANGUAGE TESTS we examine test types, and learn how to design tests and revise existing ones. To start the process of designing tests, we will ask some critical questions. 5 questions should form basis of your approach to designing tests for class. Question 1: What is the purpose of the test? · Why am I creating this test? · For an evaluation of overall proficiency? (Proficiency Test) · To place students into a course? (Placement Test) · To measure achievement within a course? (Achievement Test) Once you established major purpose of a test, you can determine its objectives. Question 2: What are the objectives of the test? · What specifically am I trying to find out? · What language abilities are to be assessed? Question 3: How will test specifications reflect both purpose and objectives? · When a test is designed, the objectives should be incorporated into a structure that appropriately weights the various competencies being assessed. Question 4: How will test tasks be selected and the separate items arranged? · The tasks need to be practical. · They should also achieve content validity by presenting tasks that mirror those of the course being assessed. · They should be evaluated reliably by the teacher or scorer. · The tasks themselves should strive for authenticity, and the progression of tasks ought to be biased for best performance. Question 5: What kind of scoring, grading, and/or feedback is expected? · Tests vary in the form and function of feedback, depending on their purpose. · For every test, the way results are reported is an important consideration. · Under some circumstances a letter grade or a holistic score may appropriate; other circumstances may require that a teacher offer substantive washback to the learner. TEST TYPES Defining your purpose will help you choose the right kind of test, and it will also help you to focus on the specific objectives of the test. Below are the test types to be examined: 1. Language Aptitude Tests 2. Proficiency Tests 3. Placement Tests 4. Diagnostic Tests 5. Achievement Tests 1. Language Aptitude Tests They predict a person’s success prior to exposure to the second language. Aptitude test is designed to measure capacity or general ability to learn a FL. They are designed to apply to the classroom learning of any language. Two standardized aptitude tests have been used in the US. The Modern Language Aptitude Test (MLAT), Pimsleur Language Aptitude Battery(PLAB) Tasks in MLAT includes: Number learning, phonetic script, spelling clues, words in sentences, and paired associates. There’s no unequivocal evidence that language aptitude tests predict communicative success in a language. Any test that claims to predict success in learning a language is undoubtedly flawed because we now know that with appropriate self-knowledge, and active strategic involvement in learning, everyone can succeed eventually. 2. Proficiency Tests A proficiency test is not limited to any one course, curriculum, or single skill in the language; rather, it tests overall ability. It includes: standardized multiple choice items on grammar, vocabulary, reading comprehension, and aural comprehension. Sometimes a sample of writing is added, and more recent tests also include oral production. Such tests often have content validity weaknesses. Proficiency tests are almost always summative and norm-referenced. They are usually not equipped to provide diagnostic feedback. Their role is to accept or to deny someone’s passage into next stage of a journey TOEFL is a typical standardized proficiency test. Creating & validating them with research is time-consuming & costly process To choose one of a number of commercially available proficiency tests is a far more practical method for classroom teachers. 3. Placement Tests The objective of placement test is to correctly place sts into a course or level. Certain proficient tests can act in the role of placement tests. A placement test usually includes a sampling of the material to be covered in the various courses in a curriculum. Sts should find the test neither too easy nor too difficult but challenging. ESL Placement Test (ESLPT) at San Francisco State University has three parts. Part 1: sts read a short article and then write a summary essay. Part 2: sts write a composition in response to an article. Part 3: multiple-choice; sts read an essay and identify grammar errors in it. ESL is more authentic but less practical, because human evaluators are required for the first two parts. Reliability problems present but mitigated by conscientious training evaluators What is lost in practicality and reliability is gained in the diagnostic information that the ESLPT provides. 4. Diagnostic Tests A diagnostic test is designed to diagnose specified aspects of a language. A diagnostic test can help a student become aware of errors and encourage the adoption of appropriate compensatory strategies. A test of pronunciation diagnose phonological features that are difficult for Sts and should become part of a curriculum. Such tests offer a checklist of features for administrator to use in pinpointing difficulties. A writing diagnostic elicit a writing sample from sts that would allow Ts to identify those rhetorical and linguistic features on which the course needed to focus special attention. A diagnostic test of oral production was created by Clifford Prator (1972) to accompany a manual of English pronunciation. In the test; Test-takers are directed to read 150-word passage while they are tape recorded. The test administrator then refers to an inventory(envanter, deftere kayıtlı eşya) of phonological items for analyzing a learner’s production. After multiple listening, they produce checklist for errors in 5 categories. Stress - rhythm, Intonation, Vowels, Consonants, Other factors. This information help Ts make decisions about aspects of English phonology. 5. Achievement Tests Achievement test is related directly to lessons, units, or even a total curriculum. Achievement tests should be limited to particular material addressed in a curriculum within a particular time frame and should be offered after a course has focused on the objectives in question. There’s a fine line of differences between diagnostic test and achievement test. Achievement tests analyze the extent to which students have acquired language features that have already been taught. (Geçmişin analizini yapıyor.) Diagnostic tests should elicit information on what students need to work on in the future. (Gelecek ile ilgili bir analiz yapılıyor.) Primary role of achievement test is to determine whether course objectives have been met – and appropriate knowledge and skills acquired – by the end of a period of instruction. They are often summative because they are administered end of a unit or term. But effective achievement tests can serve as useful washback by showing the errors of students and helping them analyze their weaknesses and strengths. Achievement tests range from five- or ten-minute quizzes to three-hour final examinations, with an almost infinite variety of item types and formats. practical steps in constructing classroom tests: A) Assessing Clear, Unambiguous Objectives Before giving a test; examine the objectives for the unit you’re testing. Your first task in designing a test, then, is to determine appropriate objectives. “Students will recognize and produce tag questions, with the correct grammatical form and final intonation pattern, in simple social conversations. B) Drawing Up Test Specifications (Talimatlar) Test specifications will simply comprise a) a broad outline of the test b) what skills you will test c) what the items will look like This is an example for test specifications based on the objective stated above: “Students will recognize and produce tag questions, with the correct grammatical form and final intonation pattern, in simple social conversations.” C) Devising Test Tasks how students will perceive them(face validity) the extent to which authentic language and contexts are present potential difficulty caused by cultural schemata In revising your draft, you should ask yourself some important questions: 1. Are the directions to each section absolutely clear? 2. Is there an example item for each section? 3. Does each item measure a specified objective? 4. Is each item stated in clear, simple language? 5. Does each multiple choice have appropriate distracters; that is, are the wrong items clearly wrong and yet sufficiently “alluring” that they aren’t ridiculously easy? 6. Is the difficulty of each item appropriate for your students? 7. Is the language of each item sufficiently authentic? 8. Do the sum of items and the test as a whole adequately reflect the learning objectives? In the final revision of your test, Time yourself if the test should be shortened or lengthened, make the necessary adjustments make sure your test is neat and uncluttered on the page if there is an audio component, make sure that the script is clear, D) Designing Multiple-Choice Test Items There’re a number of weaknesses in multiple-choice items: The technique tests only recognition knowledge. Guessing may have a considerable effect on test scores. The technique severely restricts what can be tested. It is very difficult to write successful items. Washback may be harmful. Cheating may be facilitated. However, 2 principles support multiple-choice formats are practicality - reliability. Some important jargons in Multiple-Choice Items: Multiple-choice items are all receptive, or selective, that is, test-taker chooses from a set of responses rather than creating a response. Other receptive item types include true-false questions and matching lists. Every multiple-choice item has a stem, which presents several options or alternatives to choose from. One of those options, the key, is correct response, others serve as distractors . IMPORTANT!!! Consider the following four guidelines for designing multiple-choice items for both classroom-based and large-scale situations: 1. Design each item to measure a specific objective. (aynı anda hem modal bilgisini hem de article bilgisini ölçme.) 2. State both stem and options as simply and directly as possible. Do not use superfluous (lüzumsuz) words, and another rule of succinctness (az ve öz) is to remove needless redundancy (gereksiz bilgi) from your options. 3. Make certain that the intended answer is clearly the only correct one. Eliminating unintended possible answers is often the most difficult problem of designing multiple-choice items. With only a minimum of context in each stem, a wide variety of responses may be perceived as correct. 4. Use item indices (indeksler) to accept, discard, or revise items: The appropriate selection and arrangement of suitable multiple-choice items on a test can best be accomplished by measuring items against three indices: a) item facility(IF), or item difficulty b) item discrimination (ID), or item differentiation, and c) distractor analysis a) Item facility (IF) is the extent to which an item is easy or difficult for the proposed group of test-takers. 20 öğrenciden 13 doğru cevap geldiyse; 13/20=0,65(%65). %15 - %85’in kabul edilebilir Two good reasons for including a very easy item (%85 or higher) are to build in some affective feelings of “success” among lower-ability students and to serve as warm-up items. And very difficult items can provide a challenge to high estability sts. b) Item discrimination (ID) is extent to which an item differentiates between high- and low-ability test-takers. An item on which high-ability students and low-ability students score equally well would have poor ID because it did not discriminate between the two groups. An item that garners(toplamak) correct responses from most of the high-ability group and incorrect responses from most of low-ability group has good discrimination power. 30 öğrenciyi en iyiden en düşüğe kadar üç eşit parçaya ayır. En yüksek notu alan 10 öğrenci ile en düşük notu alan 10 öğrenciyi bir item’da aşağıdaki gibi ayıralım Item # High-ability students (top 10) Low-ability students (bottom10) Correct 7 2 Incorrect 3 8 ID: 7-2=5/ 10= 0,50 The result tells us that us that the item has a moderate level of ID. High discriminating level would approach 1.0 and no discriminating power at all would be zero. In most cases, you would want to discard an item that scored near zero. No absolute rule governs establishment of acceptable and unacceptable ID indices. c) Distractor efficiency (DE) is the extent to which the distractors “lure” a sufficient number of test-takers, especially lower-ability ones, and those responses are somewhat evenly distributed across all distractors. Example: *Note: C is the correct response. Choices High-ability students (10) Low-ability students (10) A 0 3 B 1 5 C* 7 2 D 0 0 E 2 0 The item might be improved in two ways: a) Distractor D doesn’t fool anyone. Therefore it probably has no utility. A revision might provide a distractor that actually attracts a response or two. b) Distractor E attracts more responses (2) from the high-ability group than the low-ability group (0). Why are good students choosing this one? Perhaps it includes a subtle reference that entices the high group but is “over the head” of low group, and therefore latter sts don’t even consider it. The other two distractor (A and B) seem to be fulfilling their function of attracting some attention from the lower-ability students. SCORING, GRADING AND GIVING FEEDBACK A) Scoring As you design a test, you must consider how the test will be scored and graded Scoring plan reflects relative weight that you place on each section and items hangi beceriyi daha çok önemsemişse o beceriye fazla puan vermek gerekir Oral production %30, Listening %30, Reading %20 ve Writing %20 şeklinde. B) Grading Grading doesn’t mean just giving “A” for 90-100. It’s not that simple. How assign letter grades is a product of country, culture and context of class institutional expectations (most of them unwritten), explicit and implicit definitions of grades that you have set forth, the relationship you have established with the class, Sts’ expectations that have been engendered in previous tests, quizzes in class. C) Giving Feedback Feedback should become beneficial washback. Those are some examples of feedback: 1. a letter grade 2. a total score 3. four subscores (speaking, listening, reading, writing) 4. for the listening and reading sections a. an indication of correct/incorrect responses b. marginal comments 5. for the oral interview a. scores for each element being rated c. oral feedback after the interview 6. on the essay b. checklist of areas needing work d. post-interview conference to go over results a. scores for each element being rated b. a checklist of areas needing work e. a self-assessment c. marginal end-of-essay comments, suggestions d. post-test conference to go work 7. on all or selected parts of the test, peer checking of results 8. a whole-class discussion of results of the test 9. individual conferences with each student to review the whole test Decide whether the following statements are TRUE or FALSE. 1.language aptitude test measures a learner’s future success in learning a FL. 2. Language aptitude tests are very common today. 3. A proficiency test is limited to a particular course or curriculum. 4. The aim of a placement test is to place a student into particular level. 5. Placement tests have many varieties. 6. Any placement test can be used at a particular teaching program. 7. Achievement tests are related to classroom lessons, units, or curriculum. 8. A five-minute quiz can be an achievement test. 9. The first task in designing a test is to determine test specification. 1. TRUE 2. FALSE 3. FALSE 4. TRUE 5. TRUE 6. FALSE (Not all placement tests suit every teaching program.) 7. TRUE 8. FALSE 9. FALSE (The first task is to determine appropriate objectives.) Decide whether the following statements are TRUE or FALSE. 1. It is very easy to develop multiple-choice tests. 2. Multiple-choice tests are practical but not reliable. 3. Multiple-choice tests are time-saving in terms of scoring and grading. 4. Multiple-choice items are receptive. 5. Each multiple-choice item in a test should measure a specific objective. 6. The stem of a multiple-choice item should be as long as possible in order to help students to understand the context. 7. If the Item Facility value is .10(% 10), it means the item is very easy. 8. Item discrimination index differentiates between high and low-ability sts. 1. FALSE (It seems easy, but is not very easy.) 2. FALSE (They can be both practical and reliable.) 3. TRUE 4. TRUE 5. TRUE 6. FALSE (It should be short and to the point.) 7. FALSE (An item with an IF value of .10 is a very difficult one.) 8. TRUE Chapter 4 STANDARDIZED TESTING: WHAT IS STANDARDIZATION: A standardized test presupposes certain standard objectives or criteria that are held constant across one form of the test to another.. They measure a broad band of competencies, but not only one particular curriculum They are norm-referenced and the main goal is to place sts in a rank order. Scholastic Aptitude Test (SAT): college entrance exam seeking further information The Graduate Record Exam (GRE): test for entry into many graduate school programs Graduate Management Admission Test (GMAT) & Law School Aptitude Test (LSAT): tests that specialize in particular disciplines Test of English as a Foreign Language (TOEFL): produced by the International English Language Testing System (IELTS) The tests are standardized because they specify a set of competencies for a given domain and through a process of construct validation they program a set of tasks. In general standardized test items are in the form of MC. They provide ‘objective’ means for determining correct and incorrect responses. However MC is not the only test item type in standardized test. Human scored tests of oral and written production are also involved. ADVANTAGES AND DISADVANTAGES OF STANDARDIZED TESTS: -Advantages: * Ready-made previously (Ts don’t need to spend time to prepare it) * It can be administered to a large number of sts in a time constraint * Easy to score thanks to MC format scoring (computerized or hole-punched grid scoring) * It has face validity -Disadvantages: * Inappropriate use of tests * Misunderstanding of the difference between direct and indirect testing characteristics of a standardized test • DEVELOPING A STANDARDIZED TEST: - Knowing how to develop a standardized test can be helpful to revise an existing test, adapt or expand an existing test, create a smaller-scale standardized test (A) The Test of English as a Foreign Language (TOEFL) ‘general ability or proficiency’ (B) The English as a Second Language Placement Test (ESLPT), San Francisco State University (SFSU) ‘placement test at a university’ (C) The Graduate Essay Test (GET), SFSU ‘gate-keeping essay test’ 1. Determine the purpose and objectives of the test. - Standardized tests are expected to be valid and practical TOEFL *To evaluate the English proficiency of people whose NL is not English. *Colleges and universities in the US use the score TOEFL score to admit or refuse international applicants for admission ESLPT *To place already admitted sts at SFSU in an approp. course in academic writing and oral production. *To provide Ts some diagnostic information about sts GET *To determine whether their writing ability is sufficient to permit them to enter graduate-level courses in their programs(it is offered beginning of each term) 2. Design test specification. TOEFL the first step is to define the construct of language proficiency After breaking langcompetence down into subset of 4 skills each performance mode can be examined on a continuum of linguistic units. (pronun, spelling, word, grammar) Oral production section tests fluency and pronunciation by using imitation Listening section focuses on a particular feature of lang or overall listening comprehens Reading section aims to test comprehension of long/short passages, single sentences, phrases or words Writing section tests writing ability in the form of open-ended(free composition) or it can be structured to elicit anything from correct spelling to discourse-level competence ESLPT Designing test specs for ESLPT was simpler tasks . purpose is placement and construct validation of a test consisted of an examination of the content of the ESL courses *In recent revision of ESLPT, content & face validity are important theoretical issues. And also practicality, reliability in tasks and item response formats equally important The specification mirrored reading-based and process writing approach used in class. GET specification for GET are skills of writing grammatically and rhetorically acceptable prose on a topic , with clearly produced organization of ideas and logical development. 3. Design, select, and arrange test tasks/items. TOEFL • Content coding: the skills and a variety of subject matter without biasing (the content must be universal and as neutral as possible) • Statistical characteristic: it include IF and ID • Before administration, they are piloted and scientifically selected to meet difficulty specifications within each subsection, section and the test overall. ESLPT For written parts; the main problems are a) selecting appropriate passages(conform the standards of content validity) • b) providing appropriate prompts (they should fit the passages) • c) processing data form pilot testing • In the MC editing test; first (easier task) choose an approp. essay within whick embed errors. And a more complicated one is to embed a specified number errors from a pre-determined error categories.(T can perceive the categories from sts GETprevious error in written work & sts’ error can be used as distractors) Topics are appealing and capable of yielding intended product of an essay that requires an organized logical arguments conclusion. No pilot testing of prompts is conducted. • Be careful about the potential cultural effect on the numerous international students who must take the GET 4. Make appropriate evaluations of different kinds of items. - IF, ID and distractor analysis may not be necessary for classroom (one-time) test, but they are must for standardized MC test. - For production responses, different forms of evaluation become important. (i.e. practicality, reliability & facility) *practicality: clarity of directions, timing of test, ease of administration & how much time is required to score *reliability: is a major player is instances where more than one scorer is employed and to a lesser extent when a single scorer has to evaluate tests over long spans of time that could lead to deterioration of standards *facilities: is key for valid and successful items. Unclear direction, complex lang, obscure topic, fuzzy data, culturally biased information may lead to higher level of difficulty GET *No data are collected from sts on their perceptions, but the scorers have an opportunity to reflect on the validity of given topic 5. Specify scoring procedures and reporting formats. TOEFL -Scores are calculated and reported for *three sections of TOEFL *a total score *a separate score ESLPT *It reports a score for each of the essay section (each essay is read by 2 readers) *Editing section is machined scanned *It provides data to place sts and diagnostic information *sts don’t receive their essay back GET *Each GET is read by two trained reader. They give scores between 1 to 4 *recommended score is 6 as threshold for allowing sts to pursue graduate-level courses *If the st gets score below 6, he either repeat the test or take a remedial course 6. Performing ongoing construct validation studies. Any standardized test must be accompanied by systematic periodic corroboration of its effectiveness and by steps toward its improvement TOEFL *the latest study on TOEFL examined the content characteristics of the TOEFL from a communicative perspective based on current research in applied linguistics and language proficiency assessment ESLPT *The development of the new ESLPT involved a lengthy process both content and construct validation, along with facing such practical issues as scoring the written sections and a machine-scorable MC answer sheet GET *There is no research to validate the GET itself. Administrators rely on the research on university level academic writing tests such as TWE. *Some criticism of the GET has come from international test-takers who posit that the topics and time limits of the GET work to the disadvantage of writers whose native language is not English. Primary market TOEFL U.S. universities and colleges for admission purposes Type Computer-based and paper-based Response modes Multiple-choice responses and essay Time allocation Up to 4 hours (CB); 3 hours (PB) Specifications CB: A listening section which includes dialogs, short conversations, academic discussions, and mini lectures; a structure section which tests formal language with two types of questions (completing incomplete sentences and identifying one of four underlined words or phrases that is not acceptable in English; a reading section which include four to five passages on academic subjects with 10-14 questions for each passage; writing section which requires examinees to compose an essay on a given topic MELAB Primary market U.S. and Canadian language programs and colleges; some worldwide educational settings Type Paper-based Response modes Multiple-choice responses and essay Time allocation 2.5 to 3.5 hours Specifications A 30-minute impromptu essay on a given topic; a 25-minute multiple-choice listening comprehension test; a 100-item 75-minute multiple choice test of grammar, cloze reading, vocabulary, and reading comprehension; an optional oral interview IELTS Primary market Australian, British, Canadian, and New Zealand academic institutions and professional organizations and some American academic institutions Type Computer-based for Reading and Writing sections; paper-based for Listening and Speaking parts Response modes Multiple-choice responses, essay, and oral production Time allocation 2 hours, 45 minutes Specifications A 60-minute reading; a 60-minute writing; a 30-minute listening of four sections; a 10 to 15 minute speaking of five sections TOEIC Primary market Worldwide; workplace settings Type Computer-based and paper-based Response modes Multiple-choice responses Time allocation 2 hours Specifications A 100-item, approximately 45-minute listening administered by audiocassette and which includes statements, questions, short conversations, and short talks; a 100-item, 75-minute reading which includes cloze sentences, error recognition, and reading comprehension CHAPTER 5 STANDARDIZED-BASED ASSESSMENT: Mid 20th Century Standardized tests had unchallenged popularity and growth. Standardized tests brought convenience, efficiency, air of empirical science. Tests were considered to be a way of making reforms in education. Quickly and cheaply assessing students became a political issue. Late 20th Century *There was possible inequity and disparity between the tests in such tests and the ones they teach in classes. *The claims in mid-20th century began to be questioned/criticised in all areas. *Teachers were in the leading position of those challenges. The Last 20 Years *Educators become aware of weaknesses in standardized testing: They were not accurate measures of achievement and success and they were not based on carefully framed, comprehensive and validated standards of achievement. *A movement has started to establish standards to assess students of all ages and subject-matter areas. *There have been efforts on basing the standardised tests on clearly specified criteria for each content area being measured. Criticism: Some teachers claimed that those tests were unfair there were dissimilarity between the content & task of the tests & what they were teaching in their classes Solutions: By becoming aware of these weaknesses, educators started to establish some standards on which sts of all ages & subject matter areas might be assessed most departments of education at all state level in the US have specified the appropriate standards (criteria, objectives) for each grade level(pre-school to grade 12) and each content area (math, science, arts…) The construction of standards makes possible concordance between standardized test specification and the goals and objectives (ESL, ESOL, ELD,ELLs) (LEP is discarded because of the negative connotation word ‘limited’) pg 105 please ELD STANDARDS In creating benchmarks for accountability, there is a tremendous responsibility to carry out a comprehensive study of a number of domains: Categories of language; phonology, discourse, pragmatic, functional and sociolinguistic elements. Specification of what ELD students’ needs are. A realistic scope of standards to be included in curriculum.(MUFRADATTAKI STANDARDLAR GERCEKCI OLCAK) Standards for teachers ( qualifications, expertise, training)(OGRETMENLERE STANDARD GETIRIYOR) A thorough analysis of means available to assess student attainment of those standards.(OGRENCILERIN OGRENDIKLERINI NASIL DEGERLENDIRECEZ ELD ASSESSMENT The development of standards obviously implies the responsibility for correctly assessing their attainment. It is found that the standardized tests of the past decades were not in line with newly developed standards the interactive process not only of developing standards but also of creating standards-based assessment started. Specialists design, revise and validate many tests. The California English Language Development Test (CELDT) is a battery of instruments designed to assess attainment of ELD standards across grade level. (not publicly available) Language and literacy assessment rubric collected students’ work. Teachers’ observations recorded on scannable forms. It provided useful data on students’ performance for oral production, reading and writing in different grades CASAS AND SCANS CASAS: (Comprehensive Adult Student Assessment System): Designed to provide broadly based assessments of ESL curricula across US. It includes more than 80 standardized assessment instruments used to; *place sts in programs *diagnose learners’ needs *monitor progress *certify mastery of functional skills At higher level of education (colleges, adult and language schools, workplace) SCANS: (Secretary’s Commissions in Achieving Necessary Skills): outlines competencies necessary for language in the workplace the competencies are acquired and maintained through training in basic skills(4 skills); thinking skills (reasoning & problem solving); personal qualities (self-esteem & sociability) Resources (allocating time, materials, staff etc.) Interpersonal skills, teamwork, customer service etc. Information processing, evaluating data, organising files etc, Systems, understanding social and organizational system, Technology use and application TEACHER STANDARDS – OGRETMEN NASIL OLMALI Linguistic and language development Culture and interrelationship between language and culture Planning and managing instructions Consequences of standardized based and standardized testing Positive High level of practicality and reliability Provides insights into academic performance Accuracy in placing a number of test takers on to a norm referenced scala Ongoing construct validation studies Negative They involve a number of test biases A small but significant number of test takers are not assessed fairly nor they are assessed accurately Fosters extinct motivation Multiple intelligence are not considered There is danger of test driven learning and teaching In general performance is not directly assessed Test bias Standardized tests involve many test bias (lang, culture, race, gender, learning styles) National Centre for Fair and Open Testing claims of tests bias from; teachers, parents, students, and legal consultants. (reading texts, listening stimulus) Standardised tests do not promote logical-mathematical and verbal linguistic to the virtual exclusions of the other contextualised, integrative intelligence. (some learners may need to be assessed with interviews, portfolios, samples of work, demonstrations, observation reports) more formative assessment rather than summative. That would solve test bias problems but it is difficult to control it in standardized items. Those who use standardised tests for the gate keeping purposes, with few if only other assessments would do well to consider multiple measures before attributing infallible predictive power to standardised test. Test-driven learning and teaching It is another consequence of standardized testing. When students know that one single measure of performance will determine their lives they are less likely to take positive attitudes towards learning. Extrinsic motivation not intrinsic Ts are also affected from test-driven policies. They are under pressure to make sure their sts excelled in the exam, at the risk of ignoring other objectives in the curriculum. A more serious effect was to punish schools with lower-socioeconomic neighbourhood ETHICAL ISSUES: CRITICAL LANGUAGE TESTING One of by-products of rapid growing testing industry is danger of an abuse of power. ‘Tests represent a social technology deeply embedded in education, government and business; tests are most powerful as they are often the single indicators for determining the future of individuals’ (Shohamy) Standards ,specified by client educational institutions, bring with them certain ethical surrounding the gate-keeping nature of standardized tests. Teachers can demonstrate standards in their teaching. Teachers can be assessed through their classroom performance. Performance can be detailed with ‘indicators’: examples of evidence that the teacher can meet a part of a standard. Indicators are more than ‘how to’ statements (complex evidence of performance. Performance based assessment is integrated (not a checklist or discrete assessments) Each assessment has performance criteria against which performance can be measured. Performance criteria identify to what extend the teacher meets the standard. Student learning is at the heart of the teacher’s performance. 6 ASSESSING LISTENING OBSERVING THE PERFORMANCE OF FOUR SKILLS 1. two interacting concepts: Performance Observation”” Sometimes the performance does not indicate true competence a bad night’s rest, illness, an emotional distraction, test anxiety, a memory block, or other student-related reliability factor. One important principle for assessing a learner’s competence is to consider the fallibility of the results of a single performance such as that produced in a test. The form which involve performances and contexts in measurement should design following: Several tests that are combined t form an assessment. The listening tasks are designed to assess the candidate’s ability to process form of spoken English. A single test with multiple test tasks to account for learning styles and performance variables In-class and extra-class graded work Alternative forms of assessment ( e. g journal, portfolio, conference, observation, self – assessment, peer – assessment ) Multiple measures give more reliable & valid assessment than a single measure We can observe neither the process of performing nor a product? 1. Receptive skills -- Listening performance The process of listening performance is about : Invisible, inaudible – process of internalizing meaning form the auditory signals being transmitted to the ear and brain. 2 The productive skills allow us to hear and see the process as it is performance writing can give permanent product of written piece. But recorded speech, there is no permanent observable product for speaking. THE IMPORTANCE OF LISTENING Listening has often played second fiddle to its counterpart of speaking. But its rare to find just a listening test. Listening is often implied as component of speaking. Oral production ability – other than monologues, speeches, reading aloud and the like– is only as good as one’s listening comprehension. Input the aural-oral mode accounts for a large proportion of successful language acquisition. BASIC TYPES OF LISTENING For effective test, designing appropriate assessment tasks in listening begins with the specification of objectives, or criteria. The following processes flash through your brain : 1. recognize speech sounds and hold a temporary “ imprint” of them in short-term memory. 2. Simultaneously determine the type of speech event. 3. use (bottom-up) linguistic decoding skills and / or (top-down) background schemata to bring a plausible interpretation to the message and assign a literal and intended meaning to the utterance. ( Jeremy Harmer, page on 305) said.. This study shows is that activating student’s schemata. 4. in most cases, delete the exact linguistic form in which the message was originally received in favor of conceptually retaining important or relevant information in long-term memory. four commonly identified types of listening performances 1. Intensive. Listening for perception of the components. Teacher use audio material on tape or hard disk when they want their students to practice listening skills 2. Responsive. 3. Selective. 4. Extensive. Extensive listening will usually take a place outside the classroom. Material for extensive listening can be obtained from a number of sources. Micro and Macro skills Micro skills Attending to smaller bits and chunks, in more of bottom-up process Discriminate among sounds of English retain chunks of language of different lengths in short-term memory Recognize stress patterns, words in stressed/ unstressed position, rhythmic structure , intonation contours, and their role in signaling information Recognize reduce form of words. Distinguish word boundaries, recognize the core of a words and interpret word order patterns and their significance Process speech at different rates of delivery Process speech containing pauses, errors, corrections, other performance variables Recognize grammatical word classes (nouns, verbs, etc.), systems (e.g. tense, agreement, pluralization), pattern, rules, and elliptical forms. Detect sentence constituents and distinguish between major-minor constituents Recognize particular meaning may be expressed in different grammatical form Recognize cohesive device in spoken discourse Macroskills Focusing on larger elements involved in a top-down approach recognize the communicative functions of utterances, according to situations, participants, goals Infer situations, participants, goals using real-world knowledge From events, ideas, and so on, described, predict outcomes, infer links and connections between events, deduce causes and effects, and detect such relations as main idea, supporting idea, new information, given information, generalization, and exemplification Distinguish between literal and implied meanings Use the facial, kinesics, body language, and other nonverbal clues to decipher meanings Develop and uses a battery of listening strategies, such as detecting key words, guessing the meaning from context, appealing for help, and signaling comprehension or lack thereof What Makes Listening Difficult 1. Clustering Chunking-phrases, clauses, constituents 2. Redundancy Repetitions, Rephrasing, Elaborations and Insertions 3. Reduced Forms Understanding reduced forms that may not be a part of learner’s past experiences in classes where only formal ”textbook” lang has been presented 4. Performance variables Hesitations, False starts, Corrections, Diversion 5 Colloquial Language Idioms, slang, reduced forms, shared cultural knowledge 6. Rate of Delivery Keeping up with speed of delivery, processing automatically as speker continu 7. Stress, Rhythm, and Intonation: Correctly understanding prosodic elements of spoken language, which is more difficult than understanding the smaller phonological bits and pieces. 8. Interaction: Negotiation,clarification,attending signals,turn taking,maintenance,termination Designing Assessment Tasks • Recognizing Phonological and Morphological Elements Phonemic pair, consonants Test-takers hear : He’s from California Test-takers read : A. He’s from California B. She’s from California Phonemic pair, vowels Test-takers hear : is he living? Test-takers read : A. is he leaving? B. is he living? Morphological pair, -ed ending Test-takers hear : I missed you very much. Test-takers read : A. I missed you very much B. I miss you very much Stress pattern in can’t Test-takers hear : My girlfriend can’t go to the party Test-takers read : A. My girlfriend can go to the party B. My girlfriend can’t go to the party One word stimulus Test-takers hear : vine Test-takers read : A. Vine B. Wine •Paraphrase Recognition – Sentence Paraphrase Test-takers hear : Hellow, my name is Keiko. I come from Japan Test-takers read : A. Keiko is comfortable in japan B. Keiko wants to come to Japan C. Keiko is Japanese D. Keiko likes Japan – Dialogue paraphrase Test-takers hear Test-takers read : man : Hi, Maria, my name is George. woman : Nice to meet you, George. Are you American? man : no, I’m Canadian : A. George lives in United States B. George is American C. George comes from Canada D. Maria is Canadian Designing Assessment Tasks • Appropriate response to a question Test-takers hear Test-takers read : how much time did you take to do your homework? : A. in about an hour B. about an hour C. about $10 D. yes, I did. • Open-ended response to a question Test-takers hear Test-takers write or speak : how much time did you take to do your homework? : __________________________________ Designing Assessment Tasks : Selective Listening Test-taker listens a limited quantity of aural input and discern some specific information Listening Cloze (cloze dictations or Partial Dictation) Listening cloze tasks require the test-taker to listen a story, monologue or conversatation and simultaneously read written text in which selected words or phrases have been deleted One Potentional Weakness of listening cloze technique They may be simply become reading comprehension tasks. Test-takers who are asked to listen to a story with periodic deletions in the written version may not need to listen at all, yet may still able to respond with the appropriate word or phrase. Information Transfer aurally processed must be trnasfered to a visual representation, E.g labelling a diagram, identifying an element in a picture, completing a form, or showing routes on a map. Chart Filling Test-takers see the chart about Lucy’s daily schedule and fill in the schedule. Sentence Repetition The test-takers must retain a strecth of language long enough to reproduce it, and then must respond with an oral repetition of that stimulus. DESIGNING ASSESSMENT TASKS: EXTENSIVE LISTENING Dictation: Test-takers hear a passage, typically 50-100 words, recited 3 times; First reading, natural speed, no pauses, test-takers listen for gist. Second reading, slowed speed, pause at each break, test-takers write. Third reading, natural speed, test takers check their work. Communicative Stimulus-Response Tasks The test-takers are presented with a stimulus monologue or conversation and then are asked to respond to a set of comprehension questions. First: Test-takers hear the insrtuction and dialogue or monologue. Second: Test-takers read the multiple-choice comprehension questions and items then chose the correct one Authentic Listening Tasks Buck (2001-p.92)“Every test requires some components of communicative language ability, and no test covers them all. Similarly, every task shares some characteristics with target-language tasks, and no test is completely authentic” Alternatives to assess comprehension in a truly communicative context Note taking Listening to a lecturer and write down the important ideas. Disadvantage: scoring is time consuming Advantages: mirror real classroom situation it fulfills the criteria of cognitive demand, communicative language & authenticity Editing Editing a written stimulus of an aural stimulus Interpretive tasks: paraphrasing a story or conversation Potential stimuli include: song lyrics, poetry, radio, TV, news reports, etc. Retelling Listen story &simply retell it either orally or written à show full comprehension Difficulties: scoring and reliability validity, cognitive, communicative ability, authenticity are well incorporated into the task. Interactive listening (face to face conversations) Chapter-7 Assessing Speaking Challenges of the testing speaking: 1- The interaction of speaking and listening 2- Elicitation techniques 3- Scoring BASIC TYPES OF SPEAKING 1.Imitative: (parrot back) Testing the ability to imitate a word, phrase, sentence. Pronunciation is tested. Examples: Word, phrase, sentence repetition 2. Intensive: The purpose is producing short stretches of oral language. It is designed to demonstrate competence in a narrow band of grammatical, phrasal, lexical, phonological relationships (stress / rhythm / intonation) 3.Responsive: (interacting with the interlocutor) include interaction and test comprehension but somewhat limited level of very short conversations, standards greetings, small talk, simple requests and comments, and the like. 4. Interactive: Difference between responsive and interactive speaking is length and complexity of interaction, which includes multiple exchanges /or multiple participant. 5. Extensive (monologue) : Extensive oral production tasks include speeches, oral presentations, story-telling, during which the opportunity for oral interaction from listeners is either highly limited (perhaps to nonverbal responses) or ruled out together. Micro- and Macroskills of Speaking microskills of speaking refer to producing small chunks of language such as phonemes, morphemes, words and phrasal units. The macroskills include the speakers' focus on the larger elements such as fluency, discourse, function, style cohesion, nonverbal communication and strategic options. Macroskills 1.Apropriately accomplish communicative functions according to situations, participants,and goals. 2.Use appropriate styles, registers, implicative, redundancies, pragmatic conventions, conversation rules, floor-keeping and –yielding, interrupting, and other sociolinguistic features in face-to-face conversations. 3.Convey links and connections between events and communicative such relations as focal and peripheral ideas, events and feelings, new information and given information, generalization and exemplification. 4.Convey facial features, body language, and other nonverbal cues along with verbal language. 5.Develop and use a battery of speaking strategies, such as emphasizing key words, rephrasing, providing a context for interpreting the meaning of words, appealing for help, and accurately assessing how well your interlocutor is understanding you. Microskills: 1.Produce differences among English phonemes and allophonic variants. 2.Produce chunks of language of different lengths. 3.Produce English stress patterns, words in stressed and unstressed positions, rhytmic structure, and intonation contours. 4.Produce reduced forms of words and phrases. 5.Use adequate number of lexical units(words) to accomplish pragmatic purposes 6.Produce fluent speech at different rates of delivery. 7.Monitor one’s own oral production and use various devices-pauses, fillers, self-corrections, backtracking- to enhance the clarity of the message. 8.Use grammatical word classes (nouns,verbs,etc.),systems (tense, agreement, pluralization), word order, patterns, rules, and elliptical forms. 9.Produce speech in natural constituents: in appropriate phrases, pause groups,breath groups, and sentence constituents. 10.Express a particular meaning in different grammatical forms. 11.Use cohesive devices in spoken discourse. Three important issues as you set out to design tasks; 1.No speaking task is capable of isolating the single skills of oral production. Concurrent involvement of the additional performance of aural comprehension, and possibly reading, is usually necessary. 2.Eliciting the specific criterion you have designated for a task can be tricky because beyond the word level, spoken language offers a number of productive options to test-takers. Make sure your elicitation prompt achieves its aims as closely as possible. 3.It is important to carefully specify scoring procedures for a response so that ultimately you achieve as high a reliability index as possible. interaction between speaking and listening or reading is unavoidable. Interaction effect: impossibility of testing speaking in isolation Elicitation techniques: to elicit specific criterion we expect from test takers. Scoring: to achieve reliability Designing Assessment Tasks: Imitative Speaking paying more attention to pronunciation, especially suprasegmentals, in attempt to help learners be more comprehensible. Repetition tasks are not allowed to occupy a dominant role in an overall oral production assessment, and as long as avoid a negative washback effect. In a simple repetition task, test-takers repeat the stimulus, whether it is a pair of words, a sentence, or perhaps a question ( to test for intonation production.) Word repetition task: Scoring specifications must be to avoid reliability breakdowns. A common form of scoring simply indicates 2 or 3 point system for each response Scoring scale for repetition tasks: 2 acceptable pronunciation 1 comprehensible, partially correct pronunciation 0 silence, seriously incorrect pronunciation The longer the stretch of language, the more possibility for error and therefore the more difficult it becomes to assign a point system to the text. PHONEPASS TEST The phonepass test has supported the construct validity of its repetition tasks not just for discourse and overall oral production ability. The PhonePass tests elicits computer-assisted oral production over a telephone. Test-takers read aloud, repeat sentences, say words, and answer questions. Test-takers are directed to telephone a designated number and listen for directions. The test has five sections. Part A Testee read aloud selected sentences forum among printed on the test sheet. Part B Testee repeat sentences dictated over the phone. Part C Testee answer questions with a single word or a short phrase of 2 or 3 words. Part D Testee hear 3 word groups in random order and link them in correctly ordered sentence Part E Testee have 30 seconds to talk about their opinion about some topic that is dictated over phone. Scores are calculated by a computerized scoring template and reported back to the testtaker within minutes. Pronunciation, reading fluency, repeat accuracy and fluency, listening vocabulary are the sub-skills scored The scoring procedure has been validated against human scoring with extraordinary high reliabilities and correlation statistics. Designing Assessment Tasks: Intensive Speaking test-takers are prompted to produce short stretches of discourse (no more then a sentence) through which they demonstrate linguistic ability at a specified level lang Intensive tasks may also be described as limited response tasks, or mechanical tasks, or what classroom pedagogy would label as controlled responses. Directed Response Tasks Administrator elicits a particular grammatical form or a transformation of a sentence. Such tasks are clearly mechanical and not communicative(possible drawbacks),but they do require minimal processing of meaning in order to produce the correct grammatical output.(practical advantages Read – Aloud Tasks (to improve pronunciation and fluency) include beyond sentence level up to a paragraph or two. It is easily administered by selecting a passage that incorporates test specs and bye recording testee’ output; the scoring is easy because all of the test-takers’s oral production is controlled. If reading aloud shows certain practical adavantages (predictable output, practicality, reliability in scoring), there are several drawbacks Reading aloud is somewhat inauthentic in that we seldom read anything aloud to someone else in the real world, with exception of a parent reading to a child. Sentence / Dialogue Completion Tasks and Oral Questionnaries ( to produce omitted lines, words in a dialogue appropiriately) Test-takers read dialogue in which one speaker’s lines have been omitted. Testtakers are first given time to read through the dialogue to get its gist and to think about appropriate lines to fill in. An advantage of this technique lies in its moderate control of the output of the test-taker (practical advantage). One disadvantage of this technique is its reliance on literacy and an ability to transfer easily from written to spoken English.(possible drawback) Another disadvantage is contrived, inauthentic nature of this task. (drawback.) Picture – Cued Tasks (to elicit oral production by using pictures) One of more popular ways to elicit oral language performance at both intensive and extensive levels is a picture-cued stimulus that requires a destcription from the test-taker. Assessment of oral production may be stimulated through a more elaborate picture. (practical advantages) Maps are another visual stimulus that can be used to assess the language forms needed to give directions and specify locations.(practical advantage) Scoring may be problematic depending on the expected performance. Scoring scale for intensive tasks 2 comprehensible; acceptable target form 1 comprehensible; partially correct target form 0 silence, or seriously incorrect target form Translation (of Limited Stretches of Discourse) (To translate from target language to native language) The test-takers are given a native language word, phrase, or sentence and are asked to translate it. As an assessment procedure, the advantages of translation lie in its control of the output of the test-taker, which of course means that scoring is more easily specified. Designing Assessment Tasks: Response Speaking Assessment involves brief interactions with an interlocutor, differing from intensive tasks in the increased creativity given to the test-taker and from interactive tasks by the somewhat limited length of utterances. Question and Answer Question and answer tasks can consist of one or two questions from an interviewer, or they can make up a portion of a whole battery of questions and prompts in an oral interview. The first question is intensive in its purpose; it is a display question intended to elicit a predetermined correct response. Questions at the responsive level tend to be genuine referential questions in which the test-taker is given more opportunity to produce meaningful language in response. Test-takers respond with a few sentences at most. Test-takers respond with questions. A potentially tricky form of oral production assessment involves more than one test-taker with an interviewer. With two students in an interview contxt, both test-takers can ask questions of each other. Giving Instruction and Directions The technique is simple : the administrator poses the problem, and the testtaker responds. Scoring is based primarily on comprehensibility and secondarily on other specified grammatical or discourse categories. Eliciting instructions or directions Paraphrasing read or hear a number of sentences and produce a paraphrase of the sentence. Advantages they elicit short stretches of output and perhaps tap into testee ability to practice conversation by reducing the output/input ratio. If you use short paraphrasing tasks as an assessment procedure, it’s important to pinpoint objective of task clearly. In this case, the integration of listening and speaking is probably more at stake than simple oral production alone. TEST OF SPOKEN ENGLISH (TSE) The TSE is a 20 –minute audio-taped test of oral language ability within an academic or Professional environment. The scores are also used for selecting and certifying health professionals such as physicians, nurses, pharmacists, physical therapists, and veterinaries. The tasks on the TSE are designed to elicit oral production in various discourse categories rather than in selected phonological, grammatical, or lexical targets. Designing Assessment Tasks: Interactive Speaking Tasks include long interactive discourse ( interview, role plays, discussions, games). İnterview A test administrator and a test-taker sit down in a direct face-to-face Exchange and proceed through a protocol of questions and directives. The interview is then scored on accuracy in pronunciation and/or grammar, vocabulary usage, fluency, pragmatic appropriateness, task accomplishment, and even comprehension. Placement interviews, designed to get a quick spoken sample from a student to verify placement into a course, Four stages: 1.Warm-up : (small talk) interviewer directs matual introductions, helps testee become comfortable, apprises testee, anxieties.(No scoring) 2.Level check: interviewer stimulates testee to respond using expected - predicted forms and functions. This stage give interviewer a picture of testee’s extroversion, readiness to speak, confidence.Linguistic target criteria are scored in this phase. 3.Probe: Probe questions and prompts challenge testee to go heights of their ability, to extend beyond limits of interviewer’s expectation through difficult questions. 4.Wind-down: This phase is a short period of time during which interviewer encourages testee to relax with easy questions, sets testee’s ease, The scussess of an oral interview will depend on; *clearly specifying administrative procedures of the assessment(practicality) *focusing the q and probes on the purpose of the assessment(validity) *appropriately eliciting an optimal amount and quality of oral production from the test-taker.( biased for best performance) *creating a consistent, workable scoring system (reliability). Role Play Role playing is a popular pedagogical activity in communicative language teaching classes. Within constraints set forth by guidelines, it frees students to be somewhat creative in their linguistic output. While role play can be controlled or ‘’guided’’ by the interviewer, this technique takes test-takers beyond simple intensive and responsive levels to a level of creativity and complexity that approaches real-world pragmatics. Scoring presents the usual issues in any task that elicits somewhat unpredictable responses from test-takers. Discussions and Conversations As formal assessment devices, discussions and conversations with and among students are difficult to specify and even more difficult to score. But as informal techniques to assess learners, they offer a level of authenticity and spontaneity that other assessment techniques may not provide. Assessing the performance of participants through score or checklists should be carefully designed to suit the objectives of the observed discussion. Discussion is a integrative task, and so it is also advisable to give some cognizance to comprehension performance in evaluating learners. Games Among informal assessment devices are a variety of games that directly involve language production. Assessment games: 1.’’Tinkertoy’’ game (Logo block) 2.Crossword puzzles 3.Information gap grids 4.City maps ORAL PROFICIENCY INTERVIEW (OPI) The best-known oral interview format is the Oral Proficinecy Interview. OPI is the result of historical progression of revisions under the auspices of several agencies, including the Educational Testing Service and American Council on Teaching Foreign Language (ACTFL). The OPI is carefully designed to elicit pronunciation, fluency and integrative ability, sociolinguistic and cultural knowledge, grammar, and vocabulary. Performance is judged by the examiner to be at one of ten possible levels on the ACTFL-designated proficiency guidelines for speaking: Superior; Advanced-high, mid, low; Intermediate-high, mid,low; Novice-high, mid,low. Designing Assessments : Extensive Speaking involves complex, relatively lengthy stretches of discourse. They are variations on monologues, with minimal verbal interaction. Oral Presentations it would not be uncommon to be called on to present a report, a paper, a marketing plan, a sales idea, a design of new product, or a method. Once again the rules for effective assessment must be invoked: a- specify the criterion, b-set appropriate tasks, c- elicit optimal output, d-establish practical, reliable scoring procedures. Scoring is the key assessment challenge. Picture –Cued Story-Telling techniques for eliciting oral production is through visual pictures, photographs, diagrams, and charts. consider a picture or series of pictures as a stimulus for a longer or description. Criteria for scoring need to be clear about what it is you are hoping to assess. Retelling a Story, News Event In this type of task, test-takers hear or read a story or news event that they are asked to retell. The objectives in assigning such a task vary from listening comprehension of the original to production a number of oral discourse features (communicating sequences and relationships of events, stress and emphasis patterns,’ ’expression’’ in the case of a dramatic story), fluency, and interaction with the hearer. Scoring should meet the intended criteria Translation (of Extended Prose) Longer texts are presented for test-taker to read in NL and then translate into English (dialogues, directions for assembly of a product, synopsis of a story or play or movie, directions on how to find something on map, and other genres). The advantage of translation is in the control of the content, vocabulary, and to some extent, the grammatical and discourse features. The disadvantage is that translation of longer text is a highly specialized skill for which some individuals obtain post-baccalaureate. Criteria for scoring should take into account not only purpose in stimulating a translation but possibility of errors that are unrelated to oral production ability 8 ASSESSING READING TYPES (GENRES) OF READING Academic reading Reference material , Textbooks, theses Essays, papers, Test directions, Editorials and opinion writing Job-related reading Messages, Letters/ emails, Memos Personal reading Newspapers , magazines, Letters, emails, cards, invitations, Schedules (trains, bus) Microskills : Discriminate among the distinctive graphemes and orthographic patterns of English. Retain chunks of language of different lenghts in short-term memory. Process writing at an efficient rate of speed to suit the purpose. Recognize a core of word, and interpret word order patterns and their significance. Recognize grammatical word classes(nouns, verbs, etc), systems (tense agreement, pluralization), patterns, rules and elliptical forms. Recognize cohesive devices in written discourse and their role in signaling the relationship between and among clauses. Macroskills : Recognize the rhetorical forms of written discourse and their significance for interpretation. Recognize the communicative functions of written text, according to form and purpose Infer context that is not explicit by using background knowledge From described events, ideas, etc, infer links and connections between events, deduce causes and effects, and detect such relations as main idea, supporting idea, new information, generalization, and exemplification Distinguish between literal and implied meanings. Detect culturally specific references and interpret them in a context of the appropriate cultural schemata. Develop and use a battery of reading strategies, such as scanning and skimming, detecting discourse markers, guessing the meaning of words from the context, and activating schemata for interpretation of texts. Some principal strategies for reading comprehension: Identify your purpose in reading a text Apply spelling rule and conventions for bottom-up decoding Use lexical analysis to determine meaning Guess at meaning when you aren’t certain Skim the text for the gist and for main ideas Scan the text for specific information(names, dates, key words) Use silent reading techniques for rapid processing Use marginal notes, outlines, charts, or semantic maps for understanding and retaining information Distinguish between literal and implied meanings Capitalize on discourse markers to process relationships. TYPES OF READING Perceptive Involve attending to the components of larger stretches of discourse : letters, words, punctuation, and other graphemic symbols. Selective Is largely an artifact of assessment formats. Used picture-cued tasks, matching, true/ false, multiple-choice, etc. Interactive Interactive task is to identify relevant features (lexical, symbolic, grammatical, and discourse) within texts of moderately short length with the objective of retaining the information that is processed. Extensive The purposes of assessment usually are to tap into a learner’s global understanding of a text, as opposed to asked test-takers to “zoom in” on small details. Top down processing is assumed for most extensive tasks. PERCEPTIVE READING Reading Aloud Reads them aloud, one by one, in the presence of-an administrator. Written response Reproduce the probein writing. Evaluation of the test taker’s response must be carefully treated. Multiple-choise Choosing one of four or five possible answers. Picture-Cued Items Shown a picture, written text and are given one of a number of possible tasks to perform. SELECTIVE READING The test designer focuses on formal aspects of language (lexical, grammatical, and a few discourse features). Category includes what many incorrectly think of as testing “vocabulary and grammar” Multiple-Choise (for Form-Focused Criteria) They may have little context, but might serve as a vocab or grammar check. Matching Tasks The most frequently appearing criterion in matching procedures is vocabulary. Editing Tasks For grammatical or rhetorical errors is a widely used test method for assessing linguistic competence in reading. Picture Cued Tasks read sentence or passage and choose one of four pictures that is described read a series of sentences or definitions, each describing a labeled part of a picture or diagram. Gap-Filling Tasks Is to create completion items where test-takers read part of a sentence and then complete it by writing a phrase. INTERACTIVE READING Cloze Tasks fill in gaps in an incomplete image (visual, auditory, or cognitive) and supply (from background schemata) omitted details. Impromptu Reading Plus Comprehension Questions without some component of assessment involving impromptu reading and responding to questions. Short-Answer Tasks following reading passages is the age-old short-answer format. Editing (Longer Texts) The technique has been applied successfully to longer passages of 200 to 300 words. 1th authenticity, 2nd tasks simulates proofreading one’s own essay. 3th connected to a specific curriculum. Scanning Strategy used by all readers to find relevant information in a text. Ordering Tasks Variations on this can serve as an assessment of overall global understanding of a story and of the cohesive devices that signal the order of events or ideas. Information Transfers Reading Charts, Maps, Graphs, Diagrams media presuppose reader’s schemata for interpreting them and are accompanied by oral or written discourse to convey, clarify, question, argue, debate, among other linguistic functions. EXTENSIVE READING Involves longer texts than we have been dealing with up to this point. Skimming Tasks Process of rapid coverage of reading matter to determine its gist or main idea Summarizing and Responding Is make summary of the text and give it a respond about the text Note Taking and Outlining A teacher, perhaps in one-on-one conferences with students, can use student notes/ outlines as indicators of the presence or absence of effective reading strategies, and thereby point the learners in positive directions. UNIT 9: ASSESSING WRITING GENRES OF WRITING Academic Writing papers and general subject reports essays, compositions academically focused journals, short-answer test responses technical reports (e.g., lab reports), theses, dissertations Job-Related Writing messages letters/emails, memos (e.g., interoffice), reports (e.g., job evaluations, project reports) schedules, labels, signs, advertisements, announcements, manuals Personal Writing letters, emails, greeting cards, invitations messages, notes, calendar entries, shopping lists, reminders financial documents (e.g., checks, tax forms, loan applications) forms, questionnaires, medical reports, immigration documents diaries, personal journals, fiction (eg. Short stories, poetry) MICROSKILLS AND MACROSKILLS OF WRITING Micro-skills Produce graphemes and orthographic patterns of English. Produce writing at an efficient rate of speed to suit the purpose. Produce an acceptable core of words and use appropriate word order patterns. Use acceptable grammatical systems (Tense, agreement), patterns and rules. Express a particular meaning in different grammatical forms. Use cohesive devices in written discourse. Macro-skills Use the rhetorical forms and conventions of written discourse. Appropriately accomplish the communicative functions of written texts according to form and purpose. Convey links and connections between events, communicate such relations as main idea, supporting idea, new information, generalization, exemplification. Distinguish between literal and implied meanings when writing. Correctly convey culturally specific references in the context of the written text. Develop&use writing strategies, accurately assessing audience’s interpretation, using prewriting devices, writing fluency in first drafts, using phrases and synonyms, soliciting feedback and using feedback for revising and editing. Types of Writing Performance Imitative Writing Assess ability to spell correctly & perceive phoneme/grapheme correspondences Form rather than meaning (letters, words, punctuation, brief sentences, mechanics of writing) Intensive Writing To produce appropriate vocabulary within a context and correct grammatical features in a sentence More form than meaning but meaning and context are of some importance (collocations, idioms, correctness, appropriateness) Responsive Writing Connect sentences & create a logically connected 2 or 3 paragraphs Discourse conventions with strong emphasis on context and meaning (limited discourse level, connecting sentences logically) mostly 2-3 paragraphs Extensive Writing To manage all the processes of writing for all purposes to write longer text (Essays, papers, theses) Processes of writing (strategies of writing) IMITATIVE WRITING Tasks in Hand Writing Letters, Words, and Punctuation Copying ( bit __ / bet __ / bat __ ) Copy the words given in the spaces provided Listening cloze selection tasks Write the missing words in blanks by selecting according to what they hear Combination of dictation with a written text Purpose=to give practice in writing Picture-cued tasks Write the word the picture represents Make sure that pictures are not ambiguous Form completion tasks Complete the blanks in simple forms Eg. Name, address, phone number Make sure that students have practiced filling out such forms Converting numbers/abbreviations to words Either write out the numbers or converting abbreviations to words More reading than writing, so specify the criterion Low authenticity, Reliable method to stimulate handwritten English Spelling Tasks and Detecting Phoneme-Grapheme Correspondences Spelling Tests Write words that are dictated, Choose words that have been heard or spoken Scoring=correct spelling Picture-Cued Tasks Write words that are displayed by pictures Eg. Boot-book, read-reed, bit-bite Choose items according to your test purpose Multiple Choice Techniques Choose and write the word with the correct spelling to fit the given sentences Items are better to have writing component / addition of homonym to make the task challenging Clashes with reading, so be careful To assess the ability to spell words correctly and to process phoneme-grapheme correspondences Matching Phonetic Symbols Write the correctly spelled word alphabetically Since Latin alphabet and Phonetic alphabet symbols are different from each other, this works well. INTENSIVE (CONTROLLED) WRITING Dictation Writing what is heard aurally Listening & correct spelling punctuation Dicto-comp Re-writing the paragraph in one's own words after hearing it for 2 or 3 times Listening & vocabulary & spelling & Punctuation Grammatical transformation Making grammatical transformations by changing or combining forms of lang Grammatical competence, Easy to administer & practical & reliable No meaningful value, Even with context no authenticity Picture-cued 1. Short sentences 2. Picture description 3. Picture sequence description Reading non-verbal means & grammar & spelling & vocabulary Reading-Writing integration, Scoring problematic when pictures are not clear Vocabulary assessment Either defining or using a word in a sentence, assessing collocations and derived morphology Vocabulary & grammar, Less authentic: using a word in sentence? Ordering Ordering / re-ordering a scrambled set of words If verbal=intensive speaking, If written=intensive writing Reading and grammar Appealing for who like word games and puzzles, Inauthentic Needs practicing in class, Both reading and writing Short answer and sentence completion Answering or asking questions for the given statements / writing 2 or 3 sentences using the given prompts Reading& Writing, Scoring on a 2-1-0 scale is appropriate 1. AUTHENTICITY (face and content validity) Teacher becomes less instructor, more coach or facilitator Assessment: formative  (+) washback > practicality and reliability 2. SCORING Both how Ss string words together and what they say 3. TIME No time constraints  freedom for drafts before finished product Questioned issue= Timed impromptu format  valid method of writing assessment RESPONSIVE AND EXTENSIVE WRITING 1. Paraphrasing Its importance: To say something in one's own words, to avoid plagiarism to offer some variety in expression Test takers' task: Paraphrasing sentences or paragraphs with purposes in mind Assessment type: Informal and formative, Positive washback Scoring: Giving similar messages is primary Discourse, grammar and vocabulary are secondary 2. Guided question and answer Its importance: To provide benefits of guiding test takers without dictating the form of the output Test takers' task: Paraphrasing sentences or paragraphs with purposes in mind Assessment type: Informal and formative Scoring: Either on a holistic scale or an analytical one 3. Paragraph Construction Tasks Topic Sentence Writing The presence or absence of topic sentence The effectiveness of topic sentence Topic Development in a Paragraph The clarity of expression The logic of the sequence The unity and cohesion The overall effectiveness Multi Paragraph Essay Addressing topic /main idea / purpose Organizing supporting ideas Using appropriate details for supporting ideas Facility and fluency in language use Demonstrating syntactic variety 4.Strategic Options Free writing, outlining, drafting and revising are strategies which help writers create effective texts Writers need to know their subject and purpose and audience to write developing main and supporting ideas is the purpose for only essay writing Some tasks commonly addressed in academic writing courses are compare/contrast, problem solution, pro/cons and cause and effect. Assessment of tasks in academic writing course could be formative & informal Knowing conventions &opportunities of genre will help to write effectively. Every genre of writing requires different conventions. Test of Written English (TWE®) Time allocated: 30 minutes time limit/ no preparation ahead of time Prepared by: a panel of experts Scoring: a mean score of 2 independent ratings based on a holistic scoring Number of raters: 2 trained raters working independently Limitations: inauthentic / not real life / puts test takers into artificially time constraint context inappropriate for instructional purposes Strengths: serves for administrative purposes Follow 6 steps to be successful Carefully identify the topic. Plan your supporting ideas. In introductory paragraph, restate topic and state organizational plan of essay. Write effective supporting paragraphs (show transitions, include a topic sentence, specify details). Restate your position and summarize in the concluding paragraph. Edit sentence structure and rhetorical expression. SCORING METHODS FOR RESPONSIVE AND EXTENSIVE WRITING Holistic Scoring Definition: Assigning a single score to represent general overall assessment Purpose of use: Appropriate for administrative purposes / Admission into an institution or placement in a course Advantage(s): Quick scoring High inter-rater reliability, Easily interpreted scores by lay persons Emphasizes strengths of written piece Applicable to many different disciplines Disadvantages No washback potential Masking the differences across the sub skills within each score Not applicable to all genres Needs trained evaluators to use the scale accurately Primary Trait Scoring Assigning a score based on the effectiveness of the text's achieving its purposes (accuracy, clarity, description, expression of opinion) Purpose of use To focus on the principle function of the text Advantage(s) Practical Allows both the writer and scorer to focus on the function / purpose Disadvantage(s) Breaking text down into subcategories and giving separate ratings for each Analytic Scoring Definition Listening short monologues to scan for certain information Purpose of use Classroom instructional purposes Advantage(s) *More backwash into the further stages of learning Diagnose both the weaknesses and strengths of writing Disadvantage(s) Lower practicality since scorers have to attend to details with each sub-score. BEYOND SCORING: RESPONDING TO EXTENSIVE WRITING Here, the writer is talking about process approach to writing and how the assessment takes place in this approach. Many educators advocate process approach to writing. This pays attention to various stages that any piece of writing goes through. By spending time with learners on pre-writing phases, editing, re-drafting and finally producing a finished version of their work, a process approach aims to get to the heart of the various skills that most writers employ. Types of responding: Self, peer, teacher responding Assessment type: Informal / formative Washback: Potential positive washback Role of the assessor: Guide / facilitator GUIDELINES FOR ASSESSING STAGES OF WRITTEN COMPOSITION Initial stages Focus: Meaning & Main idea & organization Ignore: Grammatical and lexical errors / minor errors Indicate: Global errors but not corrected Later stages Focus: Fine tuning toward a final version Ignore: Indicate: Problems related to cohesion/documentation/citation 10 BEYOND TESTS: ALTERNATIVES IN ASSESSMENT Characteristics of Alternative Assessment require students to perform, create, produce, or do something; use real-world contexts or simulations; are non-intrusive in that they extend the day-to-day classroom activities; allow students to be assessed on what they normally do in class every day; use tasks that represent meaningful instructional activities; focus on processes as well as products; tap into higher-level thinking and problem-solving skills; provide information about both the strengths and weaknesses of students; are multi-culturally sensitive when properly administered; ensure that people, not machines, do the scoring, using human judgment; encourage open disclosure of standards and rating criteria; and call upon teachers to perform new instructional and assessment roles. DILEMMA OF MAXIMIZING BOTH PRACTICALITY AND WASHBACK LARGE SCALE STANDARDIZED TESTS ALTERNATIVE ASSESSMENT one-shot performances timed multiple-choice decontextualized norm-referenced foster extrinsic motivation highly practical, reliable instruments • minimize time and money • much practicality or reliability • cannot offer much washback or authenticity • open-ended in their time orientation and format, • contextualized to a curriculum, • referenced to the criteria (objectives) of that curriculum • likely to build intrinsic motivation • considerable time and effort • offer much authenticity and washback • • • • • • • The dilemma of maximizing both practicality and washback The principal purpose of this chapter is to examine some of the alternatives in assessment that are markedly different from formal tests. Especially large scaled standardized tests, tend to be one shot performances that are timed, multiple choice decontextualized, norm-referenced, and that foster extrinsic motivation. On the other hand, tasks like portfolios, journals, Conferences and interviews and self assessment are Open ended in their time orientation and format Contextualized to a curriculum Referenced to the criteria ( objectives) of that curriculum and Likely to build intrinsic motivation. PORTFOLIOS One of the most popular alternatives in assessment, especially within a framework of communicative language teaching, is portfolio development. portfolios include materials such as Essays and compositions in draft and final forms Reports, project outlines Poetry and creative prose Artwork, photos, newspaper or magazine clippings; Audio and/or video recordings of presentations, demonstrations, etc Journals, diaries, and other personal reflection ; Test, test scores, and written homework exercises Notes on lecturer; and Self-and peer- assessments-comments, and checklists. Successful portfolio development will depend on following a number of steps and guidelines. 1. State objectives clearly. 2. Give guidelines on what materials to include. 3. Communicate assessment criteria to students, 4. Designate time within the curriculum for portfolio development. 5. Establish periodic schedules for review and conferencing. 6. Designate an accessible place to keep portfolios. 7. Provide positive washback giving final assessment JOURNALS a journal is a log or account of one’s thoughts, feelings, reactions, assessment, ideas, or progress toward goals, usually written with little attention to structure, form, or correctness. Categories or purposes in journal writing, such as the following: a. Language learning logs b. Grammar journals c. Responses to readings d. Strategies based learning logs e. Self-assessment reflections f. Diaries of attitudes, feelings, and other affective factors g. Acculturation logs CONFERENCES AND INTERVIEWS Conferences Conferences is not limited to drafts of written work including portfolios and journals. Conferences must assume that the teacher plays the role of a facilitator and guide, not of an administrator, of a formal assessment. Interviews Interview may have one or more of several possible goals in which the teacher assesses the student’s oral production ascertains a students need before designing a course of curriculum seeks to discover a students’ learning style and preferences One overriding principle of effective interviewing centers on the nature of the questions that will be asked. OBSERVATIONS In order to carry out classroom observation, it is of course important to take the following steps: 1. Determine the specific objectives of the observation. 2. Decide how many students will be observed at one time 3. Set up the logistics for making unnoticed observations 4. Design a system for recording observed performances 5. Plan how many observations you will make SELF AND PEER ASSESSMENT Five categories of self and peer assessment: 1. Assessment of performance, in this category, a student typically monitors him or herself in either oral or written production and renders some kind of evaluation of performance. 2. Indirect assessment of performance, indirect assessment targets larger slices of time with a view to rendering an evaluation of general ability as opposed to one to one specific, relatively time constrained performance. 3. Metacognitive assessment for setting goals, some kind evaluation are more strategic in nature, with the purpose not just of viewing past performance or competence but of setting goals and maintaining an eye on the process of their pursuit. 4. Socioaffective assessment, yet another type of self and peer assessment comes in the form of methods of examining affective factors in learning. Such assessment is quite different from looking at and planning linguistic aspects of acquisition. 5. Student generated tests, a final type of assessment that is not usually classified strictly as self or peer assessment is the technique of engaging students in the process of constructing tests themselves. GUIDELINES FOR SELF AND PEER ASSESSMENT Self and peer assessment are among the best possible formative types of assessment and possibly the most rewarding. Four guidelines will help teachers bring this intrinsically motivating task into the classroom successfully. 1. Tell students the purpose of assessment 2. Define the task clearly 3. Encourage impartial evaluation of performance or ability 4. Ensure beneficial washback through follow up tasks A TAXONOMY OF SELF AND PEER ASSESSMENT TASKS It is helpful to consider a variety of tasks within each of the four skills( listening skill, speaking skill, reading skill, writing skill). An evaluation of self and peer assessment according to our classic principles of assessment yields a pattern that is quite consistent with other alternatives to assessment that have been analyzed in this chapter. Practicality can achieve a moderate level with such procedures as checklists and questionnaires CHAPTER 11: GRADING AND STUDENT EVALUATION GUIDELINES FOR SELECTING GRADING CRITERIA It is essential for all components of grading to be consistent with an institutional philosophy and/or regulations (see below for a further discussion of this topic). All of the components of a final grade need to be explicitly stated in writing to students at the beginning of a term of study, with a designation of percentages or weighting figures for each component. If your grading system includes items (d) through (g) in the questionnaire above (improvement, behavior, effort; motivation), it is important for you to recognize their subjectivity. But this should not give you an excuse to avoid converting such factors into observable and measurable results. Finally, consider allocating relatively _ small weights to items (c) through (h) so that a grade primarily reflects achievement. A designation of 5 percent to 10 percent of a grade to such factors will not mask strong achievement in a course. CALCULATING GRADES: ABSOLUTE AND RELATIVE GRADING ABSOLUTE GRADING: If you pre-specify standards of performance on a numerical point system, you are using an absolute system of grading. For example, having established points for a midterm test, points for a final exam, and points accumulated for the semester, you might adhere to the specifications in the table below. The key to making an absolute grading system work is to be painstakingly clear on competencies and objectives, and on tests, tasks, and other assessment techniques that will figure into the formula for assigning a grade. RELATIVE GRADING: It is more commonly used than absolute grading. It has the advantage of allowing your own interpretation and of adjusting for unpredicted ease or difficulty of a test. Relative grading is usually accomplished by ranking students in order of performance (percentile ranks) and assigning cut-off points for grades. An older, relatively uncommon method of relative grading is what has been called grading "on the curve," a term that comes from the normal bell curve of normative data plotted on a graph. TEACHERS’ PERCEPTIONS OF APPROPRIATE GRADE DISTRIBUTIONS Most teachers bring to a test or a course evaluation an interpretation of estimated appropriate distributions, follow that interpretation, and make minor adjustments to compensate for such matters as unexpected difficulty. What is surprising, however, is that teachers' preconceived notions of their own standards for grading often do not match their actual practice INSTITUTIONAL EXPECTATIONS AND CONSTRAINTS For many institutions letter grading is foreign but point systems (100 pts or percentages) are common. Some institutions refuse to employ either a letter grade or a numerical system of evaluation and instead offer narrative evaluations of Ss. This preference for more individualized evaluations is often a reaction to overgeneralization of letter and numerical grading. CROSS-CULTURAL FACTORS AND THE QUESTION OF DIFFICULTY A number of variables bear on the issue. In many cultures, it is unheard of to ask a student to self-assess performance. Ts assign a grade, and nobody questions the teacher's criteria. measure of a good teacher is one who can design a test that is so difficult that no student could achieve a perfect score. The fact that students fall short of such marks of perfection is a demonstration of the teacher's superior knowledge. as a corollary, grades of A are reserved for a highly select few, and students are delighted with Bs. one single final examination is the accepted determinant of a student's entire course grade. the notion of a teacher's preparing students to do their best on a test is an educational contradiction. In some cultures a "hard" test is a good test, but in others, a good test results in a distribution like the one in the bar graph for a "great bunch": a large proportion of As and Bs, a few Cs, and maybe a D or an F for the "deadbeats" in the class. How do you gauge such difficulty as you design a classroom test that has not had the luxury of piloting and pre-testing? The answer is complex. It is usually a combination of a number of possible factors: experience as a teacher (with appropriate intuition) adeptness at designing feasible tasks special care in framing items that are clear and relevant mirroring in-class tasks that students have mastered variation of tasks on the test itself reference to prior tests in the same course a thorough review and preparation for the test knowledge of your students' collective abilities a little bit of luck WHAT DO LETTER GRADES “MEAN”? Typically, institutional manuals for teachers and students will list the following descriptors of letter grades: A: excellent B: good C: adequate D: inadequate/unsatisfactory F: failing/unacceptable The overgeneralization implicit in letter grading underscores the meaninglessness of the adjectives typically cited as descriptors of those letters. Is there a solution to their gate-keeping role? 1. Every teacher who uses letter grades or a percentage score to provide an evaluation, whether a summative, end-of-course assessment or on a formal assessment procedure, should a. use a carefully constructed system of grading, b. assign grades on the basis of explicitly stated criteria, and c. base criteria on objectives of course or assessment procedure(s). 2. Educators everywhere must work to persuade the gatekeepers of the world that letter/numerical evaluations are simply one side of a complex representation of a student's ability. Alternatives to letter grading are essential considerations. ALTERNATIVES TO LETTER GRADING For assessment of a test, paper, report, extra-class exercise, or other formal, scored task, the primary objective of which is to offer formative feedback, the possibilities beyond a simple number or letter include a teacher's marginal and/or end comments, a teacher's written reaction to a student's self-assessment of performance, a teacher's review of the test in the next class period, peer-assessment of performance, self-assessment of performance, and a teacher's conference with the student. For summative assessment of a student at the end of a course, those same additional assessments can be made, perhaps in modified forms: a teacher's marginal and/or end of exam/paper/project comments T's summative written evaluative remarks on a journal, portfolio, or other tangible product T's written reaction to a student's self assessment of performance in a course a completed summative checklist of competencies, with comments narrative evaluations of general performance on key objectives a teacher's conference with the student A more detailed look is now appropriate for a few of the summative alternatives to grading, particularly self-assessment, narrative evaluations, checklists, and conferences. 1. Self-assessment. Self-assessment of end-of-course attainment of objectives is recommended through the use of the following: Checklists a guided journal entry that directs the student to reflect on the content and linguistic objectives an essay that self-assesses, a teacher-student conference 2. Narrative evaluations. In protest against the widespread use of letter grades as exclusive indicators'of achievement, a number of institutions have at one time or another required narrative evaluations of students. In some instances those narratives replaced grades, and in others they supplemented them. (pg. 296-297) Advantages: individualization, evaluation of multiple objectives of a course, face validity, washback potential. Disadvantages: not quantified by admissions and transcript evaluation offices, not practical-time consuming, Ss’ paying little attention to these, Ts’ succumbing to formulaic narratives which follow a template. 3- Checklist evaluations. To compensate for the time-consuming impracticality of narrative evaluation, some programs opt for a compromise: a checklist with brief comments from the teacher ideally followed by a conference and/or a response from the student. Advantages: increased practicality, reliability, washback. Teacher time is minimized; uniform measures are applied across all students; some open-ended comments from the teacher are available; and the student responds with his or her own goals (in light of the results of the checklist and teacher comments). !!! When the checklist format is accompanied, as in this case, by letter grades as well, virtually none of the disadvantages of narrative evaluations remain, with only a small chance that some individualization may be slightly. 4.Conferences. Perhaps enough has been said about the virtues of conferencing. You already know that the impracticality of scheduling sessions with students is offset by its washback benefits. SOME PRINCIPLES AND GUIDELINES FOR GRADING AND EVALUATION You should now understand that grading is not necessarily based on a universally accepted scale, grading is sometimes subjective and context-dependent, grading of tests is often done on the "curve," grades reflect a teacher's philosophy of grading, grades reflect an institutional philosophy of grading, cross-cultural variation in grading philosophies needs to be understood, grades often conform, by design, to a teacher's expected distribution of students across a continuum, tests do not always yield an expected level of difficulty, letter grades may not "mean" the same thing to all people, and alternatives to letter grades or numerical scores are highly desirable as additional indicators of achievement. With those characteristics of grading and evaluation in mind, the following principled guidelines should help you be an effective grader and evaluator of student performance: Develop an informed, comprehensive personal philosophy of grading that is consistent with your philosophy of teaching and evaluation. Ascertain an institution’s philosophy of grading and, unless otherwise negotiated, conform to that philosophy (so that you are not out of step with others). Design tests that conform to appropriate institutional and cultural expectations of the difficulty that Ss should experience. Select appropriate criteria for grading and their relative weighting in calculating grades. Communicate criteria for grading to Ss at the beginning of the course and at subsequent grading periods (mid-term, final). Triangulate letter grade evaluations with alternatives that are more formative and that give more washback.

testing

Related documents

Products

Support

testing

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib