testing

advertisement
LANGUAGE ASSESSMENT
ELT Teacher Training
Tarık İNCE
CHAPTER 1
TESTING
ASSESSING
AND
TEACHING
In an era of communicative language teaching:
Tests should measure up to standards of authenticity and meaningfulness.
Ts should design tests that serve as motivating learning experiences rather
than anxiety-provoking threats.
Tests;
should be positive experiences
should build a person’s confidence and become learning experiences
should bring out the best in students
shouldn’t be degrading
shouldn’t be artificial
shouldn’t be anxiety-provoking
Language Assessment aims;
to create more authentic, intrinsically motivating assessment procedures that
are appropriate for their context & designed offer constructive feedback to sts
What is a test?
A test
is measuring a person’s ability, knowledge or performance in a given domain.
1. Method
A set of techniques, procedures or items.
To qualify as a test, the method must be explicit and structured. Like;
Multiple-choice questions with prescribed correct answers
A writing prompt with a scoring rubric
An oral interview based on a question script and a checklist of
expected responses to be filled by the administrator
2 Measure
A means for offering the test-taker some kind of result.
If an instrument does not specify a form of reporting measurement, then that
technique cannot be defined as a test.
Scoring may be like the followings
Classroom-based short answer essay test may earn the test-taker a letter grade
accompanied by the instructor’s marginal comments.
Large-scale standardized tests provide a total numerical score, a percentile
rank, and perhaps some sub-scores.
3. The test-taker(the individual) = The person who takes the test.
Testers need to understand;
who the test-takers are?
what is their previous experience and background?
whether the test is appropriately matched to their abilities?
how should test-takers interpret their scores?
4. Performance
Test measures performance, but results imply test-taker’ ability or competence.
Some language tests measure one’s ability to perform language:
To speak, write, read or listen to a subset of language
Some others measure a test-taker’s knowledge about language:
Defining a vocabulary item, reciting a grammatical rule or identifying a
rhetorical feature in written discourse.
5. Measuring a given domain
It means measuring the desired criterion and not including other factors.
Proficiency tests:
Even though the actual performance on the test involves only a sampling of
skills, that domain is overall proficiency in a language – general competence in
all skills of a language.
Classroom-based performance tests:
These have more specific criteria. For example:
A test of pronunciation might well be a test of only a limited set of
phonemic minimal pairs.
A vocabulary test may focus on only the set of words covered in a particular
lesson.
A well-constructed test is an instrument that provides an accurate measure of
the test taker’s ability within a particular domain.
TESTING, ASSESSMENT & TEACHING
TESTING
are prepared administrative
procedures that occur at
identifiable times in a
curriculum.
When tested, learners know that
their performance is being
measured and evaluated.
When tested, learners muster all
their faculties to offer peak
performance.
Tests are a subset of assessment.
They are only one among many
procedures and tasks that
teachers can ultimately use to
assess students.
Tests are usually time-constrained
(usually spanning a class period
or at most several hours) and
draw on a limited sample of
behaviour.
ASSESSMENT
Assessment is an ongoing process that
encompasses a much wider
domain.
A good teacher never ceases to assess
students, whether those
assessments are incidental or
intended.
Whenever a student responds to a
question, offers a comment, or tries
out a new word or structure, the
teacher subconsciously makes an
assessment of the student’s
performance.
Assessment includes testing.
Assessment is more extended and it
includes a lot more components.
What about TEACHING?
For optimal learning to take place, learners must have opportunities to “play”
with language without being formally graded.
Teaching sets up the practice games of language learning:
the opportunities for learners to listen, think, take risks, set goals,
and process feedback from the teacher (coach)
and then recycle through the skills that they are trying to master.
During these practice activities, teachers are indeed observing students’
performance and making various evaluations of each learner.
Then, it can be said that testing and assessment are subsets of teaching.
ASSESSMENT
Informal Assessment
They are incidental, unplanned
comments and responses.
Examples include: “Nice job!” “Well
done!” “Good work!” “Did you say
can or can’t?” “Broke or break!”, or
putting a ☺ on some homework.
Classroom tasks are designed to
elicit performance without
recording results and making
fixed judgements about a
student’s competence.
Examples of unrecorded assessment:
marginal comments on papers,
responding to a draft of an essay,
advice about how to better
pronounce a word, a suggestion
for a strategy for compensating
for a reading difficulty, and
showing how to modify a
student’s note-taking to better
remember the content of a
lecture.
Formal Assessment
They are exercises or procedures specifically
designed to tap into a storehouse of skills
and knowledge.
They are systematic, planned sampling
techniques constructed to give Ts and sts
an appraisal of student achievement.
They are tournament games that occur
periodically in the course of teaching.
It can be said that all tests are formal
assessments, but not all formal
assessment is testing.
Example 1: A student’s journal or portfolio of
materials can be used as a formal
assessment of attainment of the certain
course objectives, but it is problematic to
call those two procedures “test”.
Example 2: A systematic set of observations
of a student’s frequency of oral
participation in class is certainly a formal
assessment, but not a “test”.
THE FUNCTION OF AN ASSESSMENT
Formative Assessment
Summative Assessment
Evaluating students in the
It aims to measure, or
summarize, what a student
process of “forming” their
has grasped, and typically
competencies and skills with
occurs at the end of a course.
the goal of helping them to
continue that growth process. It does not necessarily point the
way to future progress.
It provides the ongoing
development of learner’s lang Example: Final exams in a
course and general
Example: When you give sts a
proficiency exams.
comment or a suggestion, or
call attention to an error, that
feedback is offered to improve
learner’s language ability.
Virtually all kinds of informal
assessment are formative.
All tests/formal assessment
(quizzes, periodic review
tests, midterm exams, etc.)
are summative.
IMPORTANT:
As far as summative assessment is considered, in the aftermath of any
test, students tend to think that “Whew! I’m glad that’s over.
Now I don’t have to remember that stuff anymore!”
An ideal teacher should try to change this attitude among students.
A teacher should:
· instill a more formative quality to his lessons
· offer students an opportunity to convert tests into “learning
experiences”.
Norm-Referenced Tests
TESTS
Each test-taker’s score is interpreted in
relation to a mean (average score),
median (middle score), standard
deviation (extend of variance in
scores), and/or percentile rank.
The purpose is to place test-takers along a
mathematical continuum in rank order.
Scores are usually reported back to the
test-taker in the form of a numerical
score. (230 out of 300, 84%, etc.)
Typical of these tests are standardized tests
like SAT. TOEFL, ÜDS, KPDS, DS, etc.
These tests are intended to be administered
to large audiences, with results
efficiently disseminated to test takers.
They must have fixed, predetermined
responses in a format that can be scored
quickly at minimum expense.
Money and efficiency are primary
concerns in these tests.
Criterion-Referenced Tests
They are designed to give testtakers
feedback, usually in the form of
grades, on specific course or lesson
objectives.
Tests that involve the sts in only one class,
and are connected to a curriculum, are
Criterion-Referenced Tests.
Much time and effort on the part of the
teacher are required to deliver useful,
appropriate feedback to students.
The distribution of students’ scores across
a continuum may be of little concern as
long as the instrument assesses
appropriate objectives.
As opposed to standardized, large scale
testing with its emphasis on
classroom-based testing, CriterionReferenced
Testing is of more prominent interest than
Norm-Referenced Testing.
Approaches to Language Testing: A Brief History
Historically, language-testing trends have followed the trends of
teaching methods.
During 1950s: An era of behaviourism and special attention to
contrastive analysis.
Testing focused on specific lang elements such as phonological,
grammatical, and lexical contrasts between two languages.
During 1970s and 80s: Communicative Theories were widely
accepted.
A more integrative view of testing.
Today: Test designers are trying to form authentic, valid
instruments that simulate real world interaction.
APPROACHES TO LANGUAGE TESTING
A) Discrete-Point Testing
B) Integrative Testing
Language can be broken down into its
component parts and those parts
can be tested successfully.
Component parts; listening, speaking,
reading and writing.
Units of language (discrete points);
phonology, graphology,
morphology, lexicon, syntax and
discourse.
An language proficiency test should
sample all 4 skills and as many
linguistic discrete points as possible
In the face of evidence that in a study
each student scored differently in
various skills depending on his
background, country and major
field, Oller admitted that “unitary
trait hypothesis was wrong.”
Language competence is a unified set
of interacting abilities that cannot
be tested separately.
Communicative competence is global
and requires such integration that
it cannot be captured in additive
tests of grammar, reading, vocab,
and other discrete points of lang.
Two types of tests examples of
integrative tests:
*cloze test and **dictation.
Unitary trait hypothesis: It suggests
an “indivisible” view of language
proficiency; that vocabulary,
grammar, phonology, “4 skills”,
and other discrete points of lang
could not be disentangled from
each other in lang performance.
Cloze Test:
Cloze Test results are good measures of overall proficiency.
The ability to supply appropriate words in blanks requires a number of
abilities that lie at the heart of competence in a language:
knowledge of vocabulary, grammatical structure,
discourse structure, reading skills and strategies.
It was argued that successful completion of cloze items taps into all of those
abilities, which were said to be the essence of global language proficiency.
Dictation
Essentially, learners listen to a passage of 100 to 150 words read aloud by an
administrator (or audiotape) and write what they hear, using correct spelling.
Supporters argue that dictation is an integrative test because
success on a dictation requires careful listening,
reproduction in writing of what is heard, efficient short-term memory,
to an extent, some expectancy rules to aid the short-term memory.
c) Communicative Language Testing ( recent approach after mid 1980s)
What does it criticise?
In order for a particular langtest to be useful for its intended purposes, test performance
must correspond in demonstrable ways to language use in non-test situations.
Integrative tests such as cloze only tell us about a candidate’s linguistic competence.
They do not tell us anything directly about a student’s performance ability. (Knowledge
about a language, not the use of language)
Any suggestion?
A quest for authenticity, as test designers centered on communicative performance.
The supporters emphasized the importance of strategic competence (the ability to
employ communicative strategies to compensate for breakdowns as well as to enhance
the rhetorical effect of utterances) in the process of communication.
Any problem in using this approach?
Yes, communicative testing presented challenges to test designers, because they began
to identify the real-world tasks that language learners were called upon to perform.
But, it was clear that the contexts for those tasks were extraordinarily widely varied and
that the sampling of tasks for any one assessment procedure needed to be validated by
what language users actually do with language.
As a result:
The assessment field became more and more concerned with the authenticity of tasks
and the genuineness of texts.
d) Performance-Based Assessment
performance-based assessment of language typically involves oral production,
written production, open-ended responses, integrated performance (across skill areas),
group performance, and other interactive tasks.
Any problems?
It is time-consuming and expensive, but those extra efforts are paying off in more direct
testing because sts are assessed as they perform actual or simulated real-world tasks.
The advantage of this approach?
Higher content validity is achieved because learners are measured in the process of
performing the targeted linguistic acts. Important
performance-based assessment means that Ts should rely a little less on formally
structured tests and a little more on evaluation while sts are performing various tasks.
In performance-based assessment:
Interactive Tests (speaking, requesting, responding, etc.) IN ☺ Paper-and-pencil OUT
Result: in this test tasks can approach the authenticity of real life language use.
CURRENT ISSUES IN CLASSROOM TESTING
The design of communicative, performance-based assessment continues to
challenge both assessment experts and classroom teachers.
There’re three issues which are helping to shape our current understanding of
effective assessment. These are:
· The effect of new theories of intelligence on the testing industry
· The advent of what has come to be called “alternative assessment”
The increasing popularity of computer-based testing
New Views on Intelligence
In the past:
Intelligence was once viewed strictly as the ability to perform linguistic and
logical-mathematical problem solving.
For many years, we’ve lived in a word of standardized, norm-referenced tests
that are timed in a multiple-choice format consisting of a multiplicity of logic
constrained items, many of which are inauthentic.
We were relying on timed, discrete-point, analytical tests in measuring lang.
We were forced to be in the limits of objectivity and give impersonal responds.
Recently:
Spatial intelligence
musical intelligence
bodily-kinesthetic intelligence
interpersonal intelligence
intrapersonal intelligence
EQ (Emotional Quotient) underscore emotions in our cognitive processing.
Those who manage their emotions tend to be more capable of fully intelligent
processing, because anger, grief, resentment, other feelings can easily impair
peak performance in everyday tasks as well as higher-order problem solving.
These conceptualizations of intelligence’ intuitive appeal infused the 1990s
with a sense of both freedom and responsibility in our testing agenda.
In past, our challenge was to test interpersonal, creative, communicative,
interactive skills, doing so to place some trust in our subjectivity and intuition.
Traditional and “Alternative” Assessment
Traditional Assessment
-One-shot, standardized exams
-Timed, multiple-choice format
-Decontextualized test items
-Scores suffice for feedback
-Norm-referenced scores
-Focus on the “right” answer
-Summative
-Oriented to product
-Non-interactive process
-Fosters extrinsic motivation
Alternative Assessment
Continuous longterm assessment
Untimed, free-response format
Contextualized communicative tests
Individualized feedback and washback
Criterion-referenced scores
Open-ended, creative answers
Formative
Oriented to process
Interactive process
Fosters intrinsic motivation
IMPORTANT
It is difficult to draw a clear line of distinction between
traditional and alternative assessment.
Many forms of assessment fall in between the two, and some
combine the best of both.
More time and higher institutional budgets are required to
administer and score assessments that presuppose more
subjective evaluation, more individualization, and more
interaction in the process of offering feedback.
But the payoff of the “Alternative Assessment” comes with more
useful feedback to students, the potential for intrinsic
motivation, and ultimately a more complete description of a
student’s ability.
Computer-Based Testing
Some computer-based tests are small-scale. Others are standardized, large scale tests
(e.g. TOEFL) in which thousands of test-takers are involved.
A type of computer-based test (Computer-Adaptive Test / CAT) is available
In CAT, the test-taker sees only one question at a time, and the computer scores each
question before selecting the next one.
Test-takers cannot skip questions, and, once they have entered and confirmed their
answers, they cannot return to questions.
Advantages of Computer-Based Testing:
o Classroom-based testing
o Self-directed testing on various aspects of a lang (vocabulary, grammar, discourse, etc)
o Practice for upcoming high-stakes standardized tests
o Some individualization, in the case of CATs.
o Scored electronically for rapid reporting of results.
Disadvantages of Computer-Based Testing:
Lack of security and the possibility of cheating in unsupervised computerized tests.
Home-grown quizzes may be mistaken for validates assessments.
Open-ended responses are less likely to appear because of need for human scorers.
The human interactive element is absent.
An Overall summary
Tests
Assessment is an integral part of the teaching-learning cycle.
In an interactive, communicative curriculum, assessment is almost constant.
Tests can provide authenticity, motivation, and feedback to the learner.
Tests are essential components of a successful curriculum and learning process.
Assessments
Periodic assessments can increase motivation as milestones of student progress.
Appropriate assessments aid in the reinforcement and retention of information.
Assessments can confirm strength and pinpoint areas needing further work.
Assessments provide sense of periodic closure to modules within a curriculum.
Assessments promote sts autonomy by encouraging self-evaluation progress.
Assessments can spur learners to set goals for themselves.
Assessments can aid in evaluating teaching effectiveness.
Decide whether the following statements are TRUE or FALSE.
1. It’s possible to create authentic and motivating assessment to offer
constructive feedback to the sts. ----------2. All tests should offer the test takers some kind of measurement or result. ----3. Performance based tests measure test takers’ knowledge about language. ----4. Tests are the best tools to assess students. ----------5. Assessment and testing are synonymous terms. ----------6. Ts’ incidental and unplanned comments and responses to sts is an example
of formal assessment. ------7. Most of our classroom assessment is summative assessment. ----------8. Formative assessment always points toward future formation of learning. ---9. The distribution sts’ scores across a continuum is a concern in norm
referenced test. ----------10. C riterion referenced testing has more instructional value than normreferenced testing for classroom teachers. ----------1. TRUE
2. TRUE
3. FALSE They are designed to test actual use of lang not knowledge about lang
4. FALSE (We cannot say they are best, but one of useful devices to assess sts.)
5. FALSE (They are not.)
6. FALSE (They are informal assessment)
7. FALSE (formative assessment)
8. TRUE
9. TRUE
10. TRUE
CHAPTER 2
PRINCIPLES OF LANGUAGE
ASSESSMENT
There’re five testing criteria for “testing a test”:
1.
Practicality 2. Reliability 3. Validity 4. Authenticity 5. Washback
1. PRACTICALITY
A practical test
· is not excessively expensive,
· stays within appropriate time constraints,
· is relatively easy to administer, and
· has a scoring/evaluation procedure that is specific and time-efficient.
For a test to be practical
· administrative details should clearly be established before the test,
· sts should be able to complete the test reasonably within the set time frame,
· the test should be able to be administered smoothly (prosedürle boğmamalı),
· all materials and equipment should be ready,
· the cost of the test should be within budgeted limits,
· the scoring/evaluation system should be feasible in the teacher’s time frame.
· methods for reporting results should be determined in advance.
2. RELIABILITY
A reliable test is consistent and dependable.
The issue of reliability of a test may best be addressed by considering a number
of factors that may contribute to the unreliability of a test.
Consider following possibilities: fluctuations
· in the student (Student-Related Reliability),
· in scoring (Rater Reliability),
· in test administration (Test Administration Reliability), and
· in the test (Test Reliability) itself.
Student-Related Reliability:
Temporary illness, fatigue, a bad day, anxiety, other physical or psychological
factors may make an “observed” score deviate from one’s “true” score.
Also a test-taker’s “test-wiseness” or strategies for efficient test taking can also
be included in this category.
Rater Reliability:
Human error, subjectivity, lack of attention to scoring criteria, inexperience, inattention,
or even preconceived (peşin hükümlü) biases may enter into scoring process.
Inter-rater unreliability occurs when 2 or more scorers yield inconsistent scores of the
same test.
Intra-rater unreliability is because of unclear scoring criteria, fatigue, bias toward
particular “good” and “bad” students, or simple carelessness.
One solution to such intra-rater unreliability is to read through about half of the tests
before rendering any final scores or grades, then to recycle back through the whole set
of tests to ensure an even-handed judgment.
The careful specification of an analytical scoring instrument can increase raterreliability.
Test Administration Reliability:
Unreliability may also result from the conditions in which the test is administered.
Street noise, photocopying variations, poor light, temperature, desks and chairs.
Test Reliability:
Sometimes the nature of the test itself can cause measurement errors.
Timed tests may discriminate against sts who do not perform well with a time limit.
Poorly written test items may be a further source of test unreliability.
3. VALIDITY
The extent to which the assessment requires students to perform
tasks that were included in the previous classroom lessons.
How is the validity of a test established?
There is no final, absolute measure of validity, but several different kinds of
evidence may be invoked in support.
it may be appropriate to examine the extent to which a test calls for
performance that matches that of the course or unit of study being tested.
In other cases we may be concerned with how well a test determines whether
or not students have reached an established set of goals or level of competence.
it could be appropriate to study statistical correlation with other related but
independent measures.
Other concerns about a test’s validity
may focus on the consequences – beyond measuring the criteria themselves - of
a test, or even on the test-taker’s perception of validity.
We will look at these five types of evidence below.
Content Validity:
If a test requires the test-taker to perform the behaviour that is being measured,
content-related evidence of validity, often popularly referred to as content validity.
If you assess a person’s ability to speak TL, asking sts answer paper-and-pencil multiple
choice questions requiring grammatical judgements does not achieve content validity.
for content validity to be achieved, one should be able to elicit the following conditions:
· Classroom objectives should be identified and appropriately framed. The first measure
of an effective classroom test is the identification of objectives.
· Lesson objectives should be represented in the form of test specifications. A test
should have a structure that follows logically from lesson or unit you are testing.
If you clearly perceive the performance of test-takers as reflective of the classroom
objectives, then you can argue this, content validity has probably been achieved.
To understand content validity consider difference between direct and indirect testing.
Direct testing involves the test-taker in actually performing the target task.
Indirect testing involves performing not target task itself, but that related in some way.
Direct testing is most feasible (uygun) way to achieve content validity in assessment.
Criterion-related Validity:
It examines the extent to which the criterion of test has actually been achieved.
For example, a classroom test designed to assess a point of grammar in
communicative use will have criterion validity if test scores are corroborated
either by observed subsequent behavior or by other communicative measures
of the grammar point in question.
Criterion-related evidence usually falls into one of two categories:
Concurrent (uygun, aynı zamanda olan) validity:
A test has concurrent validity if its results are supported by other concurrent
performance beyond the assessment itself.
For example, the validity of a high score on the final exam of a foreign
language course will be substantiated by actual proficiency in the language.
· Predictive (öngörüsel, tahmini) validity:
The assessment criterion in such cases is not to measure concurrent ability but
to assess (and predict) a test-taker’s likelihood of future success.
For example, the predictive validity of an assessment becomes important in the
case of placement tests, language aptitude tests, and the like.
· Construct Validity:
Every issue in language learning and teaching involves theoretical constructs.
In the field of assessment, construct validity asks, “Does this test actually tap
into the theoretical construct as it has been identified?” (test gerçekten de test
etmek istediğim konu ya da beceriyi test etmede gerekli olan yapısal
özellikleri taşıyor mu?)
Imagine that you have been given a procedure for conducting an oral
interview. The scoring analysis for the interview includes several factors in
the final score: pronunciation, fluency, grammatical accuracy, vocabulary use,
and sociolinguistic appropriateness. The justification for these five factors lies
in a theoretical construct that claims those factors to be major components of
oral proficiency. So if you were asked to conduct on oral proficiency
interview that evaluated only pronunciation and grammar, you could be
justifiably suspicious about the construct validity of that test.
“Large-scale standardized tests” olarak nitelediğimiz sınavlar “construct
validity” açısından pek de uygun değildir. Çünkü pratik olması açısından
(yani hem zaman hem de ekonomik nedenlerden) bu testlerde ölçülmesi
gereken bütün dil becerileri ölçülememektedir. Örneğin TOEFL’ da “oral
production” bölümünün olmaması “construct validity” açısından büyük bir
engel olarak karşımıza çıkmaktadır.
Consequential Validity:
Consequential validity encompasses all the consequences of a test,
including such considerations as its accuracy in measuring intended
criteria, its impact on the preparation of test-takers, its effect on the
learner, and the (intended and unintended) social consequences of a test’s
interpretation and use.
McNamara (2000, p. 54) cautions against test results that may reflect
socioeconomic conditions such as opportunities for coaching (özel ders,
özel ilgi). For example, only some families can afford coaching, or because
children with more highly educated parents get help from their parents.
Teachers should consider the effect of assessments on students’
motivation, subsequent performance in a course, independent learning,
study habits, and attitude toward school work.
Face Validity:
the degree to which a test looks right, and appears to measure the knowledge
or abilities it claims to measure, based on the subjective judgment of test-takers
· Face validity means that the students perceive the test to be valid. Face
validity asks the question “Does the test, on the ‘face’ of it, appear from the
learner’s perspective to test what it is designed to test?
· Face validity is not something that can be empirically tested by a teacher or
even by a testing expert. It depends on subjective evaluation of the test-taker.
· A classroom test is not the time to introduce new tasks.
· If a test samples the actual content of what the learner has achieved or expects
to achieve, face validity will be more likely to be perceived.
· Content validity is a very important ingredient in achieving face validity.
· Students will generally judge a test to be face valid if directions are clear, the
structure of the test is organized logically, its difficulty level is appropriately
pitched, the test has no “surprises”, and timing is appropriate.
· To give an assessment procedure that is “biased for best” a teacher offers
students appropriate review and preparation for the test, suggests strategies
that will be beneficial, and structures the test so that the best students will be
modestly challenged and the weaker students will not be overwhelmed.
4. AUTHENTICITY
In an authentic test
· the language is as natural as possible,
· items are as contextualized as possible,
· topics and situations are interesting, enjoyable and/or humorous,
· some thematic (konuyla ilgili) organization, such as through a story line or
episode is provided,
· tasks represent real-world tasks.
Reading passages are selected from real-world sources that test-takers are
likely to have encountered or will encounter.
Listening comprehension sections feature natural language with hesitations,
white noise, and interruptions.
More and more tests offer items that are “episodic” in that they are sequenced
to form meaningful units, paragraphs, or stories.
5. WASHBACK
Washback includes the effects of an assessment on teaching and learning prior
to the assessment itself, that is, on preparation for the assessment.
Informal performance assessment is by nature more likely to have built-in
washback effects because the teacher is usually providing interactive feedback.
Formal tests can also have positive washback, but they provide no washback if
the students receive a simple letter grade or a single overall numerical score.
Tests should serve as learning devices through which washback is achieved.
Sts’ incorrect responses can become windows of insight into further work.
Their correct responses need to be praised, especially when they represent
accomplishments in a student’s inter-language.
Washback enhances a number of basic principles of language acquisition:
intrinsic motivation, autonomy, self-confidence, language ego, interlanguage,
and strategic investment, among others.
To enhance washback comment generously & specifically on test performance.
Washback implies that students have ready access to the teacher to discuss the
feedback and evaluation he has given.
Teachers can raise the washback potential by asking students to use test results
as a guide to setting goals for their future effort.
What is washback?
In general terms: The effect of testing on teaching and learning
In large-scale assessment: Refers to the effects that the tests have on
instruction in terms of how students prepare for the test
In classroom assessment: The information that washes back to students
in the form of useful diagnoses of strengths and weaknesses
What does washback enhance?
Intrinsic motivation
Language ego
Autonomy
Inter-language
Self-confidence
Strategic investment
What should teachers do to enhance washback?
Comment generously and specifically on test performance
Respond to as many details as possible
Praise strengths
Criticize weaknesses constructively
Give strategic hints to improve performance
Decide whether the following statements are TRUE or FALSE.
1. An expensive test is not practical.
2. One of the sources of unreliability of a test is the school.
3. Sts, raters, test, and administration of it may affect the test’s reliability.
4. In indirect tests, students do not actually perform the task.
5. If students are aware of what is being tested when they take a test, and think
that the questions are appropriate, the test has face validity.
6. Face validity can be tested empirically.
7. Diagnosing strengths and weaknesses of students in language learning is a
facet of washback.
8. One way of achieving authenticity in testing is to use simplified language.
1. TRUE
2. FALSE
3. TRUE
4. TRUE
5. TRUE
6. FALSE
7. TRUE
8. FALSE
Decide which type of validity does each sentence belong to?
1. It is based on subjective judgment. ---------------------2. It questions the accuracy of measuring the intended criteria. ----------------------
3. It appears to measure the knowledge and abilities it claims to measure. ------------4. It measures whether the test meets the objectives of classroom objectives. -------5. It requires the test to be based on a theoretical background. ---------------------6. Washback is part of it. ----------------------
7. It requires the test-taker to perform the behavior being measured. -----------------8. The students (test-takers) think they are given enough time to do the test. ----------9. It assesses a test-taker's likelihood of future success. (e.g. placement tests). --------10. The students' psychological mood may affect it negatively or positively. -------------11. It includes the consideration of the test's effect on the learner. ---------------------12. Items of the test do not seem to be complicated. ---------------------13. The test covers the objectives of the course. ---------------------14. The test has clear directions. ---------------------1. Face
2. Consequential
3. Face
4. Content 5. Construct 6. Content
7. Criterion related
8. Face
9. Criterion related
10. Consequential
11. Consequential
12. Face validity
13. Content validity
14. Face validity
Decide with which type of reliability could each sentence be related?
1. There are ambiguous items.
2. The student is anxious.
3. The tape is of bad quality.
4. The teacher is tired but continues scoring.
5. The test is too long.
6. The room is dark.
7. The student has had an argument with the teacher.
8. The scorers interpret the criteria differently.
9. There Is a lot of noise outside the building.
1. Test reliability
3. Test administration reliability
2. Student-related reliability
4. Rater reliability
5. Test reliability
6. Test administration reliability
7. Student-related reliability
8. Rater reliability
9. Test administration reliability
CHAPTER 3
DESIGNING CLASSROOM LANGUAGE
TESTS
we examine test types, and learn how to design tests and revise existing ones.
To start the process of designing tests, we will ask some critical questions.
5 questions should form basis of your approach to designing tests for class.
Question 1: What is the purpose of the test?
· Why am I creating this test?
· For an evaluation of overall proficiency? (Proficiency Test)
· To place students into a course? (Placement Test)
· To measure achievement within a course? (Achievement Test)
Once you established major purpose of a test, you can determine its objectives.
Question 2: What are the objectives of the test?
· What specifically am I trying to find out?
· What language abilities are to be assessed?
Question 3: How will test specifications reflect both purpose and objectives?
· When a test is designed, the objectives should be incorporated into a structure
that appropriately weights the various competencies being assessed.
Question 4: How will test tasks be selected and the separate items arranged?
· The tasks need to be practical.
· They should also achieve content validity by presenting tasks that mirror
those of the course being assessed.
· They should be evaluated reliably by the teacher or scorer.
· The tasks themselves should strive for authenticity, and the progression of
tasks ought to be biased for best performance.
Question 5: What kind of scoring, grading, and/or feedback is expected?
· Tests vary in the form and function of feedback, depending on their purpose.
· For every test, the way results are reported is an important consideration.
· Under some circumstances a letter grade or a holistic score may appropriate;
other circumstances may require that a teacher offer substantive washback to
the learner.
TEST TYPES
Defining your purpose will help you choose the right kind of test, and it will
also help you to focus on the specific objectives of the test.
Below are the test types to be examined:
1. Language Aptitude Tests
2. Proficiency Tests
3. Placement Tests
4. Diagnostic Tests
5. Achievement Tests
1. Language Aptitude Tests
They predict a person’s success prior to exposure to the second language.
Aptitude test is designed to measure capacity or general ability to learn a FL.
They are designed to apply to the classroom learning of any language.
Two standardized aptitude tests have been used in the US.
The Modern Language Aptitude Test (MLAT),
Pimsleur Language Aptitude Battery(PLAB)
Tasks in MLAT includes: Number learning, phonetic script, spelling clues,
words in sentences, and paired associates.
There’s no unequivocal evidence that language aptitude tests predict
communicative success in a language.
Any test that claims to predict success in learning a language is undoubtedly
flawed because we now know that with appropriate self-knowledge, and
active strategic involvement in learning, everyone can succeed eventually.
2. Proficiency Tests
A proficiency test is not limited to any one course, curriculum, or single skill in
the language; rather, it tests overall ability.
It includes: standardized multiple choice items on grammar, vocabulary,
reading comprehension, and aural comprehension. Sometimes a sample of
writing is added, and more recent tests also include oral production.
Such tests often have content validity weaknesses.
Proficiency tests are almost always summative and norm-referenced.
They are usually not equipped to provide diagnostic feedback.
Their role is to accept or to deny someone’s passage into next stage of a journey
TOEFL is a typical standardized proficiency test.
Creating & validating them with research is time-consuming & costly process
To choose one of a number of commercially available proficiency tests is a far
more practical method for classroom teachers.
3. Placement Tests
The objective of placement test is to correctly place sts into a course or level.
Certain proficient tests can act in the role of placement tests.
A placement test usually includes a sampling of the material to be covered in
the various courses in a curriculum.
Sts should find the test neither too easy nor too difficult but challenging.
ESL Placement Test (ESLPT) at San Francisco State University has three parts.
Part 1: sts read a short article and then write a summary essay.
Part 2: sts write a composition in response to an article.
Part 3: multiple-choice; sts read an essay and identify grammar errors in it.
ESL is more authentic but less practical, because human evaluators are
required for the first two parts.
Reliability problems present but mitigated by conscientious training evaluators
What is lost in practicality and reliability is gained in the diagnostic
information that the ESLPT provides.
4. Diagnostic Tests
A diagnostic test is designed to diagnose specified aspects of a language.
A diagnostic test can help a student become aware of errors and encourage the
adoption of appropriate compensatory strategies.
A test of pronunciation diagnose phonological features that are difficult for Sts
and should become part of a curriculum. Such tests offer a checklist of features
for administrator to use in pinpointing difficulties.
A writing diagnostic elicit a writing sample from sts that would allow Ts to
identify those rhetorical and linguistic features on which the course needed to
focus special attention.
A diagnostic test of oral production was created by Clifford Prator (1972) to
accompany a manual of English pronunciation. In the test;
Test-takers are directed to read 150-word passage while they are tape recorded.
The test administrator then refers to an inventory(envanter, deftere kayıtlı
eşya) of phonological items for analyzing a learner’s production.
After multiple listening, they produce checklist for errors in 5 categories.
Stress - rhythm, Intonation, Vowels, Consonants, Other factors.
This information help Ts make decisions about aspects of English phonology.
5. Achievement Tests
Achievement test is related directly to lessons, units, or even a total curriculum.
Achievement tests should be limited to particular material addressed in a
curriculum within a particular time frame and should be offered after a course
has focused on the objectives in question.
There’s a fine line of differences between diagnostic test and achievement test.
Achievement tests analyze the extent to which students have acquired
language features that have already been taught. (Geçmişin analizini yapıyor.)
Diagnostic tests should elicit information on what students need to work
on in the future. (Gelecek ile ilgili bir analiz yapılıyor.)
Primary role of achievement test is to determine whether course objectives
have been met – and appropriate knowledge and skills acquired – by the end
of a period of instruction.
They are often summative because they are administered end of a unit or term.
But effective achievement tests can serve as useful washback by showing the
errors of students and helping them analyze their weaknesses and strengths.
Achievement tests range from five- or ten-minute quizzes to three-hour final
examinations, with an almost infinite variety of item types and formats.
practical steps in constructing classroom tests:
A) Assessing Clear, Unambiguous Objectives
Before giving a test;
examine the objectives for the unit you’re testing.
Your first task in designing a test, then, is to determine appropriate objectives.
“Students will recognize and produce tag questions, with the correct
grammatical form and final intonation pattern, in simple social conversations.
B) Drawing Up Test Specifications (Talimatlar)
Test specifications will simply comprise
a) a broad outline of the test
b) what skills you will test
c) what the items will look like
This is an example for test specifications based on the objective stated above:
“Students will recognize and produce tag questions, with the correct
grammatical form and final intonation pattern, in simple social conversations.”
C) Devising Test Tasks
how students will perceive them(face validity) the extent to which authentic language
and contexts are present potential difficulty caused by cultural schemata
In revising your draft, you should ask yourself some important questions:
1. Are the directions to each section absolutely clear?
2. Is there an example item for each section?
3. Does each item measure a specified objective?
4. Is each item stated in clear, simple language?
5. Does each multiple choice have appropriate distracters; that is, are the wrong items
clearly wrong and yet sufficiently “alluring” that they aren’t ridiculously easy?
6. Is the difficulty of each item appropriate for your students?
7. Is the language of each item sufficiently authentic?
8. Do the sum of items and the test as a whole adequately reflect the learning objectives?
In the final revision of your test, Time yourself
if the test should be shortened or lengthened, make the necessary adjustments
make sure your test is neat and uncluttered on the page
if there is an audio component, make sure that the script is clear,
D) Designing Multiple-Choice Test Items
There’re a number of weaknesses in multiple-choice items:
The technique tests only recognition knowledge.
Guessing may have a considerable effect on test scores.
The technique severely restricts what can be tested.
It is very difficult to write successful items.
Washback may be harmful.
Cheating may be facilitated.
However,
2 principles support multiple-choice formats are practicality - reliability.
Some important jargons in Multiple-Choice Items:
Multiple-choice items are all receptive, or selective, that is, test-taker chooses
from a set of responses rather than creating a response. Other receptive item
types include true-false questions and matching lists.
Every multiple-choice item has a stem, which presents several options or
alternatives to choose from.
One of those options, the key, is correct response, others serve as distractors .
IMPORTANT!!!
Consider the following four guidelines for designing multiple-choice items for
both classroom-based and large-scale situations:
1. Design each item to measure a specific objective. (aynı anda hem modal bilgisini hem de article bilgisini ölçme.)
2. State both stem and options as simply and directly as possible. Do not use
superfluous (lüzumsuz) words, and another rule of succinctness (az ve öz) is
to remove needless redundancy (gereksiz bilgi) from your options.
3. Make certain that the intended answer is clearly the only correct one.
Eliminating unintended possible answers is often the most difficult problem of
designing multiple-choice items. With only a minimum of context in each
stem, a wide variety of responses may be perceived as correct.
4. Use item indices (indeksler) to accept, discard, or revise items: The
appropriate selection and arrangement of suitable multiple-choice items on a
test can best be accomplished by measuring items against three indices: a)
item facility(IF), or item difficulty b) item discrimination (ID), or
item differentiation, and c) distractor analysis
a) Item facility (IF) is the extent to which an item is easy or difficult for the proposed
group of test-takers.
20 öğrenciden 13 doğru cevap geldiyse; 13/20=0,65(%65). %15 - %85’in kabul edilebilir
Two good reasons for including a very easy item (%85 or higher) are to build in some
affective feelings of “success” among lower-ability students and to serve as warm-up
items. And very difficult items can provide a challenge to high estability sts.
b) Item discrimination (ID) is extent to which an item differentiates between high- and
low-ability test-takers.
An item on which high-ability students and low-ability students score equally well
would have poor ID because it did not discriminate between the two groups.
An item that garners(toplamak) correct responses from most of the high-ability group
and incorrect responses from most of low-ability group has good discrimination power.
30 öğrenciyi en iyiden en düşüğe kadar üç eşit parçaya ayır. En yüksek notu alan 10
öğrenci ile en düşük notu alan 10 öğrenciyi bir item’da aşağıdaki gibi ayıralım
Item #
High-ability students (top 10)
Low-ability students (bottom10)
Correct
7
2
Incorrect
3
8
ID: 7-2=5/ 10= 0,50 The result tells us that us that the item has a moderate level of ID.
High discriminating level would approach 1.0 and no discriminating power at all would
be zero. In most cases, you would want to discard an item that scored near zero.
No absolute rule governs establishment of acceptable and unacceptable ID indices.
c) Distractor efficiency (DE) is the extent to which the distractors “lure” a
sufficient number of test-takers, especially lower-ability ones, and those
responses are somewhat evenly distributed across all distractors.
Example: *Note: C is the correct response.
Choices
High-ability students (10)
Low-ability students (10)
A
0
3
B
1
5
C*
7
2
D
0
0
E
2
0
The item might be improved in two ways:
a) Distractor D doesn’t fool anyone. Therefore it probably has no utility. A
revision might provide a distractor that actually attracts a response or two.
b) Distractor E attracts more responses (2) from the high-ability group than
the low-ability group (0). Why are good students choosing this one?
Perhaps it includes a subtle reference that entices the high group but is “over
the head” of low group, and therefore latter sts don’t even consider it.
The other two distractor (A and B) seem to be fulfilling their function of
attracting some attention from the lower-ability students.
SCORING, GRADING AND GIVING FEEDBACK
A) Scoring
As you design a test, you must consider how the test will be scored and graded
Scoring plan reflects relative weight that you place on each section and items
hangi beceriyi daha çok önemsemişse o beceriye fazla puan vermek gerekir
Oral production %30, Listening %30, Reading %20 ve Writing %20 şeklinde.
B) Grading
Grading doesn’t mean just giving “A” for 90-100. It’s not that simple.
How assign letter grades is a product of country, culture and context of class
institutional expectations (most of them unwritten),
explicit and implicit definitions of grades that you have set forth,
the relationship you have established with the class,
Sts’ expectations that have been engendered in previous tests, quizzes in class.
C) Giving Feedback
Feedback should become beneficial washback. Those are some examples of feedback:
1. a letter grade
2. a total score
3. four subscores (speaking, listening, reading, writing)
4. for the listening and reading sections
a. an indication of correct/incorrect responses
b. marginal comments
5. for the oral interview
a. scores for each element being rated
c. oral feedback after the interview
6. on the essay
b. checklist of areas needing work
d. post-interview conference to go over results
a. scores for each element being rated
b. a checklist of areas needing work
e. a self-assessment
c. marginal end-of-essay comments, suggestions
d. post-test conference to go work
7. on all or selected parts of the test, peer checking of results
8. a whole-class discussion of results of the test
9. individual conferences with each student to review the whole test
Decide whether the following statements are TRUE or FALSE.
1.language aptitude test measures a learner’s future success in learning a FL.
2. Language aptitude tests are very common today.
3. A proficiency test is limited to a particular course or curriculum.
4. The aim of a placement test is to place a student into particular level.
5. Placement tests have many varieties.
6. Any placement test can be used at a particular teaching program.
7. Achievement tests are related to classroom lessons, units, or curriculum.
8. A five-minute quiz can be an achievement test.
9. The first task in designing a test is to determine test specification.
1. TRUE
2. FALSE
3. FALSE
4. TRUE
5. TRUE
6. FALSE (Not all placement tests suit every teaching program.)
7. TRUE
8. FALSE
9. FALSE (The first task is to determine appropriate objectives.)
Decide whether the following statements are TRUE or FALSE.
1. It is very easy to develop multiple-choice tests.
2. Multiple-choice tests are practical but not reliable.
3. Multiple-choice tests are time-saving in terms of scoring and grading.
4. Multiple-choice items are receptive.
5. Each multiple-choice item in a test should measure a specific objective.
6. The stem of a multiple-choice item should be as long as possible in order to
help students to understand the context.
7. If the Item Facility value is .10(% 10), it means the item is very easy.
8. Item discrimination index differentiates between high and low-ability sts.
1. FALSE (It seems easy, but is not very easy.)
2. FALSE (They can be both practical and reliable.)
3. TRUE
4. TRUE
5. TRUE
6. FALSE (It should be short and to the point.)
7. FALSE (An item with an IF value of .10 is a very difficult one.)
8. TRUE
Chapter 4 STANDARDIZED TESTING:
WHAT IS STANDARDIZATION:
A standardized test presupposes certain standard objectives or criteria that are held
constant across one form of the test to another..
They measure a broad band of competencies, but not only one particular curriculum
They are norm-referenced and the main goal is to place sts in a rank order.
Scholastic Aptitude Test (SAT):
college entrance exam seeking further information
The Graduate Record Exam (GRE):
test for entry into many graduate school programs
Graduate Management Admission Test (GMAT) & Law School Aptitude Test (LSAT):
tests that specialize in particular disciplines
Test of English as a Foreign Language (TOEFL):
produced by the International English Language Testing System (IELTS)
The tests are standardized because they specify a set of competencies for a given
domain and through a process of construct validation they program a set of tasks.
In general standardized test items are in the form of MC.
They provide ‘objective’ means for determining correct and incorrect responses.
However MC is not the only test item type in standardized test.
Human scored tests of oral and written production are also involved.
ADVANTAGES AND DISADVANTAGES OF STANDARDIZED TESTS:
-Advantages:
* Ready-made previously (Ts don’t need to spend time to prepare it)
* It can be administered to a large number of sts in a time constraint
* Easy to score thanks to MC format scoring (computerized or hole-punched
grid scoring)
* It has face validity
-Disadvantages:
* Inappropriate use of tests
* Misunderstanding of the difference between direct and indirect testing
characteristics of a standardized test
•
DEVELOPING A STANDARDIZED TEST:
- Knowing how to develop a standardized test can be helpful to
revise an existing test, adapt or expand an existing test, create a
smaller-scale standardized test
(A) The Test of English as a Foreign Language (TOEFL) ‘general
ability or proficiency’
(B) The English as a Second Language Placement Test (ESLPT),
San Francisco State University (SFSU) ‘placement test at a
university’
(C) The Graduate Essay Test (GET), SFSU ‘gate-keeping essay
test’
1. Determine the purpose and objectives of the test.
- Standardized tests are expected to be valid and practical
TOEFL
*To evaluate the English proficiency of people whose NL is not English.
*Colleges and universities in the US use the score TOEFL score to admit or
refuse international applicants for admission
ESLPT
*To place already admitted sts at SFSU in an approp. course in academic
writing and oral production.
*To provide Ts some diagnostic information about sts
GET
*To determine whether their writing ability is sufficient to permit them to enter
graduate-level courses in their programs(it is offered beginning of each term)
2. Design test specification.
TOEFL
the first step is to define the construct of language proficiency
After breaking langcompetence down into subset of 4 skills each performance mode can
be examined on a continuum of linguistic units. (pronun, spelling, word, grammar)
Oral production section tests fluency and pronunciation by using imitation
Listening section focuses on a particular feature of lang or overall listening comprehens
Reading section aims to test comprehension of long/short passages, single sentences,
phrases or words
Writing section tests writing ability in the form of open-ended(free composition) or it
can be structured to elicit anything from correct spelling to discourse-level competence
ESLPT
Designing test specs for ESLPT was simpler tasks . purpose is placement and construct
validation of a test consisted of an examination of the content of the ESL courses
*In recent revision of ESLPT, content & face validity are important theoretical issues.
And also practicality, reliability in tasks and item response formats equally important
The specification mirrored reading-based and process writing approach used in class.
GET
specification for GET are skills of writing grammatically and rhetorically acceptable
prose on a topic , with clearly produced organization of ideas and logical development.
3. Design, select, and arrange test tasks/items.
TOEFL
• Content coding: the skills and a variety of subject matter without biasing (the
content must be universal and as neutral as possible)
• Statistical characteristic: it include IF and ID
• Before administration, they are piloted and scientifically selected to meet difficulty
specifications within each subsection, section and the test overall.
ESLPT
For written parts; the main problems are
a) selecting appropriate passages(conform the standards of content validity)
• b) providing appropriate prompts (they should fit the passages)
• c) processing data form pilot testing
• In the MC editing test; first (easier task) choose an approp. essay within whick
embed errors. And a more complicated one is to embed a specified number errors
from a pre-determined error categories.(T can perceive the categories from sts
GETprevious error in written work & sts’ error can be used as distractors)
Topics are appealing and capable of yielding intended product of an essay that requires
an organized logical arguments conclusion. No pilot testing of prompts is conducted.
• Be careful about the potential cultural effect on the numerous international students
who must take the GET
4.
Make appropriate evaluations of different kinds of items.
- IF, ID and distractor analysis may not be necessary for classroom (one-time)
test, but they are must for standardized MC test.
- For production responses, different forms of evaluation become important.
(i.e. practicality, reliability & facility)
*practicality: clarity of directions, timing of test, ease of administration & how
much time is required to score
*reliability: is a major player is instances where more than one scorer is
employed and to a lesser extent when a single scorer has to evaluate tests over
long spans of time that could lead to deterioration of standards
*facilities: is key for valid and successful items. Unclear direction, complex
lang, obscure topic, fuzzy data, culturally biased information may lead to
higher level of difficulty
GET
*No data are collected from sts on their perceptions, but the scorers have an
opportunity to reflect on the validity of given topic
5. Specify scoring procedures and reporting formats.
TOEFL
-Scores are calculated and reported for
*three sections of TOEFL
*a total score
*a separate score
ESLPT
*It reports a score for each of the essay section (each essay is read by 2 readers)
*Editing section is machined scanned
*It provides data to place sts and diagnostic information
*sts don’t receive their essay back
GET
*Each GET is read by two trained reader. They give scores between 1 to 4
*recommended score is 6 as threshold for allowing sts to pursue graduate-level
courses
*If the st gets score below 6, he either repeat the test or take a remedial course
6. Performing ongoing construct validation studies.
Any standardized test must be accompanied by systematic periodic
corroboration of its effectiveness and by steps toward its improvement
TOEFL
*the latest study on TOEFL examined the content characteristics of the TOEFL
from a communicative perspective based on current research in applied
linguistics and language proficiency assessment
ESLPT
*The development of the new ESLPT involved a lengthy process both content
and construct validation, along with facing such practical issues as scoring the
written sections and a machine-scorable MC answer sheet
GET
*There is no research to validate the GET itself. Administrators rely on the
research on university level academic writing tests such as TWE.
*Some criticism of the GET has come from international test-takers who posit
that the topics and time limits of the GET work to the disadvantage of writers
whose native language is not English.
Primary market
TOEFL
U.S. universities and colleges for admission purposes
Type
Computer-based and paper-based
Response modes
Multiple-choice responses and essay
Time allocation
Up to 4 hours (CB); 3 hours (PB)
Specifications
CB: A listening section which includes dialogs, short conversations, academic
discussions, and mini lectures;
a structure section which tests formal language with two types of questions
(completing incomplete sentences and identifying one of four underlined
words or phrases that is not acceptable in English;
a reading section which include four to five passages on academic subjects
with 10-14 questions for each passage;
writing section which requires examinees to compose an essay on a given topic
MELAB
Primary market
U.S. and Canadian language programs and colleges; some worldwide
educational settings
Type
Paper-based
Response modes
Multiple-choice responses and essay
Time allocation
2.5 to 3.5 hours
Specifications
A 30-minute impromptu essay on a given topic;
a 25-minute multiple-choice listening comprehension test;
a 100-item 75-minute multiple choice test of grammar, cloze reading,
vocabulary, and reading comprehension;
an optional oral interview
IELTS
Primary market
Australian, British, Canadian, and New Zealand academic institutions and
professional organizations and some American academic institutions
Type
Computer-based for Reading and Writing sections; paper-based for Listening
and Speaking parts
Response modes
Multiple-choice responses, essay, and oral production
Time allocation
2 hours, 45 minutes
Specifications
A 60-minute reading;
a 60-minute writing;
a 30-minute listening of four sections;
a 10 to 15 minute speaking of five sections
TOEIC
Primary market
Worldwide; workplace settings
Type
Computer-based and paper-based
Response modes
Multiple-choice responses
Time allocation
2 hours
Specifications
A 100-item, approximately 45-minute listening administered by audiocassette
and which includes statements, questions, short conversations, and short talks;
a 100-item, 75-minute reading which includes cloze sentences, error
recognition, and reading comprehension
CHAPTER 5 STANDARDIZED-BASED
ASSESSMENT:
Mid 20th Century
Standardized tests had unchallenged popularity and growth.
Standardized tests brought convenience, efficiency, air of empirical science.
Tests were considered to be a way of making reforms in education.
Quickly and cheaply assessing students became a political issue.
Late 20th Century
*There was possible inequity and disparity between the tests in such tests and
the ones they teach in classes.
*The claims in mid-20th century began to be questioned/criticised in all areas.
*Teachers were in the leading position of those challenges.
The Last 20 Years
*Educators become aware of weaknesses in standardized testing: They were
not accurate measures of achievement and success and they were not based on
carefully framed, comprehensive and validated standards of achievement.
*A movement has started to establish standards to assess students of all ages
and subject-matter areas.
*There have been efforts on basing the standardised tests on clearly specified
criteria for each content area being measured.
Criticism:
Some teachers claimed that those tests were unfair there were dissimilarity
between the content & task of the tests & what they were teaching in their
classes
Solutions:
By becoming aware of these weaknesses, educators started to establish some
standards on which sts of all ages & subject matter areas might be assessed
most departments of education at all state level in the US have specified the
appropriate standards (criteria, objectives) for each grade level(pre-school to
grade 12) and each content area (math, science, arts…)
The construction of standards makes possible concordance between
standardized test specification and the goals and objectives (ESL, ESOL,
ELD,ELLs) (LEP is discarded because of the negative connotation word
‘limited’) pg 105 please
ELD STANDARDS
In creating benchmarks for accountability, there is a tremendous responsibility
to carry out a comprehensive study of a number of domains:
Categories of language; phonology, discourse, pragmatic, functional and
sociolinguistic elements.
Specification of what ELD students’ needs are.
A realistic scope of standards to be included in curriculum.(MUFRADATTAKI
STANDARDLAR GERCEKCI OLCAK)
Standards for teachers ( qualifications, expertise, training)(OGRETMENLERE
STANDARD GETIRIYOR)
A thorough analysis of means available to assess student attainment of those
standards.(OGRENCILERIN OGRENDIKLERINI NASIL
DEGERLENDIRECEZ
ELD ASSESSMENT
The development of standards obviously implies the responsibility for
correctly assessing their attainment.
It is found that the standardized tests of the past decades were not in line with
newly developed standards the interactive process not only of developing
standards but also of creating standards-based assessment started.
Specialists design, revise and validate many tests.
The California English Language Development Test (CELDT) is a battery of
instruments designed to assess attainment of ELD standards across grade
level. (not publicly available)
Language and literacy assessment rubric collected students’ work.
Teachers’ observations recorded on scannable forms.
It provided useful data on students’ performance for oral production, reading
and writing in different grades
CASAS AND SCANS
CASAS: (Comprehensive Adult Student Assessment System):
Designed to provide broadly based assessments of ESL curricula across US.
It includes more than 80 standardized assessment instruments used to;
*place sts in programs *diagnose learners’ needs
*monitor progress
*certify mastery of functional skills
At higher level of education (colleges, adult and language schools, workplace)
SCANS: (Secretary’s Commissions in Achieving Necessary Skills):
outlines competencies necessary for language in the workplace
the competencies are acquired and maintained through training in basic skills(4 skills);
thinking skills (reasoning & problem solving);
personal qualities (self-esteem & sociability)
Resources (allocating time, materials, staff etc.)
Interpersonal skills, teamwork, customer service etc.
Information processing, evaluating data, organising files etc,
Systems, understanding social and organizational system,
Technology use and application
TEACHER STANDARDS – OGRETMEN NASIL OLMALI
Linguistic and language development
Culture and interrelationship between language and culture
Planning and managing instructions
Consequences of standardized based and standardized testing
Positive
High level of practicality and reliability
Provides insights into academic performance
Accuracy in placing a number of test takers on to a norm referenced scala
Ongoing construct validation studies
Negative
They involve a number of test biases
A small but significant number of test takers are not assessed fairly nor they
are assessed accurately
Fosters extinct motivation
Multiple intelligence are not considered
There is danger of test driven learning and teaching
In general performance is not directly assessed
Test bias
Standardized tests involve many test bias (lang, culture, race, gender, learning styles)
National Centre for Fair and Open Testing claims of tests bias from; teachers, parents,
students, and legal consultants. (reading texts, listening stimulus)
Standardised tests do not promote logical-mathematical and verbal linguistic to the
virtual exclusions of the other contextualised, integrative intelligence. (some learners
may need to be assessed with interviews, portfolios, samples of work, demonstrations,
observation reports) more formative assessment rather than summative.
That would solve test bias problems but it is difficult to control it in standardized items.
Those who use standardised tests for the gate keeping purposes, with few if only other
assessments would do well to consider multiple measures before attributing infallible
predictive power to standardised test.
Test-driven learning and teaching
It is another consequence of standardized testing. When students know that one single
measure of performance will determine their lives they are less likely to take positive
attitudes towards learning. Extrinsic motivation not intrinsic
Ts are also affected from test-driven policies. They are under pressure to make sure
their sts excelled in the exam, at the risk of ignoring other objectives in the curriculum.
A more serious effect was to punish schools with lower-socioeconomic neighbourhood
ETHICAL ISSUES: CRITICAL LANGUAGE TESTING
One of by-products of rapid growing testing industry is danger of an abuse of power.
‘Tests represent a social technology deeply embedded in education, government and
business; tests are most powerful as they are often the single indicators for determining
the future of individuals’ (Shohamy)
Standards ,specified by client educational institutions, bring with them certain ethical
surrounding the gate-keeping nature of standardized tests.
Teachers can demonstrate standards in their teaching.
Teachers can be assessed through their classroom performance.
Performance can be detailed with ‘indicators’: examples of evidence that the teacher can
meet a part of a standard.
Indicators are more than ‘how to’ statements (complex evidence of performance.
Performance based assessment is integrated (not a checklist or discrete assessments)
Each assessment has performance criteria against which performance can be measured.
Performance criteria identify to what extend the teacher meets the standard.
Student learning is at the heart of the teacher’s performance.
6 ASSESSING LISTENING
OBSERVING THE PERFORMANCE OF FOUR SKILLS
1. two interacting concepts:
Performance
Observation””
Sometimes the performance does not indicate true competence
a bad night’s rest, illness, an emotional distraction, test anxiety, a memory
block, or other student-related reliability factor.
One important principle for assessing a learner’s competence is to consider the
fallibility of the results of a single performance such as that produced in a test.
The form which involve performances and contexts in measurement should
design following:
Several tests that are combined t form an assessment.
The listening tasks are designed to assess the candidate’s ability to process
form of spoken English.
A single test with multiple test tasks to account for learning styles and
performance variables
In-class and extra-class graded work
Alternative forms of assessment ( e. g journal, portfolio, conference,
observation, self – assessment, peer – assessment )
Multiple measures give more reliable & valid assessment than a single measure
We can observe neither the process of performing nor a product?
1. Receptive skills -- Listening performance
The process of listening performance is about :
Invisible, inaudible – process of internalizing meaning form the auditory
signals being transmitted to the ear and brain.
2 The productive skills allow us to hear and see the process as it is performance
writing can give permanent product of written piece.
But recorded speech, there is no permanent observable product for speaking.
THE IMPORTANCE OF LISTENING
Listening has often played second fiddle to its counterpart of speaking. But its
rare to find just a listening test.
Listening is often implied as component of speaking.
Oral production ability – other than monologues, speeches, reading aloud and
the like– is only as good as one’s listening comprehension.
Input the aural-oral mode accounts for a large proportion of successful
language acquisition.
BASIC TYPES OF LISTENING
For effective test, designing appropriate assessment tasks in listening begins
with the specification of objectives, or criteria.
The following processes flash through your brain :
1. recognize speech sounds and hold a temporary “ imprint” of them in
short-term memory.
2. Simultaneously determine the type of speech event.
3. use (bottom-up) linguistic decoding skills and / or (top-down)
background schemata to bring a plausible interpretation to the message and
assign a literal and intended meaning to the utterance. ( Jeremy Harmer, page
on 305) said.. This study shows is that activating student’s schemata.
4. in most cases, delete the exact linguistic form in which the message
was originally received in favor of conceptually retaining important or
relevant information in long-term memory.
four commonly identified types of listening performances
1. Intensive.
Listening for perception of the components.
Teacher use audio material on tape or hard disk when they want their students
to practice listening skills
2. Responsive.
3. Selective.
4. Extensive.
Extensive listening will usually take a place outside the classroom.
Material for extensive listening can be obtained from a number of sources.
Micro and Macro skills
Micro skills
Attending to smaller bits and chunks, in more of bottom-up process
Discriminate among sounds of English
retain chunks of language of different lengths in short-term memory
Recognize stress patterns, words in stressed/ unstressed position, rhythmic
structure , intonation contours, and their role in signaling information
Recognize reduce form of words.
Distinguish word boundaries, recognize the core of a words and interpret
word order patterns and their significance
Process speech at different rates of delivery
Process speech containing pauses, errors, corrections, other performance
variables
Recognize grammatical word classes (nouns, verbs, etc.), systems (e.g. tense,
agreement, pluralization), pattern, rules, and elliptical forms.
Detect sentence constituents and distinguish between major-minor constituents
Recognize particular meaning may be expressed in different grammatical form
Recognize cohesive device in spoken discourse
Macroskills
Focusing on larger elements involved in a top-down approach
recognize the communicative functions of utterances, according to situations,
participants, goals
Infer situations, participants, goals using real-world knowledge
From events, ideas, and so on, described, predict outcomes, infer links and
connections between events, deduce causes and effects, and detect such
relations as main idea, supporting idea, new information, given information,
generalization, and exemplification
Distinguish between literal and implied meanings
Use the facial, kinesics, body language, and other nonverbal clues to decipher
meanings
Develop and uses a battery of listening strategies, such as detecting key
words, guessing the meaning from context, appealing for help, and signaling
comprehension or lack thereof
What Makes Listening Difficult
1. Clustering
Chunking-phrases, clauses, constituents
2. Redundancy
Repetitions, Rephrasing, Elaborations and Insertions
3. Reduced Forms
Understanding reduced forms that may not be a part of learner’s past
experiences in classes where only formal ”textbook” lang has been presented
4. Performance variables
Hesitations, False starts, Corrections, Diversion
5 Colloquial Language
Idioms, slang, reduced forms, shared cultural knowledge
6. Rate of Delivery
Keeping up with speed of delivery, processing automatically as speker continu
7. Stress, Rhythm, and Intonation:
Correctly understanding prosodic elements of spoken language, which is more
difficult than understanding the smaller phonological bits and pieces.
8. Interaction:
Negotiation,clarification,attending signals,turn taking,maintenance,termination
Designing Assessment Tasks
• Recognizing Phonological and Morphological Elements
Phonemic pair, consonants
Test-takers hear
:
He’s from California
Test-takers read
:
A. He’s from California
B. She’s from California
Phonemic pair, vowels
Test-takers hear
:
is he living?
Test-takers read
:
A. is he leaving?
B. is he living?
Morphological pair, -ed ending
Test-takers hear
:
I missed you very much.
Test-takers read
:
A. I missed you very much
B. I miss you very much
Stress pattern in can’t
Test-takers hear
:
My girlfriend can’t go to the party
Test-takers read
:
A. My girlfriend can go to the party
B. My girlfriend can’t go to the party
One word stimulus
Test-takers hear
:
vine
Test-takers read
:
A. Vine
B. Wine
•Paraphrase Recognition
– Sentence Paraphrase
Test-takers hear
: Hellow, my name is Keiko. I come from Japan
Test-takers read
: A. Keiko is comfortable in japan
B. Keiko wants to come to Japan
C. Keiko is Japanese
D. Keiko likes Japan
– Dialogue paraphrase
Test-takers hear
Test-takers read
:
man
: Hi, Maria, my name is George.
woman : Nice to meet you, George. Are you
American?
man
: no, I’m Canadian
: A. George lives in United States
B. George is American
C. George comes from Canada
D. Maria is Canadian
Designing Assessment Tasks
• Appropriate response to a question
Test-takers hear
Test-takers read
: how much time did you take to do your homework?
: A. in about an hour
B. about an hour
C. about $10
D. yes, I did.
• Open-ended response to a question
Test-takers hear
Test-takers write or speak
: how much time did you take to do your
homework?
: __________________________________
Designing Assessment Tasks : Selective Listening
Test-taker listens a limited quantity of aural input and discern some specific information
Listening Cloze (cloze dictations or Partial Dictation)
Listening cloze tasks require the test-taker to listen a story, monologue
or conversatation and simultaneously read written text in which selected words or
phrases have been deleted
One Potentional Weakness of listening cloze technique
They may be simply become reading comprehension tasks. Test-takers who are asked
to listen to a story with periodic deletions in the written version may not need to listen
at all, yet may still able to respond with the appropriate word or phrase.
Information Transfer
aurally processed must be trnasfered to a visual representation, E.g labelling a diagram,
identifying an element in a picture, completing a form, or showing routes on a map.
Chart Filling
Test-takers see the chart about Lucy’s daily schedule and fill in the schedule.
Sentence Repetition
The test-takers must retain a strecth of language long enough to reproduce it, and then
must respond with an oral repetition of that stimulus.
DESIGNING ASSESSMENT TASKS: EXTENSIVE LISTENING
Dictation: Test-takers hear a passage, typically 50-100 words, recited 3 times;
First reading, natural speed, no pauses, test-takers listen for gist.
Second reading, slowed speed, pause at each break, test-takers write.
Third reading, natural speed, test takers check their work.
Communicative Stimulus-Response Tasks
The test-takers are presented with a stimulus monologue or conversation and
then are asked to respond to a set of comprehension questions.
First: Test-takers hear the insrtuction and dialogue or monologue.
Second: Test-takers read the multiple-choice comprehension questions and
items then chose the correct one
Authentic Listening Tasks
Buck (2001-p.92)“Every test requires some components of communicative
language ability, and no test covers them all. Similarly, every task shares some
characteristics with target-language tasks, and no test is completely authentic”
Alternatives to assess comprehension in a truly communicative context
Note taking
Listening to a lecturer and write down the important ideas.
Disadvantage: scoring is time consuming
Advantages: mirror real classroom situation it fulfills the criteria of
cognitive demand, communicative language & authenticity
Editing
Editing a written stimulus of an aural stimulus
Interpretive tasks:
paraphrasing a story or conversation
Potential stimuli include: song lyrics, poetry, radio, TV, news reports, etc.
Retelling
Listen story &simply retell it either orally or written à show full comprehension
Difficulties: scoring and reliability
validity, cognitive, communicative ability, authenticity are well incorporated
into the task.
Interactive listening (face to face conversations)
Chapter-7 Assessing Speaking
Challenges of the testing speaking:
1- The interaction of speaking and listening
2- Elicitation techniques
3- Scoring
BASIC TYPES OF SPEAKING
1.Imitative: (parrot back) Testing the ability to imitate a word, phrase, sentence.
Pronunciation is tested. Examples: Word, phrase, sentence repetition
2. Intensive: The purpose is producing short stretches of oral language. It is designed
to demonstrate competence in a narrow band of grammatical, phrasal, lexical,
phonological relationships (stress / rhythm / intonation)
3.Responsive: (interacting with the interlocutor) include interaction and test
comprehension but somewhat limited level of very short conversations, standards
greetings, small talk, simple requests and comments, and the like.
4. Interactive: Difference between responsive and interactive speaking is length and
complexity of interaction, which includes multiple exchanges /or multiple participant.
5. Extensive (monologue) : Extensive oral production tasks include speeches, oral
presentations, story-telling, during which the opportunity for oral interaction from
listeners is either highly limited (perhaps to nonverbal responses) or ruled out together.
Micro- and Macroskills of Speaking
microskills of speaking refer to producing small chunks of language such as
phonemes, morphemes, words and phrasal units. The macroskills include the
speakers' focus on the larger elements such as fluency, discourse, function,
style cohesion, nonverbal communication and strategic options.
Macroskills
1.Apropriately accomplish communicative functions according to situations,
participants,and goals.
2.Use appropriate styles, registers, implicative, redundancies, pragmatic
conventions, conversation rules, floor-keeping and –yielding, interrupting, and
other sociolinguistic features in face-to-face conversations.
3.Convey links and connections between events and communicative such
relations as focal and peripheral ideas, events and feelings, new information
and given information, generalization and exemplification.
4.Convey facial features, body language, and other nonverbal cues along with
verbal language.
5.Develop and use a battery of speaking strategies, such as emphasizing key
words, rephrasing, providing a context for interpreting the meaning of words,
appealing for help, and accurately assessing how well your interlocutor is
understanding you.
Microskills:
1.Produce differences among English phonemes and allophonic variants.
2.Produce chunks of language of different lengths.
3.Produce English stress patterns, words in stressed and unstressed positions,
rhytmic structure, and intonation contours.
4.Produce reduced forms of words and phrases.
5.Use adequate number of lexical units(words) to accomplish pragmatic
purposes
6.Produce fluent speech at different rates of delivery.
7.Monitor one’s own oral production and use various devices-pauses, fillers,
self-corrections, backtracking- to enhance the clarity of the message.
8.Use grammatical word classes (nouns,verbs,etc.),systems (tense, agreement,
pluralization), word order, patterns, rules, and elliptical forms.
9.Produce speech in natural constituents: in appropriate phrases, pause
groups,breath groups, and sentence constituents.
10.Express a particular meaning in different grammatical forms.
11.Use cohesive devices in spoken discourse.
Three important issues as you set out to design tasks;
1.No speaking task is capable of isolating the single skills of oral
production. Concurrent involvement of the additional performance of aural
comprehension, and possibly reading, is usually necessary.
2.Eliciting the specific criterion you have designated for a task can be
tricky because beyond the word level, spoken language offers a number of
productive options to test-takers. Make sure your elicitation prompt achieves
its aims as closely as possible.
3.It is important to carefully specify scoring procedures for a response
so that ultimately you achieve as high a reliability index as possible.
interaction between speaking and listening or reading is unavoidable.
Interaction effect: impossibility of testing speaking in isolation
Elicitation techniques: to elicit specific criterion we expect from test takers.
Scoring: to achieve reliability
Designing Assessment Tasks: Imitative Speaking
paying more attention to pronunciation, especially suprasegmentals, in
attempt to help learners be more comprehensible.
Repetition tasks are not allowed to occupy a dominant role in an overall oral
production assessment, and as long as avoid a negative washback effect.
In a simple repetition task, test-takers repeat the stimulus, whether it is a pair
of words, a sentence, or perhaps a question ( to test for intonation production.)
Word repetition task:
Scoring specifications must be to avoid reliability breakdowns. A common
form of scoring simply indicates 2 or 3 point system for each response
Scoring scale for repetition tasks:
2 acceptable pronunciation
1 comprehensible, partially correct pronunciation
0 silence, seriously incorrect pronunciation
The longer the stretch of language, the more possibility for error and therefore
the more difficult it becomes to assign a point system to the text.
PHONEPASS TEST
The phonepass test has supported the construct validity of its repetition tasks not just
for discourse and overall oral production ability.
The PhonePass tests elicits computer-assisted oral production over a telephone.
Test-takers read aloud, repeat sentences, say words, and answer questions.
Test-takers are directed to telephone a designated number and listen for directions.
The test has five sections.
Part A Testee read aloud selected sentences forum among printed on the test sheet.
Part B Testee repeat sentences dictated over the phone.
Part C Testee answer questions with a single word or a short phrase of 2 or 3 words.
Part D Testee hear 3 word groups in random order and link them in correctly ordered
sentence
Part E Testee have 30 seconds to talk about their opinion about some topic that is
dictated over phone.
Scores are calculated by a computerized scoring template and reported back to the testtaker within minutes.
Pronunciation, reading fluency, repeat accuracy and fluency, listening vocabulary are
the sub-skills scored
The scoring procedure has been validated against human scoring with extraordinary
high reliabilities and correlation statistics.
Designing Assessment Tasks: Intensive Speaking
test-takers are prompted to produce short stretches of discourse (no more then a
sentence) through which they demonstrate linguistic ability at a specified level lang
Intensive tasks may also be described as limited response tasks, or mechanical tasks, or
what classroom pedagogy would label as controlled responses.
Directed Response Tasks
Administrator elicits a particular grammatical form or a transformation of a sentence.
Such tasks are clearly mechanical and not communicative(possible drawbacks),but they
do require minimal processing of meaning in order to produce the correct
grammatical output.(practical advantages
Read – Aloud Tasks (to improve pronunciation and fluency)
include beyond sentence level up to a paragraph or two. It is easily administered by
selecting a passage that incorporates test specs and bye recording testee’ output; the
scoring is easy because all of the test-takers’s oral production is controlled.
If reading aloud shows certain practical adavantages (predictable output, practicality,
reliability in scoring), there are several drawbacks
Reading aloud is somewhat inauthentic in that we seldom read anything aloud to
someone else in the real world, with exception of a parent reading to a child.
Sentence / Dialogue Completion Tasks and Oral Questionnaries
( to produce omitted lines, words in a dialogue appropiriately)
Test-takers read dialogue in which one speaker’s lines have been omitted. Testtakers are first given time to read through the dialogue to get its gist and to
think about appropriate lines to fill in.
An advantage of this technique lies in its moderate control of the output of the
test-taker (practical advantage).
One disadvantage of this technique is its reliance on literacy and an ability to
transfer easily from written to spoken English.(possible drawback)
Another disadvantage is contrived, inauthentic nature of this task. (drawback.)
Picture – Cued Tasks (to elicit oral production by using pictures)
One of more popular ways to elicit oral language performance at both
intensive and extensive levels is a picture-cued stimulus that requires a
destcription from the test-taker.
Assessment of oral production may be stimulated through a more elaborate
picture. (practical advantages)
Maps are another visual stimulus that can be used to assess the language
forms needed to give directions and specify locations.(practical advantage)
Scoring may be problematic depending on the expected performance.
Scoring scale for intensive tasks
2 comprehensible; acceptable target form
1 comprehensible; partially correct target form
0 silence, or seriously incorrect target form
Translation (of Limited Stretches of Discourse) (To translate from target
language to native language)
The test-takers are given a native language word, phrase, or sentence and are
asked to translate it.
As an assessment procedure, the advantages of translation lie in its control of
the output of the test-taker, which of course means that scoring is more easily
specified.
Designing Assessment Tasks: Response Speaking
Assessment involves brief interactions with an interlocutor, differing from
intensive tasks in the increased creativity given to the test-taker and from
interactive tasks by the somewhat limited length of utterances.
Question and Answer
Question and answer tasks can consist of one or two questions from an
interviewer, or they can make up a portion of a whole battery of questions and
prompts in an oral interview.
The first question is intensive in its purpose; it is a display question intended
to elicit a predetermined correct response.
Questions at the responsive level tend to be genuine referential questions in
which the test-taker is given more opportunity to produce meaningful
language in response.
Test-takers respond with a few sentences at most.
Test-takers respond with questions.
A potentially tricky form of oral production assessment involves more than
one test-taker with an interviewer. With two students in an interview contxt,
both test-takers can ask questions of each other.
Giving Instruction and Directions
The technique is simple : the administrator poses the problem, and the testtaker responds. Scoring is based primarily on comprehensibility and
secondarily on other specified grammatical or discourse categories.
Eliciting instructions or directions
Paraphrasing
read or hear a number of sentences and produce a paraphrase of the sentence.
Advantages they elicit short stretches of output and perhaps tap into testee
ability to practice conversation by reducing the output/input ratio.
If you use short paraphrasing tasks as an assessment procedure, it’s important
to pinpoint objective of task clearly. In this case, the integration of listening
and speaking is probably more at stake than simple oral production alone.
TEST OF SPOKEN ENGLISH (TSE)
The TSE is a 20 –minute audio-taped test of oral language ability within an
academic or Professional environment.
The scores are also used for selecting and certifying health professionals such
as physicians, nurses, pharmacists, physical therapists, and veterinaries.
The tasks on the TSE are designed to elicit oral production in various discourse
categories rather than in selected phonological, grammatical, or lexical targets.
Designing Assessment Tasks: Interactive Speaking
Tasks include long interactive discourse ( interview, role plays, discussions, games).
İnterview
A test administrator and a test-taker sit down in a direct face-to-face Exchange and
proceed through a protocol of questions and directives. The interview is then scored on
accuracy in pronunciation and/or grammar, vocabulary usage, fluency, pragmatic
appropriateness, task accomplishment, and even comprehension.
Placement interviews, designed to get a quick spoken sample from a student to verify
placement into a course,
Four stages:
1.Warm-up : (small talk) interviewer directs matual introductions, helps testee
become comfortable, apprises testee, anxieties.(No scoring)
2.Level check: interviewer stimulates testee to respond using expected - predicted
forms and functions. This stage give interviewer a picture of testee’s extroversion,
readiness to speak, confidence.Linguistic target criteria are scored in this phase.
3.Probe: Probe questions and prompts challenge testee to go
heights of their ability,
to extend beyond limits of interviewer’s expectation through difficult questions.
4.Wind-down: This phase is a short period of time during which interviewer
encourages testee to relax with easy questions, sets testee’s ease,
The scussess of an oral interview will depend on;
*clearly specifying administrative procedures of the
assessment(practicality)
*focusing the q and probes on the purpose of the
assessment(validity)
*appropriately eliciting an optimal amount and quality
of oral production from the test-taker.( biased for best
performance)
*creating a consistent, workable scoring system
(reliability).
Role Play
Role playing is a popular pedagogical activity in communicative language
teaching classes.
Within constraints set forth by guidelines, it frees students to be somewhat
creative in their linguistic output.
While role play can be controlled or ‘’guided’’ by the interviewer, this
technique takes test-takers beyond simple intensive and responsive levels to a
level of creativity and complexity that approaches real-world pragmatics.
Scoring presents the usual issues in any task that elicits somewhat
unpredictable responses from test-takers.
Discussions and Conversations
As formal assessment devices, discussions and conversations with and among
students are difficult to specify and even more difficult to score.
But as informal techniques to assess learners, they offer a level of authenticity
and spontaneity that other assessment techniques may not provide.
Assessing the performance of participants through score or checklists should
be carefully designed to suit the objectives of the observed discussion.
Discussion is a integrative task, and so it is also advisable to give some cognizance to
comprehension performance in evaluating learners.
Games
Among informal assessment devices are a variety of games that directly
involve language production.
Assessment games:
1.’’Tinkertoy’’ game (Logo block)
2.Crossword puzzles
3.Information gap grids
4.City maps
ORAL PROFICIENCY INTERVIEW (OPI)
The best-known oral interview format is the Oral Proficinecy Interview.
OPI is the result of historical progression of revisions under the auspices of
several agencies, including the Educational Testing Service and American
Council on Teaching Foreign Language (ACTFL).
The OPI is carefully designed to elicit pronunciation, fluency and integrative
ability, sociolinguistic and cultural knowledge, grammar, and vocabulary.
Performance is judged by the examiner to be at one of ten possible levels on
the ACTFL-designated proficiency guidelines for speaking: Superior;
Advanced-high, mid, low; Intermediate-high, mid,low; Novice-high, mid,low.
Designing Assessments : Extensive Speaking
involves complex, relatively lengthy stretches of discourse.
They are variations on monologues, with minimal verbal interaction.
Oral Presentations
it would not be uncommon to be called on to present a report, a paper, a marketing
plan, a sales idea, a design of new product, or a method.
Once again the rules for effective assessment must be invoked:
a- specify the criterion,
b-set appropriate tasks,
c- elicit optimal output,
d-establish practical, reliable scoring procedures.
Scoring is the key assessment challenge.
Picture –Cued Story-Telling
techniques for eliciting oral production is through visual pictures, photographs,
diagrams, and charts.
consider a picture or series of pictures as a stimulus for a longer or description.
Criteria for scoring need to be clear about what it is you are hoping to assess.
Retelling a Story, News Event
In this type of task, test-takers hear or read a story or news event that they are
asked to retell.
The objectives in assigning such a task vary from listening comprehension of
the original to production a number of oral discourse features (communicating
sequences and relationships of events, stress and emphasis patterns,’
’expression’’ in the case of a dramatic story), fluency, and interaction with the
hearer.
Scoring should meet the intended criteria
Translation (of Extended Prose)
Longer texts are presented for test-taker to read in NL and then translate into
English (dialogues, directions for assembly of a product, synopsis of a story or
play or movie, directions on how to find something on map, and other genres).
The advantage of translation is in the control of the content, vocabulary, and to
some extent, the grammatical and discourse features.
The disadvantage is that translation of longer text is a highly specialized skill
for which some individuals obtain post-baccalaureate.
Criteria for scoring should take into account not only purpose in stimulating a
translation but possibility of errors that are unrelated to oral production ability
8 ASSESSING READING
TYPES (GENRES) OF READING
Academic reading
Reference material , Textbooks, theses
Essays, papers, Test directions, Editorials and opinion writing
Job-related reading
Messages, Letters/ emails, Memos
Personal reading
Newspapers , magazines, Letters, emails, cards, invitations, Schedules (trains,
bus)
Microskills :
Discriminate among the distinctive graphemes and orthographic
patterns of English.
Retain chunks of language of different lenghts in short-term
memory.
Process writing at an efficient rate of speed to suit the purpose.
Recognize a core of word, and interpret word order patterns and
their significance.
Recognize grammatical word classes(nouns, verbs, etc),
systems (tense agreement, pluralization), patterns, rules and
elliptical forms.
Recognize cohesive devices in written discourse and their role in
signaling the relationship between and among clauses.
Macroskills :
Recognize the rhetorical forms of written discourse and their significance for
interpretation.
Recognize the communicative functions of written text, according to form and
purpose
Infer context that is not explicit by using background knowledge
From described events, ideas, etc, infer links and connections between events,
deduce causes and effects, and detect such relations as main idea, supporting
idea, new information, generalization, and exemplification
Distinguish between literal and implied meanings.
Detect culturally specific references and interpret them in a context of the
appropriate cultural schemata.
Develop and use a battery of reading strategies, such as scanning and
skimming, detecting discourse markers, guessing the meaning of words from
the context, and activating schemata for interpretation of texts.
Some principal strategies for reading comprehension:
Identify your purpose in reading a text
Apply spelling rule and conventions for bottom-up decoding
Use lexical analysis to determine meaning
Guess at meaning when you aren’t certain
Skim the text for the gist and for main ideas
Scan the text for specific information(names, dates, key words)
Use silent reading techniques for rapid processing
Use marginal notes, outlines, charts, or semantic maps for understanding and
retaining information
Distinguish between literal and implied meanings
Capitalize on discourse markers to process relationships.
TYPES OF READING
Perceptive
Involve attending to the components of larger stretches of discourse : letters,
words, punctuation, and other graphemic symbols.
Selective
Is largely an artifact of assessment formats. Used picture-cued tasks, matching,
true/ false, multiple-choice, etc.
Interactive
Interactive task is to identify relevant features (lexical, symbolic, grammatical,
and discourse) within texts of moderately short length with the objective of
retaining the information that is processed.
Extensive
The purposes of assessment usually are to tap into a learner’s global
understanding of a text, as opposed to asked test-takers to “zoom in” on small
details. Top down processing is assumed for most extensive tasks.
PERCEPTIVE READING
Reading Aloud
Reads them aloud, one by one, in the presence of-an administrator.
Written response
Reproduce the probein writing. Evaluation of the test taker’s response must be
carefully treated.
Multiple-choise
Choosing one of four or five possible answers.
Picture-Cued Items
Shown a picture, written text and are given one of a number of possible tasks
to perform.
SELECTIVE READING
The test designer focuses on formal aspects of language (lexical, grammatical,
and a few discourse features). Category includes what many incorrectly think
of as testing “vocabulary and grammar”
Multiple-Choise (for Form-Focused Criteria)
They may have little context, but might serve as a vocab or grammar check.
Matching Tasks
The most frequently appearing criterion in matching procedures is vocabulary.
Editing Tasks
For grammatical or rhetorical errors is a widely used test method for assessing
linguistic competence in reading.
Picture Cued Tasks
read sentence or passage and choose one of four pictures that is described
read a series of sentences or definitions, each describing a labeled part of a
picture or diagram.
Gap-Filling Tasks
Is to create completion items where test-takers read part of a sentence and then
complete it by writing a phrase.
INTERACTIVE READING
Cloze Tasks
fill in gaps in an incomplete image (visual, auditory, or cognitive) and supply (from
background schemata) omitted details.
Impromptu Reading Plus Comprehension Questions
without some component of assessment involving impromptu reading and responding
to questions.
Short-Answer Tasks
following reading passages is the age-old short-answer format.
Editing (Longer Texts)
The technique has been applied successfully to longer passages of 200 to 300 words.
1th authenticity, 2nd tasks simulates proofreading one’s own essay. 3th connected to a
specific curriculum.
Scanning
Strategy used by all readers to find relevant information in a text.
Ordering Tasks
Variations on this can serve as an assessment of overall global understanding of a story
and of the cohesive devices that signal the order of events or ideas.
Information Transfers Reading Charts, Maps, Graphs, Diagrams
media presuppose reader’s schemata for interpreting them and are accompanied by oral
or written discourse to convey, clarify, question, argue, debate, among other linguistic
functions.
EXTENSIVE READING
Involves longer texts than we have been dealing with up to this point.
Skimming Tasks
Process of rapid coverage of reading matter to determine its gist or main idea
Summarizing and Responding
Is make summary of the text and give it a respond about the text
Note Taking and Outlining
A teacher, perhaps in one-on-one conferences with students, can use student
notes/ outlines as indicators of the presence or absence of effective reading
strategies, and thereby point the learners in positive directions.
UNIT 9: ASSESSING WRITING
GENRES OF WRITING
Academic Writing
papers and general subject reports essays, compositions
academically focused journals, short-answer test responses
technical reports (e.g., lab reports), theses, dissertations
Job-Related Writing
messages letters/emails, memos (e.g., interoffice), reports (e.g., job evaluations, project
reports)
schedules, labels, signs, advertisements, announcements, manuals
Personal Writing
letters, emails, greeting cards, invitations messages, notes, calendar entries,
shopping lists, reminders financial documents (e.g., checks, tax forms, loan applications)
forms, questionnaires, medical reports, immigration documents
diaries, personal journals, fiction (eg. Short stories, poetry)
MICROSKILLS AND MACROSKILLS OF WRITING
Micro-skills
Produce graphemes and orthographic patterns of English.
Produce writing at an efficient rate of speed to suit the purpose.
Produce an acceptable core of words and use appropriate word order patterns.
Use acceptable grammatical systems (Tense, agreement), patterns and rules.
Express a particular meaning in different grammatical forms.
Use cohesive devices in written discourse.
Macro-skills
Use the rhetorical forms and conventions of written discourse.
Appropriately accomplish the communicative functions of written texts
according to form and purpose.
Convey links and connections between events, communicate such relations as
main idea, supporting idea, new information, generalization, exemplification.
Distinguish between literal and implied meanings when writing.
Correctly convey culturally specific references in the context of the written text.
Develop&use writing strategies, accurately assessing audience’s interpretation,
using prewriting devices, writing fluency in first drafts, using phrases and
synonyms, soliciting feedback and using feedback for revising and editing.
Types of Writing Performance
Imitative Writing
Assess ability to spell correctly & perceive phoneme/grapheme
correspondences
Form rather than meaning (letters, words, punctuation, brief sentences,
mechanics of writing)
Intensive Writing
To produce appropriate vocabulary within a context and correct grammatical
features in a sentence
More form than meaning but meaning and context are of some importance
(collocations, idioms, correctness, appropriateness)
Responsive Writing
Connect sentences & create a logically connected 2 or 3 paragraphs
Discourse conventions with strong emphasis on context and meaning (limited
discourse level, connecting sentences logically) mostly 2-3 paragraphs
Extensive Writing
To manage all the processes of writing for all purposes to write longer text
(Essays, papers, theses)
Processes of writing (strategies of writing)
IMITATIVE WRITING
Tasks in Hand Writing Letters, Words, and Punctuation
Copying ( bit __ / bet __ / bat __ )
Copy the words given in the spaces provided
Listening cloze selection tasks
Write the missing words in blanks by selecting according to what they hear
Combination of dictation with a written text
Purpose=to give practice in writing
Picture-cued tasks
Write the word the picture represents
Make sure that pictures are not ambiguous
Form completion tasks
Complete the blanks in simple forms Eg. Name, address, phone number
Make sure that students have practiced filling out such forms
Converting numbers/abbreviations to words
Either write out the numbers or converting abbreviations to words
More reading than writing, so specify the criterion
Low authenticity, Reliable method to stimulate handwritten English
Spelling Tasks and Detecting Phoneme-Grapheme Correspondences
Spelling Tests
Write words that are dictated, Choose words that have been heard or spoken
Scoring=correct spelling
Picture-Cued Tasks
Write words that are displayed by pictures Eg. Boot-book, read-reed, bit-bite
Choose items according to your test purpose
Multiple Choice Techniques
Choose and write the word with the correct spelling to fit the given sentences
Items are better to have writing component / addition of homonym to make
the task challenging
Clashes with reading, so be careful To assess the ability to spell words correctly
and to process phoneme-grapheme correspondences
Matching Phonetic Symbols
Write the correctly spelled word alphabetically
Since Latin alphabet and Phonetic alphabet symbols are different from each
other, this works well.
INTENSIVE (CONTROLLED) WRITING
Dictation
Writing what is heard aurally
Listening & correct spelling punctuation
Dicto-comp
Re-writing the paragraph in one's own words after hearing it for 2 or 3 times
Listening & vocabulary & spelling & Punctuation
Grammatical transformation
Making grammatical transformations by changing or combining forms of lang
Grammatical competence, Easy to administer & practical & reliable
No meaningful value, Even with context no authenticity
Picture-cued
1. Short sentences
2. Picture description
3. Picture sequence description
Reading non-verbal means & grammar & spelling & vocabulary
Reading-Writing integration, Scoring problematic when pictures are not clear
Vocabulary assessment
Either defining or using a word in a sentence, assessing collocations and
derived morphology
Vocabulary & grammar, Less authentic: using a word in sentence?
Ordering
Ordering / re-ordering a scrambled set of words
If verbal=intensive speaking, If written=intensive writing
Reading and grammar
Appealing for who like word games and puzzles, Inauthentic
Needs practicing in class, Both reading and writing
Short answer and sentence completion
Answering or asking questions for the given statements / writing 2 or 3
sentences using the given prompts
Reading& Writing, Scoring on a 2-1-0 scale is appropriate
1. AUTHENTICITY (face and content validity)
Teacher becomes less instructor, more coach or facilitator
Assessment: formative  (+) washback > practicality and reliability
2. SCORING
Both how Ss string words together and what they say
3. TIME
No time constraints  freedom for drafts before finished product
Questioned issue= Timed impromptu format  valid method of writing assessment
RESPONSIVE AND EXTENSIVE WRITING
1. Paraphrasing
Its importance: To say something in one's own words, to avoid plagiarism to
offer some variety in expression
Test takers' task: Paraphrasing sentences or paragraphs with purposes in mind
Assessment type: Informal and formative, Positive washback
Scoring: Giving similar messages is primary Discourse, grammar and
vocabulary are secondary
2. Guided question and answer
Its importance: To provide benefits of guiding test takers without dictating
the form of the output
Test takers' task: Paraphrasing sentences or paragraphs with purposes in
mind
Assessment type: Informal and formative
Scoring: Either on a holistic scale or an analytical one
3. Paragraph Construction Tasks
Topic Sentence Writing
The presence or absence of topic sentence The effectiveness of topic sentence
Topic Development in a Paragraph
The clarity of expression The logic of the sequence The unity and cohesion The
overall effectiveness
Multi Paragraph Essay
Addressing topic /main idea / purpose Organizing supporting ideas Using
appropriate details for supporting ideas Facility and fluency in language use
Demonstrating syntactic variety
4.Strategic Options
Free writing, outlining, drafting and revising are strategies which help writers
create effective texts
Writers need to know their subject and purpose and audience to write
developing main and supporting ideas is the purpose for only essay writing
Some tasks commonly addressed in academic writing courses are
compare/contrast, problem solution, pro/cons and cause and effect.
Assessment of tasks in academic writing course could be formative & informal
Knowing conventions &opportunities of genre will help to write effectively.
Every genre of writing requires different conventions.
Test of Written English (TWE®)
Time allocated: 30 minutes time limit/ no preparation ahead of time
Prepared by: a panel of experts
Scoring: a mean score of 2 independent ratings based on a holistic scoring
Number of raters: 2 trained raters working independently
Limitations: inauthentic / not real life / puts test takers into artificially time
constraint context inappropriate for instructional purposes
Strengths: serves for administrative purposes
Follow 6 steps to be successful
Carefully identify the topic.
Plan your supporting ideas.
In introductory paragraph, restate topic and state organizational plan of essay.
Write effective supporting paragraphs (show transitions, include a topic
sentence, specify details).
Restate your position and summarize in the concluding paragraph.
Edit sentence structure and rhetorical expression.
SCORING METHODS FOR RESPONSIVE AND EXTENSIVE WRITING
Holistic Scoring
Definition: Assigning a single score to represent general overall assessment
Purpose of use: Appropriate for administrative purposes / Admission into
an institution or placement in a course
Advantage(s): Quick scoring
High inter-rater reliability, Easily interpreted scores by lay persons
Emphasizes strengths of written piece Applicable to many different disciplines
Disadvantages
No washback potential
Masking the differences across the sub skills within each score
Not applicable to all genres
Needs trained evaluators to use the scale accurately
Primary Trait Scoring
Assigning a score based on the effectiveness of the text's achieving its purposes
(accuracy, clarity, description, expression of opinion)
Purpose of use
To focus on the principle function of the text
Advantage(s)
Practical
Allows both the writer and scorer to focus on the function / purpose
Disadvantage(s)
Breaking text down into subcategories and giving separate ratings for each
Analytic Scoring
Definition
Listening short monologues to scan for certain information
Purpose of use
Classroom instructional purposes
Advantage(s)
*More backwash into the further stages of learning Diagnose both the
weaknesses and strengths of writing
Disadvantage(s)
Lower practicality since scorers have to attend to details with each sub-score.
BEYOND SCORING: RESPONDING TO EXTENSIVE WRITING
Here, the writer is talking about process approach to writing and how the
assessment takes place in this approach.
Many educators advocate process approach to writing.
This pays attention to various stages that any piece of writing goes through.
By spending time with learners on pre-writing phases, editing, re-drafting and
finally producing a finished version of their work, a process approach aims to
get to the heart of the various skills that most writers employ.
Types of responding: Self, peer, teacher responding
Assessment type: Informal / formative
Washback: Potential positive washback
Role of the assessor: Guide / facilitator
GUIDELINES FOR ASSESSING STAGES OF WRITTEN
COMPOSITION
Initial stages
Focus: Meaning & Main idea & organization
Ignore: Grammatical and lexical errors / minor errors
Indicate: Global errors but not corrected
Later stages
Focus: Fine tuning toward a final version
Ignore: Indicate: Problems related to cohesion/documentation/citation
10 BEYOND TESTS:
ALTERNATIVES IN ASSESSMENT
Characteristics of Alternative Assessment
require students to perform, create, produce, or do something;
use real-world contexts or simulations;
are non-intrusive in that they extend the day-to-day classroom activities;
allow students to be assessed on what they normally do in class every day;
use tasks that represent meaningful instructional activities;
focus on processes as well as products;
tap into higher-level thinking and problem-solving skills;
provide information about both the strengths and weaknesses of students;
are multi-culturally sensitive when properly administered;
ensure that people, not machines, do the scoring, using human judgment;
encourage open disclosure of standards and rating criteria; and
call upon teachers to perform new instructional and assessment roles.
DILEMMA OF MAXIMIZING BOTH PRACTICALITY AND WASHBACK
LARGE SCALE STANDARDIZED TESTS
ALTERNATIVE ASSESSMENT
one-shot performances
timed
multiple-choice
decontextualized
norm-referenced
foster extrinsic motivation
highly practical, reliable
instruments
• minimize time and money
• much practicality or reliability
• cannot offer much washback or
authenticity
• open-ended in their time
orientation and format,
• contextualized to a curriculum,
• referenced to the criteria
(objectives) of that curriculum
• likely to build intrinsic motivation
• considerable time and effort
• offer much authenticity and
washback
•
•
•
•
•
•
•
The dilemma of maximizing both practicality and washback
The principal purpose of this chapter is to examine some of the alternatives in
assessment that are markedly different from formal tests.
Especially large scaled standardized tests, tend to be one shot performances that are
timed, multiple choice decontextualized, norm-referenced, and that foster extrinsic
motivation.
On the other hand, tasks like portfolios, journals,
Conferences and interviews and self assessment are
Open ended in their time orientation and format
Contextualized to a curriculum
Referenced to the criteria ( objectives) of that curriculum and
Likely to build intrinsic motivation.
PORTFOLIOS
One of the most popular alternatives in assessment, especially within a
framework of communicative language teaching, is portfolio development.
portfolios include materials such as
Essays and compositions in draft and final forms
Reports, project outlines
Poetry and creative prose
Artwork, photos, newspaper or magazine clippings;
Audio and/or video recordings of presentations, demonstrations, etc
Journals, diaries, and other personal reflection ;
Test, test scores, and written homework exercises
Notes on lecturer; and
Self-and peer- assessments-comments, and checklists.
Successful portfolio development will depend on following a number of steps
and guidelines.
1. State objectives clearly.
2. Give guidelines on what materials to include.
3. Communicate assessment criteria to students,
4. Designate time within the curriculum for portfolio development.
5. Establish periodic schedules for review and conferencing.
6. Designate an accessible place to keep portfolios.
7. Provide positive washback giving final assessment
JOURNALS
a journal is a log or account of one’s thoughts, feelings, reactions, assessment,
ideas, or progress toward goals, usually written with little attention to
structure, form, or correctness.
Categories or purposes in journal writing, such as the following:
a. Language learning logs
b. Grammar journals
c. Responses to readings
d. Strategies based learning logs
e. Self-assessment reflections
f. Diaries of attitudes, feelings, and other affective factors
g. Acculturation logs
CONFERENCES AND INTERVIEWS
Conferences
Conferences is not limited to drafts of written work including portfolios and
journals.
Conferences must assume that the teacher plays the role of a facilitator and
guide, not of an administrator, of a formal assessment.
Interviews
Interview may have one or more of several possible goals in which the teacher
assesses the student’s oral production
ascertains a students need before designing a course of curriculum
seeks to discover a students’ learning style and preferences
One overriding principle of effective interviewing centers on the nature of the
questions that will be asked.
OBSERVATIONS
In order to carry out classroom observation, it is of course important to take
the following steps:
1. Determine the specific objectives of the observation.
2. Decide how many students will be observed at one time
3. Set up the logistics for making unnoticed observations
4. Design a system for recording observed performances
5. Plan how many observations you will make
SELF AND PEER ASSESSMENT
Five categories of self and peer assessment:
1. Assessment of performance, in this category, a student typically
monitors him or herself in either oral or written production and renders some
kind of evaluation of performance.
2. Indirect assessment of performance, indirect assessment targets larger
slices of time with a view to rendering an evaluation of general ability as
opposed to one to one specific, relatively time constrained performance.
3. Metacognitive assessment for setting goals, some kind evaluation are
more strategic in nature, with the purpose not just of viewing past
performance or competence but of setting goals and maintaining an eye on the
process of their pursuit.
4. Socioaffective assessment, yet another type of self and peer
assessment comes in the form of methods of examining affective factors in
learning. Such assessment is quite different from looking at and planning
linguistic aspects of acquisition.
5. Student generated tests, a final type of assessment that is not usually
classified strictly as self or peer assessment is the technique of engaging
students in the process of constructing tests themselves.
GUIDELINES FOR SELF AND PEER ASSESSMENT
Self and peer assessment are among the best possible formative types of
assessment and possibly the most rewarding.
Four guidelines will help teachers bring this intrinsically motivating task into
the classroom successfully.
1. Tell students the purpose of assessment
2. Define the task clearly
3. Encourage impartial evaluation of performance or ability
4. Ensure beneficial washback through follow up tasks
A TAXONOMY OF SELF AND PEER ASSESSMENT TASKS
It is helpful to consider a variety of tasks within each of the four skills(
listening skill, speaking skill, reading skill, writing skill).
An evaluation of self and peer assessment according to our classic principles of
assessment yields a pattern that is quite consistent with other alternatives to
assessment that have been analyzed in this chapter.
Practicality can achieve a moderate level with such procedures as checklists
and questionnaires
CHAPTER 11:
GRADING AND STUDENT EVALUATION
GUIDELINES FOR SELECTING GRADING CRITERIA
It is essential for all components of grading to be consistent with an
institutional philosophy and/or regulations (see below for a further discussion
of this topic).
All of the components of a final grade need to be explicitly stated in writing to
students at the beginning of a term of study, with a designation of percentages
or weighting figures for each component.
If your grading system includes items (d) through (g) in the questionnaire
above (improvement, behavior, effort; motivation), it is important for you to
recognize their subjectivity. But this should not give you an excuse to avoid
converting such factors into observable and measurable results.
Finally, consider allocating relatively _ small weights to items (c) through (h) so
that a grade primarily reflects achievement. A designation of 5 percent to 10
percent of a grade to such factors will not mask strong achievement in a
course.
CALCULATING GRADES: ABSOLUTE AND RELATIVE GRADING
ABSOLUTE GRADING:
If you pre-specify standards of performance on a numerical
point system, you are using an absolute system of grading.
For example, having established points for a midterm test, points
for a final exam, and points accumulated for the semester, you
might adhere to the specifications in the table below.
The key to making an absolute grading system work is to be
painstakingly clear on competencies and objectives, and on tests,
tasks, and other assessment techniques that will figure into the
formula for assigning a grade.
RELATIVE GRADING:
It is more commonly used than absolute grading. It has the
advantage of allowing your own interpretation and of adjusting
for unpredicted ease or difficulty of a test.
Relative grading is usually accomplished by ranking students in
order of performance (percentile ranks) and assigning cut-off
points for grades.
An older, relatively uncommon method of relative grading is
what has been called grading "on the curve," a term that comes
from the normal bell curve of normative data plotted on a graph.
TEACHERS’ PERCEPTIONS OF APPROPRIATE GRADE
DISTRIBUTIONS
Most teachers bring to a test or a course evaluation an
interpretation of estimated appropriate distributions, follow that
interpretation, and make minor adjustments to compensate for
such matters as unexpected difficulty. What is surprising,
however, is that teachers' preconceived notions of their own
standards for grading often do not match their actual practice
INSTITUTIONAL EXPECTATIONS AND CONSTRAINTS
For many institutions letter grading is foreign but point systems
(100 pts or percentages) are common.
Some institutions refuse to employ either a letter grade or a
numerical system of evaluation and instead offer narrative
evaluations of Ss.
This preference for more individualized evaluations is often a
reaction to overgeneralization of letter and numerical grading.
CROSS-CULTURAL FACTORS AND THE QUESTION OF DIFFICULTY
A number of variables bear on the issue. In many cultures,
it is unheard of to ask a student to self-assess performance.
Ts assign a grade, and nobody questions the teacher's criteria.
measure of a good teacher is one who can design a test that is so
difficult that no student could achieve a perfect score.
The fact that students fall short of such marks of perfection is a
demonstration of the teacher's superior knowledge.
as a corollary, grades of A are reserved for a highly select few,
and students are delighted with Bs.
one single final examination is the accepted determinant of a
student's entire course grade.
the notion of a teacher's preparing students to do their best on a
test is an educational contradiction.
In some cultures a "hard" test is a good test, but in others, a good
test results in a distribution like the one in the bar graph for a
"great bunch": a large proportion of As and Bs, a few Cs, and
maybe a D or an F for the "deadbeats" in the class.
How do you gauge such difficulty as you design a classroom test
that has not had the luxury of piloting and pre-testing?
The answer is complex. It is usually a combination of a number of
possible factors:
experience as a teacher (with appropriate intuition)
adeptness at designing feasible tasks
special care in framing items that are clear and relevant
mirroring in-class tasks that students have mastered
variation of tasks on the test itself
reference to prior tests in the same course
a thorough review and preparation for the test
knowledge of your students' collective abilities
a little bit of luck
WHAT DO LETTER GRADES “MEAN”?
Typically, institutional manuals for teachers and students will list
the following descriptors of letter grades:
A: excellent
B: good
C: adequate
D: inadequate/unsatisfactory F: failing/unacceptable
The overgeneralization implicit in letter grading underscores the
meaninglessness of the adjectives typically cited as descriptors of
those letters. Is there a solution to their gate-keeping role?
1. Every teacher who uses letter grades or a percentage score to
provide an evaluation, whether a summative, end-of-course
assessment or on a formal assessment procedure, should
a. use a carefully constructed system of grading,
b. assign grades on the basis of explicitly stated criteria, and
c. base criteria on objectives of course or assessment procedure(s).
2. Educators everywhere must work to persuade the gatekeepers
of the world that letter/numerical evaluations are simply one
side of a complex representation of a student's ability.
Alternatives to letter grading are essential considerations.
ALTERNATIVES TO LETTER GRADING
For assessment of a test, paper, report, extra-class exercise, or other formal,
scored task, the primary objective of which is to offer formative feedback, the
possibilities beyond a simple number or letter include
a teacher's marginal and/or end comments,
a teacher's written reaction to a student's self-assessment of performance,
a teacher's review of the test in the next class period,
peer-assessment of performance,
self-assessment of performance, and
a teacher's conference with the student.
For summative assessment of a student at the end of a course, those same additional assessments can be made, perhaps in modified forms:
a teacher's marginal and/or end of exam/paper/project comments
T's summative written evaluative remarks on a journal, portfolio, or other
tangible product
T's written reaction to a student's self assessment of performance in a course
a completed summative checklist of competencies, with comments
narrative evaluations of general performance on key objectives
a teacher's conference with the student
A more detailed look is now appropriate for a few of the summative
alternatives to grading, particularly self-assessment, narrative evaluations,
checklists, and conferences.
1. Self-assessment.
Self-assessment of end-of-course attainment of objectives is recommended
through the use of the following:
Checklists
a guided journal entry that directs the student to reflect on the content and
linguistic objectives
an essay that self-assesses, a teacher-student conference
2. Narrative evaluations.
In protest against the widespread use of letter grades as exclusive indicators'of
achievement, a number of institutions have at one time or another required
narrative evaluations of students. In some instances those narratives replaced
grades, and in others they supplemented them. (pg. 296-297)
Advantages: individualization, evaluation of multiple objectives of a course,
face validity, washback potential.
Disadvantages: not quantified by admissions and transcript evaluation offices,
not practical-time consuming, Ss’ paying little attention to these, Ts’
succumbing to formulaic narratives which follow a template.
3- Checklist evaluations.
To compensate for the time-consuming impracticality of
narrative evaluation, some programs opt for a compromise: a
checklist with brief comments from the teacher ideally followed
by a conference and/or a response from the student.
Advantages: increased practicality, reliability, washback.
Teacher time is minimized; uniform measures are applied across
all students; some open-ended comments from the teacher are
available; and the student responds with his or her own goals (in
light of the results of the checklist and teacher comments).
!!! When the checklist format is accompanied, as in this case, by
letter grades as well, virtually none of the disadvantages of
narrative evaluations remain, with only a small chance that some
individualization may be slightly.
4.Conferences.
Perhaps enough has been said about the virtues of conferencing.
You already know that the impracticality of scheduling sessions
with students is offset by its washback benefits.
SOME PRINCIPLES AND GUIDELINES FOR GRADING AND EVALUATION
You should now understand that
grading is not necessarily based on a universally accepted scale,
grading is sometimes subjective and context-dependent,
grading of tests is often done on the "curve,"
grades reflect a teacher's philosophy of grading,
grades reflect an institutional philosophy of grading,
cross-cultural variation in grading philosophies needs to be understood,
grades often conform, by design, to a teacher's expected
distribution of students across a continuum,
tests do not always yield an expected level of difficulty,
letter grades may not "mean" the same thing to all people, and
alternatives to letter grades or numerical scores are highly
desirable as additional indicators of achievement.
With those characteristics of grading and evaluation in mind, the
following principled guidelines should help you be an effective
grader and evaluator of student performance:
Develop an informed, comprehensive personal philosophy of
grading that is consistent with your philosophy of teaching and
evaluation.
Ascertain an institution’s philosophy of grading and, unless
otherwise negotiated, conform to that philosophy (so that you
are not out of step with others).
Design tests that conform to appropriate institutional and
cultural expectations of the difficulty that Ss should experience.
Select appropriate criteria for grading and their relative
weighting in calculating grades.
Communicate criteria for grading to Ss at the beginning of the
course and at subsequent grading periods (mid-term, final).
Triangulate letter grade evaluations with alternatives that are
more formative and that give more washback.
Download