Foreign Language Assessment

advertisement
Foreign Language Assessment
Compiled by Po-Sen Liao

Introduction
Why do we test at all? Is it the only way to get students to learn? To test out of
habit? To test as a punitive measure? A discouraging barrier for students to face or a
hurdle they have to jump at prescribed points? Can tests be a positive challenge?
(questionable and promising testing procedures, see p.4 )
The assessment tasks should be nonthreatening and developmental in nature,
allowing the learners ample opportunities to demonstrate what they know and do not
know, and providing useful feedback both for the learners and for their teachers.
Advances in language testing over the last decade have included:
1.
The development of a theoretical view that sees language ability as
multi-componential and recognizes the influence of test-method and
2.
3.

test-taker characteristics on test performance.
The use of more sophisticated measurement and statistical tools.
The development of communicative language tests commensurate with
the increased teaching of communicative skills.
Terminology
1. Assessment:
A variety of ways of collecting information on a learner’s language ability or
achievement. An umbrella term encompasses tests, observation, or project works.
Proficiency assessment: The assessment of general language abilities acquired by
the learner independent of a course of study (e.g. TOFEL, TOEIC).
Achievement assessment: To establish what a student has learned in relation to a
particular course (e.g. tests carried out by the teacher and based on the specific
content of the course). It is to determine acquisition of course objectives at the end of
instruction.
Diagnostic assessment: It is designed to diagnose a particular aspect of a
language. A diagnostic test in pronunciation might have the purpose of determining
which phonological features of English are difficult for learners and should therefore
become a part of a curriculum. Such assessment may offer a checklist of features for
the teacher to use in pinpointing difficulties.
Placement assessment: Its purpose is to place a student into an appropriate level
or section of a language curriculum. Certain proficiency tests and diagnostic tests can
act in the role of placement assessments.
Aptitude assessment: It is designed to measure a person’s capacity or general
1
ability to learn a foreign language and to be successful. It is considered to be
independent of a particular language. This test usually requires learners to perform
such tasks as memorizing numbers and vocabulary, listening to foreign words, and
detecting spelling clues and grammatical patterns.
Formative assessment: It is often closely related to the instructional program and
may take forms of quizzes and chapter tests. Its results are often used in a diagnostic
manner by teachers to modify instruction.
Summative assessment: The type of assessment that occurs at the end of a period
of study. It goes beyond the material of specific lessons and focuses on evaluating
general course outcomes.
Norm-referenced assessment: to evaluate ability against a standard or normative
performance of a group. It provides a broad indication of relative standing. (e.g. a
score in an exam reports a learner’s standing compared to other students).
Criterion-referenced assessment: to assess achievement or performance against
a cut-off score that is determined as a reflection of mastery or attainment of specified
objectives. This approach is used to see whether a respondent has met certain
instructional objectives or criteria. Focus is on ability to perform tasks rather than
group ranking. (e.g. a learner can give basic personal information).
2. Evaluation:
Refers to the overall language program and not just with what individual students
have learned.
Assessment of an individual students’ progress or achievement is an important
component of evaluation. Evaluation goes beyond student achievement to
consider all aspects of teaching and learning, and to look at how educational
decisions can be informed by the results of alternative forms of assessment.
 How to evaluate the assessment instrument?
1. Validity: A test is valid when it measures effectively what it is intended to
measure. A test must be reliable in order to be valid.
Types of validity:
A. Content validity: Checking all test items to make certain that they
correspond to the instructional objectives of the course.
B. Criterion-related validity: Determining how closely learner’s performance
on a given new test parallels their performance on another instrument, or
criterion. If the instrument to be validated is correlated with another criterion
instrument at the same time, then this is refereed to as concurrent validity.
If the correlation takes place at some future time, then it is referred to as
2
predictive validity.
C. Construct validity:It refers to the degree to which scores on an assessment
instrument permit inferences about underlying trait. It examines whether the
instrument is a true reflection of the theory of the trait being measured.
D. System validity: the effects of instructional changes brought about by the
introduction of the test into an educational system. (p.41)
Washback effect: how assessment instruments affect educational practices
and beliefs.
E. Face validity/ perceived validity: whether the test looks as if is measuring
what it is supposed to measure.
2. Reliability:The degree to which it can be trusted to produce the same result upon
repeated administrations. A language test must produce consistent results and give
consistent information.
Types of reliability:
A. Test-retest reliability: the degree of consistency of scores for the same test
given to the same students on different occasions.
B. Alternate-forms reliability: the consistency of scores for the same students
on different occasions on different but comparable forms of the test.
C. Split-half reliability: a special case of alternate-forms reliability. The same
individuals are tested on one occasion with a single test. A score is calculated
for each half of the test for each individual and the consistency of the two
halves is compared.
D. Scorer reliability: the degree of consistency of scores from different scorers
for the same individuals on the same test (interrater reliability) or from the
same scorer for the same individuals on the same test but on different
occasions (interrater reliability). Score reliability is an issue when scores are
based on subjective judgments.
(e.g., rater unreliability: two teachers observe two ESL students during a
conversation together. Both teachers listened to and assessed the same
conversation and reported different interpretations. Instrument-related,
person-related reliability)
Measuring reliability:
The reliability index is a number ranging from .00 to 1.00 that indicates what
proportion of measurement is reliable. An index of .80 means that your measurement
3
is 80% reliable and 20% error. For the purpose of classroom testing, a reliability
coefficient of at least .70 is good. Higher reliability coefficients would be expected of
standardized tests used for large-scale administration (.80 or better).
Improving reliability:
Rater reliability: to use more than one well-trained and experienced observer,
interviewer, or composition reader.
Person-related reliability: to assess on several occasions.
Instrument-related reliability: to use a variety of methods of information collection.
(A test which is reliable is not necessarily valid. A test may have maximum
consistence, but may not be measuring what it is specifically intended to measure But
an instrument can be only as valid as it is reliable. Inconsistency in a measurement
reduces validity.)
3. Practicality: practical considerations like cost, administration time, administrator
qualifications, and acceptability.

Background:
Historically, language-testing trends and practices have followed the changing
winds of the teaching methodology.
In the 1950s and 1960s, under the influence of behaviorism and structural
linguistics, language tests were designed to assess learners’ mastery of different areas
of the linguistic system such as phoneme discrimination, grammatical knowledge and
vocabulary. Tests often used objective testing formats such as multiple choice.
However, such discrete item tests provided no information on learners’ ability to use
language for communicative purposes.
In the 1970s and early 1980s this led to an upsurge of integrative tests such as
cloze and dictation, which required learners to use linguistic and contextual
knowledge to reconstitute the meaning of written or spoken texts.
Since the early 1980s, with the widespread of Communicative Language
Teaching (CLT), assessment has become increasingly direct. Many language tests
often contain tasks which resemble the kinds of language-use situations that test
takers would encounter in using the language for communicative purposes in
everyday life. The tasks typically include activities such as oral interviews, listening
to and reading extracts from the media, and various kinds of authentic writing tasks
which reflect real-life demands. Today, test designers are still challenged in their quest
for more authentic, content-valid instruments that stimulate real-world interaction
4
while still meeting reliability and practicality criteria.
The best way to evaluate students’ performance in a second language is still a
matter of debate. Given the wide variety of assessment methods available and the lack
of consensus on the most appropriate means to use, the best way to assess language
performance in the classroom may be through a multifaceted or eclectic approach,
whereby a variety of methods are used.
Discrete-point/ integrative assessment: (p.161)
Since the 1960s, the notion of discrete-point assessment, that is, assessing one
and only one point at a time, has met with some disfavor among theorists. They feel
that such a method provides little information on the student’s ability to function in
actual language-use situations. They also contend that it is difficult to determine
which points are being assessed. In the past testing points have determined in part by
a contrastive analysis of differences between the target and the native languages. But
this contrastive analysis was criticized for being too limiting. About 20 years ago an
integrative approach emerged, with the emphasis on testing more than one point at a
time.
There is actually a continuum from the most discrete-point on the one hand to the
most integrative items or procedures on the other. Most items fall somewhere in
between.
Direct/ indirect assessment:
A direct measure samples explicitly from the behavior being evaluated, while an
indirect measure is contrived to the extent that the task differs from a normal
language-using task. There is an increasing concern being voiced that assessment
need to be developed that directly reflect the traits are supposed to measure.
 Traditional test items:
1. Multiple-choice (show your items to your colleagues)
The multiple-choice item, like other paper-and-pencil tests (e.g. true-false items,
matching items, short questions), measures whether the student knows or understands
what to do when confronted with a problem situation. Multiple-choice items are
favored because their scoring can be reliable, rapid, and economical. However, they
cannot determine how the student actually will perform in that situation. Furthermore,
it is not well adapted to measuring some problem-solving skills, or the ability to
organize and present ideas.
5
Suggestions for constructing multiple-choice items:
1. The correct answer must not be dubious.
(e.g.) Which is the odd one out?
a. rabbit
b. hare
c. bunny
d. deer
2. Items should be presented in context.
(e.g.) Fill in the blank with the most suitable option:
Visitor: Thank you very much for such a wonderful visit.
Hostess: We were so glad you could come. Come back______.
a. soon
b. later
c. today
d. tomorrow
3. All distracters should be plausible.
(e.g.) What is the major purpose of the United Nations?
a. To develop a new system of international law.
b. To provide military control of nations that have recently attained their
independence. (vs. To provide military control).
c. To maintain peace among the peoples of the world.
d. To establish and maintain democratic forms of government in newly formed
nations (vs. To form new governments).
4. For young students, 3-choice items may be preferable in order to reduce the
amount of reading. For other learners, 4 or 5 choices are favored to reduce the
chances of guessing the correct answer.
2. Essay questions:
Learning outcomes concerned with the abilities to select, organize, integrate,
relate, and evaluate ideas require the freedom of response provided by essay questions.
It emphasizes on the integration and application of thinking and problem-solving
skills. However, the most serious limitation is the unreliability of the scoring. Another
closely related limitation is the amount of time required for scoring the answers. A
series of studies has shown that answers to essay questions are scored differently by
different teachers and that even the same teachers score the answers differently at
6
different times. One teacher stresses factual content, one organization of ideas, and
another writing skills. With each teacher evaluating the degree to which different
learning outcomes are achieved, it is not surprising that their scoring diverge so
widely. The scoring reliability should be increased by clearly defining the outcomes to
be measured, properly framing the questions, carefully following scoring rules, and
obtaining practice in scoring.
Suggestions for constructing essay questions:
A prompt for the essay is presented in the form of a mini-teat that the respondents
need to understand and operationalize. We need to give careful consideration to the
instructions the respondents attend to in accomplishing the testing tasks.
1. Instructions should be brief, but explicit.
2. Specific about the form the answers are to take—if possible, presenting a
sample question and answer.
3. Informative as to the value of each item and section of the assessment
instrument, the time allowed for the test, and whether speed is a factor.
4. Formulate questions that will call forth the behavior specified in the
learning outcomes.
5. Phrase each question so that the students’ task is clearly defined.
(e.g.): (the incorrect answers may due to misinterpretation or lack of
achievement)
Write a one-page statement defending the importance of conserving our
natural resources. Your answer will be evaluated in terms of its organization,
comprehensiveness, and the relevance of the arguments presented. (30%, 30
minutes)
Describe the similarities and differences between --- (comparing)
What are major causes of --- (cause and effect)
Briefly summarize the contents of --- (summarizing)
Describe the strengths and weaknesses of the following --- (evaluating)

Appraising tests:
Instead of discarding the test after a classroom test has been administered and the
students have discussed the results, a better approach is to appraise the effectiveness
of the test items and to build a file of high-quality items for future use.
Scoring tests is not the final step in the evaluation process. Scores are arbitrary.
The main concern is the interpretation of the scores: (p.98)
Raw score: the score obtained directly as a result of tallying up all the items
answered correctly, usually is not easy to interpret.
7
Percentage score: the number of items that students answered correctly divided by
the total items on the test.
Percentile: a number tells what percent of individuals within he specified norm group
scored lower than the raw score of a given student.
Mean score: the average score or a given group of students. We divide the students
scores added together by the number of scores involved.
Item difficulty: the ratio of correct responses to total responses for a given test item.
A norm-referenced assessment: (aims to differentiate among high and low achievers)
should have items that approximately 60% to 80% of the respondents answer
correctly.
A criterion-reference assessment: (aims to determine whether students have achieved
the objectives of a course) to obtain item difficulty of 90% or better.
Formula: P = R/N x 100
(P = item difficulty, R = the number of students who got the item right, N = the total
number of students who tried the item)
Item discrimination/ item discriminating power: how well an item performs in
separating the better students form the weaker ones.
(While the item-difficulty index focuses on how the items fare, item discrimination
looks at how the respondents fare from item to item)
An item-discrimination level of .30 or above is generally agreed to be desirable.
Formula: D = Ru–Rl / 1/2T
( D = item discriminating power, Ru = the number of students in the upper who get
the item right, Rl = the number of students in the lower group who get the item right,
T = the total number of students included in the item analysis)
An item with maximum positive discriminating power is one in which all students in
the upper group get the item right and all the students in the lower group get the item
wrong. (D=10-0 / 10 = 1) An item with no discriminating power is one in which an
equal number of students in both the upper and lower groups gets the item right.
(D=10-10 / 10 = 0)It is also possible to calculate an index of negative discriminating
power; that is, one in which more students in the lower group than the upper group get
the item right. Such items should be revised so that they discriminate positively, or
they should be discarded.
Piloting of assessment instruments
Ideally, an assessment instrument that is intended to perform an important
function in an institution would undergo piloting on a small sample of respondents
8
similar to those for whom it is designed. The pilot administration provides the
assessor feedback on the items and procedures. The assessor can obtain some valuable
insights about what part of the instrument needs to be revised before it is administered
to a larger group.

Assessing the speaking skills
If students are asked to participate in communicative, open-ended activities in
the classroom, then it is hypocritical to assess their progress with discrete point
grammar tests. The test should be designed to give students a real-life, culturally
authentic task.
1. Interviews: Greeting, warm-up chat, close-up.
2. Pair discourse
3. Group oral
4. Tape recording
A. Oral descriptions of visuals: comprises appropriate conversation stimulus at the
novice level, not only because they provide a psychological prop, but also because
they facilitate listing and identifying tasks for the students. There may be many
possibilities for appropriate answers.
Sources of visuals: the teacher’s or students’ personal slides and photos,
yearbook pictures, magazine pictures.
B. Role-play: To function in a “survival situation” in which students might encounter
in a real life situation.
(e.g.) Situation cards: On which are listed role-playing instructions for the
learners.
1. When you see two of your friends at the mall, you decide to invite them to your
birthday party. Tell them when and where it is, what you will do at the party, how
many people will be there, and any other details you think they would be
interested in knowing.
2. Leave a message on the answering machine with the following information:
- Leave your name and the time you called
- Tell the person where you are going tonight.
- Tell the person you’ll see him/her tomorrow at a particular place and time.)

Assessing reading comprehension (p.211)
When readers approach text on the basis of the prior content, language, and
textual schemata that they may have with regard to that particular text, this is referred
to as top-down reading. When readers focus exclusively on what is present in the text
9
itself, and especially on the words and sentences of the text, this is referred to as
bottom-up reading. Successful learners usually display a combination of top-down
and bottom-up reading.
Test constructors and users of assessment instruments should be aware of the
skills tested by reading comprehension questions. There are numerous taxonomies of
such skills:
1. The recognition of words and phrases of similar or opposing meaning.
5. The identifying or locating of information.
6. The discriminating of elements or features within context; the analysis of
elements within a structure and of the relationship among them --- e.g.
causal, sequential, chronological, hierarchical.
7. The interpreting of complex ideas, actions, events, and relationships.
8. Inferring --- the deriving of conclusions, and predicting the continuation.
9. Synthesis
10. Evaluation
Testing methods:
1. Fixed-response format: multiple-choice
2. Structure-response format:
Cloze: The cloze test is extensively used a completion measure, ideally
aimed at tapping reading skills interactively, with respondents using cues
from the text in a bottom-up fashion as well as bring their background
knowledge to bear on the task.
(e.g.)
I am one of
people who simply cannot
up. To me,
the
of an orderly living
is as remote and
as trying to climb
Fuji.
The cloze has been used as a measure of readability, global reading skills,
grammar, and writing. It can be scored according to an exact-word,
acceptable-word, or multiple-choice approach. (p.139)

Assessing listening skills
The following are examples of listening comprehension items and
procedure, ranging from the most discrete-point items to more integrative
assessment tasks. The teacher must decide when it is appropriate to use any of
these approaches for assessing listening in the classroom.
1. Discrimination of sounds: sound-discrimination items are of particular
10
benefit in assessing points of contrast between two given languages.
(e.g.) The respondents indicate which sound of three is different from the
other two.
(1) sun, (2) put, (3) dug
2. Listening for grammatical distinction: The respondent could listen for
inflectional marker, or to determine whether the subject and verb are in the
singular or the plural.
(e.g.) The boys sing well:
(1) singular, (2) plural, (3) same form for singular and plural
3. Listen for vocabulary: The respondents perform an action in response to a
command (e.g. getting up, walking to the window) or draw a picture
according to oral instructions (e.g. coloring a picture a certain way).
4. Auditory comprehension:
a. The respondents hear a statement and must indicate the appropriate
paraphrase for the statement.
(e.g.) What’d you get yourself into this time?
(1) What are you wearing this time?
(2) What did you buy this time?
(3) What’s your problem this time?
b. The respondents listen in on a telephone conversation and at
appropriate times must indicate what they would say if they were
one of the speakers in the conversation.
(e.g.)
Mother: Well, Mary, you know you were supposed to call me last
week.
Mary: I know, Mom, but I got tied up.
Mother: That’s really no excuse.
Mary: (1) Yes, I’ll call him.
(2) You’re right. I’m sorry.
(3) I’ve really had nothing to do.
5. Communicative assessment:
There has been a general effort in recent years to make communicative types
of listening comprehension tests more authentic. It has been suggested that
increased attention be paid to where the content of the assessment instrument
falls along the oral/literate continuum --- from a news broadcast, to a lecture,
to a consultative dialogue.
a. Lecture task: the respondents hear a lecture, with filled pauses and
other features that make it different from oral recitation of a written
11
text. After the lecture, tape-recorded multiple-choice, structured, or
open-ended questions are presented, with responses to be written
on the answer sheet.
b. Dictation: Dictation can serve as a measure of auditory
comprehension if is given at a fast enough pace so that it is not
simply a spelling test. Nonnative respondents must segment the
sounds into words, phrases, and sentences.

Assessing writing skills:
Two principal types of scoring scales for rating writing;
1. Holistic scoring: a single score is assigned to a student’s overall writing
performance. This is basically what teachers do when they assign number or
letter grades to students’ compositions. Holistic scores represent teachers’
overall impressions and judgments. As such, they can serve as general
incentives for learning, and they can distinguish students with respect to their
general achievement in writing. However, because they provide no detailed
information about specific aspects of performance (e.g. grammatical ability),
they are not very useful in guiding teaching and learning.
2. Analytic scoring: different components or features of the students’ responses
are given separate scores (on an essay, spelling, grammar, organization, and
punctuation might be scored separately). Individual analytic scores are
sometimes added together to yield a total score. The scoring categories
included in an analytic system should reflect instructional objectives and
plans. Determining levels of performance for each category generally reflects
teachers’ expectations, based on past experience. Analytic scoring provides
useful feedback to students and diagnostic information to teachers about
specific areas of performance that satisfactory or unsatisfactory. This
information can be useful for planning instruction and studying,

Alternative assessment methods:
Not only is assessment reflecting more of an integrative approach, but it has also
become clear that assessment of language benefits from the use of multiple means
over time. Alternative assessment methods including observation, portfolios,
conferences, and dialogue journals, have led to the incorporation of the results they
provide into students’ grades.
Observation:
Observation is an integral part of everyday teaching. Teachers continuously
observe their students’ language use during formal instruction or while the students
12
are working individually at their desks; teachers may arrange individual conference
times during which they observe students carefully on a one-to-one basis.
It is important to identify why you want to observe and what kinds of decisions
you want to be able to make based on your observations. A number of decisions need
to be made when planning observation:
1. Why do you want to observe and what decisions do you want to make as
a result of your observations?
2. What aspects of teaching or learning that are appropriate to these
decisions do you want to observe?
3. Do you want to observe individual students, small groups of students, or
the whole class?
4. Will you observe on one occasion or repeatedly?
5. How will you record your observations?
Three ways of recording classroom observations:
1. Anecdotal records
2. Checklist
3. Rating scales
Portfolios:
A purposeful collection of students’ work that demonstrates to students and
others their efforts, progress, and achievements in given areas. They provide a
continuous record of students’ language development that can be shared with others.
If portfolios are reviewed routinely by teachers and students in conference together,
then they can also provide information about students’ views of their own language
learning and strategies they apply in learning. This in turn can enhance student
involvement in and ownership of their own learning. Portfolios can encourage
students to assess their own strengths and weaknesses, and to identify their own goals
for learning. Portfolio assessment is student-centered, collaborative, holistic, in
contrast to many assessment methods that treat students as objects of evaluation and
place the responsibility and task of assessment in the hands of teachers.
A file folder, box, or any durable container can serve as a portfolio. Samples of
writing, lists of books that have been read, book report, tape-recordings of speaking
samples, favorite short stories, and so on can all be included in a portfolio.
1. Have students include a brief note describing why each piece is included in
their portfolio; what they like about it; what they learned when they did it; and
where there could be improvement.
2. During portfolio conferences, ask students to describe their current strengths
13
and weaknesses and to indicate where they have made progress; ask them to give
evidence of this progress.
3. Teachers should be interested, supportive, and constructive when providing
responses to portfolio pieces and students’ reflections on their work.
Problems with portfolio assessment: One problem that arises is that teachers
have a lot of work to do in this approach to assessment. There is the issue of how
much time the teachers are willing to devote to this endeavor. Another problem is that
the approach may make teachers anxious since any failures will reflect on them. In
addition, the emphasis on revising has been viewed as pampering students too much
and letting lazy students get by.
Conferences
Conferences can be used as part of evaluation, and generally take the form of a
conversation or discussion between teachers and students about school work.
Conferences can include individual students, several students, or even the whole class.
They can be conversations about completed work, or about work in progress.
At all times, students must feel that the conference is under their control and for
their benefit. Beginning by having students review the work for you; permit them to
comment on whatever is important from their point of view even though it might not
seem so to you. To facilitate discussion, the following kinds of questions can be asked
of students:
1. What do you like about this work?
2. What do you think you did well?
3. How does it show improvement from previous work? Can you show me the
improvement?
4. Did you have any difficulties with this piece of work? Can you show me where
you had difficulty? What did you do to overcome it?
5. What strategies did you use to figure out the meaning of words you could not read?
Or what did you do when you did not know a word that you wanted to write?
6. Are there things about this work you do not like? Are there things you would like
to improve?
Conducting your conferences with each student on a regular basis throughout the
year or course in order to monitor progress and difficulties that might be impeding
progress and to plan lessons or instruction that is responsive to students’ ongoing
needs. We do not recommend using conferences for grading purposes because grading
generally focuses on learning outcomes or achievement, whereas the primary focus of
conferences is process.
14
Dialogue journal
Journals are written conversation between students and teachers. They provide
opportunities for students to provide feedback about their learning experiences. There
are a number of important benefits:
1. They provide useful information for individualizing instruction, for example:
a. writing skills
b. writing strategies
c. students’ experiences in and outside of school
d. learning process
e. attitudes and feeling about themselves, their teachers, schooling
f. their interests, expectations, goals
2. They increase opportunities for functional communication between students and
teachers.
3. They give students opportunities to use language for genuine communication and
personalized reading.
4. They permit teachers to individualize language teaching by modeling writing
skills in their responses to student journals.
5. They promote the development of certain writing skills.
6. The enhance student involvement in and ownership of learning.
The following guidelines for using journals are suggested:
1. Collect students’ journals on a regular basis and read them carefully before
returning them. Keep the interval between readings as brief as possible; otherwise,
students may perceive the feedback contained in their journals as unimportant and
merely a writing exercise.
2. Encourage students to write about their success as well as their difficulties and
hardships. Similarly, encourage them to write about classroom activities and
events that they found useful, effective, and fun as well as those they found to be
confusing, useless, uninteresting, or frustrating.
3. Be patient and allow students time to develop confidence in the process of sharing
their personal impressions.
4. Avoid the use of evaluative or judgmental comments to ensure students’
confidence and candor.
Self-assessment: having the learners evaluate their own performance. Learners could
be asked to rate their ability to perform specific functional tasks or to indicate whether
they would be able to respond successfully to given assessment items or procedures.
15
Careful self-assessment by students in a classroom could be one of the means for
multiple assessment. (p.197)

Reporting results:
We need to deemphasize the significance of certain standardized test scores by
complementing them with a host of ongoing and comprehensive assessment measures,
including portfolios, self-assessment, and observations of student performance in
class, student journals, and other forms of assessment. Having a variety of
information about teaching and learning can enhance the reliability of your
assessments and the validity of your decision making.
 Appendix:
1. Halo effect:
Rating an examinee high on one trait or characteristic simply because he or she
rates high on other characteristics.
2. Hawthorne effect: (placebo effect)
The influence of researcher’s presence on the outcome of study (lighting on
worker productivity/ attention increases productivity). If a new teaching
method/textbook is used, there may be improvement in learning which is due not
the method/textbook, but to the fact that it is new.
How to avoid the Hawthorne effect: not to emphasize the experiment/observation
when the new treatment becomes routine.
3. John Henry effect: (improved control group performance)
The tendency of control group subjects who know they are in an experiment to
exert extra effort and hence to perform above their expected average (a black
railroad worker vs. streamdrill in laying railroad tracks).
How to avoid: provide baseline data before the experiment is introduced;
measurement should also occur after the experiment is over.
16
Download