Language Testing

advertisement
Language Testing
Liu Jianda
Syllabus
It is expected that, by the end of this module, participants should be
able to do the following :





Understand the general considerations that must be addressed in the
development of new tests or the selection of existing language tests;
Make their own judgements and decisions about either selecting an
existing language test or developing a new language test;
Familiarise themselves with the fundamental issues, approaches, and
methods used in measurement and evaluation;
Design, develop, evaluate and use language tests in ways that are
appropriate for a given purpose, context, and group of test takers;
Understand the future development of language testing and the
application of IT to computerized language testing.
Syllabus
In order to achieve these objectives, the module gives participants the opportunity
to develop the following skills:

writing test items

collecting test data and conducting item analysis

evaluating language tests with regard to validity and reliability
This is done by considering a wide range of issues and topics related to language
testing. These include the following :

General concepts in language testing and evaluation

Evaluation of a language test: reliability and validity

Communicative approach to language testing

Design of a language test

Item writing and item analysis

Interpreting test results

Item response theory and its applications

Computerized language testing and its future development
Class Schedule
1
2
3
4
5
6
7
8
9
10
11
12
Basic concepts in language testing
Test validation: reliability and validity (1)
Test validation: reliability and validity (2)
Test construction (1)
Test construction (2)
Test construction (3)
Test Construction (4)
Test Construction (5)
Test Construction (6)
Rasch analysis (1)
Rasch analysis (2)
Language testing and modern technology
Assessment
One 5000 – 6000 word paper on language
testing
Collaborative work:
You’ll be divided into group of four to complete

the development of a test paper. Each of you
will be responsible for one part of the test
paper. But each part should contribute equally
to the whole test paper. Therefore, besides
developing your part, you need to come
together to discuss the whole test paper in
terms of reliability and validity.
Course books




Bachman, L. F. & Palmer, A. (1996). Language Testing in
Practice. Oxford: Oxford University Press.
Brown, J. D. (1996). Testing in Language Programs. Upper
Saddle River, NJ: Prentice Hall Regents.
Li, X. (1997). The Science and Art of Language Testing.
Changsha: Hunan Educational Press.
McNamara, T. (1996). Measuring second language
performance. London ; New York: Longman.
Website:
http://www.clal.org.cn/personal/testing/Leeds
Session 1
Basic concepts in language testing
.
A short history of language testing
Spolsky (1978) classified the development
of language testing into three periods, or
trends:
 the prescientific period
 the psychometric/structuralist period
 the integrative/sociolinguistic period.
The prescientific period





grammar-translation approaches to
language teaching
translation and free composition tests
difficult to score objectively
no statistical techniques applied to
validate the tests
simple, but unfair to students
The psychometric-structuralist period





audio-lingual and related teaching methods
objectivity, reliability, and validity of tests
considered
measure discrete structure points
multiple-choice format (standardized tests)
follow scientific principles, have trained
linguists and language testers
The integrative-sociolinguistic period
communicative competence

Chomsky’s (1965) distinction of competence and performance



Competence: an ideal speaker-listener’s knowledge of the rules of the
language;
performance: the actual use of language in concrete situations
Hymes’s (1972) proposal of communicative competence


the ability of native speakers to use their language in ways that are not only
linguistically accurate but also socially appropriate.
Canale & Swain’s (1980) framework of communicative competence:





Grammatical competence, mastery of the language code such as morphology,
lexis, syntax, semantics, phonology;
Sociolinguistic competence, mastery of appropriate language use in different
sociolinguistic contexts;
Discourse competence, mastery of how to achieve coherence and cohesion in
spoken and written communication
Strategic competence, mastery of communication strategies used to
compensate for breakdowns in communication and to enhance the
effectiveness of communication.
The integrative-sociolinguistic period
Bachman’s (1990)’s framework of communicative
language ability:


Language competence: grammatical, sociolinguistic, and
discourse competence (Canale & Swain):




organizational competence

grammatical competence

textual competence
pragmatic competence

illocutionary competence

sociolinguistic competence
Strategic competence: performs assessment, planning, and
execution functions in determining the most effective means
of achieving a communicative goal
Psychophysiological mechanisms: characterize the channel
(auditory, visual) and mode (receptive, productive)
The integrative-sociolinguistic period

Oller’s (1979) pragmatic proficiency test:





Temporally and sequentially consistent with the
real world occurrences of language forms
Linking to a meaningful extralinguistic context
familiar to the testees
Clark’s (1978) direct assessment:
approximating to the greatest extent the
testing context to the real world
Cloze test and dictation (Yang, 2002b)
Communicative testing or to test
communicatively
The integrative-sociolinguistic period

Performance tests (Brown, Hudson, Norris, &
Bonk, 2002; Norris, 1998)



Not discrete-point in nature
Integrating two or more of the language skills of
listening, speaking, reading, writing, and other
aspects like cohesion and coherence,
suprasegmentals, paralinguistics, kinesics,
pragmatics, and culture
Task-based: essays, interviews, extensive reading
tasks
Performance Tests
Three characteristics:

The task should:








be based on needs analysis (What criteria should be
used? What content and context? How should experts
be used?)
be as authentic as possible with the goal of measuring
real-world activities
sometimes have collaborative elements that stimulate
communicative interactions
be contextualized and complex
integrate skills with content
be appropriate in terms of number, timing, and
frequency of assessment
be generally non-intrusive, that is, be aligned with the
daily actions in the language classroom
Performance Tests

Raters should be appropriate in terms of:




number of raters
overall expertise
familiarity and training in use of the scale
The rating scale should be based on appropriate:



categories of language learning and development
appropriate breadth of information regarding learner
performance abilities
standards that are both authentic and clear to students
To enhance the reliability and validity of decisions as well as
accountability, performance assessments should be combined with
other methods for gathering information (e.g. self-assessments,
portfolios, conferences, classroom behaviors, and so forth)
Development graph (Li, 1997: 5)
2. Theoretical issues

Language testing is concerned with
both content and methodology.
Development since 1990



Communicative language testing (Weir,
1990)
Reliability and validity
Social functions of language testing
Ethical language testing
Washback (impact) (Qi, 2002; Wall, 1997)

impact: effects of tests on individuals, policies or practices within the classroom,
the school, the educational system or society as a whole
washback: effects of tests on language teaching and learning
Ways of investigating washback:






analyses of test results
teachers’ and students’ accounts of what takes place in the classroom (questionnaires and
interviews)
classroom observation
Ethics of test use



use with care (Spolsky, 1981: 20)
codes of practice
Professionalization of the field



training of professionals
development of standards of practice and mechanism for their implementation and
enforcement
Critical language testing


put language testing in the society
Factors affecting performance of
examinees
Communicative
language ability
TEST SCORE
Random
factors
Test method
facets
Personal
attributes
Development since 1990
Testing interlanguage pragmatic knowledge

currently on research level
focus on method validation
web-based test by Roever



Computerized language testing

Item banking
Computer-assisted language testing
Computerized adaptive language testing








Test items adapted for individuals
Test ends when examinee’s ability is determined
Test time very shorter
Web-based testing
Phonepass testing
Development since 1990
Language testing and second
language acquisition (Bachman &
Cohen, 1998)




Help to define construct of language
ability
Use findings of language testing to prove
hypotheses in SLA
Provide SLA researchers with testing and
standards of testing
Development of research methodology
Factor analysis
The main applications of factor analytic
techniques are:





(1) to reduce the number of variables and
(2) to detect structure in the relationships
between variables, that is to classify variables.
Therefore, factor analysis is applied as a
data reduction or structure detection
method
Generalizability theory (Bachman, 1997;
Bachman, Lynch, & Mason, 1995)


Estimating the relative effects of different
factors on test scores (facets)
The most generalizable indicator of an
individual’s language ability is the universe
score, however, in real world, we can only
obtain scores from a limited sample of
measures, so we need to estimate the
dependability of a given observed score as an
estimate of the universe score.
Two stages are involved in applying G-theory to
test development
G-study
The purpose is to estimate the effects of the various
facets in the measurement procedure (usually
conducted in pretesting).



e.g. persons (differences in individuals’ speaking ability),
raters (differences in severity among raters), tasks
(differences in difficulty of tasks);
two-way interactions:




task x rater  different raters are rating the different tasks
differently
person x task  some tasks are differentially diffucult for
different groups of test takers (source of bias)
person x rater  some raters score the performance of
different groups of test takers differently (indication of rater
bias)
Two stages are involved in applying Gtheory to test development




D-study
The purpose is to design an optimal measure for
the interpretations or decisions that are to be made
on the basis of the test scores (estimation of
dependability).
Generalizability coefficient (G coefficient) provides
an estimate of the proportion of an individual’s
observed score that can be attributed to his or her
universe score, taking into consideration the effects
of the different conditions of measurement
specified in the universe of generalization. But it is
appropriate for norm-referenced tests.
For criterion-referenced tests, use phi coefficient.
(GENOVA)
Item response theory (Rasch model)
It enables us to estimate the statistical
properties of items and the abilities of
test takers so that these are not
dependent upon a particular group of test
takers or a particular form of a test. It is
widely used in large-scale standardized
test.
Structural equation model (Antony
John Kunnan, 1998)


A combination of multiple regression,
path analysis and factor analysis
Attempts to explain a correlation or a
covariance data matrix derived from a
set of observed variables; latent
variables are responsible for the
covariance among the measured
variables.
Basic procedures in SEM (Example from
Purpura, 1998)








Examine the relationships between strategy use and
second language test performance.
Design two questionnaires for cognitive strategies
and metacognitive strategies (40 items)
Ask respondents to answer the questionnaires
Respondents take a foreign language test
Cluster the 40 items to measure several variables
Compute the reliability of the variables
Conduct factor analysis to identify factors
Conduct SEM analysis (AMOS, EQS, LISREL)
Qualitative method




Verbal report (think-aloud,
introspective)
Observation
Questionnaires and interviews
Discourse analysis
3. Classification of language tests

According to families


Norm-referenced tests
Criterion-referenced tests
Norm-referenced tests



Measure global language abilities (e.g.
listening, reading speaking, writing)
Score on a test is interpreted relative to
the scores of all other students who
took the test
Normal distribution
Normal Distribution
http://stat-www.berkeley.edu/~stark/Java/NormHiLite.htm
Norm-referenced tests


Students know the format of the test
but do not know what specific content
or skill will be tested
A few relatively long subtests with a
variety of question contents
Criterion-referenced tests





Measure well-defined and fairly specific objectives
Interpretation of scores is considered absolute
without referring to other students’ scores
Distribution of scores need not to be normal
Students know in advance what types of questions,
tasks, and content to expect for the test
A series of short, well-defined subtests with similar
question contents
According to decision purposes




Proficiency tests
Placement tests
Achievement tests
Diagnostic tests
Proficiency tests



Test students’ general levels of language
proficiency
The test must provide scores that form a
wide distribution so that interpretations of the
differences among students will be as fair as
possible
Can dramatically affect students’ lives, so
slipshod decision making in this area would
be particularly unprofessional
Placement tests



Group students of similar ability levels
(homogeneous ability levels)
Help decide what each student’s
appropriate level will be within a
specific program
Right tests for right purposes
Achievement tests





About the amount of learning that students have
done
The decision may involve who will a advanced to the
next level of study or which students should graduate
Must be designed with a specific reference to a
particular course
Criterion-referenced, conducted at the end of the
program
Used to make decisions about student’s levels of
learning, meanwhile can be used to affect curriculum
changes and to test those changes continually
against the program realities
Diagnostic tests





Aimed at fostering achievement by promoting strengths and
eliminating the weaknesses of individual students
Require more detailed information about the very specific areas
in which students have strengths and weaknesses
Criterion-referenced, conducted at the beginning or in the
middle of a language course
Can be diagnostic at the beginning or in the middle but
achievement test at the end
Perhaps the most effective use of a diagnostic test is to report
the performance level on each objective (in a percentage) to
each student so that he or she can decide how and where to
invest time and energy most profitably
Formative assessment vs. summative
assessment


Formative: a judgment of an ongoing program used
to provide information for program review,
identification of the effectiveness of the instructional
process, and the assessment of the teaching process
Summative: a terminal evaluation employed in the
general assessment of the degree to which the larger
outcomes have been obtained over a substantial part
of or all of a course. It is used in determining
whether or not the learner has achieved the ultimate
objectives for instruction which were set up in
advance of the instruction.
Public examinations vs.
classroom tests




Purpose: proficiency vs. achievement
(placement, diagnostic)
Format: standardized vs. open
(objective vs. subjective)
Scale: large-scale vs. small-scale (selfassessment)
Scores: normality, backwash
Download