Factor 2. Effective Administration

advertisement
INVESTIGATING THE
RELIABILITY AND VALIDITY
OF HIGH-STAKES ESL TESTS
Dr. Paula Winke
Michigan State University
winke@msu.edu
March 22, 2013, TESOL Dallas
Background


Tests must be valid.
Validity = evidence that
a test actually measures
what it is intended to
measure.
I can fly
planes!
Background

Validity can be measured:
 Quantitatively
 Psychometric
measures of
reliability
Tests should be
Narrow reliable and
consistent.
 Qualitatively
 Social,
ethical, practical, and
consequential considerations
(e.g. Messik, 1980, 1989,
1994; Moss, 1998;
McNamara & Roever, 2006;
Ryan, 2002; Shohamy, 2001,
2006)
Broad
Tests should be
fair,
meaningful, cost
efficient,
developmentally
appropriate, and
do no harm.
4
K-12, NCLB
ELPA in
Michigan
Who determines reliability & validity?
•
•
•
Test developers
Researchers
Test Stakeholders
Validity is a multifaceted concept that includes value
judgments and social consequences. (Messick, 1980;
1989; 1994; 1995; McNamara & Roever, 2006)
“Justifying the validity of test use is the responsibility of
all test users.” (Chapelle, 1999, p. 258)
Study 1: The ELPA in Michigan


The purpose of this study was to evaluate the
perceived effectiveness of the English Language
Proficiency Assessment (ELPA), which is used in the
state of Michigan to fulfill the No Child Left Behind
(NCLB) requirements.
In particular, I wanted to look at the test validity of
the ELPA.
Goal of this Study

Investigate the views of
teachers and school
administrators (educators) on
the administration of the
ELPA in Michigan
ELPA in Michigan
 Administered
to 70,000
English Language
Learners (ELLs) in K-12
annually since spring of
2006
 Fulfillment of No Child
Left Behind (NCLB) and
Federal Title I and Title
III requirements
9
The ELPA
Based on standards
adopted by the State
 Subtests







Listening
Reading
Writing
Speaking
Comprehension
Scoring



Basic
Intermediate
Proficient
Levels of the 2007 ELPA





Level I for kindergarteners
Level II for grades 1 and 2
Level III for grades 3 through 5
Level IV for grades 6 through 8
Level V for grades 9 through 12
ELPA Technical Manual

2006 MI-ELPA Technical Manual claimed ELPA is valid.
A. Item writers were trained
B. Items and test blueprints were reviewed by content experts
C. Item discrimination indices were calculated
D. Item response theory was used to measure item fit and
correlations among items and test sections

Validity argument did not consider the test’s
consequences, fairness, meaningfulness, or cost and
efficiency, all of which are part of a test’s validation
criteria (Linn et al., 1991)
13
Research Questions
1. What are educators’ opinions about the ELPA and
its administration?
2. Do educators’ opinions vary according to the
demographic or teaching environment in which the
ELPA was administered?
267 Participants

Educator Type
Teachers of ESL &
Language Arts
Teachers of English
Literature & Other Subjects
School Principals &
Administrators
Others
Not Identified
Total
Number
166
Percentage
62.2%
5
1.9%
62
23.2%
30
11.2%
4
1.5%
267
100%
Materials
3-part
online survey
Part 1: 6 items on demographic information
Part 2: 40 belief statements + comments
Part 3: 5 open-ended questions
Procedure
ELPA Testing Window 3/194/27 Online Survey Window 3/29-5/20
MITESOL
Listserv
email sent
3/29
Names and
emails
culled off
Web
3/30-4/7
Reminder
email to
MITESOL and
culled list
sent 5/14
Analysis
Quantitative:
factor analysis of the data
generated from the 40 belief items;
derived factor scores and used those to
see if the ELL demographic or teaching
environment affected survey outcomes
Qualitative: deductive-analytic,
qualitative analysis of answers to the
open-ended questions and comments on
the 40 closed items.
How Factor Analysis Works
Item 1
Speaking ability
Item 2
Item 3
Item 4
Item 5 ability
Listening
Item 6
Item 7
Item 8
Item 9
Item 10
Factor 1
•• After
Items items
that are
don’t
 Factor
analysis
clustered
correlate
with
runs
correlations
together
with
any
Factor 2
among
the that
otherlarger
items
factor
(that
were answered
answered
items in
don’t
fit
into
awe
thesee
same
way,
to
what
items
can look
at the
cluster)
can
be
are
related—
items
thefrom
that
is,inwhat
dropped
cluster
label
questions
tap
furtherand
analysis
the
cluster—what
into
the same
or discussion.
is the theme of
underlying
the cluster?
construct.
21
Results: 1. Factor Analysis
Factor
Alpha
Mean
SE
SD
1. Reading and
writing tests
0.9
5.59
0.15
2.54
2. Effective
administration
0.88
7.34
0.15
2.77
3. Impacts
0.88
5.43
0.15
2.61
4. Speaking test
0.9
5.88
0.15
2.49
5. Listening test
0.88
5.64
0.15
2.62
ANOVA results Factor 2 (affective
administration) by ELL concentration
23
0
-0.5
-0.97
-1
UB
-1.5
Mean
LB
-1.99
-2
-2.14
-2.5
-3
Less than 5%
5 to 25%
More than 25%
ANOVA results Factor 4 (speaking
test) by ELL concentration
24
1
0.5
0.19
0
UB
-0.27
Mean
-0.5
LB
-0.71
-1
-1.5
Less than 5%
5 to 25%
More than 25%
Qualitative Results

A. Perceptions on the ELPA subtests
 Factor
1. Reading and writing
 Factor 4. Speaking
 Factor 5. Listening

B. Logistics
 Factor

2. Effective Administration
C. Test impacts
 Factor
3. Impacts
A. Perceptions on the ELPA subtests
(Factors 1, 4 and 5)



65 wrote that the test was too difficult for lower
grade levels.
Out of the 145 responders who administered
ELPA Level I, the test for kindergarteners, 30
commented that it was too hard or inappropriate.
Most of these comments centered on the reading
and writing portions of the test.
A. Perceptions on the ELPA subtests
(Factors 1, 4 and 5)

Example 1: Having given the ELPA grades K-4, I
feel it is somewhat appropriate at levels 1-4.
However I doubt many of our American, English
speaking K students could do well on the K test.
This test covered many literacy skills that are not
part of our K curriculum. It was upsetting for many
of my K students. The grade 1 test was very
difficult for my non-readers who often just stopped
working on it. I felt stopping was preferable to
having them color in circles randomly. Educator
130, ESL teacher, administered Levels I , II, & III.
A. Perceptions on the ELPA subtests
(Factors 1, 4 and 5)




Speaking subsection of the ELPA was too subjective.
31 educators wrote the rubrics did not contain example
responses or did not provide enough levels to allow for
an accurate differentiation among student abilities.
4 educators wrote ELLs who were shy or did not know
the test administrator did poorly on the speaking test.
15 mentioned problems with two-question prompts that
were designed to elicit two-part responses; some
learners only answered the second question, resulting in
low scores.
A. Perceptions on the ELPA subtests
(Factors 1, 4 and 5)

Example 2: [I]t [the rubric] focuses on features of
language that are not important and does not focus
on language features that are important; some
items test way too many features at a time; close
examination of the rubric language makes it
impossible to make a decision to give 2 or 3 points
because it is too subjective and not quantifiable
enough because too many features are being
assessed at a time Educator 79, ESL teacher,
administered Levels I, II, & III.
A. Perceptions on the ELPA subtests
(Factors 1, 4 and 5)

Example 3: For some this was adequate but for
those students who are shy it didn't give an accurate
measure at all of their true ability. I have a student
who is an excellent English speaker but painfully shy
and she scored primarily zeros because she was
nervous and too shy to speak. Educator 30, ESL
teacher, administered Levels III & IV.
A. Perceptions on the ELPA subtests
(Factors 1, 4 and 5)

Example 4: When you say in succession: "What
would you say to give your partner directions?
What pictures would be fun to make? The students
are only going to answer the last question and
100% of mine did! So it is already half wrong. Too
many questions had a two part answer but only one
bubble to fill in how they did. Not very adequate to
me! Educator 133, ESL and bilingual teacher,
administered Levels I & II.
A. Perceptions on the ELPA subtests
(Factors 1, 4 and 5)


30 of the 151 educators who administered lower
levels of the ELPA commented that the listening
section was unable to hold the students’ attention
because it was too long and repetitive or had
bland topics.
9 complained of the reliance on memory and
reading skills to answer questions.
A. Perceptions on the ELPA subtests
(Factors 1, 4 and 5)

Example 5: Some of the stories were very long for
kindergarten and first graders to sit through and
then they were repeated! The kids didn't listen well
the second time it was read and had trouble with
the questions because of the length of the story.
Educator 176, Reading recovery and literacy coach,
administered Levels I, II, & III.
A. Perceptions on the ELPA subtests
(Factors 1, 4 and 5)

Example 6: The kindergarten level is where I saw
the most difference between actual ability and test
results. Here students could give me the correct
answer orally but mark the wrong answer in the test
booklet.
Educator 80, ESL teacher, administered Levels I, II, &
III.
B. Effective Administration
(Factor 2)



There was not adequate space or time for
testing.
Test materials did not arrive on time or were
missing.
Other national and state tests were being
conducted at the same time, which overburdened
the test administrators and the ELLs.
B. Effective Administration
(Factor 2)

Example 7: There are only six teachers to test over
600 students in 30 schools. We had to [administer
the ELPA] ourselves because it was during IOWA
testing and that was where the focus was. Because
we are itinerant in our district we were given
whatever hallway or closet was available. Educator
249, ESL teacher, administered all Levels.
B. Effective Administration
(Factor 2)

Example 8: The CD's and cassettes were late.
Several of the boxes of materials were also late
and I spent a lot of time trying to track them. NONE
of the boxes came to the ESL office as requested. It
was difficult with four grades in a building to find
the right time to pull students and not mess with
classwork and schedules. Educator 151, school
administrator, administered all Levels.
C. Impacts
(Factor 3)




49 educators wrote that the test did not directly
impact the ESL curricular content.
86 educators wrote that the test reduced the
quantity and quality of ESL services during all or
part of the ELPA test window.
79 educators commented on the negative
psychological impact on students.
7 out of the 79 wondered about how much
money and resources the ELPA cost Michigan.
C. Impacts
(Factor 3)

Example 9: I feel it makes those students stand out
from the rest of the school because they have to be
pulled from their classes. They thoroughly despise
the test . One student told me and I quote, "I wish I
wasn't Caldean* so I didn't have to take this" and
another stated she would have her parents write
her a note stating that she was not Hispanic (which
she is) Educator 71, Title 1 teacher, administered
Levels I , II, III, & IV.
*Chaldeans speak Aramaic, are from the part of Iraq originally called Mesopotamia. As
Christians, they are a minority among Iraqis. Many have immigrated to Michigan.
C. Impacts
(Factor 3)

Example 10: This test took too long to administer
and beginning students, who did not even
understand the directions, were very frustrated and
in some cases crying because they felt so incapable.
Educator 109, ELL and resource room teacher,
administered Levels II & III.
C. Impacts
(Factor 3)

Example 11: Just that the ELPA is a really bad idea
and bad test... we spend the entire year building
students confidence in using the English language,
and the ELPA successfully spends one week
destroying it again
Educator 171, ESL teacher, administered Level IV.
Discussion
 Research
question 1
What are educators’ opinions about the ELPA and its
administration?
 Answers



Q1. 1. Difficult for lower grade students
Q1. 2. Speaking tests are problematic for youngsters
Q1. 3. Logistic problems with the administration
Q1. 1. Difficult for lower grade students
The educators may be right.
What do we know about young language learners?
 The attention span of young learners in the early
years of schooling is short, 10 to 15 minutes.
 They are easily diverted and distracted.
 They may drop out of a task when they find it
difficult, though they are often willing to try a task in
order to please the teacher (McKay, 2006, p. 6).
Q1. 1. Difficult for lower grade students
 Children
under 8 are unable to use language to
talk about language. (They have no metalanguage.)
(McKay, 2006)
 Children do not develop the ability to read silently
to themselves until between the ages of 7 and 9
(Pukett & Black, 2000). Before that, children are just
starting to understand how writing and reading
work.
Q1. 1. Difficult for lower grade students
 Children
between 5 and 7 are only beginning to
develop feelings of independence. They may
become anxious when separated from familiar
people and places. Having unfamiliar adults
administer tests in unfamiliar settings might be
introducing a testing environment that is not
“psychologically safe” (McKay, 2006) for
children.
 Children between 5 and 7 are still developing
their gross and fine motor skills (McKay, 2006).
They may not be able to fill in bubble-answersheets.
Q1. 1. Difficult for lower grade students



Recommendations
Test administrators should read directions aloud
and be allowed to clarify them if necessary.
Students should be allowed to give verbal
answers in the listening and reading subsections.
Better would be to have a test that could be
stopped if the educator felt like it should stop.
Computer Adaptive Testing for kids?
47


May be extremely problematic…
http://www.upi.com/blog/2013/03/04/Seattleteacher-revolt-other-districts-join-fight-against-highstakes-tests/2741362401320/
48
http://www.upi.com/blog/2013/03/0
4/Seattle-teacher-revolt-otherdistricts-join-fight-against-highstakes-tests/2741362401320
stakes-tests/2741362401320/
Q1. 2. Speaking tests are unreliable



Recommendations
Stop the standardized, high-stakes, oral
proficiency testing of young children.
If it must be done, use a puppet to reduce test
anxiety and stranger-danger.
Allow non-verbal responses to count as responses.
Q1. 3. Effective Administration
What do test logistics have to do with validity?
 A test cannot be reliable if the physical context
of the exam is not in order (Brown 2004).
Materials and equipment should be ready, audio
should be clear, and the classroom should be
quiet, well lit, and a comfortable temperature.
 If a test is not reliable, it is not valid (Bachman,
1990; Brown, 2004; Chapelle, 1999;
McNamara, 2000).
Q1. 3. Effective Administration


Recommendations
Test administrators should be required to flag
tests given under unfavorable conditions and
provide information to the testing agency about
the situation.
This could spur improvements in the conditions of
future test administrations, especially if districts
and states were held accountable for providing
adequate evidence of fair test conditions.
Discussion



The unintended effect of reduction in ELL services raises
the question of whether the ELPA program in Michigan is
feasible (Nevo & Shohamy, 1986).
Feasibility standards are intended to “ensure that a
testing method will be realistic, prudent and frugal”
(Shohamy, 2001, p. 152).
It may be unrealistic to expect ESL teachers with limited
resources to administer the existing ELPA within the given
time frame.
Discussion


Language tests can affect students’ psychological wellbeing by mirroring back to students a sense of how
society views them and how they should view themselves
(Crooks, 1998; Shohamy, 2000, 2001, 2006; Spratt,
1995; Taylor, 1994).
The test does not recognize the ELLs’ heritage languages
or cultural identities, perhaps communicating that these
are unimportant to their education and thus to society
(Schmidt, 2000).
Recommendation 1 of 4


Collect validity data anonymously from educators
immediately after the test is administered.
Educators can complete a survey about the test
logistics, their perceptions of the test, and the test’s
impacts on the curriculum and students’ psyches.
Sampling issues can be avoided if all educators
are required to complete the surveys. If educators
know in advance that they will be asked about the
validity of the test, they may feel more ownership
over the testing process.
Recommendation 2 of 4


Use an outside evaluator to construct and conduct
the validity survey.
States and for-profit test agencies like Harcourt
Assessment, Pearson, or any testing body, have an
incentive to avoid criticizing the NCLB tests they
manage. It is better to have a neutral party
conduct the survey, summarize results and present
them to the public, and perhaps suggest ways to
improve the test (Ryan, 2002).
Recommendation 3 of 4


Use the validity data to form research questions.
These questions should be answered by literature
reviews and additional data collection. The
answers will shed light on why certain aspects of
the test went wrong or were viewed negatively by
the educators.
Recommendation 4 of 4



Publish all validity evidence publicly.
Presenting the data in open forums, on the web, at
conferences, and in the test’s technical manual will
encourage discussion about the uses of the test
and the inferences that can be drawn from it.
Disseminating validity data may also increase trust
in the state and the corporate and non-profit
testing agencies they hire.
The way forward in Michigan?
58




The ELPA is not feasible. Michigan just joined WIDA.
The ELPA is being given now (right now!) in
Michigan, but this will be the last year of its
administration.
WIDA assessments do not give reading or writing
tests to young children still working on pre-literacy
skills. (Yeah!)
Yet still, the ELL WIDA tests are not normed on likeaged children who are native speakers of English.
There is much work to be done.
Questions?
THANK YOU!
59
Dr. Paula Winke, Michigan State University
winke@msu.edu
Clarification on two kinds of scales, with
correlated or uncorrelated indices!
http://www.stat-help.com/factor.pdf (This is the overview paper I gave out a few
weeks ago.)
1. These three
observed variables are
effect indicators. They
are appropriate for FA.
Self-esteem causes
the scores on the three
indicator variable
scales. This is
modeled with factor
analysis.
2. These three
indicator variables are
what cause stress.
Note that these three
things may not
correlate. One can use
Principle
Components
Analysis to investigate
this type of structure.
Worth
Much proud
Good as others
Self
Esteem
We expect these three to
correlate because they are
all caused by self-esteem.
Job probs
Home probs
No coping skills
Stress
Factor Analysis How-to Papers
61


Costello, A. B., & Osborne, J. W. (2005). Best
practices in exploratory factor analysis: Four
recommendations for getting the most from your
analysis. Practical Assessment, Research & Evaluation,
10(7), Available online:
http://pareonline.net/getvn.asp?v=10&n=17
DiStefano, C., Zhu, M., & Mîndrilă, D. (2009).
Understanding and using factor scores: Considerations
for the applied researcher. Practical Assessment,
Research & Evaluation, 14(20), Available online:
http://pareonline.net/getvn.asp?v=14&n=20
Download