April - Rachel A. Gordon

advertisement
Rachel A. Gordon
University of Illinois at Chicago
Kerry G. Hofer
Peabody Research Institute, Vanderbilt University
Presentation in the Presidential Session on Universal Preschool: What Have We Learned, and What Does It Mean
for Practice and Policy? Annual Meeting of the American Educational Research Association (April 6, 2014).
Principal Investigators
 Rachel Gordon
 Kerry Hofer
 Other Investigators
 Sandra Wilson
 Everett Smith
 Graduate Students
 Elisabeth Stewart
 Jenny Kushto-Hoban
 Hillary Rowe
 Anna Colaner

•Graduate Students (con’t).
– Rowena Crabbe
– Fang Peng
– Danny Lambouths
– Ken Fujimoto
•Consultant
– Betsy Becker
•Institute for Education Sciences
Grant #R305A130118
We present some results from our preliminary
investigations in this new project. Although we have
confidence in what is presented here, these analyses
are the first steps towards a more thorough look at
the validity of two quality measures. The results may
change as we move forward, including as we revise
the details of the regression models and the metaanalytic techniques used and as we take the products
through peer review.

Principal Investigator
 Rachel Gordon

Other Investigators
 Everett Smith
 Robert Kaestner
 Sanders Korenman
Graduate Students
Ken Fujimoto
Kristin Abner
Anna Colaner
Nicole Colwell
Xue Wang
IES R305A090065
NIH R01HD060711

Policy initiatives focus on high-quality
preschool…
Source: http://www.whitehouse.gov/issues/education/early-childhood
high quality early childhood education
Source: http://www.whitehouse.gov/issues/education/early-childhood
high-quality early learning programs
Source: http://www.whitehouse.gov/issues/education/early-childhood

This question seems simple at first glance,
but upon reflection is difficult to answer.

In the case of early care and education,
thinking carefully about this question, and
how to answer it, has increasing importance.

With the push toward expanding access to
high quality preschool, measures designed
for other purposes have been adopted for
high stakes use.
http://www.excelerateillinois.com/docman/resources/2-gold-excelerate-illinois-chart/file
ECERS-R:
Average overall score: At
least 4.5 with no
classroom below a 4.0,
verified by on-site
independent assessment
http://www.excelerateillinois.com/docman/resources/2-gold-excelerate-illinois-chart/file
CLASS:
Emotional support and
classroom organization
average scores above 5.0
with no classroom below
4.0, as verified by on-site
independent assessment
http://www.excelerateillinois.com/docman/resources/2-gold-excelerate-illinois-chart/file

I will review the case that:
 Both were developed for other purposes.
 Both have limitations for these high stakes uses.

And discuss the rationale for:
 Rethinking how we verify program quality.
 Establishing research-policy-practice partnerships
to support a “next generation” of quality
measures.

I will begin by examining evidence regarding
whether the scales predict child outcomes.

This aspect of validity is relevant to policy, to
the extent that public investments in early
care and education are meant to promote
optimal child development and school
readiness.

There may be many different reasons for these
small associations of ECERS-R and CLASS
scores with child achievement outcomes.

One possibility is low validity in other, more
fundamental features of the measures…
 at least for assessing aspects of quality that
promote school readiness in ways suitable for high
stakes uses.

ECERS-R
 Items mix different aspects of quality.
 Standard scoring makes it difficult to “pull out”
aspects of quality most relevant for school readiness.

CLASS
 Highly inferential rating process may limit inter-rater
reliability.
 Limited empirical evidence for theoretical
dimensions.


Developed in 1970s from a checklist to help
practitioners improve the quality of their settings.
Reflects the early childhood education field’s
concept of developmentally appropriate practice:
 predominance of child-initiated activities selected from a wide
array of options;
 a “whole child” approach that integrates physical, emotional,
social and cognitive development;
 teacher facilitation of development by being responsive to
children’s age-related and individual needs.
Over 400
indicators
 Standard “stop scoring” structure reflects
this
across
the
checklist, practice and philosophical
origin.
43 items!
Categories from 1 to 7 have several “indicators”
Conditions in the indicators of lower scores must be met
before indicators of higher scores are evaluated.

Especially within some items, indicators often
organized around contexts of practice and
reflect multiple aspects of quality.
Source: Harms, T., Clifford, R.M., & Cryer, D. (1998). Early
Childhood Environment Rating Scale, Revised Edition. New
York, NY: Teachers College Press.

If higher scores reflect higher quality, then
average quality scores should be higher for
centers rated in higher categories versus
lower categories.

In item response theory models, the
thresholds between categories should also
show a stair-step progression, if they are
ordered so that higher categories mark
higher quality.
Source: Gordon, Rachel A., Ken Fujimoto, Robert Kaestner, Sanders Korenman, and Kristin Abner.
2013. “An Assessment of the Validity of the ECERS-R with Implications for Assessments of Child
Care Quality and its Relation to Child Development.” Developmental Psychology, 49: 146-160
Study Name
Initial Year
1999
3-City Study
2001
ECLS-B
Early Head Start
Research Eval Project
1996-1998
FACES 1997
1997
FACES 2000
2000
FACES 2003
Head Start Impact
Study
2003
Fragile Families
Focal Population
Low-income families from low-income neighborhoods
in Boston, Chicago and San Antonio.
Nationally-representative sample
drawn from birth records in 46 states.
New Early Head Start applicants
with a child under 12 months of age.
New Head Start 3- and 4- year old participants.
2002
1998-2000
PCER
2003
Birth records sampled from hospitals in twenty large U.S. cities.
Twelve sites implemented curricula in
preschool programs. Each site had 14-20 programs.
QUINCE
2004
Twenty-four CCR&R agencies in five states (CA,IA,MN,NE,NC)
ECERS-R 10: Meals/Snacks
2
Score 1
Score 3
Score 5
Score 7
1
0
-8
Sanitary condition
Well-balanced meals
Schedule appropriate
Nonpunitive atmosphere
Sanitary conditions usually
maintained
Eat independently
Pleasant atmosphere
Staff sits with children
Coversation
Child-sized utensils
-9
Children help
-1
-2
Difficulty
Acceptable nutrional value
Appropriate meal schedule
Positive atmosphere
-3
-4
-5
-6
-7
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Indicator
SourceGordon, Rachel, Kerry Hofer, Ken Fujimoto, Nicole Colwell, Robert Kaestner, Sanders Korenman.
“Measuring Aspects of Child Care Quality Specific to Domains of Child Development: An Indicator-level
Analysis of the ECERS-R.” Presented in the Paper Symposium "Measuring Early Care and Education
Quality: New Insights about the Early Childhood Environment System Rating Scale - Revised" (Chair:
Rachel Gordon Discussant: Margaret Burchinal) (Saturday April 20 2013, Seattle WA).

Unlike the checklist and practice origins of the ECERS-R several
decades ago…
 The CLASS was developed more recently based on “developmental theory and
research suggesting that interactions between students and adults are the primary
mechanism of student development and learning.” (Pianta, La Paro & Hamre, p. 1)
 Its predecessor was part of a research study, and it was aimed at professional
development and coaching use before being adopted in high stakes policy contexts.
 The CLASS manual requires observers to assimilate what they see in order to assign
scores to just a few items.
 The manual advises: “Because of the highly inferential nature of the CLASS, scores
should never be given without referring to the manual.” (Pianta, La Paro & Hamre,
p. 17, bold in original)
Source: Pianta, R. C., La Paro, K. M., & Hamre, B. K. (2008). Classroom Assessment
Scoring System Manual, PreK. Baltimore MD: Brookes Publishing.

A recent publication from the CLASS developers
(Cash, Hamre, Pianta, & Myers, 2012) reveals:
 Exact reliability is low: 41% overall exact agreement
with master score in training of 2,093 Head Start staff.
 Black and Latino raters placed their Instructional
Support scores farther from the master score as did
raters who disagreed with intentional teaching beliefs.

The CLASS developers also recently found (Hamre,
Hatfield, Pianta & Jamil, in press):
 a bi-factor structure with one general dimension
(responsive teaching) and two specific dimensions
(proactive management and routines; cognitive
facilitation).
 these differ from the subscales written into policy.

In our work, we are replicating these results, and
also examining the targeting and content of items
with IRT models.

This body of evidence highlights the way
in which measures developed for other
purposes have been adopted for high
stakes policy uses.

Not surprisingly, there are limitations in
the validity of these measures for this
high stakes purpose.

Consistent with the latest Standards for Educational and
Psychological Testing we contend there is a need to step
back and consider the intents of these policy uses, build in
continuous and local validation of measures selected for
these uses, and allow for the refinement of measures over
place and time.

We need to consider big picture questions like:
 What are the goals of public investments in preschool?
 How do we design quality measures to help assure we
are meeting those specific goals?
http://www.apa.org/science/programs/testing/standards.aspx

As a concrete example, if it is desirable to distinguish
classrooms that fall above and below specific
thresholds, as in current policy uses, then measures with
very high information (and low error) at those
thresholds are needed.

If instead it is desirable to invest public dollars in
improving quality through coaching, then we would like
to have a measure (or two linked measures) that cover
the continuum of quality over which growth is expected.

As another example, it is essential to think carefully
about variation in quality across children, classrooms,
times of day, days of week, and weeks of year.

We currently have very little evidence about such
variation -- and the extent to which choices about when
and how to observe classrooms affects measure validity
-- including for high stakes uses.

Even with these challenges, there is much
potential to build new evidence.
Download