Untitled

advertisement
Correlatesof Satisfactionwith
GraduateSchoolApplicants’Performance
on the GRE Writing Measure
DonaldE. Powers
Mary E. Fowles
GRE BoardReportNo. 93-18R
November1997
This reportpresentsthe findingsof a
researchprojectfundedby and carried
out underthe auspicesof the Graduate
RecordExaminationsBoard.
EducationalTestingService,Princeton,NJ 08541
********************
Researchers
are encouragedto expressfreely their professional
judgment. Therefore,pointsof view or opinionsstatedin Graduate
RecordExaminationsBoardReportsdo not necessarilyrepresentofficial
GraduateRecordExaminationsBoardpositionor policy.
The GraduateRecordExaminationsBoardandEducationalTestingServiceare
dedicatedto the principleof equalopportunity,andtheir programs,
services,and employmentpoliciesare guidedby that principle.
EDUCATIONAL TESTING SERVICE, ETS, the ETS logo, GRADUATE RECORD EXAMINATIONS,
and GRE are registeredtrademarksof EducationalTestingService.
Copyright0 1997by EducationalTestingService. All rightsreserved.
Acknowledgments
The authors appreciate the helpful reviewsprovided by Hunter Breland, Brent
Bridgeman, and Mark Farnum; the data analysesconductedby Patricia O’Reilly; and the
administrative assistanceof Ruth Yoder.
Abstract
The focus of the study was the new GRE writing measure and the proposed GRE
scoringguide. The objectivewas to determine the degree to which the features of student
essaysupon which scoreswill be based are the same features that graduate educatorsuse at
their institutions to evaluate students’writing. A sample of essays,which had been previously
reviewed and judged by graduate deans and faculty, were restored several times by trained
readers. Each restoring was based on a specifictrait in the GRE scoringguide -- development
of ideas, sentencestructure, and so on. The influence of each feature on the judgments made
by faculty/deans and on the scoresassignedby trained readerswas compared. No evidencewas
uncovered to suggestany differencesbetween graduate deans/faculty and GRE essayreaders
with respect to the baseson which they judge essayquality.
Correlates of Satisfactionwith Graduate SchoolApplicants’ Performance
on the GRE Writing Measure
The manner in which test scoresor other judgments are assignedto test performances is
an integral aspect of test validity. As Messick (1989) has noted, validity is not an intrinsic
property of tests themselves,but rather it is a function of how test scoresare interpreted and
used. To ensure the validity (and maximize the utility) of test scoresfor a particular purpose,
the methods and criteria used to evaluate test performancesshould reflect the values held by
potential users of test scores. This assertionapplies both to traditional multiple-choice measures
and to assessmentsfor which judgmental scoringis employed, as is the casewith the evaluation
of writing ability.
At its January 1993 meeting, the GRE Board tentatively approved the addition of a
writing measure to the GRE testing program, along with a draft set of integrated criteria
(holistic scoringguide) for judging candidates’essays. Earlier, as part of a surveyto determine
possibleinterest in a writing measure, graduate deans and faculty reviewed the proposed scoring
guide to determine its relevance to the writing proficiency expectedof graduate studentsat their
institutions. Eighty percent (80%) of the respondentsindicated that they were either somewhat
or very satisfied that the scoringguide addressedthe writing skills required of first-year graduate
students. In addition, the degree to which graduate deans and faculty were satisfiedwith the
quality of writing exhibited in a sample of student essayscorrespondedreasonablystronglywith
the scoresthat had been assignedby trained essayreaders (Powers, Fowles, & Willard, 1994).
These results are discussedin more detail below. Thus, the criteria in the proposed scoring
guide, as well as the scoresthemselves,appear to reflect the values of GRE constituents
reasonablywell, at least at a global level. The study reported here was designedto determine
the degree to which these values are also reflected at a more specificlevel.
The primary objectivewas to examine the possibleinfluence that particular features of
student essaysmight have on judgments of writing proficiency. By comparing the relative
influence of these features on ratings of writing quality as reflected in (a) holistic (global) scores
assignedby trained essayreaders and (b) the general satisfactionof graduate faculty with the
student essays,we hoped to determine with greater precision the extent to which the proposed
scoringguide does in f&t reflect the values of graduate educators.
Related Research
The influence of numerous specificcharacteristicsof essayscould have been studied.
Breland and Jones (1982), for example, investigatedthe influence of 20 essaycharacteristicson
scoresassignedfor the College Board’s English Composition Achievement Test. They found
that certain characteristicsof discourse(organization, transition, use of supporting evidence, and
originality of ideas) related more highly to the scoresawarded by readers than did a variety of
syntacticand lexical characteristics(e.g., subject-verbagreement, punctuation, and pronoun
usage)*
Bridgeman and Carlson (1984) asked graduate faculty in several disciplinesto rate the
importance of selectedfeatures of writing. They found that such traits as the quality of content,
the development of ideas, and the extent to which essaysaddressedthe topic were regarded as
more important than were such features as vocabularysize, punctuation/spelling, and sentence
structure.
2
Carlson, Bridgeman, Camp, and Waanders (1985) found that holistic scores,discourselevel scores,and sentence-levelscoreswere all very highly related for a sample of non-native
undergraduate applicants. They also found that the scoresassignedby English writing experts,
English-as-a-second-language
specialists,and subjectmatter professorswere all strongly
correlated, suggestingthat these different groupsprobably based their evaluationson many of
the same factors.
Breland (1991) investigatedthe relationship of scoreson the College Board Advanced
Placement history exams to responsecharacteristicsthat were regarded as directly relevant (e.g.,
historical content), indirectly relevant (e.g., number of words written and composition quality),
and irrelevant (e.g., neatnessand handwriting quality). The correlation of these features with
scoresdiffered somewhat accordingto the content of the examination (European vs. U.S.
history) and its format (e.g., whether or not it was document based).
A number of researchers(see, for example, Miller & Cracker, 1990 for a summary) have
investigatedthe relationship between holistic scoresand more molecular or analytic scores. For
instance, Olson and Swadener (1984) found that four analytic scoresexplained more than 60%
of the variation in holistic scores. But, Moss, Cole, and Khampalikit (1982) found relatively low
correlations between analytic and holistic scores,due in large measure to the low reliabilities of
the analytical scores. Prater and Padia (1983) found somewhat higher correlations (in the .7Os).
The studiescited here were primarily concernedwith the relationshipsbetween scoresassigned
by trained essayreaders, not with the features on which test usersbase their judgments of
writing quality.
Sucvevof Graduate Facultv and Deans
As mentioned earlier, in order to ascertaininterest in the assessmentof prospective
graduate students’writing skills,graduate schoolpersonnelwere surveyedat some 115 graduate
institutions in October 1992. Questionnaireswere sent to graduate school deans and faculty,
along with the essayprompts, proposed scoringguide, and a sample of student essays
representing different levels of quality on a l-to-6 holistic score scale.’ The survey sample was
comprised of graduate department chairs and faculty (about two thirds) and graduate deans
(about one third), who represented the sciences(biology, chemistry, computer science,and
electrical engineering), the social sciences(economics,political science,and psychology),and the
humanities/other (communications,history, English, and fine arts) from each of six major
Carnegie classificationsof the academic institutions (Research I and II, Doctoral-granting I and
II, and Comprehensive I and II).
Three surveyforms were developed, each containing six different student responsesso as
to counterbalanceessayquality, topic content, and order of presentation, as shown in Table 1.
Essaysrepresenting the lowest scorelevel possible,“1,” were not included, as these occurred very
infrequently (about 1% of the time) in the pilot sample.
’ Thesesampleessayswere obtainedearlierfrom first-yeargraduatestudentsandthird-andfourth-year
undergraduates,
whohadonehourto compose(draft,write,andrevise)an essay.Promptspresenteda
generalissueor a complexideaandaskedstudents
to developtheir viewson the issueor idea,usingreasons
andexamplesdrawn from their reading, observations,or experiences.The essayswere then scoredtwice
holistically,
according
to criteriapresentedin the scoringguide,at a centralized
scoringsession.Scorers
wereuniversity
facultyfrom a varietyof disciplines
(the sciences.,
the socialsciences,
andthe humanities)and
from a varietyof collegesanduniversities,
all trainedto applythe samescoringcriteria.
3
Table 1
Order of Presentation of Student Essaysin Three Survey Forms
Survey Form B
Survey Form A
Holistic
Score
Holistic
Score
Survey Form C
Order of essay
in booklet
Content
lst
Science . . . . . . . . . . . .4
Social Science. . . . . .5
Humanities. . . . . . . . 6
2nd
Science . . . . . . . . . . . .5
Social Science . . . . . .2
Humanities. . . . . . . . 3
3rd
Humanities . . . . . . . . .5
Science.. . . . . . . . . . 3
Social Science. . . . . . 4
4th
Humanities . . . . . . . . .2
Science.. . . . . . . . . . 6
Social Science. . . . . . 5
5th
Social Science . . . . . . .3
Humanities . . . . . . . . 5
Science . . . . . . . . . . .5
6th
Social Science. . . . . . .6
Humanities . . . . . . . .4
Science . . . . . . . . . . .2
Content
Content
Holistic
Score
Survey respondentswere asked the following question: How satisfiedwould you be with
the general writing ability of applicantsto your program/institution who wrote each [of the
following] essays?
Responseswere recorded on a scalewith the following values: (5) very satisfied,
(4) somewhat satisfied, (3) neither satisfiednor dissatisfied,(2) somewhat dissatisfied,(1) very
dissatisfied. The scoresthat were previously assignedto each essaywere not revealed to survey
respondents.
The three prompts were as follows:
Science
“Most of the environmental problems we face are due to advancesin technology,
yet further advancesin technologyare necessaryto solve these problems.”
To what extent do you agree or disagreewith this view? Present one or two
examplesof environmental problems that either support or refute the above
statement and discusswhat you think could be solutionsto the problems you
describe.
.
oaal
Science
“A society that tries to maximize both the liberty and equality of its citizens is
attempting the impossible. Steps taken to ensure liberty will surely create
inequality and stepstaken to promote equality will be at the expenseof liberty.”
To what extent do you agree or disagreewith the author’s position that there is
tension between liberty and equality? Support your views with specific examples
from your knowledge of history and/or current events.
. .
umanitres
Describe a particular theory that has played or still plays a significantrole in an
area of the humanities that interestsyou. Using specific examplesto support your
position, explain how that theory has influenced the field.
4
Holistic score scalevalueswere describedin the draft scoringguide. Each scorelevel
from 1 (low) to 6 (high) was introduced with a general statement of proficiency as follows:
6 - Presentsa thorough and insightful responseto the question and demonstrates
mastery of the elements of effective writing.
5 - Presentsa well-developed responseto the question and demonstratesa strong control
of the elements of effective writing.
4 - Presentsa clearly competent responseto the question and demonstrates
adequate control of the elements of writing.
3 - Presentsa limited responseto the question, either in content or in control of
the elements of writing.
2 - Presentsa weak responseto the question and demonstrateslittle control of
the elements of writing.
1 - Is seriouslydeficient in kiting skills.
In addition to these general characterizations,more detailed characteristicsof essaysat
each level appeared in the scoringguide. For example, an essayin category 6 was describedas
exhibiting all or most of the following characteristics: demonstratesa substantialunderstanding
of the subjectdiscussed;effectively supportsideas with well-chosen,accurate examples;sustains
a well-focuseddiscussionof the subject;expressesideas with clarity and precision; useslanguage
fluently and effectively, with varied sentencestructure and vocabularyappropriate to the subject.
An essayassignedto category 2, on the other hand, was describedas having one or more of the
following characteristics: may display a poor understandingof the task and/or the subject
discussed;may provide little development of ideas and lack relevant examples;may present an
unfocuseddiscussionof the topic; may have seriousproblems in expressingideas clearly; may
have seriousand frequent problems in syntaxand vocabulary.
Sumev Results
A total of 347 respondedto the survey,and 231 of these respondentsprovided ratings of
the 18 sample essaysincluded in surveyforms A, .B, and C. As stated earlier, 80% of the
respondentsindicated that they were either somewhat or very satisfiedthat the scoringguide
addressedthe writing skills required of first-year graduate students. There was, however, also a
contrary opinion, as expressedby one faculty member, who felt that the kind of specialized
(“arcane”) writing style thought to be typical in the discipline (psychology)was not very well
reflected in either the kinds of essayprompts or the type of scoringto be used for the new
writing measure.
Faculty and deans also made numerous suggestionsabout what the measure should (and
should not) reflect. Mentioned were suchcharacteristicsas fluency, sentencestructure,
grammar, spelling punctuation, ease of reading level of maturity, clarity, expression,style,
ability to cite relevant literature, semantics,syntax,“historicalimagination,”logic, organization,
and parsimony. Many of these characteristicseither appear or are implied in the proposed
5
GRE scoringguide. In short, respondentshad definite views about what they value in students’
writing.
As mentioned previously,responsesalso revealed a significantcorrespondencebetween
levels of satisfactionwith student essaysand the scorespreviously assignedto them at the
scoringsession. Table 2 shows,by content of essaytopic, the percentagesof respondentswho
were (a) very satisfied,(b) either very satisfiedor somewhat satisfied,(c) very dissatisfied,and
(d) either very dissatisfiedor somewhat dissatisfied. Table 2 showsa relatively sharp increasein
satisfactionand an equally steep decreasein dissatisfactionas a function of essayquality/score.
Thus. essavscoresbased on the DronosedscoringPuide exhibit a relativelv strong
CorresDondencewith overall facultv satisfaction. To reiterate, the objective of the study
describedhere was to explore the extent to which specificfeatures of essaysare related to
faculty satisfaction.
Method
Each of the 15 essaysrated previously(for satisfaction)by graduate faculty and deans
was restored by four trained readers on each of five qualities, or traits, mentioned in the
proposed GRE scoringguide:
(1) development/support of ideas with well-chosen,relevant examples
(2) ability to present/sustain a focuseddiscussionof the topic
(3) ability to expressideas clearly
(4) ability to use languagefluently, with varied sentencestructure and appropriate
vocabulary
(5) ability to apply the elements of effective writing (grammar, punctuation, spelling,and
capitalization)
A sixth feature (understandingof the subjectdiscussed)was originally specified in the guide.
However, even after training, readers felt that they could not clearly distinguishbetween
understandingof the subjectand the development/support of ideas. The two qualities were
therefore merged.
A l-to-6 scale,similar to that used for overall scoring,was used for each feature. In
addition, scorerswere asked to identify any other characteristicsthat affected, either positively
or negatively, their judgments of each essay. Readers’ training consistedof discussingthe
presenceof each trait in benchmark essaysat each scorelevel and then scoringseven “practice”
essays. Ratings of specificessayfeatures were obtained during a one-day scoringsession.
Holistic ratings were provided by the same readers approximatelythree weeks later by mail.
6
Table 2
Percentagesof Survey RespondentsWho Indicated Various Levels of Satisfaction
with Student Performance, by EssayPrompt and EssayScore
EssayScore
EssayPrompt
2
3
4
5
6
Very Satisfied
Science
8
17
25
47
Humanities
4
15
15
48
Social Science
0
14
41
44
4
15
27
46
Total
Somewhat or Very Satisfied
Science
26
48
71
Humanities
28
57
55
87
9
30
67
88
22
45
64
86
Social Science
Total
-
85
Very Dissatisfied
Science
84
7
3
4
Humanities
71
13
5
2
Social Science
64
25
14
2
74
15
7
3
Total
-
Somewhat or Very Dissatisfied
Science
92
49
16
11
Humanities
89
54
20
22
Social Science
85
74
43
14
89
58
27
15
TOtd
-
Note. For each prompt, entries are based on 66-82 responsesfor essaysscoredas 2, 3,4, and 6.
Becausetwice as many essaysscoredas 5 were judged, the number of respondents(145155) was
approximately doubled for these essays.Entries for totals are based on approximately three
times these numbers. The total number of respondentswho provided ratingswas 231.
Source. Powers, D. E., Fowles, M., & Willard, A. (1994).
Results
The first finding, on which all the other results turn, was that the ratings among the five
specific essayfeatures were all very highly correlated for each reader. The median correlations
among features for the four readerswere AS, .85, .89, and .89. The lowest single correlation
between any two traits for any reader was .72. The correlations among mean ratings2(averaged
over all four readers) of essayfeatures are shownin Table 3. It is clear from these correlations
either that these features occurred “hand in glove”in the essayswe studied or that readers could
not distinguishamong the features. The later interpretation was reinforced by comments from
both the readers and the trainers, who reported that the readers (all highly experiencedin
holistic scoringprocedures) had a difficult time “deholisticizing”themselves,that is, thinking in
terms of specific essayfeatures.
Table 3
Correlations Among Mean Ratings of Traits (Over Four Readers)
Variable
M
SD
Mechanics
Fluency
Ideas
Mechanics
4.1
1.3
Fluency
4.0
1.3
.92
Ideas
4.1
1.5
.89
.94
Focus
4.0
1.5
.90
.90
.92
Clarity
3.9
1.5
.93
.97
.95
Focus
.93
The necessaryconsequenceof the strong intercorrelations among essayfeatures was
that none of the individual features was differentially related to the holistic scoresthe readers
had independently assignedto the essays(Table 4). No consistenttendencywas discerniblefor
any of the features to relate more stronglythan others to holistic scores:all of the features
correlated about equally stronglywith holistic scores. (Table 4 also reveals that the scores
assignedwhen essayswere first collected in early tryouts correlated substantially,for each essay
reader, with the holistic scoresgiven in the current study -- from .82 to .95.)
2 We digresshere to emphasizethe point that correlationsamong means are typicallysubstantiallyhigher
than those among individual observations.This fact undoubtedlyaccounts,at least in part, for the very high
correlationsreported here.
8
Table 4
Correlations of Holistic Scoreswith EssayTraits (by Reader)
Reader
Trait
Rl
R2
R3
R4
Mean R&R4
Mechanics
.88
.93
.91
.83
.96
Fluency
.93
.89
.97
.84
.96
Ideas
.88
.92
.92
.77
.94
Focus
.84
.99
.92
.82
.95
Clarity
.91
.96
.97
.89
.97
Previous Score
.95
.92
.95
.82
.95
Table 5 showsthat readers’ ratings of essaycharacteristicswere very highly correlated
with the mean satisfactionratings by graduate faculty. All correlationswere in the .8Osor .9Os.
There was no discerniblepattern suggestingthat any particular essayfeatures were more highly
related than others to faculty satisfaction.
Table 5
Correlations of Mean Faculty SatisfactionRatings with
Ratings of EssayTraits and with Holistic Scores(by Reader)
Reader
Trait
Rl
R2
R3
R4
Mean Rl-R4
Mechanics
.84
.94
.84
.91
.94
Fluency
.92
.81
.93
.87
.92
Ideas
.83
.83
.86
.91
.88
Focus
.82
.90
.85
.87
.89
Clarity
.86
.89
.91
.93
.92
Previous Score
.90
.92
.93
.86
.95
Discussion
It was envisioned that this studywould help to establishthe relative importance that
graduate deans/faculty place on each of the individual essaycharacteristicsupon which scoring
will be based. This information was foreseen as potentially useful for establishingthe
appropriate relative emphasesof particular features in the operational scoringof essaysand,
9
more specifically,for informing the wording/formatting of the final scoringguide and the
training of essayreaders.
The results suggestthat readers (at least the four typical readers used in this study) have
incorporated one or more of the specifii characteristicsmentioned in the scoringguide (or some
other co-occurringfeatures) into the global scoresthey assignto essays. These same specific
features were also shown to correlate very stronglywith judgments of satisfactionof writing
quality that were made previously by graduate deans and faculty. One plausible interpretation
of the latter result is that graduate schoolpersonnel also base their assessmentsof writing
quality largely on the same features that have been designatedas the basesfor scoringthe GRE
writing test. If the qualities are not the same, at least they are properties that are highly
related.
Several limitations of this study should be mentioned. First, the conclusionsare based
on a relatively small (though reasonablyrepresentative) sampling of essays,which necessarily
precluded the inclusion of all subgroupsof test takers (e.g., international students). Also,
becausethe distribution of essayswas more nearly rectangularthan would typically be observed
in practice, the interreader agreement estimatesmay be somewhatinflated. Interreader
agreement may also have been somewhathigher than usual becauseany essaysthat tended to
generate controversyamong readers were not included in the sample. Second, no consideration
was given to the possibility of different points of view among graduate schoolpersonnel
regarding the baseson which satisfactionwith student writing is based. That is, only “average
satisfaction”over all graduate faculty respondentswas appraised. It is possible,though given the
high correlations among features perhaps unlikely, that distinct points of view could have been
identified and that faculty holding each view might have weighed the individual features
differently. Finally, our results are indirect in the sensethat we did not actually ask graduate
schoolpersonnel for their ratings of specificfeatures of essays,but only for their global
assessments,
which we related to judgments of particular characteristicsas made by our readers.
Had we observed& correlations among facultyjudgments, readers’ holistic ratings, and
readers’ ratings of specificfeatures, several alternative interpretations would have been
plausible. These alternatives are not, however, logically consistentwith the high correlations
noted here.
Despite the limitations of the study,we have some confidence in our conclusionabout
the particular qualities upon which GRE essayscoringwill be based:they have very nearly the
same relative associationwith (and likely influence on) the scoresassignedby trained essay
readers as they do with the degree of satisfactionexpressedby graduate faculty. This result
provides, we believe, a good beginningto accumulatingthe kind of evidence needed to establish
the validity of the GRE writing measure for its intended purpose.
References
Breland, H. M. (1991). A studv of gender and oerformance on Advanced Placement History
minations (College Board Report No. 91-4, ETS Research Report No. 91-16). New
York: CoIIege Entrance Examination Board.
Breland, H. M., & Jones, R. J. (1982). Percentions of writinp skills (College Board Report No.
82-4, ET’S Research Report No. 82-47). New York: College Entrance Examination
Board
Bridgeman, B., & Carlson, S. B. (1984). Survey of academicwriting tasks. Written
Communication, L, 247-280.
Carlson, S. B., Bridgeman, B. B., Camp, R., & Waanders, J. (1985). Relationship of admission
test scoresto writine performances of native and nonnative sneakersof Enelish (TOEFL
Research Report 19, ETS RR-8521). Princeton, NJ: Educational Testing Service.
Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (3rd ed.). New
York: American Council on Education/Macmillan PublishingCompany, pp. 13-103.
Miller, M. D., & Cracker, L. (1990). Validation methods for direct writing assessment.Applied
Measurement in Education, & 285-296.
Moss, P. H., Cole, N. S., & Khampahkit, C. (1982). Comparison of proceduresto assesswritten
language skiUsat grades4,7, and 10. Journal of Educational Measurement, l9_ 37-47.
Olson, M. C., & Swadener, M. (1984). Establishingand implementing Colorado’s Writing
AssessmentProgram. En&h Education, 14,208-219.
Powers, D. E., Fowles, M., & Willard, A. (1994). Direct assessment,direct validation? An
example from the assessmentof writing. Educational Assessment,2, 89-100.
Prater, D., & Padia, W. (1983). Developing parallel holistic and analytic scoringguides for
assessingelementary writing samples.Journal of Research and Development in
Education, 17. 20-24.
Download