Correlatesof Satisfactionwith GraduateSchoolApplicants’Performance on the GRE Writing Measure DonaldE. Powers Mary E. Fowles GRE BoardReportNo. 93-18R November1997 This reportpresentsthe findingsof a researchprojectfundedby and carried out underthe auspicesof the Graduate RecordExaminationsBoard. EducationalTestingService,Princeton,NJ 08541 ******************** Researchers are encouragedto expressfreely their professional judgment. Therefore,pointsof view or opinionsstatedin Graduate RecordExaminationsBoardReportsdo not necessarilyrepresentofficial GraduateRecordExaminationsBoardpositionor policy. The GraduateRecordExaminationsBoardandEducationalTestingServiceare dedicatedto the principleof equalopportunity,andtheir programs, services,and employmentpoliciesare guidedby that principle. EDUCATIONAL TESTING SERVICE, ETS, the ETS logo, GRADUATE RECORD EXAMINATIONS, and GRE are registeredtrademarksof EducationalTestingService. Copyright0 1997by EducationalTestingService. All rightsreserved. Acknowledgments The authors appreciate the helpful reviewsprovided by Hunter Breland, Brent Bridgeman, and Mark Farnum; the data analysesconductedby Patricia O’Reilly; and the administrative assistanceof Ruth Yoder. Abstract The focus of the study was the new GRE writing measure and the proposed GRE scoringguide. The objectivewas to determine the degree to which the features of student essaysupon which scoreswill be based are the same features that graduate educatorsuse at their institutions to evaluate students’writing. A sample of essays,which had been previously reviewed and judged by graduate deans and faculty, were restored several times by trained readers. Each restoring was based on a specifictrait in the GRE scoringguide -- development of ideas, sentencestructure, and so on. The influence of each feature on the judgments made by faculty/deans and on the scoresassignedby trained readerswas compared. No evidencewas uncovered to suggestany differencesbetween graduate deans/faculty and GRE essayreaders with respect to the baseson which they judge essayquality. Correlates of Satisfactionwith Graduate SchoolApplicants’ Performance on the GRE Writing Measure The manner in which test scoresor other judgments are assignedto test performances is an integral aspect of test validity. As Messick (1989) has noted, validity is not an intrinsic property of tests themselves,but rather it is a function of how test scoresare interpreted and used. To ensure the validity (and maximize the utility) of test scoresfor a particular purpose, the methods and criteria used to evaluate test performancesshould reflect the values held by potential users of test scores. This assertionapplies both to traditional multiple-choice measures and to assessmentsfor which judgmental scoringis employed, as is the casewith the evaluation of writing ability. At its January 1993 meeting, the GRE Board tentatively approved the addition of a writing measure to the GRE testing program, along with a draft set of integrated criteria (holistic scoringguide) for judging candidates’essays. Earlier, as part of a surveyto determine possibleinterest in a writing measure, graduate deans and faculty reviewed the proposed scoring guide to determine its relevance to the writing proficiency expectedof graduate studentsat their institutions. Eighty percent (80%) of the respondentsindicated that they were either somewhat or very satisfied that the scoringguide addressedthe writing skills required of first-year graduate students. In addition, the degree to which graduate deans and faculty were satisfiedwith the quality of writing exhibited in a sample of student essayscorrespondedreasonablystronglywith the scoresthat had been assignedby trained essayreaders (Powers, Fowles, & Willard, 1994). These results are discussedin more detail below. Thus, the criteria in the proposed scoring guide, as well as the scoresthemselves,appear to reflect the values of GRE constituents reasonablywell, at least at a global level. The study reported here was designedto determine the degree to which these values are also reflected at a more specificlevel. The primary objectivewas to examine the possibleinfluence that particular features of student essaysmight have on judgments of writing proficiency. By comparing the relative influence of these features on ratings of writing quality as reflected in (a) holistic (global) scores assignedby trained essayreaders and (b) the general satisfactionof graduate faculty with the student essays,we hoped to determine with greater precision the extent to which the proposed scoringguide does in f&t reflect the values of graduate educators. Related Research The influence of numerous specificcharacteristicsof essayscould have been studied. Breland and Jones (1982), for example, investigatedthe influence of 20 essaycharacteristicson scoresassignedfor the College Board’s English Composition Achievement Test. They found that certain characteristicsof discourse(organization, transition, use of supporting evidence, and originality of ideas) related more highly to the scoresawarded by readers than did a variety of syntacticand lexical characteristics(e.g., subject-verbagreement, punctuation, and pronoun usage)* Bridgeman and Carlson (1984) asked graduate faculty in several disciplinesto rate the importance of selectedfeatures of writing. They found that such traits as the quality of content, the development of ideas, and the extent to which essaysaddressedthe topic were regarded as more important than were such features as vocabularysize, punctuation/spelling, and sentence structure. 2 Carlson, Bridgeman, Camp, and Waanders (1985) found that holistic scores,discourselevel scores,and sentence-levelscoreswere all very highly related for a sample of non-native undergraduate applicants. They also found that the scoresassignedby English writing experts, English-as-a-second-language specialists,and subjectmatter professorswere all strongly correlated, suggestingthat these different groupsprobably based their evaluationson many of the same factors. Breland (1991) investigatedthe relationship of scoreson the College Board Advanced Placement history exams to responsecharacteristicsthat were regarded as directly relevant (e.g., historical content), indirectly relevant (e.g., number of words written and composition quality), and irrelevant (e.g., neatnessand handwriting quality). The correlation of these features with scoresdiffered somewhat accordingto the content of the examination (European vs. U.S. history) and its format (e.g., whether or not it was document based). A number of researchers(see, for example, Miller & Cracker, 1990 for a summary) have investigatedthe relationship between holistic scoresand more molecular or analytic scores. For instance, Olson and Swadener (1984) found that four analytic scoresexplained more than 60% of the variation in holistic scores. But, Moss, Cole, and Khampalikit (1982) found relatively low correlations between analytic and holistic scores,due in large measure to the low reliabilities of the analytical scores. Prater and Padia (1983) found somewhat higher correlations (in the .7Os). The studiescited here were primarily concernedwith the relationshipsbetween scoresassigned by trained essayreaders, not with the features on which test usersbase their judgments of writing quality. Sucvevof Graduate Facultv and Deans As mentioned earlier, in order to ascertaininterest in the assessmentof prospective graduate students’writing skills,graduate schoolpersonnelwere surveyedat some 115 graduate institutions in October 1992. Questionnaireswere sent to graduate school deans and faculty, along with the essayprompts, proposed scoringguide, and a sample of student essays representing different levels of quality on a l-to-6 holistic score scale.’ The survey sample was comprised of graduate department chairs and faculty (about two thirds) and graduate deans (about one third), who represented the sciences(biology, chemistry, computer science,and electrical engineering), the social sciences(economics,political science,and psychology),and the humanities/other (communications,history, English, and fine arts) from each of six major Carnegie classificationsof the academic institutions (Research I and II, Doctoral-granting I and II, and Comprehensive I and II). Three surveyforms were developed, each containing six different student responsesso as to counterbalanceessayquality, topic content, and order of presentation, as shown in Table 1. Essaysrepresenting the lowest scorelevel possible,“1,” were not included, as these occurred very infrequently (about 1% of the time) in the pilot sample. ’ Thesesampleessayswere obtainedearlierfrom first-yeargraduatestudentsandthird-andfourth-year undergraduates, whohadonehourto compose(draft,write,andrevise)an essay.Promptspresenteda generalissueor a complexideaandaskedstudents to developtheir viewson the issueor idea,usingreasons andexamplesdrawn from their reading, observations,or experiences.The essayswere then scoredtwice holistically, according to criteriapresentedin the scoringguide,at a centralized scoringsession.Scorers wereuniversity facultyfrom a varietyof disciplines (the sciences., the socialsciences, andthe humanities)and from a varietyof collegesanduniversities, all trainedto applythe samescoringcriteria. 3 Table 1 Order of Presentation of Student Essaysin Three Survey Forms Survey Form B Survey Form A Holistic Score Holistic Score Survey Form C Order of essay in booklet Content lst Science . . . . . . . . . . . .4 Social Science. . . . . .5 Humanities. . . . . . . . 6 2nd Science . . . . . . . . . . . .5 Social Science . . . . . .2 Humanities. . . . . . . . 3 3rd Humanities . . . . . . . . .5 Science.. . . . . . . . . . 3 Social Science. . . . . . 4 4th Humanities . . . . . . . . .2 Science.. . . . . . . . . . 6 Social Science. . . . . . 5 5th Social Science . . . . . . .3 Humanities . . . . . . . . 5 Science . . . . . . . . . . .5 6th Social Science. . . . . . .6 Humanities . . . . . . . .4 Science . . . . . . . . . . .2 Content Content Holistic Score Survey respondentswere asked the following question: How satisfiedwould you be with the general writing ability of applicantsto your program/institution who wrote each [of the following] essays? Responseswere recorded on a scalewith the following values: (5) very satisfied, (4) somewhat satisfied, (3) neither satisfiednor dissatisfied,(2) somewhat dissatisfied,(1) very dissatisfied. The scoresthat were previously assignedto each essaywere not revealed to survey respondents. The three prompts were as follows: Science “Most of the environmental problems we face are due to advancesin technology, yet further advancesin technologyare necessaryto solve these problems.” To what extent do you agree or disagreewith this view? Present one or two examplesof environmental problems that either support or refute the above statement and discusswhat you think could be solutionsto the problems you describe. . oaal Science “A society that tries to maximize both the liberty and equality of its citizens is attempting the impossible. Steps taken to ensure liberty will surely create inequality and stepstaken to promote equality will be at the expenseof liberty.” To what extent do you agree or disagreewith the author’s position that there is tension between liberty and equality? Support your views with specific examples from your knowledge of history and/or current events. . . umanitres Describe a particular theory that has played or still plays a significantrole in an area of the humanities that interestsyou. Using specific examplesto support your position, explain how that theory has influenced the field. 4 Holistic score scalevalueswere describedin the draft scoringguide. Each scorelevel from 1 (low) to 6 (high) was introduced with a general statement of proficiency as follows: 6 - Presentsa thorough and insightful responseto the question and demonstrates mastery of the elements of effective writing. 5 - Presentsa well-developed responseto the question and demonstratesa strong control of the elements of effective writing. 4 - Presentsa clearly competent responseto the question and demonstrates adequate control of the elements of writing. 3 - Presentsa limited responseto the question, either in content or in control of the elements of writing. 2 - Presentsa weak responseto the question and demonstrateslittle control of the elements of writing. 1 - Is seriouslydeficient in kiting skills. In addition to these general characterizations,more detailed characteristicsof essaysat each level appeared in the scoringguide. For example, an essayin category 6 was describedas exhibiting all or most of the following characteristics: demonstratesa substantialunderstanding of the subjectdiscussed;effectively supportsideas with well-chosen,accurate examples;sustains a well-focuseddiscussionof the subject;expressesideas with clarity and precision; useslanguage fluently and effectively, with varied sentencestructure and vocabularyappropriate to the subject. An essayassignedto category 2, on the other hand, was describedas having one or more of the following characteristics: may display a poor understandingof the task and/or the subject discussed;may provide little development of ideas and lack relevant examples;may present an unfocuseddiscussionof the topic; may have seriousproblems in expressingideas clearly; may have seriousand frequent problems in syntaxand vocabulary. Sumev Results A total of 347 respondedto the survey,and 231 of these respondentsprovided ratings of the 18 sample essaysincluded in surveyforms A, .B, and C. As stated earlier, 80% of the respondentsindicated that they were either somewhat or very satisfiedthat the scoringguide addressedthe writing skills required of first-year graduate students. There was, however, also a contrary opinion, as expressedby one faculty member, who felt that the kind of specialized (“arcane”) writing style thought to be typical in the discipline (psychology)was not very well reflected in either the kinds of essayprompts or the type of scoringto be used for the new writing measure. Faculty and deans also made numerous suggestionsabout what the measure should (and should not) reflect. Mentioned were suchcharacteristicsas fluency, sentencestructure, grammar, spelling punctuation, ease of reading level of maturity, clarity, expression,style, ability to cite relevant literature, semantics,syntax,“historicalimagination,”logic, organization, and parsimony. Many of these characteristicseither appear or are implied in the proposed 5 GRE scoringguide. In short, respondentshad definite views about what they value in students’ writing. As mentioned previously,responsesalso revealed a significantcorrespondencebetween levels of satisfactionwith student essaysand the scorespreviously assignedto them at the scoringsession. Table 2 shows,by content of essaytopic, the percentagesof respondentswho were (a) very satisfied,(b) either very satisfiedor somewhat satisfied,(c) very dissatisfied,and (d) either very dissatisfiedor somewhat dissatisfied. Table 2 showsa relatively sharp increasein satisfactionand an equally steep decreasein dissatisfactionas a function of essayquality/score. Thus. essavscoresbased on the DronosedscoringPuide exhibit a relativelv strong CorresDondencewith overall facultv satisfaction. To reiterate, the objective of the study describedhere was to explore the extent to which specificfeatures of essaysare related to faculty satisfaction. Method Each of the 15 essaysrated previously(for satisfaction)by graduate faculty and deans was restored by four trained readers on each of five qualities, or traits, mentioned in the proposed GRE scoringguide: (1) development/support of ideas with well-chosen,relevant examples (2) ability to present/sustain a focuseddiscussionof the topic (3) ability to expressideas clearly (4) ability to use languagefluently, with varied sentencestructure and appropriate vocabulary (5) ability to apply the elements of effective writing (grammar, punctuation, spelling,and capitalization) A sixth feature (understandingof the subjectdiscussed)was originally specified in the guide. However, even after training, readers felt that they could not clearly distinguishbetween understandingof the subjectand the development/support of ideas. The two qualities were therefore merged. A l-to-6 scale,similar to that used for overall scoring,was used for each feature. In addition, scorerswere asked to identify any other characteristicsthat affected, either positively or negatively, their judgments of each essay. Readers’ training consistedof discussingthe presenceof each trait in benchmark essaysat each scorelevel and then scoringseven “practice” essays. Ratings of specificessayfeatures were obtained during a one-day scoringsession. Holistic ratings were provided by the same readers approximatelythree weeks later by mail. 6 Table 2 Percentagesof Survey RespondentsWho Indicated Various Levels of Satisfaction with Student Performance, by EssayPrompt and EssayScore EssayScore EssayPrompt 2 3 4 5 6 Very Satisfied Science 8 17 25 47 Humanities 4 15 15 48 Social Science 0 14 41 44 4 15 27 46 Total Somewhat or Very Satisfied Science 26 48 71 Humanities 28 57 55 87 9 30 67 88 22 45 64 86 Social Science Total - 85 Very Dissatisfied Science 84 7 3 4 Humanities 71 13 5 2 Social Science 64 25 14 2 74 15 7 3 Total - Somewhat or Very Dissatisfied Science 92 49 16 11 Humanities 89 54 20 22 Social Science 85 74 43 14 89 58 27 15 TOtd - Note. For each prompt, entries are based on 66-82 responsesfor essaysscoredas 2, 3,4, and 6. Becausetwice as many essaysscoredas 5 were judged, the number of respondents(145155) was approximately doubled for these essays.Entries for totals are based on approximately three times these numbers. The total number of respondentswho provided ratingswas 231. Source. Powers, D. E., Fowles, M., & Willard, A. (1994). Results The first finding, on which all the other results turn, was that the ratings among the five specific essayfeatures were all very highly correlated for each reader. The median correlations among features for the four readerswere AS, .85, .89, and .89. The lowest single correlation between any two traits for any reader was .72. The correlations among mean ratings2(averaged over all four readers) of essayfeatures are shownin Table 3. It is clear from these correlations either that these features occurred “hand in glove”in the essayswe studied or that readers could not distinguishamong the features. The later interpretation was reinforced by comments from both the readers and the trainers, who reported that the readers (all highly experiencedin holistic scoringprocedures) had a difficult time “deholisticizing”themselves,that is, thinking in terms of specific essayfeatures. Table 3 Correlations Among Mean Ratings of Traits (Over Four Readers) Variable M SD Mechanics Fluency Ideas Mechanics 4.1 1.3 Fluency 4.0 1.3 .92 Ideas 4.1 1.5 .89 .94 Focus 4.0 1.5 .90 .90 .92 Clarity 3.9 1.5 .93 .97 .95 Focus .93 The necessaryconsequenceof the strong intercorrelations among essayfeatures was that none of the individual features was differentially related to the holistic scoresthe readers had independently assignedto the essays(Table 4). No consistenttendencywas discerniblefor any of the features to relate more stronglythan others to holistic scores:all of the features correlated about equally stronglywith holistic scores. (Table 4 also reveals that the scores assignedwhen essayswere first collected in early tryouts correlated substantially,for each essay reader, with the holistic scoresgiven in the current study -- from .82 to .95.) 2 We digresshere to emphasizethe point that correlationsamong means are typicallysubstantiallyhigher than those among individual observations.This fact undoubtedlyaccounts,at least in part, for the very high correlationsreported here. 8 Table 4 Correlations of Holistic Scoreswith EssayTraits (by Reader) Reader Trait Rl R2 R3 R4 Mean R&R4 Mechanics .88 .93 .91 .83 .96 Fluency .93 .89 .97 .84 .96 Ideas .88 .92 .92 .77 .94 Focus .84 .99 .92 .82 .95 Clarity .91 .96 .97 .89 .97 Previous Score .95 .92 .95 .82 .95 Table 5 showsthat readers’ ratings of essaycharacteristicswere very highly correlated with the mean satisfactionratings by graduate faculty. All correlationswere in the .8Osor .9Os. There was no discerniblepattern suggestingthat any particular essayfeatures were more highly related than others to faculty satisfaction. Table 5 Correlations of Mean Faculty SatisfactionRatings with Ratings of EssayTraits and with Holistic Scores(by Reader) Reader Trait Rl R2 R3 R4 Mean Rl-R4 Mechanics .84 .94 .84 .91 .94 Fluency .92 .81 .93 .87 .92 Ideas .83 .83 .86 .91 .88 Focus .82 .90 .85 .87 .89 Clarity .86 .89 .91 .93 .92 Previous Score .90 .92 .93 .86 .95 Discussion It was envisioned that this studywould help to establishthe relative importance that graduate deans/faculty place on each of the individual essaycharacteristicsupon which scoring will be based. This information was foreseen as potentially useful for establishingthe appropriate relative emphasesof particular features in the operational scoringof essaysand, 9 more specifically,for informing the wording/formatting of the final scoringguide and the training of essayreaders. The results suggestthat readers (at least the four typical readers used in this study) have incorporated one or more of the specifii characteristicsmentioned in the scoringguide (or some other co-occurringfeatures) into the global scoresthey assignto essays. These same specific features were also shown to correlate very stronglywith judgments of satisfactionof writing quality that were made previously by graduate deans and faculty. One plausible interpretation of the latter result is that graduate schoolpersonnel also base their assessmentsof writing quality largely on the same features that have been designatedas the basesfor scoringthe GRE writing test. If the qualities are not the same, at least they are properties that are highly related. Several limitations of this study should be mentioned. First, the conclusionsare based on a relatively small (though reasonablyrepresentative) sampling of essays,which necessarily precluded the inclusion of all subgroupsof test takers (e.g., international students). Also, becausethe distribution of essayswas more nearly rectangularthan would typically be observed in practice, the interreader agreement estimatesmay be somewhatinflated. Interreader agreement may also have been somewhathigher than usual becauseany essaysthat tended to generate controversyamong readers were not included in the sample. Second, no consideration was given to the possibility of different points of view among graduate schoolpersonnel regarding the baseson which satisfactionwith student writing is based. That is, only “average satisfaction”over all graduate faculty respondentswas appraised. It is possible,though given the high correlations among features perhaps unlikely, that distinct points of view could have been identified and that faculty holding each view might have weighed the individual features differently. Finally, our results are indirect in the sensethat we did not actually ask graduate schoolpersonnel for their ratings of specificfeatures of essays,but only for their global assessments, which we related to judgments of particular characteristicsas made by our readers. Had we observed& correlations among facultyjudgments, readers’ holistic ratings, and readers’ ratings of specificfeatures, several alternative interpretations would have been plausible. These alternatives are not, however, logically consistentwith the high correlations noted here. Despite the limitations of the study,we have some confidence in our conclusionabout the particular qualities upon which GRE essayscoringwill be based:they have very nearly the same relative associationwith (and likely influence on) the scoresassignedby trained essay readers as they do with the degree of satisfactionexpressedby graduate faculty. This result provides, we believe, a good beginningto accumulatingthe kind of evidence needed to establish the validity of the GRE writing measure for its intended purpose. References Breland, H. M. (1991). A studv of gender and oerformance on Advanced Placement History minations (College Board Report No. 91-4, ETS Research Report No. 91-16). New York: CoIIege Entrance Examination Board. Breland, H. M., & Jones, R. J. (1982). Percentions of writinp skills (College Board Report No. 82-4, ET’S Research Report No. 82-47). New York: College Entrance Examination Board Bridgeman, B., & Carlson, S. B. (1984). Survey of academicwriting tasks. Written Communication, L, 247-280. Carlson, S. B., Bridgeman, B. B., Camp, R., & Waanders, J. (1985). Relationship of admission test scoresto writine performances of native and nonnative sneakersof Enelish (TOEFL Research Report 19, ETS RR-8521). Princeton, NJ: Educational Testing Service. Messick, S. (1989). Validity. In R. L. Linn (Ed.), Educational Measurement (3rd ed.). New York: American Council on Education/Macmillan PublishingCompany, pp. 13-103. Miller, M. D., & Cracker, L. (1990). Validation methods for direct writing assessment.Applied Measurement in Education, & 285-296. Moss, P. H., Cole, N. S., & Khampahkit, C. (1982). Comparison of proceduresto assesswritten language skiUsat grades4,7, and 10. Journal of Educational Measurement, l9_ 37-47. Olson, M. C., & Swadener, M. (1984). Establishingand implementing Colorado’s Writing AssessmentProgram. En&h Education, 14,208-219. Powers, D. E., Fowles, M., & Willard, A. (1994). Direct assessment,direct validation? An example from the assessmentof writing. Educational Assessment,2, 89-100. Prater, D., & Padia, W. (1983). Developing parallel holistic and analytic scoringguides for assessingelementary writing samples.Journal of Research and Development in Education, 17. 20-24.