Grade Leniency -1-
Grade Leniency and Student Evaluation of Instruction
Across Multiple Sections of the Same Course
Arden Miller
Missouri State University and
David Stockburger
United States Air Force Academy
The controversy over grade leniency and student evaluations continues, with arguments that professors can (Greenwald & Gilmore, 1997; Isley & Singh, 2005’; Krautmann & Sander, 1999) or cannot (Centra, 2003; Marsh, 1987; Marsh & Roche, 2000) improve evaluations by giving higher grades. Experimental evidence and social psychology theory would support that students with low grades can reduce cognitive dissonance and engage in ego defense by giving low evaluations to teachers who give them low grades (Maurer, 2006).
Much of the research on student evaluations involves diverse and complex analyses (e.g.,
Marsh, 1987) and a total absence of theoretical guidance. To expect that student evaluations would not be influenced by expected grade would contradict a long standing history of social psychology research on cognitive dissonance, attribution, and ego threat. Failure threatens the ego (Snyder, Stephan & Rosenfield, 1978; Miller, 1985) and motivates us to find rationales to defend the ego. One common strategy involves diminishing the value of the activity (Miller &
Klein, 1989). Similarly, Cognitive Dissonance Theory (Festinger, 1957) predicts that people who experience poor performance but perceive themselves as competent will experience dissonance, which they can reduce my negative evaluations of the instruction. Attribution
Grade Leniency -2- research (Weiner, 1992) also supports the argument that among low achievement motivation students, failure is associated with external attributions for cause, and the most plausible external attribution is the quality of instruction. While arguments regarding degree of influence are reasonable, the position that they are not affected is inconsistent with existing theory and evidence.
Salmons (1993) provides evidence of the causal direction of student ratings of instructors and higher grades. She had 444 students complete faculty evaluations after 3-4 weeks of classes, and again after 13 weeks. Students who expected to get Fs significantly lowered their evaluations while students who expected to receive As and Bs significantly raised their expectations.
Centra (2003) reports that in a study of 9,194 class averages using the Student Instructional
Support, the relationship between expect grades and global ratings was only .2. He goes on to argue that when variance due to perceived learning outcomes is regressed from the global evaluation, the effect of expected grades is eliminated. But student’s best assessment of perceived learning outcome is their expected grade and these are highly correlated. Thus when perceived learning is regressed from the global evaluations, it is not surprising that suppression effects would eliminate or even reverse the correlation between expected grade and global evaluation (need a suppression effect reference.)
In general there are several reasons why the relationship of expected grade to global evaluations is suppressed. For example, faculty ratings are generally very high on the average which suppresses correlation. But the main reason the grade/evaluation effect is often minimized in the research is due to the confusion between leniency and expected grade. Isley (2005) demonstrated that it is the difference between expected grade and the student’s GPA that influences student evaluations.
Grade Leniency -3-
Expected grade is a measure of leniency to the extent it reflects differences from typical grades for the student and the course expectation. The vast majority of variation in expected grades follows individual differences in ability and effort. There is no reason to believe that highly capable students will give higher ratings, and thus all of the variance in expected grade based upon ability differences would be unrelated to student evaluations. Thus, expected grade for the individual by itself is a very poor measure of leniency, and correlations as low as .2 should not be a surprise.
Marsh (1983) has argued that the individual is not the proper unit of analysis because he perceived that this could cause false findings related to individual differences in students. Hence he argues these studies should use class as the unit of analysis. We agree, both for that reason and because it can mask significant relationships as well. Individual differences in expectancy will attenuate the correlation less when we use class average as the unit for analysis. To the extent that the same class average would be expected across all courses, the class average for expected grade is a good measure of grading leniency.
Course quality, not individual attributes, is the attribute that we are attempting to assess when we are using student evaluations of courses. So it makes little sense to use the individual students as the unit of analysis unless we wish to study how individual attributes influence course evaluation. Several studies provide support that when class is the unit of analysis, expected grade is a more significant bias (Blackhart, Peruche, DeWall & Joiner, 2006; DuCette
& Kenney, 1982; Ellis, Burke, Lomire, & McCormick, 2003)
Blackhart (Blackhart et al., 2006) analyzed 167 psychology classes in a multiple regression analysis with class as the unit of analysis and found that the two most significant predictors of instructor ratings were average grade given by the instructor and instructor status (TA or rank of
Grade Leniency -4- faculty). Because of the limited number of classes, their power was limited. However, in addition to the concern over the relationship between grades and global course evaluations was the finding that TAs were rated more highly than ranked faculty. This raises additional validity questions. We must either accept that the least qualified teachers are actually the best teachers, or that we have evidence that student evaluations have given us false information on the quality of teaching.
DuCette and Kenney (1982) provided evidence that using course as a unit of analysis increased the correlation between expected grade and other course ratings. Within specific groupings of classes, these correlations ranged from .23 to .53. Two factors limited the level of their relationships. The classes used were all upper division courses and graduate courses.
Secondly, over 90% of the students in these classes expected an A or a B. Thus it would be predicted that the correlations between expected grade and global course ratings would be reduced due to the absence of variation in expected grades.
Different disciplines and subject areas have very different GPAs and thus students have differing grade and workload expectations in different courses as well. For example, an instructor in Anatomy giving a 3.0 GPA might be considered lenient while an Education instructor giving a 3.25 GPA might be considered hard. To have a valid measure of workload and leniency factors, correlations should be conducted with varied teachers of the same course.
Further, different populations take courses in different areas, thus population differences between anatomy classes and education classes could create or mask findings as well. Hence, analysis of these correlations within the same discipline and course was expected to strengthen this relationship and offer more valid results.
Grade Leniency -5-
In the current study, we will use data collected over a 20 year period to allow for more powerful analyses with analyses occurring within many sections of the same course. The expectation is that relationships between expected grade and global evaluations will increase when kept within the same course, relative to correlations found when all courses in psychology are analyzed together. And we also expect correlations to be substantially higher than those predicted by researchers who have used individual students as their unit of analysis.
Convergent validity is demonstrated when measures correlate with factors that should indicate teaching effectiveness. The presumption of university requirements for a terminal degree is that better trained faculty will be more effective teachers. Hence, if student evaluations are a valid measure, better trained full time faculty should receive higher ratings than per course instructors and teaching assistants. However, Blackhart’s findings appeared to contradict this expectation (Blackhart et al., 2006). We will investigate the relationship of instructor status, per course, TA, or regular faculty, and student evaluations.
Method
The study was conducted at a large midwestern public university. We used data from
2846 undergraduate, 244 mixed graduate and undergraduate, and 429 graduate psychology classes taught from 1987 to 2007 that were evaluated by students using the same 15 item instrument (see Appendix A). Faculty distributed scan forms at some point no more than two weeks before the conclusion of the semester. A student was assigned to collect the forms and deliver them to the secretary. The faculty left the room while students completed the forms.
Classes with fewer than five student responses were not retained in the data pool.
Grade Leniency -6-
Results
Correlations between relevant evaluation items
Undergraduate courses. Table 1 presents the intercorrelations of the six relevant items in two forms. Using class average on each item as the unit of analysis, we found an overall relationship between class expected grades and class overall ratings (r = .41), exams reflecting the material (r = .47), grading fairness (r = .47) and appropriateness of assignments (r = .47).
Analyzing data within specific psychology courses (e.g., all introductory statistics classes) makes the expected grade a better measure of leniency since different courses have different levels of difficulty and desire to take the course. We computed the correlations within the same course for the five courses that are either a core psychology requirement or a general education requirement. We then computed the average for those five classes. We found higher correlations with grade expectancy; overall quality mean r for five classes = .54, exams reflecting material r =
.63, grading fairness, r = .58 and appropriateness of assignments, .54.
Similarly, the correlations between wanting the course and other course ratings shown in
Table 1 increase dramatically when we consider correlations within a specific course.
Graduate courses. We looked at diverse graduate classes that typically are included as part of the graduate requirements in psychology and we looked at two graduate classes that are service course to education in Table 2. In the service courses, we computed correlations within each class and averaged them. In the service classes, we saw very high correlations between grade expectancy and overall quality r = .59, exams reflecting material r = .49, grading fairness, r
= .525 and appropriateness of assignments, r = .51. However, these correlations were quite low in other graduate classes, with the correlation between grade expectancy and overall quality r =
Grade Leniency -7-
.17, exams reflecting material r = .07, grading fairness, r = .22 and appropriateness of assignments, r = .295.
Instructor Status and student ratings
We compared teaching assistants, per-course faculty, instructors, and ranked faculty in undergraduate general education classes, since these courses have the largest proportion of TAs and per-course faculty. An analysis of variance and Scheffe post hoc tests indicated that ranked faculty and instructor expected grades were lower than teaching assistants and per-course faculty, F (3, 517) = 18.7, p < .001, MSe = .096, see Table 4. Overall quality ratings were higher for teaching assistants and per-course faculty than for ranked faculty and instructors, F (3, 517) =
7.11, p < .001, MSe = .30. Teaching Assistants were credited with the most appropriate exams, followed by per-course faculty, with lower ratings for instructors and ranked faculty, F (3, 517) =
18.6, p < .001, MSe = .161. Teaching assistants were judge to grade more fairly than per course faculty, instructors, and ranked faculty, F (3, 517) = 7.14, p < .001, MSe = .095. Assignments and required activities were considered less appropriate in courses taught by ranked faculty relative to the other three groups F (3, 517) = 16.4, p < .001, MSe = .106.
Average ratings of instructors
Average ratings of instruction are presented in Table 3.
Discussion
The findings support the argument that student evaluations of faculty can be inflated by leniency in grading. This position is supported both in the strong relationships between expected grade and global ratings and by the evidence that greater training and experience is related to poorer evaluations and lower expected grades.
Grade Leniency -8-
It is very likely that these correlations are suppressed by the loading of scores at the high end of the scale. Generally, all evaluation items reflect scores at the high end of the 1-5 scale even when items are constructed to intentionally move evaluators from the ends. The item, “The overall quality of this course was among the top 20% of those I have taken,” is conspicuously designed to move subjects away from the top rating. Yet average global ratings remain about a
3.9.
To establish concurrent validity, we must demonstrate that global ratings correlate with academic achievement or other clearly defined educational goals with class as unit of analysis.
Little attention has been paid to evaluating whether instructor ratings are related to actual rather than perceived achievement. And this correlation must be substantial enough to definitively outweigh other invalid influences on student evaluations. Otherwise, inappropriate strategies will be the best way for faculty to get higher evaluations.
Evidence suggests that student evaluations are influenced by likability, attractiveness, and dress (Buck & Tiene, 1989; Feely, 2002; Gurung & Vespia, 2007) in addition to leniency and low demands (Greenwald & Gilmore, 1997). One must even question whether factors like instructor warmth, which relates to student evaluation (Best & Addison, 2000), is really fitting to the purposes of a college education. With the sum of invalid variance from numerous factors being potentially high, establishment of a high relationship to achievement is imperative.
Perception of the influence of course attributes on teacher evaluations is far more detrimental to the quality of education than the biased evaluations themselves. It is unlikely that good teachers, even if more challenging, will get bad evaluations in which most students rate their course poorly. Good teachers are rarely losing their positions due to low quality evaluations.
Grade Leniency -9-
But Marsh (1987) found that faculty perceive that evaluations are biased based upon course difficulty (72%), expected grade (68%), and course workload (60%). If one’s goal is high merit ratings and teaching awards, and the most significant factor is student evaluations of teaching, then putting easier questions on the test, adding more extra credit, cutting the project expectations, letting students off the hook for missing deadlines, and boosting borderline grades would all be likely strategies for boosting evaluations. The impact is hard to deny. With a much broader base of students going to college, and thus more students with lower high school achievement levels, we would expect GPAs to decline. The converse is true. GPAs have been constantly rising and the problem of grade inflation is now a common part of college language.
This problem is exacerbated by the high means on most global ratings. If the university average on an item is about 3.9, getting a one has nearly three times more negative impact than the positive effect of getting a five. Consequently, an important strategy to improve ratings is to make sure the instructor doesn’t anger anyone. Hence, deadlines get extended and few students fail even when they do not put in the effort.
There will be no measure of teaching quality that will be free from one bias or another.
There are two keys to the effective teaching evaluation. First, it must be multifaceted with measures whose biases differ and thus result in some balance. No single measure can be the determinant of an inordinate amount of evaluation variance. For example, if syllabi are rated by colleagues, and almost all receive the same score, regardless of its weighting it will influence only a small portion of the variance.
Secondly, effective teachers will get positive student ratings even when they have high expectations and do not inflate grades. It is common in these research studies to see means of
4.0 on a 5 point scale. Thus, many excellent teachers will score below average. It is maladaptive
Grade Leniency -10- to try to increase a 3.9 global rating to a 4.1, because it often requires that the instructor try to eliminate ones because they have in inordinate influence on the mean. In fact, one must wonder whether moving from a 4.5 to a 4.8 might require one to become a less effective teacher. This effort of competing against the norms is likely to lead to grade inflation and permissiveness toward negligent students.
Some researchers. (Ellis et al, 2003; Greenwald et al., 1999) argue that student evaluations of instruction should be adjusted on the basis of grades assigned. However, there are problems with such an approach. The regression Betas are likely to differ based upon course and many other factors. In our research and in research by DuCette (DuCettte & Kenney, 1982) substantial variation in correlations was found across different course sets. Establishing valid adjustments would be problematic at best.
Further, such an approach would punish an instructors when they happen to get an unusually effective class and gives them the grades they deserve. Student evaluations are not a proper motivational factor in grade assignment, whether it is to inflate or deflate grades.
It would seem nearly impossible to eliminate invalid bias in student ratings of instruction.
Yet, they do tell us a teacher is ineffective when the majority rate the class poorly. It is the normative, competitive use that makes student evaluations of teaching so subject to problematic interpretation. And adjusting evaluations for expected grades only further exacerbates this problem.
Can a weak teachers ensure that they will get good evaluations by assigning high grades to students? It is highly doubtful! Will a good teacher get damning student evaluations? Again, this is highly unlikely. Can a professor who wants to step up a merit grade or attain a teaching
Grade Leniency -11- award improve their chances by giving easier grades? The evidence appears to suggest that this is in fact quite likely.
One of the detrimental effects of social research has been that increasingly, perceived quality of service is equated with effective service. Thus the doctor’s bedside manner becomes more important than accurate diagnosis. Much of higher education appears to have adopted the model that what you think you have learned is more important than what you have learned. More research must be dedicated to effective evaluation of teaching in higher education that is not merely a matter of student perception.
Grade Leniency -12-
References
Best, J.B., & Addison, W.E., (2000) A preliminary study of perceived warmth of professor and student evaluations. Teaching of Psychology, 27, 60-62.
Blackhart, G. C., Peruche, B.M., DeWall, C.N, & Joiner, T.E. Jr. (2006) Factors influencing teaching evaluations in higher education. Teaching of Psychology, 33, 37-39.
Buck, S., & Tiene, D. (1989) The impact of physical attractiveness, gender, and teaching philosophy on teacher evaluations. Journal of Educational Research, 82, 172-177.
Centra, J.A. (2003) Will teachers receive higher student evaluations by giving higher grades and less course work? Research in Higher Education, 44, 495-518.
DuCette, J., & Kenney, J. (1982) Do grading standards affect student evaluations of teaching?
Some new evidence on an old question. Journal of Educational Psychology, 74 , 308-314.
Ellis, L., Burke, D.M., Lomire, P., & McCormick, D.R. (2003) Student grades and average ratings of instructional quality: The need for adjustment. The Journal of Educational
Research, 97, 35-40.
Feeley, T.H. (2002) Evidence of halo effects in student evaluations of communication instruction. Communication Education, 51, 225-236.
Festinger, L. (1957) A theory of cognitive dissonance , Stanford, CA: Stanford University Press
Greenwald, A.G., & Gillmore, G.M. (1997) Grading leniency is a removable contaminant of student ratings. American Psychologist, 52, 1209-1217.
Grade Leniency -13-
Gurung, R.A.R., & Vespia, K.M. (2007) Looking good, teaching well? Linking liking, looks, and learning. Teaching of Psychology, 34, 5-10.
Krautman, A. C., & Sander, W. (1999) Grades and student evaluations of teachers. Economics of
Education Review, 18, 59-63.
Miller, A. (1985). A developmental study of the cognitive basis of performance impairment after failure. Journal of Personality and Social Psychology, 49 , 529-538.
Miller, A., & Klein, J. (1989) Individual differences in ego value of academic performance and persistence after failure. Contemporary Educational Psychology.
Marsh, H.W. (1987). Students’ evaluations of university teaching: Research findings, methodological issues, and directions for future research. International Journal of
Educational Research, 11, 253-288.
Marsh, H.W., & Roche, L.A. (2000) Effects of grading leniency and low workload on students’ evaluations of Teaching: Popular myth, bias, validity, or innocent bystanders? Journal of
Educational Psychology, 92, 202-228.
Salmons, S.D. (1993). The relationship between students’ grades and their evaluation of instructor performance. Applied H.R.M. Research, 4, 102-114.
Snyder, M.L., Stephan, W.G., & Rosenfield, D. (1978) Attributional egotism. In J.H. Harvey,
W.J. Ickes, and R.R. Kidd (Eds), New directions in attribution research (Vol. 2) Hillsdale,
N.J.: Erlbaum.
Grade Leniency -14-
Weiner, B. (1992). Human Motivation: Metaphors, theories, and research.
Newbury Park, CA:
Sage.
Grade Leniency -15-
Table 1: Correlations across all classes regardless of course and correlations within the same course for student ratings on relevant items from the evaluation survey: Undergraduates only.
Correlations within same course averaged for five core or general education courses
Overall
Exams
Fair grade
Overall Exams Fair Assignments Wanted Grade
1 .795
.724 1
.717 .736
.754
.793
1
Assignments .744 .665 .751
Wanted .524 .282 .373
.410 .470 .473
.783
.741
.773
1
.375
.594
.651
.528
.485
.525
1
.305
.535
.627
.583
.543
.483
1
Grade Leniency -16-
Table 2: Correlations across Psychology graduate classes and teacher education psychology graduate classes for student ratings on relevant items from the evaluation survey.
Correlations within same course averaged for two required service courses for education (n = 94)
Overall Exams Fair Assignments Wanted Grade
1 .783 .83 .86 .716 .586 Overall
Exams
Fair
.542 1
.592 .544
.784
1
Assignments .670 .394 .627
Wanted .578 .265 .319 grade .171 .067 .219
.636
.793
1
.509
.295
.542
.558
.635
1
.245
.493
.525
.511
.390
1
Grade Leniency -17-
Table 3: Mean scores faculty on analysis items
Overall Exams
Undergraduate 3.87
Mixed 3.94
Graduate 3.92
4.25
4.31
4.15
Evaluation Item
Fair Assignments Wanted Grade
4.39
4.43
4.28
4.26
4.31
4.28
3.77
4.05
3.79
4.20
4.37
4.70 n
2846
244
429
Grade Leniency -18-
Table 4: Student evaluation as a function of the status of the course instructor
TA (a)
*
Status of the Course Instructor
Per Course(b) Instructor(c) Ranked(d)
Overall
Exams
Grade
Assignments
4.00cd
4.47bcd
4.57bcd
4.30d
3.94cd
4.26acd
4.42a
4.17d
3.70ab
4.05ab
4.35a
4.16d
3.74ab
4.11ab
4.40a
4.02abc
Fair 3.95cd 3.94cd 3.62ab 3.74ab
Grade 4.30cd 4.19cd 4.01ab 4.04ab
* letters indicate instructor category from which the indicated mean significantly differs using Scheffe