Do Course Evaluations Truly Reflect Student

advertisement
Do Course Evaluations Truly Reflect Student
Learning?: Evidence from an Objectively Graded
Post-test
Trinidad Beleche, David Fairris and Mindy Marks* a
Aug. 9, 2010
Abstract
It is difficult to assess whether course evaluations reflect how much students really
learn from a class because valid measures of learning are rarely available. This
paper makes use of a unique setting in which students receive instructor-specific
grades but also take a common, high-stakes post-test which is centrally graded and
serves as the basis for capturing actual student learning. We match these studentspecific measures of learning to student-specific course evaluation scores from
electronic records and a rich set of student-level covariates, including a pre-test
score and other measures of skills prior to entering the course. While small in
magnitude, we find a positive and statistically significant association between our
measure of student learning and course evaluations. The association between
course evaluation scores and two other measures of learning commonly found in
the literature – namely, grade in the course and grade in subsequent courses – are
also estimated and discussed.
JEL-codes: A2, I23, J24
Keywords: Course Evaluations, Student Learning, Grade
a
Department of Economics, University of California-Riverside, Riverside, CA 92521, United States
* Corresponding author. Department of Economics, University of California-Riverside, Riverside, CA
92521, United States. Tel.: +1 951 827 4164; fax: +1 951 827 5685. E-mail address:
mindy.marks@ucr.edu.
1. Introduction
Student evaluations of courses are the most common evaluation tool in higher education
today. Scores on course evaluation forms are used by many university stakeholders as a
measure of the transmission of knowledge in a course. Since the wide spread use of course
evaluations in higher education in the 1960s and 1970s, there has been a flurry of papers
investigating the relationship between evaluation scores and various measures of student
learning (See Cashin (1988; 1995), Cohen (1981), Dowell and Neal (1982), and Clayson
(2009) for comprehensive reviews).
A recent review (Clayson, 2009) concludes that no
general consensus has been reached about the validity of this relationship largely because
of the challenge associated with obtaining valid measures of student learning.
Many researchers have viewed a positive correlation between student grades and scores on
student course evaluations as evidence of greater knowledge transmission, assuming that
grades reflect learning. However, it is not clear that higher course grades necessarily
reflect more learning. The positive association between grades and course evaluations may
also reflect initial student ability and preferences, instructor grading leniency, or even a
favorable meeting time, all of which may translate into higher grades and greater student
satisfaction with the course, but perhaps not greater learning.
What is required is a measure of learning that is independent of instructor-assigned grades.
Grades in subsequent courses have been used by some researchers as an alternative
measure of student learning (see Carrell and West (2008), Ewing and Koching (2010),
Johnson (2003), Weinberg, Fleisher & Hashimoto (2009), and Yunker and Yunker (2003)).
1
While the evidence from these studies is mixed, they often find a negative association
between grade in the subsequent class and the quality of the preceding course as proxied
by a summary measure of the preceding instructors’ course evaluations. On the
assumption that future grades are a valid measure of learning in the prior class, this line of
research suggests that course evaluations are at best weakly correlated with the underlying
learning experience of a class.
However, while grade in a subsequent class is arguably a superior proxy for learning than
is grade in the current class, if subsequent classes do not build on knowledge conveyed in
the initial class, the observed relationship between course evaluations and this measure of
learning may be very weak. Additionally, results that use subsequent class grade as a
measure of learning are likely to suffer from selection concerns since not all students enroll
in a subsequent class. Hence, an objective measure of student learning in the specific
course under evaluation remains an important goal for a careful analysis of this issue.
In addition to the absence of an objective measure of student learning, the course
evaluation literature faces a number of difficulties regarding empirical methodology as
well. For example, due to the anonymous nature of course evaluations, student
evaluations are typically available to researchers only as course means, and only for that
subset of students who choose to fill out evaluations. Measures of learning, on the
contrary, are available at the individual student level, and for all students in the course.
This may pose a serious measurement problem: if researchers cannot identify which
particular students complete evaluations, evaluation scores and learning outcomes are
2
necessarily derived from different and potentially dissimilar segments of the class
population (Dowell and Neal, 1982). This does not pose a problem if we can assume that
evaluation scores for the subset of students who fill out evaluation forms reflect the
average experience of the class as a whole, but there is ample research calling this
assumption into question (Palmer, 1978; Clayson, 2007; Kherfi, 2009). Data which allow
the researcher to link individual evaluation scores with individual measures of student
learning would overcome this potential measurement problem.
This paper tests the relationship between student course evaluations and an objective
measure of student learning. The objective measure of learning derives from a unique
setting in which all students must take a high-stakes post-test in a remedial course that is
graded by people other than the faculty teaching the sections. This measure of learning
reflects concepts that relate directly to the course, as opposed to subsequent courses, but
unlike course grade it is not issued by the instructor. Our data contain other attractive
features which allow us to address some of the empirical methodological problems found
in the previous literature. We possess electronic course evaluations that make it possible
for us to match the student’s evaluation response to information contained in official
student records, thereby eliminating the measurement problem cited above. Thus, for each
student we know their demographic information, their score on a pre-test that is conducted
for placement purposes, their scores on various elements of the course evaluation, their
instructor-assigned grade, their score on the post-test, and their course grade.1
1
Carrell and West (2008) is the only other paper of which we are aware that employs an independently
graded, common post-test as a measure of learning. They find that classes in the US Air Force Academy with
higher than average course evaluations have students who perform better on the post-test. However, their
study is limited to comparing learning for the entire class to an average evaluation score that reflects the
3
Furthermore, this learning measure is available for all students who complete the initial
class as opposed to the subsample of students who continue on to a subsequent class.
The results of our analysis reveal a weak but positive and statistically significant
relationship between an objective measure of student learning and student course
evaluation scores. This finding holds when we control for student-specific demographic
characteristics, ability, and pre-test scores. Additionally, our findings are robust to the
inclusion of instructor and section fixed effects. A one standard deviation increase in
learning is associated with a 0.05 to 0.07 increase in course evaluation scores on a fivepoint scale, whereas, a one standard deviation increase in course grade leads to a 0.09 to
0.12 increase in course evaluations. This suggests that the observed relationship between
grades and course evaluations reflects something more than mere learning. When we use
grades in the subsequent class as a measure of learning, we find a negative, and always
statistically insignificant relationship between course evaluations and this proxy for
learning. This result calls into question the use of performance in future classes as an
accurate measure of student learning.
2. Institutional Description and Data
The four-year public university from which we draw our data has in place an entry-level
skills requirement in one of the core disciplines. Students can fulfill this requirement by
obtaining a sufficiently high score on either one component of the SAT or an appropriate
subsample of students who complete evaluations. There is evidence that students who fill out evaluations are
non-representative in terms of academic ability. Since we possess individual level data on course evaluations
we are able to overcome this problem.
4
advanced placement test.2 Students who do not meet the above requirements must sit for a
placement exam; we use the numerical score on this test as a pre-test measure. All of the
placement exams for entering freshmen from every campus of the state university system
are graded centrally in the spring of their senior year of high school. 3
Students who fail to achieve a score above the predetermined threshold are required to take
and pass a remedial course. Students normally enroll in the remedial class during the fall
of their freshman year. In practice, this requirement must be met by the fall of their second
year or students cannot continue at the university. Students may repeat the class, and so
the same student can appear multiple times in our sample. The data in this paper come
from 1106 students in 97 sections of the remedial program for students who enrolled in the
remedial class in the academic years 2007, 2008 and Fall 2009.
The remedial sections are taught by instructors who are supervised by a senior professor at
the university. Each section is a standalone class with around 20 students. It is common
for an instructor to teach multiple sections. The average instructor in our sample has
taught 2.41 sections over the course of seven quarters. All sections use the same text, a
regulated number of assignments and the same point system for grading. However,
individual instructors assign grades and have control over the content of the sections they
teach, including supplementary materials and the nature of assignments and exams. The
2
Students may also fulfill this requirement by passing a designated course at an accredited community
college prior to entering the university.
3
Exam scores range from 2 to 12 with a grade of 6 or less considered failing, and a grade of 8 considered
passing. If the student receives a 7, then an additional grader’s score is used. The mean student in our sample
has a score of 5.4.
5
instructors have discretion over 700 points available in each section. Although the classes
are somewhat standardized and the students are more or less homogenous across sections,
we observe large variability in section grades. There is more than a two-letter grade
difference between the mean grade in the sections with the highest and lowest average
grades.
All students take a common post-test that is worth 300 points. The students need 730
points out of a total of 1000 to move onto the next course in a required series of classes.
The overall grade (lecture points plus post-test score) is not curved (i.e. 800 points is a Bregardless of section or instructor). The post-test is identical in structure to the pre-test, a
fact of which the students are aware. We note that the post-test is modeled on the system
wide placement test which was designed by a team of experts to capture the subject’s
fundamental concepts. Additionally, part of the section grade consists of exercises that
replicate the post-test. The post-test appears to be a valid instrument in as much as further
analysis reveals that students with stronger academic backgrounds perform better on the
exam. The post-test is administered to all students at the same time and place, and after the
deadline for submission of course evaluations. This test is graded by a committee
according to a set diagnostic standard. There is a norming session, followed by mentored
grading; many of the instructors doing this grading also participated in the grading of the
original pre-test. We will use the post-test score conditioned on initial student ability as
our measure of learning.4
4
A student had to have taken the post-test to be included in our sample, thus we drop 185 students who did
not take the post-test.
6
During the time of our study, the university was phasing in electronic evaluations.5 Thus,
during much of the period instructors could choose between electronic evaluations and
paper evaluations. The sample we analyze in this paper is restricted to students whose
instructors used electronic evaluations. To the best of our knowledge this is the first study
to make use of electronic course evaluations. We focus on electronic evaluations because
there is an electronic record, which allows us to match a given evaluation to the
corresponding institutional data for that student. Thus, we limit our analysis to the
subsample of students for which we can observe both learning and course evaluations. Our
data do not contain student identifiers, and so the anonymity of the evaluation process
remains intact.
Column 1 of Table 1 shows summary statistics for those instructors who opted for
traditional paper evaluations, while Column 2 shows summary statistics for those
instructors who chose electronic evaluations. Column 3 contains tests for differences in
the means between these two samples. The test statistics in Column 3 show that
instructors with academically stronger students as measured by their SAT Verbal, ACT
scores, high school GPA, and class grades were more likely to use electronic evaluations. 6
However, the comparison also reveals that sections with electronic course evaluations are
generally representative of the overall remedial population with respect to student
demographic characteristics and pre-test scores.
5
The phase-in was completed by Fall 2009, when the only mechanism for course evaluation was electronic.
6
This could be because weaker students who do not regularly attend class have the opportunity to complete
an electronic evaluation whereas such students must show up in class to complete a paper evaluation. If
instructors fear that weaker students (or those with low expected grades) give poorer evaluations then paper
evaluations are a safer strategy.
7
Since we are interested in the relationship between student-level learning and course
evaluations, in addition to having an instructor who elected for electronic course
evaluations (the sample in Column 2) a student had to have completed an electronic course
evaluation to be included in our sample. For the instructors that opted for electronic course
evaluations about 68% of the students completed an evaluation form. Information about
this sample is contained in Column 5 of Table 1. This is the sample used in this paper, and
it contains 1106 students, 33 instructors in 97 sections. Sixty-two percent of the sample is
first-generation college-bound, and 58 percent of the sample is female. The average high
school GPA is 3.42, and the average student earned a C+ on the section component and a
C+ on the post-test.
Column 6 tests if there are differences between the sample of students used in this study
(those who complete electronic course evaluations (Column 5)) and students who do not
complete evaluations (Column 4). The students who complete course evaluations are not
representative of the class overall. Female and low-income students are more likely to
complete course evaluations, as are students with higher high school grade point averages.
Students who performed better on the SATs are less likely to complete electronic
evaluations. Importantly, students who complete evaluations earn significantly higher
section grades and perform slightly better on our measure of learning (Post-test 300). The
difference in characteristics and outcomes between students who complete evaluations and
those who do not suggests that aggregate evaluation scores (based on the subsample of
students who complete evaluations) may not accurately reflect the average learning
8
experience of the entire class. Thus, studies that match aggregate evaluation scores to
aggregate measures of learning (such as average grade) for the class as a whole, likely
suffer from measurement error.7
Table 2 contains information about the evaluation instrument as well as the descriptive
statistics for each of the questions on the instrument. Each evaluation score ranges from 1
(strongly disagree) to 5 (strongly agree). The student may also answer “N/A” to a given
question. Table 2 also shows the number of completed responses for each question after
we exclude the “N/A”s from our sample. Since we are interested in capturing the amount
of learning in the course we will focus on the final question, “The course overall as a
learning experience was excellent” [C6].8 The mean response to this question is a 4.42, but
there is high variability.
3. Empirical Methodology
A common model in the literature takes average grade (or expected grade) as a measure of
learning and relates it to average course evaluation score for an instructor in a particular
section of a course. This model often includes section level covariates and may include
observable characteristics of the instructor – for example, age, gender, or foreign-born
status.
7
Additionally, it should be noted that the estimates in this paper reflect the relationship between learning and
course evaluations for the non-random subsample of students who complete course evaluations. Caution
should be exercised when mapping these results to the entire student population, as students who complete
course evaluations differ from those who do not.
8
Results are robust to using “Instructor was effective as a teacher overall” [B8].
9
We can estimate a variant of this model (Equation (1)) by utilizing student-level as
opposed to section-level information and a richer set of covariates than is typically found
in the literature.
'
Evalijs  Gradeijs  X ijs
1   ijs
(1)
In our Equation (1), Evalijs denotes the evaluation score given by student i to instructor j in
 section s. X ijs' includes student background traits such as age, and indicators for gender,

'
ethnicity, living on campus, first generation and low income. Also included in X ijs
are the

following section-specific controls: term, enrollment, course evaluation response rate,
 include the
withdrawal rate and percent of students repeating the class. We will also
following student ability measures: high school GPA, SAT or ACT scores, pre-test scores,
and indicators for missing SAT, ACT or pre-test scores. Gradeijs denotes the lecture points
earned by student i enrolled under instructor j in section s. Gradeijs is the portion of student

i’s grade that the instructor has control over, and it is normalized by z-scores. The z-score
 earned by students who completed an
is normed so that zero reflects the mean lecture point
electronic course evaluation. In this model, α, is the parameter of interest.
Given our previous concerns that section grades may reflect more than merely learning in
the course, in Equation (2) we replace Gradeijs with a variable which we feel better
captures student learning -- namely, Posttestijs . This variable is the z-score of the points

earned by student i on the common post-test.

'
Evalijs  Posttestijs  X ijs
 2  uijs

(2)
10
Recall that the instructor does not administer or grade the post-test; thus, Posttestijs , is an
'
independent, objective measure of learning. X ijs
is the same vector of covariates in

Equation (1). Note that controlling for pre-test scores allows us to net out the learning
 score is a measure of skills upon entry to, and the
accumulated by students; the pre-test
post-test score is a measure of skills upon exit from, the class. The estimated coefficient 
capture the change in student evaluation scores for a one-standard deviation change in our
measure of learning, holding all other measurable factors constant. For comparison to the
literature, we will also estimate Equation (2) using grade in the required subsequent course
as a proxy for learning.
We can employ instructor fixed effects in these models as well. There is ample evidence
that instructor attributes can influence course evaluations (Felton, Koper, Mitchell &
Stinson, 2006; Fischer, 2006; McPherson, 2006). The instructor fixed effects control for
unmeasured instructor characteristics that may be correlated with both the proxies for
learning and Evalijs, and which thereby lead to biased estimates. In these specifications, the
coefficient of interest captures the average within-instructor impact of student learning on
 course evaluations.
student
Finally, since we have student-specific evaluation data, we can also control for additional
omitted factors by including section fixed effects (which subsumes instructor fixed
effects). While numerous papers have employed instructor fixed effects, to the best of our
knowledge no paper has been able to include section fixed effects. The section fixed
effects control for any additional section-level characteristics that may bias the previous
11
estimates. 9 For instance, early morning sections might be associated with both lower
course evaluations and reduced student learning, leading to a positive bias in the absence
of explicit controls for the timing of sections. One can imagine other uncaptured sectionspecific features (like a class clown) leading to negative bias in the estimated effects; thus,
the overall direction of the bias is unclear. In these specifications, the coefficient of
interest captures the average within-section impact of student learning on student course
evaluations.
4. Results
Table 3 presents regression results where the dependent variable is the numerical score on
the specific question of the evaluation asking whether “the course overall as a learning
experience was excellent.” Column 1 shows the estimated impact of instructor-assigned
grade on the course evaluation score, with a rich set of student demographic and sectionlevel controls. Students who receive higher section grades give significantly higher course
evaluations. More specifically, a one-standard deviation increase in section grade
translates into a 0.11 point increase in the course evaluation score. Columns 2 adds student
ability measures, including the student’s pre-test scores. This allows us to eliminate
potential bias due to differences in students’ starting skills. Including student ability
controls does not alter the relationship between section grade and course evaluations.
9
We have estimated Equations (1) and (2) using ordered probits as well. However, we are concerned about
the reliability of the standard errors when section fixed effects are included in this analysis. The results under
the ordered probit model lead to a similar conclusion in that both coefficients of interest (Grade and Posttest)
are positive and statistically significant. The evaluation scores are also skewed towards the right of the
distribution, with only 12 percent of the students giving a score of three or less. We have also estimated all
models with an indicator variable that takes a value of one if the student gave the highest evaluation score
and zero otherwise. The overall pattern of results is unchanged under this specification of the dependent
variable.
12
In this regression, as for most of the regression results that follow, the only control
variables that are statistically significant are an indicator for Asian, which enters
negatively, and the section response rate which is positive. There is never a statistically
significant relationship between gender, first generation college student, or student ability
measures and course evaluation scores.
Column 3 adds instructor fixed effects to the specification in Column 2. The inclusion of
instructor fixed effects controls for any time invariant instructor traits (such as grading
standards or language skills) that may impact both section grade and course evaluations.
Adding a full set of instructor fixed effects reduces slightly the qualitative magnitude of
the estimated coefficient but does not affect its statistical significance. These results are
consistent with instructor grading standards biasing slightly upward the observed
relationship between learning and course evaluations.
Column 4 adds section fixed effects thereby subsuming the instructor fixed effects. The
section fixed effects control for any additional section-level characteristics (such as time of
day the class meets or peer effects) that may affect the previous estimates. In essence these
regressions ask, “Within a given section do students who receive a higher section grade
give higher course evaluations?” Including the section fixed effects further reduces the
relationship between section grade and course evaluations. However, even after
controlling for these additional factors a robust and statistically significant relationship
persists.
13
Overall, the results in Columns (1-4) confirm what is found in the existing literature—
namely, that there exists a strong and positive relationship between measures of course
grades and course evaluations. An increase of one standard deviation in the section grade
(almost one full letter grade) increases the expected course evaluation score by about 1/7 of
a standard deviation, and each additional full letter grade received in the course is
associated with a 0.09 to 0.12 increase in the average course evaluation score.
There are obvious concerns associated with using student grades as a proxy for learning.
Instructors who present more challenging material in a course may impart more knowledge
to students, but students may also receive below average grades because of the difficulty of
the material. Additionally, course grades arguably reflect more than just learning; they
also capture elements of effort such as class attendance and the follow through to complete
assignments.
In Columns (5-8) of Table 3 we replicate our earlier regression results, but with an
independent measure of learning instead of course grade as the dependent variable. In
these regressions, we use grade on the common, independently graded post-test (measured
in z-scores) as a measure of learning. Column 5 shows the association between scores on
the post-test and course evaluations with student demographics and section control
variables only. The association between performance on the post-test and instructor
evaluation scores is positive but only weakly statistically significant.
14
The Column 6 results include student initial abilities, including pre-test scores, which
render the coefficient of interest a measure of the impact of net learning on course
evaluation scores. This adjustment strengthens our findings in that the estimated
coefficient becomes slightly larger in magnitude and is statistically significant at the 5
percent level. The results suggest that models focusing on levels rather than gains in
learning are misspecified. A one-standard deviation in net learning translates into a 0.06
point increase in course evaluation score. Note that the point estimates are half the size of
those in Column 2 where section grade is used as a proxy for learning. Thus, a sizeable
portion of the relationship between grade and course evaluations reflects something other
than performance on the comprehensive post-test.
In Column 7 and Column 8 we add instructor and section fixed effects, respectively, and
the qualitative magnitude of the estimated effect is unchanged, which indicates that there
are not important instructor-level or section-level omitted factors that bias our estimate of
the relationship between learning and course evaluations. The results in Column 8 imply
that if two students in the same section are otherwise identical, the student who acquires
more knowledge will, on average, give a higher course evaluation score. While these
results suggest that it is possible to infer information about knowledge acquired from
course evaluation results, a certain amount of caution is required because the estimated
relationships are very small. A one standard deviation increase in learning (as measured
by the post-test) is associated with less than one-tenth of standard deviation increase in
course evaluation.
15
The results in Table 3 reveal a statistically significant relationship between grade in class
and course evaluation scores, and between learning and evaluation scores as well.
However, the estimated magnitude of the relationship between section grades and course
evaluations is almost double that of the estimated relationship between learning and course
evaluations. Arguably, grades capture something about a course and its value to students
more than the direct learning it imparts.10
A common story in the literature is that the observed association between grades and
evaluations reflects in part the fact that students reward instructors who are easy graders or
who demand less work (see, for example, Aigner and Thum (1986), Babcock
(forthcoming), Greenwald and Gilmore (1997), and Isely and Singh (2005)). However, the
change in the quantitative impact going from the results in Column 2 to those in Column 3
of Table 3 indicates that surprisingly little of the estimated cross-instructor relationship
between grades and evaluation scores is accounted for by time-invariant instructor
characteristics. This finding suggests the need for a more nuanced explanation for why
students translate high grades into high course evaluation scores, and one that rests on
students doing so within a class far more than across classes.
Clayson (2009) cites evidences that students are poor judges of their own knowledge.
Perhaps students interpret a relatively high class grade, especially when awarded in
10
When both current grade and our measure of learning are included in the model, the effect of current grade
continues to dominate. In our preferred specification we find that both coefficients remain positive and
statistically significant but decrease slightly in magnitude (see Column 9 of Table 3). The interpretation of
this regression is unclear; however, owing to the multicollinearity induced by the fact that grades and
learning are significantly interrelated. We note that learning remains statistically significant even after
conditioning on current grade.
16
comparison to others in a specific class, as a valid measure of learning and acknowledge
this “learning” on the course evaluation form. This explanation is, of course, pure
speculation at this point, but it does not take away from our conclusion that course
evaluations do indeed capture student learning.
In Table 4, we invoke “grade in the subsequent class” as a proxy for learning, and explore
the relationship between this measure of learning and course evaluation scores. Although
in this particular case the subsequent course is required for graduation, subsequent course
grades are available for only 919 of the students in our sample.11 The smaller sample size
is due to the fact that students separate from the university before moving on to the next
class in the series, or they postpone taking the subsequent class beyond the latest period for
which we have data. We find that students who continue on to the required subsequent
class are academically stronger in terms of standardized tests and class performance.
Columns (1-4) of Table 4 replicate the results from Table 3 (Columns 5-8) with post-test as
the measure of learning for the subsample of students who continue on to the subsequent
class. The results mirror those of the full sample – there is a positive and statistically
significant relationship between learning, as measured by performance on the post-test, and
course evaluations.
The regression results with grade in the subsequent class as a measure of learning can be
found in Columns (5-8) of Table 4. The results show a slight negative, albeit statistically
11
The sample includes students who failed and retook the initial class before moving on to the subsequent
class. For these students their most recent course evaluation score was used. However, we find similar
results if these students are excluded from the analysis.
17
insignificant relationship between learning as proxied by subsequent course grade and
course evaluation scores. That is, a student’s performance in the subsequent class is
negatively related to how highly he or she evaluated the preceding course. These findings
are similar to those of other researchers who have used grade in subsequent class as a
measure of learning.12 Given the validity of our preferred measure of learning, the results
in Table 4 suggest that grade in subsequent courses is a poor proxy for learning in the
current class. If the subsequent class emphasizes different material, one might expect little
correlation between the learning in the preceding course and performance in the
subsequent course.
5. Robustness Checks
In this section, we conduct a series of robustness checks on the specification in Table 3.
We begin by limiting the sample to the 891 students who took the course in the fall of their
freshman year. More advanced freshmen might select into course sections based on
knowledge of instructor or peer characteristics. For instance, if groups of hard-working
students enroll in a class section together, this could induce an artificial relationship
between learning and instructor evaluation if conscientious students give higher
evaluations. The results for this sample are shown in Panel A of Table 5. The coefficient
estimates are very similar to those in Table 3 pointing to limited concerns about selection.
However, statistical significance is weakened, arguably due to the reduced sample size.
12
Carrell and West (2008) in a study of Air Force Academy cadets and Johnson (2003) focusing on Duke
students find that course evaluations are insignificant (and often negative) predictors of follow-on student
achievement. Weinberg et al (2009) have data on economics students at the Ohio State University and they
show that grades in subsequent classes have a positive but on course evaluations in the preceding class but a
statistically insignificant impact once current grades are controlled for Yunker and Yunker (2003) focus on
accounting students and find a statistically significant negative association between introductory course with
high evaluations and performance in the subsequent class.
18
As an additional robustness check we limit the sample to students who had an incentive to
perform well on the post-test. By university regulations, students must pass this class in
their first year. The 183 students with very high scores (above 600 points) on the section
portion of the grade may be somewhat unconcerned about their performance on the posttest; they are almost guaranteed a passing grade in the class as they only need 130 points
out of 300 on their post-test in order to pass the course. We also dropped 22 students with
very low scores (less than 430 point out of a total of 700 points) in the classroom portion
of the total grade. These students cannot earn enough points on the post-test to pass the
class. By focusing on students who have a greater incentive to take the post-test seriously
we reduce the measurement error associated with the use of the post-test as a measure of
learning. The results from this subsample are shown in Panel B of Table 5. The
coefficient estimates on the variable that capture learning are slightly larger, consistent
with measurement error causing attenuation bias. For instance, in the specification that
includes the full set of fixed effects (Column 4) a one standard deviation increase in
performance on the post-test is associated with 0.07 higher score on the evaluation
instrument. This compares to a 0.06 higher score for the full sample (in Table 3).
6. Which Evaluation Questions Best Capture Learning
Given that institutions are often limited to course evaluations as the primary measure of
course quality, it is worth investigating which elements, if any, of a typical course
evaluation form provide the most information about the learning acquired in a class.
Because the university is concerned about the knowledge gain for all students, not just
19
those who fill out evaluations, for this analysis we include all students in sections with
electronic evaluations, irrespective of whether or not the student completed an
evaluation.13 For all students who took the post-test, we estimate the following regression
to generate a measure of learning in a given section.
82
Posttestij 

j
 X ij'    ij
(3)
j 1
Posttestij is the numerical score on the post-test for student i in section j,  is a set of
section fixed effects and X ij' is a vector of student characteristics and ability measures.
This regression will generate 82 coefficients,  j , j  1,...,82 , each one representing the
average amount of gains in the post-test acquired in each section. We refer to them as
learning measures. 14 The estimated coefficients on the  ' s tell us, conditional on the
initial ability and demographic characteristics of the students, in which sections students
learn the most as reflected by their performance on the independently graded post-test.
In Figure 1, we plot these estimated learning coefficients against the average course
evaluation scores in each section using the question (C6) that refers to the quality of the
overall class experience. The average student in the “best” section performed 25 points
better on the post-test than the average student in the “worst” section. In those sections
where more learning occurred, as measured by ability-adjusted performance on the post13
We also conducted the analysis with only those students who filled out electronic evaluations. The overall
pattern of results is similar.
14
Because we do not want the response of a few students to reflect learning for the entire class we restrict our
sample to sections with response rates greater than 45 percent in this analysis. When we include all sections,
the correlations are systematically weaker. For instance, the correlation coefficient is 0.16 and 0.06 for
elements C6 and B8, respectively. However, the relative rankings of the elements of the course evaluation in
terms of information about estimated learning remain unchanged.
20
test, students awarded the higher scores on question C6 (the correlation coefficient is 0.19).
While this is true on average, one can clearly see from the figure that greater knowledge
transmission does not translate into higher evaluations in every case. While there is an
apparent relationship between low course evaluations (those with average scores below 4)
and poor learning, the overall relationship between course learning and course evaluations
is weak.
Indeed, the section with the lowest course evaluation score witnessed similar
amounts of estimated learning gains as the sections with the highest course evaluations.
Thus, it would not be prudent to rely solely on course evaluations as a means of gauging
student learning.
The correlation coefficients between the estimated post-test learning measures,  , and
mean section-level course evaluation scores for all the elements of the course evaluations
are found in Table 6. In every case, there exists a positive relationship, suggesting that
sections with high performance on the post-test do, on average, receive slightly higher
course evaluations. The questions that reflect learning in the course, as opposed to
characteristics of the instructor, seem to better reflect estimated post-test results. The
questions that solicit information about the clarity of the instructor, the use of supplemental
material and overall course experience are the strongest predictors of estimated learning
gains. Elements of the course evaluation that reflect the syllabus, the relationship between
the exams and the course material and the instructor’s respect for his students and
helpfulness convey little information about how much learning was acquired in the course
section.
21
7. Conclusion
This paper utilizes a measure of individual student learning not found thus far in the
literature to explore the relationship between course evaluation scores and the amount of
student learning. We have data from a unique setting in which students take a pre-test
placement exam and a post-test exit exam which is common to all students and is graded
by a team of instructors instead of the instructor of record for the course. We treat the
score on this post-test as an independent measure of student learning, conditional on the
pre-test score and other ability measures.
In addition to a new, more objective measure of student learning, we possess electronic
course evaluations which are available at the individual student level as opposed to the
section or class level. We show that students who choose to fill out course evaluations are
not representative of the overall class as they tend to be academically stronger students. To
overcome the potential econometric problems found in the larger literature which utilizes
more aggregate level data, we investigate how an individual student’s learning maps to his
course evaluation as opposed to mapping student-level learning to the average evaluation
score for his class.
Our results reveal that course evaluations do indeed reflect student learning. In
specifications that use the common post-test as a measure of learning, there is a
consistently positive and statistically significant relationship between individual student
learning and course evaluations. The relationship between learning and course
evaluations is strengthened by ability controls and is robust to the inclusion of instructor
22
and section fixed effects. While the estimated relationship is positive and statistically
significant, the quantitative association is in fact quite small, suggesting perhaps that it
would be prudent for institutions wishing to capture the extent of knowledge transmission
in the classroom to explore measures beyond student evaluations of the course.
In addition to an objective measure of learning, we possess, at the individual student level,
grade in the course as assigned by the course instructor, which is one of the most common
measures of learning found in the literature. When section grades are used as a proxy for
learning, we find a quantitatively stronger relationship to course evaluation scores than is
found for our more objective measure of learning. This suggests that, with regard to its
impact on course evaluations, grade in the course reflects more than mere learning. We
also possess grade in a subsequent course – another measure of learning found in the
literature. Our results indicate a negative relationship between course evaluations and
subsequent course grade. This finding likely reflects the fact that grade in the subsequent
class is a much weaker measure of learning than the post-test which reflects cumulative
learning in the current class.
Finally, we show that some questions on a typical course evaluation form better reflect
student learning than do others. The elements that explicitly ask about learning or the
clarity and effectiveness of the instructor are more strongly correlated with estimated
learning gains for the class. Questions asking about the value of the class as a whole
provide more information on learning than those that refer to instructor-student
interactions.
23
Acknowledgements
We gratefully acknowledge Jorge Agüero, Philip Babcock, Marc Law, and participants of
the 2010 Western International Economic Association meetings for their valuable
comments. Thanks also go to the director of the remedial program and Junelyn Peeples for
assistance in collecting the data.
24
References
Aigner, D. J., & Thum, F.D. (1986). On Student Evaluation of Teaching Ability. Journal
of Economic Education, 17(4), 243-265.
Babcock, P. Real Costs of Nominal Grade Inflation? New Evidence from Student Course
Evaluations. Economic Inquiry, forthcoming.
Bedard, K., & Kuhn, P. (2008). Where Class Size Really Matters: Class Size and Student
Ratings of Instructor Effectiveness. Economics of Education Review, 27 (3), 253265.
Cashin, W. E. (1988). Student Ratings of Teachings: A Summary of the Research. IDEA
Paper 20. Manhattan, KS: Kansas State University, Center for Faculty Evaluation
and Development.
Cashin, W. E. (1995). Student Ratings of Teaching The Research Revisited. IDEA Paper
32. Manhattan, KS: Kansas State University, Center for Faculty Evaluation and
Development.
Carrell, S. E., & West, J. E. (2008). Does Professor Quality Matter? Evidence from
Random Assignment of Students to Professors. Working Paper 14081. Cambridge,
MA: National Bureau of Economic Research.
Clayson, D. E. (2007). Conceptual and Statistical Problems of Using Between-Class Data
in Educational Research. Journal of Marketing Education, 29(1), 34-38.
Clayson, D. E. (2009). Student Evaluations of Teaching: Are They Related to What
Students Learn?: A Meta-Analysis and Review of the Literature. Journal of
Marketing Education, 31(1), 16-30.
Cohen, P. A. (1981). Student Ratings of Instruction and Student Achievement: A MetaAnalysis of Multi-Section Validity Studies. Review of Educational Research,
51(3), 281-309.
Dowell, D. A. & Neal J. A. (1982). A Selective Review of the Validity of Student Ratings
of Teaching. Journal of Higher Education, 53(1), 51-61.
Ewing, A. & Kochin, L. (2010). Learning, Grades, and Student Evaluations of Teaching
in an Economics Course Sequence. Mimeo.
Felton, J., Koper, P. T., Mitchell, J. B., & Stinson, M. (2006). Attractiveness, Easiness,
and Other Issues: Student Evaluations of Professors on RateMyProfessors.com.
SSRN Working Paper.
25
Fischer, J. D. (2006). Implications of Recent Research on Student Evaluations of Teaching.
The Montana Professor, 17: 11.
Greenwald, A. G. & Gillmore, G. M. (1997). Grading Leniency is a Removable
Contaminant of Student Ratings. American Psychologist, 52(11), 1209-1217.
Isely, P., & Singh, H. (2005). Do Higher Grades Lead to Favorable Student Evaluations?
Journal of Economic Education, 36(1), 29-42.
Johnson, V. E. (2003). Grade Inflation: A Crisis in College Education. New York:
Springer.
Kherfi, S. (2009). Course Grade and Perceived Instructor Effectiveness When the
Characteristics of Survey Respondents are Observable, American University,
Sharjah Working Paper.
McPherson, M. A. (2006). Determinants of How Students Evaluate Teachers.
Journal of Economic Education, 37(1), 3-20.
Palmer, J., Carliner, G., & Romer, T. (1978). Leniency, Learning and Evaluations.
Journal of Educational Psychology, 70(5), 855-863.
Seldin P. (1993). The Use and Abuse of Student Ratings of Professors. Chronicles of
Higher Education, 39(4), A40.
Weinberger, B. A., Fleisher, B. M., & Hashimoto, M. (2009). Evaluating Teaching in
Higher Education. Journal of Economic Education, 40(3), 227-261.
Yunker, P. J., & Yunker, J. A. (2003). Are Student Evaluations of Teaching Valid?
Evidence from an Analytical Business Core Course. Journal of Education for
Business, 78(6), 313-317.
26
3
3.5
4
4.5
5
Figure 1. Estimated Learning and Evaluation Instrument. [C6]
-12
-8
-4
0
4
Estimated Learning (lambda)
8
12
Fitted Value
27
Table 1. Descriptive Statistics.
Faculty Chose Electronic Evaluation (Eval)
(1)
(2)
(3)
(4)
(5)
(6)
Faculty Did Not
Choose Eval
Faculty Chose
Eval
t-test
Student Did Not
Fill Out Eval
Student Filled Out
Eval
t-test
Mean
S.D.
Mean
S.D.
p-value
Mean
S.D.
Mean
S.D.
p-value
Lecture 700
Post-test (300)
540
224
(63.59)
(17.91)
547
226
(56.43)
(16.54)
0.000
0.000
535
225
(61.66)
(17.03)
553
227
(52.83)
(16.27)
0.000
0.018
Total Points
764
(69.58)
773
(61.69)
0.000
760
(66.33)
780
(58.30)
0.000
Took Next Class
0.85
(0.36)
0.85
(0.36)
0.790
0.82
(0.38)
0.86
(0.35)
0.053
Subsequent Grade
2.79
(0.76)
2.81
(0.76)
0.465
2.66
(0.86)
2.88
(0.69)
0.000
Pre-Test
5.44
(0.81)
5.45
(0.85)
0.635
5.47
(0.81)
5.44
(0.86)
0.6067
SAT Verbal
455
(71.12)
462
(68.08)
0.001
470
(66.21)
458
(68.62)
0.000
SAT Writing
460
(67.54)
466
(65.88)
0.006
470
(64.48)
464
(66.48)
0.078
SAT Math
508
(95.40)
512
(96.96)
0.215
527
(95.92)
505
(96.74)
0.000
ACT
19
(3.32)
20
(3.44)
0.001
20
(3.30)
20
(3.50)
0.585
High School GPA
3.35
(0.31)
3.40
(0.31)
0.000
3.35
(0.29)
3.42
(0.32)
0.000
Female
0.54
(0.50)
0.54
(0.50)
0.972
0.47
(0.50)
0.58
(0.49)
0.000
African American
0.07
(0.25)
0.08
(0.27)
0.225
0.08
(0.27)
0.08
(0.27)
0.699
Hispanic
0.36
(0.48)
0.36
(0.48)
0.710
0.33
(0.47)
0.37
(0.48)
0.130
Asian
0.45
(0.50)
0.45
(0.50)
0.934
0.46
(0.50)
0.44
(0.50)
0.538
Low Income
0.53
(0.50)
0.51
(0.50)
0.166
0.47
(0.50)
0.52
(0.50)
0.046
First Generation
0.61
(0.49)
0.61
(0.49)
0.991
0.58
(0.49)
0.62
(0.48)
0.074
On Campus
0.69
(0.46)
0.67
(0.47)
0.155
0.70
(0.46)
0.66
(0.47)
0.127
Repeating Class
0.15
(0.36)
0.12
(0.32)
0.16
(0.37)
0.09
(0.29)
0.001
0.000
Section Enrollment
20
(1.79)
20
(1.59)
20
(1.61)
20
(1.57)
0.005
0.001
Number of Sections
149
97
Number of Instructors
77
33
Observations
2766
1634
528
1106
Unique Observations
2682
1611
519
1092
Notes: Difference in means statistically significant at 5% level in bold. Standard deviations in parentheses. Eval refers to
electronic evaluation. In the sample of students whose instructor did (not) choose electronic evaluations there are: 39(32) students
with missing SAT Math scores, 994(1814) students with ACT scores. In the sample of students who did (not) fill out electronic
evaluations there are: 27(12) students with missing SAT Math scores, and 646(348) students with ACT scores. Total Points is the
sum of Lecture 700 and Pre-test 300. Pre-test score ranges from 1 to 6. Low income is an indicator variable for parent’s income is
less than or equal to $30,000. First Generation implies that neither of the student’s parents completed university. On Campus refers
to students who reside in campus housing. Unique observations refers to the number of students that did not retake the class.
28
Table 2. Descriptive Statistics about the Evaluation Instrument.
Instrument
Evaluation Question
N
Mean
S.D.
B1
Instructor was prepared and organized.
1117
4.65
(0.66)
B2
Instructor used class time effectively.
1114
4.64
(0.71)
B3
Instructor was clear and understandable.
1116
4.65
(0.67)
B4
Instructor exhibited enthusiasm for subject and teaching.
1118
4.64
(0.73)
B5
Instructors respected students; sensitive to and concerned with their
progress.
1116
4.65
(0.73)
B6
Instructor was available and helpful.
1117
4.66
(0.73)
B7
Instructor was fair in evaluating students.
1117
4.53
(0.83)
B8
Instructor was effective as a teacher overall.
1116
4.61
(0.73)
C1
The syllabus clearly explained the structure of the courses.
1114
4.61
(0.71)
C2
The exams reflected the materials covered during the course.
1112
4.59
(0.70)
C3
The required readings contributed to my learning.
1112
4.50
(0.79)
C4
The assignments contributed to my learning.
1111
4.54
(0.76)
C5
Supplementary materials (e.g. films, slides, videos, guest lectures, web
pages, etc.) were informative.
1106
4.34
(0.95)
C6
The course overall as a learning experience was excellent.
1106
4.42
(0.84)
Notes: Standard deviation in parentheses. Evaluation score ranges from 1 (strongly disagree) to 5 (strongly agree). “N/A”
responses are excluded.
.
29
Table 3. Determinants of “The course overall as a learning experience was excellent.” [C6]
Lecture 700a
(1)
(2)
(3)
(4)
0.113***
(0.030)
0.118***
(0.031)
0.104***
(0.037)
0.093**
(0.038)
Post-test 300a
Student Abilityb
Sectionc
Characteristics
Yes
Fixed Effects
Instructor Fixed Effects
(5)
0.054*
(0.027)
Yes
Yes
Yes
Yes
Yes
Yes
(6)
(7)
(9)
0.064**
(0.029)
0.062**
(0.028)
0.065**
(0.029)
0.084**
(0.037)
0.055*
(0.029)
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
(8)
Yes
1106
1106
1106
1106
1106
1106
1106
1106
1106
0.059
0.047
0.192
0.059
0.088
0.096
0.046
0.083
0.093
Notes: Robust standard errors in parentheses. * significant at 10%; ** significant at 5%; *** significant at 1%.
a. Lecture 700 and Post-test 300 represent z-scores of the lecture grade and post-test, respectively. Robust standard errors in
parentheses.
b. Student ability includes: cumulative high school GPA, placement score, SAT Verbal, SAT I Writing, ACT, and indicators for
missing SAT, ACT or placement score.
c. Section characteristics include student’s age, and indicators for gender, ethnicity, living on campus, first generation, low
income, term, enrollment, response rate, withdrawal rate, and percent of students repeating the class.
Observations
Adjusted R2
30
Table 4: Using Subsequent Grade as a Measure of Learning.a
Dependent variable: “The course overall as a learning experience was excellent.”
Post-test 300
(1)
(2)
(3)
(4)
0.047*
(0.030)
0.056*
(0.031)
0.057*
(0.031)
0.060**
(0.032)
Subsequent
Gradeb
Student Ability
Section
Characteristics
Yes
Fixed Effects
Instructor Fixed Effects
Observations
Adjusted R2
919
0.050
Yes
Yes
Yes
Yes
(5)
(6)
(7)
(8)
-0.018
(0.037)
-0.024
(0.037)
-0.004
(0.037)
-0.020
(0.039)
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
919
0.052
919
0.078
Yes
919
0.083
919
0.047
919
0.048
919
0.075
919
0.078
Notes: Robust standard errors in parentheses. * significant at 10%; ** significant at 5%; *** significant at
1%. See notes to Table 3 for variable descriptions and a list of control variables.
a. Sample is restricted to students who took the subsequent class.
b. Subsequent grade is the grade on a 4 point scale that the student received in the required subsequent class.
31
Table 5. Robustness Checks on the Determinants of “The course overall as a learning
experience was excellent.” [C6]
(1)
(2)
(3)
(4)
0.051*
(0.029)
891
0.043
0.062**
(0.031)
891
0.044
0.055*
(0.030)
891
0.079
0.058*
(0.031)
891
0.096
Panel A. Fall Onlya
Post-test 300
Observations
Adjusted R2
Panel B. Restricted to Students on the Margin of Passing the Classb
Post-test 300
Observations
Adjusted R2
Student Ability
Section
Characteristics
Fixed Effects
Instructor Fixed Effects
0.054*
(0.032)
906
0.057
Yes
0.071**
(0.034)
906
0.060
0.073**
(0.034)
906
0.099
0.074**
(0.035)
906
0.111
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Notes: Robust standard errors in parentheses. * significant at 10%; ** significant at 5%; *** significant at
1%. The dependent variable is a numerical score on response to the question, “The course overall as a
learning experience was excellent.“ [C6] See notes to Table 3 for variable descriptions and a list of control
variables.
a. Sample restricted to students who enrolled in the class fall term of their freshman year.
b. Sample restricted to students whose lecture score was between 430 and 600.
32
Table 6. Validity of the Evaluation Questions.
Evaluation Question
Correlation
Coefficient
Instructor Elements
B1
Instructor was prepared and organized
0.14
B2
Instructor used class time effectively
0.18
B3
Instructor was clear and understandable
0.24
B4
Instructor exhibited enthusiasm for subject and teaching
0.12
B5
0.09
B6
Instructor respected students; sensitive to and concerned with
their progress
Instructor was available and helpful
B7
Instructor was fair in evaluating students
0.15
B8
Instructor was effective as a teacher overall
0.16
0.10
Course Elements
C1
The syllabus clearly explained the structure of the courses
0.04
C2
The exams reflected the materials covered during the course
0.08
C3
The required readings contributed to my learning
0.11
C4
The assignments contributed to my learning
0.16
C5
Supplementary materials (e.g. films, slides, videos, guest
lectures, web pages, etc) were informative
0.26
C6
The course overall as a learning experience was excellent
0.19
33
Download