Committee Report to UNCW Faculty Senate Decisions and Process

advertisement
Draft, 11/10/2010 Not to be quoted Committee Report to UNCW Faculty Senate
2010 Ad Hoc Faculty Senate Committee on SPOT use for RTP
Decisions and Process
Committee Members: Prof. Craig Galbraith, (Management and Committee Chair; UNCW RTP
Committee member and chairperson, 2005-2010); Prof. John Fischetti (Education; current
UNCW RTP Committee member and chairperson); Prof. Barry Wray (Information Systems and
Operations Management); Prof. John Taggart (Environmental Studies), Prof. Regina Felix
(Foreign Language and Literatures), Prof. Susan Roberts (Clinical Research, Nursing), Prof.
Yaw Chang (Mathematics and Statistics; former member of UNCW RTP Committee)
Charge of Committee
The charge of the committee was:
1) To investigate and report as to what empirical research has found regarding the validity
of SETEs in measuring teaching effectiveness (if they are not valid, or of low validity
should UNCW be using SETEs in RTP, etc.)
2) What do UNCW and UNC documents actually say about using SETEs for purposes of
RTP
3) If SETEs are not valid measures of teaching effectiveness, what should be used to
measure teaching effectiveness?
4) Provide information about how other universities have attempted to resolve these issues.
5) Develop recommendations for the UNCW Faculty Senate to discuss and possibly vote.
The committee was purposely composed of tenured (both Associate and Full ranks) and
untenured faculty members, from different departments and Schools at UNCW. The committee
met approximately every two weeks during Fall Semester, 2010.
1 | P a g e Draft, 11/10/2010 Not to be quoted Findings of Committee: Charge 1
Charge 1: The basic issue investigated by our committee was stated by Wilbert J. McKeachie in
an article for the American Council of Learned Societies in 1996. McKeachie states that, “if
student ratings are part of the data used in personnel decisions, one must have convincing
evidence that they add valid evidence of teaching effectiveness” (McKeachie, 1996, p. 3).
Conclusion of Committee. Recent empirical research indicates that quantified student
evaluation of teaching effectiveness (SETEs) such as UNCW SPOTS are invalid measures
of teaching effectiveness (very low validity) given the high validity standards that should be
expected for purposes of RTP and personnel decisions. This is particularly true for the
commonly used global measure, such as UNCW Q16.
Support for Conclusion. Based upon current research (see summary of empirical studies at end
of report), it is evident that that:
a)
SETEs have low criterion validity as a measure of teaching effectiveness - in well
controlled, multi-section studies, SETEs only explain a relatively small percent (between
4% and 20%, with many studies indicating < 9.0%) of the variation in objective measures
of teaching effectiveness such as comprehensive exam performance and performance on
standardized student outcome measure. The vast majority of these studies use both a
global measure of SETEs (like UNCW Q16) and a multi-variable average of SETE
questions (like summing UNCW Q1 to Q15). Research indicates that different items in
the SETEs correlate differently with student learning, and that the global measure (like
UNCW Q16) generally has one of the lower correlations.
b)
Most of the arguments for using quantitative SETEs (studies that result in higher
correlations with student learning) come from early 1975 to 1983 studies of introductory
1st year psychology and similar courses that required a common lesson plan and were
taught by different graduate students where one might expect instructor “likability” to
better hold attention of students (as shown in high school studies). More recent studies of
courses where instructors have more control over course content (more advanced courses,
elective courses, and graduate courses) show a much lower correlation between SETEs
and student learning (around 6% of variance explain, with some negative correlations).
This suggests that the level of criterion validity of SETEs changes dramatically between
levels of instruction.
c)
There are very few empirical studies that examine correlation of SETEs across
different classes and departments using a standardized measure of student learning. This
is important since most universities compare faculty SETEs across courses, departments,
and Schools (quintiles, averages, etc.) for purposes of RTP. The empirical studies that do
examine cross-department and course validity tests of SETEs, indicate around 4% to 5%
of student learning is captured by SETEs. This level of explained variance is considered
a LOW level of statistical validity (see Cohen, 1981).
d)
Recent empirical studies have indicated SETEs have problems with other types of
validity issues, such as substantive validity and consequential validity. For example,
2 | P a g e Draft, 11/10/2010 Not to be quoted Dowell and Neal (1983) observed early in this debate, "student ratings are inaccurate
indicators of student learning and they are best regarded as indices of 'consumer
satisfaction' rather than teaching effectiveness" (Dowell & Neil, 1983, p. 462).
e)
Recent empirical studies have indicated that SETEs can be significantly
manipulated by the individual strategies of the instructor, such as providing treats and
cookies during the time SETEs are filled out. There are even websites that suggest ways
to manipulate student evaluations. For example, the Chronicle of Higher Education has a
blog site titled “Tricks for Boosting Student Evaluations” dedicated to faculty reporting
how to manipulate SETEs (http://chronicle.com/blogs/onhiring/tricks-for-boostingstudent-evaluations/22033).
f)
Recent empirical studies have shown that SETEs can be significantly influenced
by the racial and cultural biases of students.
g)
Recent empirical studies have shown that SETEs are significantly correlated with
personal “likability” characteristics of the instructor, such as attractiveness, charisma,
sexiness, and age.
h)
Empirical research has shown that SETEs results vary depending on a course in
an on-line course versus face to face, traditional course.
i)
Empirical research has shown that student taking an on-line version of an SETE
will give different scores than using a pencil and paper SETE.
j)
Empirical research has indicated a “recency effect” or “memory decay” in student
evaluations, that is, student will rate an instructor differently on some items during the
class period versus 2 weeks later (ratings generally decrease over time for some items).
k)
Some empirical studies have shown that SETEs are positively correlated with
grades, thus leading to grade inflation issues. This is called the “grading leniency
hypothesis.” Some studies, however, have found no correlation with grades.
l)
Recent empirical studies have found that SETEs are negatively correlated with
“deep learning”, that is students that have faculty members with high SETEs perform
worse in subsequent, more advanced classes.
m)
Recent empirical studies indicated that the relationship between SETEs and actual
student learning is most likely negatively bi-modal, that is faculty members with average
SETE are the most effective, where faculty members with very high (and very low)
SETEs are actually associated with LOWER levels of student learning.
n)
Untenured faculty members on the SPOT committee seemed most impacted by
SPOT results, and several indicated that they would teach their courses “how it should be
properly taught” if SPOTs were eliminated or deemphasized. Several senior members
indicated that SPOTs affected their grading (higher grades given) while other senior
members indicated that SPOTs had no effect.
3 | P a g e Draft, 11/10/2010 Not to be quoted Findings of Committee: Charge 2
Charge 2: Current UNCW and UNC Policy regarding SETEs.
Conclusion of Committee. Under UNC Policy, some student evaluation of teaching is
required to be used for RTP and personnel decision, however, a) there is no formal
mandate that quantitative SETE data be used (it could be qualitative) or that a global
measure (Q16) be used, and b) the weighting of student evaluation versus peer evaluation is
up to the faculty and institution (student evaluation could be weighted very low, equal, or
very high). In addition, from the 1993 report it is clear that SETEs should NOT be simply
compared with other faculty, departments, or schools (averages, quintiles, etc.) and that
SETEs are most effectively used to compare changes over time.
Support for Conclusion. Excerpts from UNCW Policy and UNC Policy are provided.
UNCW Policy
Relevant sections of the UNCW Faculty Handbook:
a) The SPOT guidelines document states, "Because numerous studies have indicated that
both peer and student evaluations are necessary for the equitable evaluation of teaching
effectiveness, it is strongly suggested that peer and student evaluations be given similar
emphasis in personnel recommendations."
b) Every faculty member is evaluated by students every semester in all courses (including
summer courses) using the institution-wide Student Perceptions of Teaching (SPOT)
questionnaire. This questionnaire and the instructions for administering it were
developed by the Faculty Senate. The use of the SPOT is mandatory, although some
departments also have additional student evaluation questions which are appended to the
SPOT. The department chairperson or appropriate supervisor receives the results of the
SPOT from the Office of Academic Computing and shares them with the faculty member
every semester. SPOT results are considered, along with other measures and instruments
of evaluation, in annual evaluation; in evaluation for reappointment, promotion, and
tenure; and in post-tenure review.
Relevant sections of the UNCW Spot Guideline document (linked in Faculty Handbook)
a) it is strongly suggested that peer and student evaluations be given similar emphasis in
personnel recommendations.
b) Individual SPOT results, when combined with qualitative interpretation by the
department chairperson and with peer evaluations of teaching, can contribute to
measuring an individual's teaching effectiveness and to identification of areas of
strength and areas where improvement is possible. Under those conditions, SPOT
results are appropriately used for annual merit evaluation summaries, consideration
for salary raises, RTP, and post-tenure-review decisions
4 | P a g e Draft, 11/10/2010 Not to be quoted UNC Policy
UNC Policy regarding evaluation of teaching is 2-part – Section 400.3 and then the 1993 policy
committee document referenced in Section 400.3.
Section 400.3
September 1993 [This policy has been published in a separate pamphlet, copies of which
are available through General Administration, Office of the Secretary.]
Section 400.3.1.1
Introduction
At the November 1992 meeting of the Board of Governors, questions were raised about
the procedures and criteria for the awarding of tenure and about the evaluation,
recognition, and reward of teaching, particularly in tenure decisions. The chairman of
the board referred the questions and concerns to two standing committees, the Committee
on Personnel and Tenure and the Committee on Educational Planning, Policies, and
Programs. The report entitled Tenure and Teaching in the University of North Carolina,
adopted by the board on September 10, 1993, distilled what was learned by the
committees and recommended additional steps to encourage good teaching within the
university and to see that the quality of teaching continues to be a prime consideration in
tenure decisions.
1.
That the Board of Governors, through the President of the University, instruct
the Chancellors of each constituent institution to do the following:
c. Review procedures for the evaluation of faculty performance to ensure (1) that student
evaluations and formal methods of peer review are included in teaching evaluation
procedures, (2) that student evaluations are conducted at regular intervals (at least one
semester each year) and on an ongoing basis, (3) that peer review of faculty includes
direct observation of the classroom teaching of new and non-tenured faculty and of
graduate teaching assistants, and (4) that appropriate and timely feedback from
evaluations of performance is provided to those persons being reviewed.
1993 Policy Pamphlet (this document has the most information regarding teaching evaluation is
incorporated as policy under Section 400.3, and the basis for the short set of recommendation
and instruction under Section 400.3.1.1). The report discussed student evaluations, peer
evaluations, and self-evaluation. Key points are:
5 | P a g e Draft, 11/10/2010 Not to be quoted For Student Evaluations
6 | P a g e Draft, 11/10/2010 Not to be quoted For Peer Evaluation
For Self-Evaluation of Teaching
7 | P a g e Draft, 11/10/2010 Not to be quoted Findings of Committee: Charge 3
Charge 3: If SETEs are not valid measures of teaching effectiveness, what should be used to
measure teaching effectiveness?
Conclusion of Committee. Committee felt that some of the information in the UNCW
SPOT form, Q1 to Q15, was useful to the instructor to improve their course. In addition,
the written comments are found to be useful. However, there was concern that the format
for the written comments were not uniform between departments (nor were there
consistent methods to summarize and report the qualitative data), and therefore it would
be difficult to compare written student information between departments for RTP
purposes under the current process. In addition, the committee investigated whether or
not there are empirical studies that correlated peer evaluations with student learning.
While there are several studies that report surveys of faculty opinions on this issue, the
committee could not find any well controlled studies that correlated peer evaluations with
student learning. Thus the validity issue for peer evaluations remains unknown. Surveys
have noted that as SETEs became more dominant, peer evaluations and classroom visits in
universities have decreased. Faculty surveys also indicate a general opinion that the widespread use of SETEs, with the subsequent decline of peer evaluations, has led to a decline
in education standards (Becker and Watts, 1999)
8 | P a g e Draft, 11/10/2010 Not to be quoted Findings of Committee: Charge 4
Charge 4: Provide information about other universities efforts to resolve these issues.
Conclusion of Committee. The Committee examined what other universities are doing.
The Committee found that some universities have reaffirmed SETEs for RTP and
personnel decisions while other universities are limiting the use of SETEs for RTP and
personnel decisions. Many universities also a specific policy that faculty members have an
absolute right to provide comment or explanations regarding the results of SETEs for
purposes of RTP. Many universities appear to be using the IDEA Center’s SETE system
(http://www.theideacenter.org/node/5). The IDEA center provides a student rating system
that incorporates reporting flexibility with department standards. The IDEA method
provides support for some aspects of validity and reliability; however it has not been tested
for other aspects of validity such as correlations with an independent measure of teaching
effectiveness such as standardized measures of student learning outcomes.
Support for Conclusion. For example,
U of Minnesota student evaluation - does not contain any global question similar to UNCW
Q16, and requires that faculty can formally respond to SETE results, and that SETEs must be
used with peer evaluations.
http://policy.umn.edu/Policies/Education/Education/TEACHINGEVALUATION.html
U of Wisconsin system points that students evaluation should not substitute as direct peer
judgment. As a policy they also state that the validity issue is still unknown and that as more
research is done policy about using SETE (or wording) may change. To quote from manual,
“Student evaluation of instruction as information used in actions on promotion, retention, or the
awarding of tenure. Each University of Wisconsin System Institution shall adopt such policies
for instructional faculty as will insure: (a) that student evaluation of the instruction of each
faculty member being considered for promotion or tenure shall be undertaken; (b) that the faculty
body which initiates recommendations for promotion or tenure shall consider, in addition to
independent peer judgment of teaching effectiveness, student evaluation data, taking into account
existing limitations in validity and reliability of the evaluation methodology employed. . . “
UCLA website indicates the global questions are most likely to reflect personal bias (under the
section Interpreting Quantitative Student Evaluations for Personnel Decisions):
http://www.oid.ucla.edu/publications/evalofinstruction/eval2#5
Southeast Missouri State gives a clear guideline for how to use student evaluation for tenure,
promotion. They use the IDEA as instrument, and have very specific guidelines about
interpretation, statistical significance, size of class, and comparison with questions that the
department has defined as being important). Process seems to be an individual faculty and
department driven process rather than a university wide form. They specifically state in their
guidelines that SETEs should not count for more than 25-33% of a measure of teaching
effectiveness. To quote, “Student evaluation of instruction is an anonymous process and is
not always compatible with academic rigor. The IDEA Center stresses that student
evaluation of instruction should make up no more than 25-33% of the measure of teaching
effectiveness”
9 | P a g e Draft, 11/10/2010 Not to be quoted Recommendations of Committee: Charge 5
Charge 5: Develop recommendations for the UNCW Faculty Senate to discuss and possibly
vote. The following recommendations are based upon the consensus of the Committee based
upon review of empirical research and review of other university processes.
Recommendation 1: Eliminate the global question (Q16 from UNCW SPOTs)
Recommendation 2: For RTP, Q1 to Q15 are not reported with any department, school, or
university averages, quintiles, and categorical statements such as (above average,
average, etc.)
Recommendation 3: Q1 to Q15 be reported for RTP only as a comparison over time for that
particular instructor.
Recommendation 4: That some questions on the SETEs be tailored to specific departmental
missions and expectations.
Recommendation 5: Quantified SETEs (UNCW) cannot be weighted more than 25% to 33% as
an assessment for teaching effectiveness for purposes of RTP
Recommendation 6: That the qualitative/written comments from student evaluations be more
systematically administered and reported in order for them to be used more effectively in
RTP decision.
Recommendation 7: That the individual faculty member has a right to formally comment,
explain, or respond to student evaluations (either quantitative or qualitative/written
comments) for purposes of RTP, and that these comments, explanations, or responses be
formally included as part of the SETE reporting process.
Recommendation 8: That the process of peer evaluation of teaching be more uniform between
departments and Schools within UNCW in order for them to be used more effectively in
RTP decision.
Recommendation 9: That peer evaluations be weighted at least equal to student evaluation of
teaching (quantified question and student written comment section) for purposes of RTP
Recommendation 10: That UNCW investigate using the IDEA Center’s evaluation system for
student evaluation (http://www.theideacenter.org/node/5) . However, no global question
should be used, and any quantified SETE process should never be weighted more than
25% to 35% as an indicator of faculty teaching effectiveness
10 | P a g e Draft, 11/10/2010 Not to be quoted Short History of Student Evaluations of Teaching Effectiveness (SETEs)
Student evaluations of teaching effectiveness (SETEs) are one of the most highly debated
aspects of modern university life; they also remain one of the most researched topics in the
literature. While the very early history of SETEs remains somewhat uncertain, it is generally
agreed that Herman Remmers’ research at Purdue University in the 1920 and 1930s pushed
SETEs into the mainstream (e.g., Remmers & Brandenberg, 1927). By the late 1940s, SETEs
were being collected at numerous top universities, including Harvard, Purdue and the University
of Washington. Responding to the rise of student activism in the 1960s and early 1970s, SETEs
quickly became a norm for many universities (McKeachie, 1979, 1996).
While originally
implemented to provide student feedback in order to improve teaching, since the 1970s SETEs
have become increasingly prevalent in faculty personnel decisions.
Summary of Empirical Studies Investigating Validity Issues for SETEs
During the 1970s and 1980s a number of empirical studies were published that tested the
validity of SETEs. Most often these studies used multiple sections of the same course, typically
an introductory basic course, correlated with some measure of student achievement such as a
common final exam, and often taught by graduate students.
Three important conclusions can be drawn by these early multi-section comparison studies,
1) a relatively low amount of statistical variation in independent and objective measures of
teaching effectiveness are explained by SETEs -- depending on the meta-analysis study, between
about 4% and 20% (Cohen, 1982, 1983; Dowel & Neal, 1982, 1983; McCallum, 1984) -- with
the majority falling into the “weak” category of scale criterion validity suggested by Cohen
11 | P a g e Draft, 11/10/2010 Not to be quoted (1969, 1981)1, 2) due to the vast differences in results, including some negative correlations,
there was a common call for continued investigation into the fundamental student rating-learning
outcome link (e.g., Dowell & Neal, 1983). For example, a large percentage of prior research
relies upon data from introductory college courses, taught with textbook created lesson plans, by
graduate students. But as Taylor (2008) notes, an individual instructor ability to influence course
content and learning is most likely to occur in more advanced and elective course, and 3) there
was a growing recognition that the incautious use of SETEs for faculty performance evaluations
were fundamentally changing the teaching focus of higher education away from the transmission
of knowledge where society is viewed as the “customer” to a marketing model where the faculty
member is viewed more as a salesperson.
As Dowell and Neal (1983) observed early in this
debate, "student ratings are inaccurate indicators of student learning and they are best regarded as
indices of 'consumer satisfaction' rather than teaching effectiveness" (Dowell & Neil, 1983, p.
462).
In spite of these early conclusions, empirical research measuring the actual relationship
between SETEs and student learning outcomes essentially ceased by the mid-1980s, leaving the
criterion validity issue open to vast differences of interpretation. Since this time the majority of
empirical research investigating the “validity” and "reliability" issues of SETEs has shifted more
toward the dimensionality problem of SETEs, including the number and stability of the different
dimensions, as well as the substantive validity and consequential validity of SETEs within
various contexts.
Almost always recent SETE design studies have used a factor-analytic
approach with data gathered from survey methodologies or focus groups (e.g., Barnes et al.,
1
Cohen (1969) refers to r=0.10 (1.0% variance explained) as a small effect, r=0.30 (9.0% variance explained) as a
medium effect, and r=0.50 (25.0% variance explained) as a large effect (see Cohen, 1992). Many researchers have
inferred an r<0.30 (less than 9% variance explained) to signify a “small” effect for purposes of testing scale validity
(e.g., Barrett et al, 2009; Hon et al, 2010; Varni et al, 2001; Whitfield et al, 2006). 12 | P a g e Draft, 11/10/2010 Not to be quoted 2008; Burdsal & Harrison, 2008; Hassan, 2009; Spooren & Mortelmans, 2006) and not on any
independent measure of actual learning outcomes or teaching effectiveness.
Even with the validity of student ratings essentially unanswered, and at best accounting for a
relatively small amount of variation in final exam results in multi-section course studies, many
authors claim that SETEs are still valid measures of teaching effectiveness and appropriate to
personnel decisions. In fact, several influential authors in the student evaluation literature, such
as Marsh (1987), McKeachie (1996); Wachtel (1998), Penny (2003) and Centra (2003),
specifically argue this point, and even suggest that the research agenda needs to be refocused
away from challenging the validity of SETEs to improving SETEs.
Obviously this has not happened, and in the past two decades literally hundreds of articles
have appeared that challenge various validity related aspects of SETEs (e.g., Balam & Shannon,
2010; Campbell & Bozeman, 2008; Davies et al, 2007; Emery et al, 2003; Langbein, 2008;
Pounder, 2007).
These include arguments that student perceptions of teaching are notoriously
subject to various types of manipulation, such as high grades, or the often debated "grading
leniency" hypothesis (e.g., Blackhart et al, 2006), course "easiness" (e.g., Bowling, 2008;
Boysen, 2008; Felton et al, 2004) and even giving treats, such as "chocolate candy" prior to the
evaluation (Youmans & Jee, 2007). Other research has demonstrated that student evaluations of
teaching results are influenced by possible race, gender, and cultural biases (e.g., Anderson &
Smith, 2005; Davies et al, 2007; Smith, 2007; Steward & Phelps, 2000), and the "likability and
popularity" attributes of the instructor, such as personal appearance (e.g., Ambady & Rosenthal,
1993; Atamian & Ganguli, 1993; Buck & Tiene, 1989), stylistic presentations (Abrami et al,
1982; Naftulin et al, 1973), and “sexiness” (e.g., Felton et al, 2004; Riniolo et al, 2006).
13 | P a g e Draft, 11/10/2010 Not to be quoted While persuasive in their arguments, few, if any, of these more recent empirical efforts
actually correlate SETEs with achievement of student learning outcomes. There are two recent
studies that do correlate to SETEs to student learning outcomes.
Recent Study 1: Carrell and West (2010), found that instructors receiving higher SETEs
tended to excel more at “contemporaneous student achievement” (teaching to the test), but
actually “harm the following on achievement of their students in more advanced classes” (2010,
p. 409), that is, high SETEs are actually associated with lower levels of “deep learning”.
Recent Study 2: Using course specific standardized student outcome learning measures for
1,800 students across 116 course sections with 87 different instructors, Galbraith, Merrill &
Kline (2010) found little or no support for the validity of SETEs as a general indicator of
teaching effectiveness or student learning. Using both traditional analytical techniques and
Bayesian classification modeling, this study showed that student evaluations of teaching
effectiveness (SETEs) accounted for less than 6.0% of the variance in standardized student
learning outcome achievement when different delivery methods are analyzed together for both
multi-sections of the same course, and across all courses in the sample. However, examining just
face-to-face classes the power of SETEs in explaining student learning outcomes drops
significantly.
Since face-to-face instruction allows for greater opportunity to implement
manipulation strategies, such as bring "treats", as well as more direct student assessment of
instructor "likability" and "charisma", the decrease in explanatory power for face-to-face classes
is perhaps not surprising.
In fact, the underlying structure appears to be non-linear and possibly negatively bimodal
where the most effective instructors are within the middle percentiles of student course ratings,
while instructors receiving ratings in the top quintile or the bottom quintile are associated with
14 | P a g e Draft, 11/10/2010 Not to be quoted significantly lower levels of student achievement – in other words, high” student ratings may, in
fact, be associated with lower student learning, This non-linear relationship was seen in both the
full sample and the face-to-face sub-sample, for both different measures of SETE. See Figure 1
below. In addition, the empirical data indicates that faculty research productivity is a better
predictor of student learning in the classroom that SETEs, even at a “teaching” university
(Galbraith & Merrill, 2010).
Figure 1: Prediction of Student Learning by SETE (Course): Trivariate Analysis
Above Average Student Learning
Full Sample
Graduate Class
Predictions
Average Student Learning
10.0%
15.0%
20.0%
25.0%
30.0%
35.0%
40.0%
45.0%
50.0%
55.0%
60.0%
65.0%
70.0%
75.0%
80.0%
85.0%
90.0%
95.0%
100.0%
Below Average Student Learning
Cumulative Percentile Instructor SETE ranking
Source: Galbraith, Merrill & Kline, 2010
15 | P a g e Draft, 11/10/2010 Not to be quoted Selected Bibliography and References to Report
Abrami, P., Levanthal, L., & Perry, R. (1982). Educational seduction, Review of Educational
Research, 52, 446-464.
Ambady, N. & Rosenthal, R. (1993). Half a minute: Predicting teacher evaluations from thin slices of nonverbal behavior and physical attractiveness, Journal of Personality and Social Psychology, 64, 431‐41. Anderson, K. & Smith, G. (2005). Students preconceptions of professors, Benefits and barriers
according to ethnicity and gender. Hispanic Journal of Behavioral Sciences, 27(2), 184-201.
Atamian, R. & Ganguli, G. (1993). Teacher popularity and teaching effectiveness: Viewpoint of
accounting students, Journal of Education for Business, 68(3), 163-169.
Balam, E. & Shannon, D. (2010). Student ratings of college teaching, A comparison of faculty
and their students. Assessment & Evaluation in Higher Education, 35(2), 209-221.
Barnes, D., Engelland, B., Matherne, C., Martin, W., Orgeron, C., Ring, J., Smith, G., &
Williams, Z. (2008) Developing a psychometrically sound measure of collegiate teaching
proficiency, College Student Journal, 42(1), 199-213.
Blackhart, G., Peruche, B., DeWall, C. & Joiner, T. (2006). Factors influencing teaching
evaluations in higher education. Teaching of Psychology, 33, 37-39.
Bowling, N. (2008). Does the relationship between student ratings of course easiness and course
quality vary across schools? The role of school academic rankings. Assessment & Evaluation in
Higher Education. 33(4), 455-464.
Boysen, G. (2008). Revenge and student evaluations of teaching. Teaching of Psychology,
35(3), 218-222.
Buck, S., & Tiene, D. (1989). The impact of physical attractiveness, gender, and teaching
philosophy on teacher evaluations. Journal of Educational Research, 82, 172-177.
Burdsal, C., & Harrison. (2008). Further evidence supporting the validity of both a
multidimensional profile and an overall evaluation of teaching effectiveness, Assessment &
Evaluation in Higher Education, 33(5), 567 – 576.
Campbell, J & Bozeman, W. (2008). The value of student ratings: Perceptions of students,
teachers, and administrators. Community College Journal of Research and Practice, 32(1), 1324.
Carrell, S., & West, J. (2010). Does professor quality matter? Evidence from random
assignments of students to professors. Journal of Political Economy 118(3), 409-432.
Centra, J. (1983). Research productivity and teaching effectiveness.
Education, 18(4), 379-389.
Research in Higher
Centra, J. (2003). Will teachers receive higher student evaluations by giving higher grades and
less course work? Research in Higher Education, 44(5), 495–518.
16 | P a g e Draft, 11/10/2010 Not to be quoted Cohen J. (1981). Statistical Power Analysis for the Behavioural Sciences (2nd edition). Lawrence
Erlbaum Associates.
Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155-159.
Cohen, P. (1982). Validity of student ratings in psychology courses: A research synthesis,
Teaching of Psychology, 9(2), 78-82.
Cohen, P. (1983). Comment on a selective review of the validity of student ratings of teaching.
Journal of Higher Education, 54(4), 448-458.
Davies, M., Hirschberg, J., Lye, J. & Johnston, C. (2007). Systematic influences on teaching
evaluations: The Case for Caution. Australian Economic Papers, 46(1), 18-38.
Dowel, D., & Neal, J. (1982). A selective review of the validity of student ratings of teaching.
Journal of Higher Education 32(1), 51-62.
Dowell, D., & Neal, J. (1983). The validity and accuracy of student ratings of instruction: A
reply to Peter A. Cohen. Journal of Higher Education 54(4), 459-463.
Emery, C., Kramer, T., & Tian, R. (2003). Return to academic standards: A critique of student
evaluations of teaching effectiveness, Quality Assurance in Education, 11(1), 37-46.
England, J. (1996). How evaluations of teaching are used in personnel decisions. Occasional
Paper No. 33. American Council of Learned Societies, University of Michigan, Retrieved from
http://archives.acls.org/op/33_Professonal_Evaluation_of_Teaching.htm
Fant, G. (2010). Tricks for boosting student evaluations. The Chronicle of Higher Education
(Online). Article and messages posted to http://chronicle.com/blogPost/Tricks-for-BoostingStudent/22033/ (September 15, 2010)
Felton. J., Mitchell, J. & Stinson, M. (2004). Web-based student evaluations of professors: The
relations between perceived quality, easiness and sexiness. Assessment & Evaluation in Higher
Education, 29(1), 91-108.
GalbraithC., & Merrill, G. (forthcoming, 2010) Faculty Research Productivity and Standardized
Student Learning Outcomes in a University Teaching Environment: A Bayesian Analysis of
Relationships. Accepted for publication at the Studies in Higher Education.
Galbraith, C., Merrill, G., & Kline, D. (2010). Are Student Evaluations of Teaching
Effectiveness Valid for Measuring Student Learning Outcomes in Business Related Classes? A
Neural Network and Bayesian Analyses (working paper, manuscript under review at the
Research in Higher Education)
Hattie, J. & Marsh, H. (1996). The relationship between research and teaching. Review of
Educational Research, 66(4), 507-542.
Hassan, K. (2009). Investigating substantive and consequential validity of student ratings of
instruction, Higher Education Research & Development, 28(3), 319-333.
Johnson, I. (2010). Class size and student performance at a public research university: A crossclassified
model.
Research
in
Higher
Education.
Published
online. http://www.springerlink.com/content/0l35t1821172j857/fulltext.pdf
Langbein, L. (2008). Management by results: Student evaluation of faculty teaching and the
mis-measurement of performance. Economics of Education Review, 27(4), 417-428.
17 | P a g e Draft, 11/10/2010 Not to be quoted Lopus, J., & Maxwell, N. (1995). Should we teach microeconomic principles before
macroeconomic principles? Economic Inquiry, 33(2), 336–350.
Marsh, H. (1987). Students’ evaluations of university teaching, research findings,
methodological issues, and directions for further research, International Journal of Educational
Research, 11(3), 253–388.
Marsh, H. & Hattie, J (2002). The relationship between research productivity and teaching
effectiveness, Journal of Higher Education, 73(5), 603-641.
McCallum, L. (1984). A meta-analysis of course evaluation data and its use in the tenure
decision. Research in Higher Education, 21, 150-158.
McKeachie, W. (1979). Student ratings of faculty, A reprise. Academe, 65(6), 384-397.
McKeachie, W. (1996). Student ratings of teaching. Occasional Paper No. 33. American
Council
of
Learned
Societies,
University
of
Michigan,
Retrieved
from
http://archives.acls.org/op/33_Professonal_Evaluation_of_Teaching.htm
Messick, S. (1995). Validation of inferences from persons' responses and performances as
scientific inquiry into score meaning. American Psychologist, 50, 741-749.
Naftlin, D., Ware, J., & Donnelly, F. (1973). The Doctor Foction, Journal of Medical Education
48, 630-635.
Pascarella, E., & Terenzini, P. (2005). How college affects students: A third decade of research.
San Francisco: Jossey-Bass
Penny, A. (2003). Changing the agenda for research into students’ views about university
teaching, four shortcomings of SRT research, Teaching in Higher Education, 8(3), 399–411.
Pounder, J. (2007). Is student evaluation of teaching worthwhile? An analytical framework for
answering the question. Quality Assurance in Education, 15(2), 178-191.
Remmers, H., & Brandenburg, G. (1927). Experimental data on the Purdue rating scale for
instructors. Educational Administration and Supervision, 13 , 519-527.
Riniolo, T., Johnson, K., Sherman, T. & Misso, J. (2006). Hot or not: Do professors perceived as
physically attractive receive higher student evaluations? The Journal of General Psychology,
133(1), 19-35.
Smith, B. (2007). Student ratings of teaching effectiveness: An analysis of end-of-course faculty
evaluations. College Student Journal, 471(4), 788-800
Spooren, P., & Mortelmans, D. (2006). Teacher professionalism and student evaluation of
teaching: Will better teachers receive higher ratings and will better students give higher ratings?
Educational Studies, 32(2), 201-214.
Steward, R. & R. Phelps. 2000. Faculty of color and university students: Rethinking the
evaluation of faculty teaching. Journal of the Research Association of Minority Professors, 4(2),
49-56.
Taylor, J. (2008). The teaching-research nexus and the importance of context: A comparative
study of England and Sweden. Compare 31(1), 53-69
18 | P a g e Draft, 11/10/2010 Not to be quoted Wachtel, H. (1998). Student evaluation of college teaching effectiveness, A brief review.
Assessment and Evaluation in Higher Education, 23(2), 191-212.
University of Iowa (2010). Qualifications for specific ranks: School of Public Health, Retrieved
from http://www.public-health.uiowa.edu/faculty-staff/faculty/handbook/pdf//AppendixJ.pdf
Youmans, R. & Jee, B. (2007). Fudging the numbers: Distributing chocolate influences student
evaluations of an undergraduate course. Teaching of Psychology 34(4), 245-247.
Zietz, J., & Cochran, H. (1997). Containing cost without sacrificing achievement: Some
evidence from college-level economics classes. Journal of Education Finance, 23, 177–192.
19 | P a g e 
Download