Draft, 11/10/2010 Not to be quoted Committee Report to UNCW Faculty Senate 2010 Ad Hoc Faculty Senate Committee on SPOT use for RTP Decisions and Process Committee Members: Prof. Craig Galbraith, (Management and Committee Chair; UNCW RTP Committee member and chairperson, 2005-2010); Prof. John Fischetti (Education; current UNCW RTP Committee member and chairperson); Prof. Barry Wray (Information Systems and Operations Management); Prof. John Taggart (Environmental Studies), Prof. Regina Felix (Foreign Language and Literatures), Prof. Susan Roberts (Clinical Research, Nursing), Prof. Yaw Chang (Mathematics and Statistics; former member of UNCW RTP Committee) Charge of Committee The charge of the committee was: 1) To investigate and report as to what empirical research has found regarding the validity of SETEs in measuring teaching effectiveness (if they are not valid, or of low validity should UNCW be using SETEs in RTP, etc.) 2) What do UNCW and UNC documents actually say about using SETEs for purposes of RTP 3) If SETEs are not valid measures of teaching effectiveness, what should be used to measure teaching effectiveness? 4) Provide information about how other universities have attempted to resolve these issues. 5) Develop recommendations for the UNCW Faculty Senate to discuss and possibly vote. The committee was purposely composed of tenured (both Associate and Full ranks) and untenured faculty members, from different departments and Schools at UNCW. The committee met approximately every two weeks during Fall Semester, 2010. 1 | P a g e Draft, 11/10/2010 Not to be quoted Findings of Committee: Charge 1 Charge 1: The basic issue investigated by our committee was stated by Wilbert J. McKeachie in an article for the American Council of Learned Societies in 1996. McKeachie states that, “if student ratings are part of the data used in personnel decisions, one must have convincing evidence that they add valid evidence of teaching effectiveness” (McKeachie, 1996, p. 3). Conclusion of Committee. Recent empirical research indicates that quantified student evaluation of teaching effectiveness (SETEs) such as UNCW SPOTS are invalid measures of teaching effectiveness (very low validity) given the high validity standards that should be expected for purposes of RTP and personnel decisions. This is particularly true for the commonly used global measure, such as UNCW Q16. Support for Conclusion. Based upon current research (see summary of empirical studies at end of report), it is evident that that: a) SETEs have low criterion validity as a measure of teaching effectiveness - in well controlled, multi-section studies, SETEs only explain a relatively small percent (between 4% and 20%, with many studies indicating < 9.0%) of the variation in objective measures of teaching effectiveness such as comprehensive exam performance and performance on standardized student outcome measure. The vast majority of these studies use both a global measure of SETEs (like UNCW Q16) and a multi-variable average of SETE questions (like summing UNCW Q1 to Q15). Research indicates that different items in the SETEs correlate differently with student learning, and that the global measure (like UNCW Q16) generally has one of the lower correlations. b) Most of the arguments for using quantitative SETEs (studies that result in higher correlations with student learning) come from early 1975 to 1983 studies of introductory 1st year psychology and similar courses that required a common lesson plan and were taught by different graduate students where one might expect instructor “likability” to better hold attention of students (as shown in high school studies). More recent studies of courses where instructors have more control over course content (more advanced courses, elective courses, and graduate courses) show a much lower correlation between SETEs and student learning (around 6% of variance explain, with some negative correlations). This suggests that the level of criterion validity of SETEs changes dramatically between levels of instruction. c) There are very few empirical studies that examine correlation of SETEs across different classes and departments using a standardized measure of student learning. This is important since most universities compare faculty SETEs across courses, departments, and Schools (quintiles, averages, etc.) for purposes of RTP. The empirical studies that do examine cross-department and course validity tests of SETEs, indicate around 4% to 5% of student learning is captured by SETEs. This level of explained variance is considered a LOW level of statistical validity (see Cohen, 1981). d) Recent empirical studies have indicated SETEs have problems with other types of validity issues, such as substantive validity and consequential validity. For example, 2 | P a g e Draft, 11/10/2010 Not to be quoted Dowell and Neal (1983) observed early in this debate, "student ratings are inaccurate indicators of student learning and they are best regarded as indices of 'consumer satisfaction' rather than teaching effectiveness" (Dowell & Neil, 1983, p. 462). e) Recent empirical studies have indicated that SETEs can be significantly manipulated by the individual strategies of the instructor, such as providing treats and cookies during the time SETEs are filled out. There are even websites that suggest ways to manipulate student evaluations. For example, the Chronicle of Higher Education has a blog site titled “Tricks for Boosting Student Evaluations” dedicated to faculty reporting how to manipulate SETEs (http://chronicle.com/blogs/onhiring/tricks-for-boostingstudent-evaluations/22033). f) Recent empirical studies have shown that SETEs can be significantly influenced by the racial and cultural biases of students. g) Recent empirical studies have shown that SETEs are significantly correlated with personal “likability” characteristics of the instructor, such as attractiveness, charisma, sexiness, and age. h) Empirical research has shown that SETEs results vary depending on a course in an on-line course versus face to face, traditional course. i) Empirical research has shown that student taking an on-line version of an SETE will give different scores than using a pencil and paper SETE. j) Empirical research has indicated a “recency effect” or “memory decay” in student evaluations, that is, student will rate an instructor differently on some items during the class period versus 2 weeks later (ratings generally decrease over time for some items). k) Some empirical studies have shown that SETEs are positively correlated with grades, thus leading to grade inflation issues. This is called the “grading leniency hypothesis.” Some studies, however, have found no correlation with grades. l) Recent empirical studies have found that SETEs are negatively correlated with “deep learning”, that is students that have faculty members with high SETEs perform worse in subsequent, more advanced classes. m) Recent empirical studies indicated that the relationship between SETEs and actual student learning is most likely negatively bi-modal, that is faculty members with average SETE are the most effective, where faculty members with very high (and very low) SETEs are actually associated with LOWER levels of student learning. n) Untenured faculty members on the SPOT committee seemed most impacted by SPOT results, and several indicated that they would teach their courses “how it should be properly taught” if SPOTs were eliminated or deemphasized. Several senior members indicated that SPOTs affected their grading (higher grades given) while other senior members indicated that SPOTs had no effect. 3 | P a g e Draft, 11/10/2010 Not to be quoted Findings of Committee: Charge 2 Charge 2: Current UNCW and UNC Policy regarding SETEs. Conclusion of Committee. Under UNC Policy, some student evaluation of teaching is required to be used for RTP and personnel decision, however, a) there is no formal mandate that quantitative SETE data be used (it could be qualitative) or that a global measure (Q16) be used, and b) the weighting of student evaluation versus peer evaluation is up to the faculty and institution (student evaluation could be weighted very low, equal, or very high). In addition, from the 1993 report it is clear that SETEs should NOT be simply compared with other faculty, departments, or schools (averages, quintiles, etc.) and that SETEs are most effectively used to compare changes over time. Support for Conclusion. Excerpts from UNCW Policy and UNC Policy are provided. UNCW Policy Relevant sections of the UNCW Faculty Handbook: a) The SPOT guidelines document states, "Because numerous studies have indicated that both peer and student evaluations are necessary for the equitable evaluation of teaching effectiveness, it is strongly suggested that peer and student evaluations be given similar emphasis in personnel recommendations." b) Every faculty member is evaluated by students every semester in all courses (including summer courses) using the institution-wide Student Perceptions of Teaching (SPOT) questionnaire. This questionnaire and the instructions for administering it were developed by the Faculty Senate. The use of the SPOT is mandatory, although some departments also have additional student evaluation questions which are appended to the SPOT. The department chairperson or appropriate supervisor receives the results of the SPOT from the Office of Academic Computing and shares them with the faculty member every semester. SPOT results are considered, along with other measures and instruments of evaluation, in annual evaluation; in evaluation for reappointment, promotion, and tenure; and in post-tenure review. Relevant sections of the UNCW Spot Guideline document (linked in Faculty Handbook) a) it is strongly suggested that peer and student evaluations be given similar emphasis in personnel recommendations. b) Individual SPOT results, when combined with qualitative interpretation by the department chairperson and with peer evaluations of teaching, can contribute to measuring an individual's teaching effectiveness and to identification of areas of strength and areas where improvement is possible. Under those conditions, SPOT results are appropriately used for annual merit evaluation summaries, consideration for salary raises, RTP, and post-tenure-review decisions 4 | P a g e Draft, 11/10/2010 Not to be quoted UNC Policy UNC Policy regarding evaluation of teaching is 2-part – Section 400.3 and then the 1993 policy committee document referenced in Section 400.3. Section 400.3 September 1993 [This policy has been published in a separate pamphlet, copies of which are available through General Administration, Office of the Secretary.] Section 400.3.1.1 Introduction At the November 1992 meeting of the Board of Governors, questions were raised about the procedures and criteria for the awarding of tenure and about the evaluation, recognition, and reward of teaching, particularly in tenure decisions. The chairman of the board referred the questions and concerns to two standing committees, the Committee on Personnel and Tenure and the Committee on Educational Planning, Policies, and Programs. The report entitled Tenure and Teaching in the University of North Carolina, adopted by the board on September 10, 1993, distilled what was learned by the committees and recommended additional steps to encourage good teaching within the university and to see that the quality of teaching continues to be a prime consideration in tenure decisions. 1. That the Board of Governors, through the President of the University, instruct the Chancellors of each constituent institution to do the following: c. Review procedures for the evaluation of faculty performance to ensure (1) that student evaluations and formal methods of peer review are included in teaching evaluation procedures, (2) that student evaluations are conducted at regular intervals (at least one semester each year) and on an ongoing basis, (3) that peer review of faculty includes direct observation of the classroom teaching of new and non-tenured faculty and of graduate teaching assistants, and (4) that appropriate and timely feedback from evaluations of performance is provided to those persons being reviewed. 1993 Policy Pamphlet (this document has the most information regarding teaching evaluation is incorporated as policy under Section 400.3, and the basis for the short set of recommendation and instruction under Section 400.3.1.1). The report discussed student evaluations, peer evaluations, and self-evaluation. Key points are: 5 | P a g e Draft, 11/10/2010 Not to be quoted For Student Evaluations 6 | P a g e Draft, 11/10/2010 Not to be quoted For Peer Evaluation For Self-Evaluation of Teaching 7 | P a g e Draft, 11/10/2010 Not to be quoted Findings of Committee: Charge 3 Charge 3: If SETEs are not valid measures of teaching effectiveness, what should be used to measure teaching effectiveness? Conclusion of Committee. Committee felt that some of the information in the UNCW SPOT form, Q1 to Q15, was useful to the instructor to improve their course. In addition, the written comments are found to be useful. However, there was concern that the format for the written comments were not uniform between departments (nor were there consistent methods to summarize and report the qualitative data), and therefore it would be difficult to compare written student information between departments for RTP purposes under the current process. In addition, the committee investigated whether or not there are empirical studies that correlated peer evaluations with student learning. While there are several studies that report surveys of faculty opinions on this issue, the committee could not find any well controlled studies that correlated peer evaluations with student learning. Thus the validity issue for peer evaluations remains unknown. Surveys have noted that as SETEs became more dominant, peer evaluations and classroom visits in universities have decreased. Faculty surveys also indicate a general opinion that the widespread use of SETEs, with the subsequent decline of peer evaluations, has led to a decline in education standards (Becker and Watts, 1999) 8 | P a g e Draft, 11/10/2010 Not to be quoted Findings of Committee: Charge 4 Charge 4: Provide information about other universities efforts to resolve these issues. Conclusion of Committee. The Committee examined what other universities are doing. The Committee found that some universities have reaffirmed SETEs for RTP and personnel decisions while other universities are limiting the use of SETEs for RTP and personnel decisions. Many universities also a specific policy that faculty members have an absolute right to provide comment or explanations regarding the results of SETEs for purposes of RTP. Many universities appear to be using the IDEA Center’s SETE system (http://www.theideacenter.org/node/5). The IDEA center provides a student rating system that incorporates reporting flexibility with department standards. The IDEA method provides support for some aspects of validity and reliability; however it has not been tested for other aspects of validity such as correlations with an independent measure of teaching effectiveness such as standardized measures of student learning outcomes. Support for Conclusion. For example, U of Minnesota student evaluation - does not contain any global question similar to UNCW Q16, and requires that faculty can formally respond to SETE results, and that SETEs must be used with peer evaluations. http://policy.umn.edu/Policies/Education/Education/TEACHINGEVALUATION.html U of Wisconsin system points that students evaluation should not substitute as direct peer judgment. As a policy they also state that the validity issue is still unknown and that as more research is done policy about using SETE (or wording) may change. To quote from manual, “Student evaluation of instruction as information used in actions on promotion, retention, or the awarding of tenure. Each University of Wisconsin System Institution shall adopt such policies for instructional faculty as will insure: (a) that student evaluation of the instruction of each faculty member being considered for promotion or tenure shall be undertaken; (b) that the faculty body which initiates recommendations for promotion or tenure shall consider, in addition to independent peer judgment of teaching effectiveness, student evaluation data, taking into account existing limitations in validity and reliability of the evaluation methodology employed. . . “ UCLA website indicates the global questions are most likely to reflect personal bias (under the section Interpreting Quantitative Student Evaluations for Personnel Decisions): http://www.oid.ucla.edu/publications/evalofinstruction/eval2#5 Southeast Missouri State gives a clear guideline for how to use student evaluation for tenure, promotion. They use the IDEA as instrument, and have very specific guidelines about interpretation, statistical significance, size of class, and comparison with questions that the department has defined as being important). Process seems to be an individual faculty and department driven process rather than a university wide form. They specifically state in their guidelines that SETEs should not count for more than 25-33% of a measure of teaching effectiveness. To quote, “Student evaluation of instruction is an anonymous process and is not always compatible with academic rigor. The IDEA Center stresses that student evaluation of instruction should make up no more than 25-33% of the measure of teaching effectiveness” 9 | P a g e Draft, 11/10/2010 Not to be quoted Recommendations of Committee: Charge 5 Charge 5: Develop recommendations for the UNCW Faculty Senate to discuss and possibly vote. The following recommendations are based upon the consensus of the Committee based upon review of empirical research and review of other university processes. Recommendation 1: Eliminate the global question (Q16 from UNCW SPOTs) Recommendation 2: For RTP, Q1 to Q15 are not reported with any department, school, or university averages, quintiles, and categorical statements such as (above average, average, etc.) Recommendation 3: Q1 to Q15 be reported for RTP only as a comparison over time for that particular instructor. Recommendation 4: That some questions on the SETEs be tailored to specific departmental missions and expectations. Recommendation 5: Quantified SETEs (UNCW) cannot be weighted more than 25% to 33% as an assessment for teaching effectiveness for purposes of RTP Recommendation 6: That the qualitative/written comments from student evaluations be more systematically administered and reported in order for them to be used more effectively in RTP decision. Recommendation 7: That the individual faculty member has a right to formally comment, explain, or respond to student evaluations (either quantitative or qualitative/written comments) for purposes of RTP, and that these comments, explanations, or responses be formally included as part of the SETE reporting process. Recommendation 8: That the process of peer evaluation of teaching be more uniform between departments and Schools within UNCW in order for them to be used more effectively in RTP decision. Recommendation 9: That peer evaluations be weighted at least equal to student evaluation of teaching (quantified question and student written comment section) for purposes of RTP Recommendation 10: That UNCW investigate using the IDEA Center’s evaluation system for student evaluation (http://www.theideacenter.org/node/5) . However, no global question should be used, and any quantified SETE process should never be weighted more than 25% to 35% as an indicator of faculty teaching effectiveness 10 | P a g e Draft, 11/10/2010 Not to be quoted Short History of Student Evaluations of Teaching Effectiveness (SETEs) Student evaluations of teaching effectiveness (SETEs) are one of the most highly debated aspects of modern university life; they also remain one of the most researched topics in the literature. While the very early history of SETEs remains somewhat uncertain, it is generally agreed that Herman Remmers’ research at Purdue University in the 1920 and 1930s pushed SETEs into the mainstream (e.g., Remmers & Brandenberg, 1927). By the late 1940s, SETEs were being collected at numerous top universities, including Harvard, Purdue and the University of Washington. Responding to the rise of student activism in the 1960s and early 1970s, SETEs quickly became a norm for many universities (McKeachie, 1979, 1996). While originally implemented to provide student feedback in order to improve teaching, since the 1970s SETEs have become increasingly prevalent in faculty personnel decisions. Summary of Empirical Studies Investigating Validity Issues for SETEs During the 1970s and 1980s a number of empirical studies were published that tested the validity of SETEs. Most often these studies used multiple sections of the same course, typically an introductory basic course, correlated with some measure of student achievement such as a common final exam, and often taught by graduate students. Three important conclusions can be drawn by these early multi-section comparison studies, 1) a relatively low amount of statistical variation in independent and objective measures of teaching effectiveness are explained by SETEs -- depending on the meta-analysis study, between about 4% and 20% (Cohen, 1982, 1983; Dowel & Neal, 1982, 1983; McCallum, 1984) -- with the majority falling into the “weak” category of scale criterion validity suggested by Cohen 11 | P a g e Draft, 11/10/2010 Not to be quoted (1969, 1981)1, 2) due to the vast differences in results, including some negative correlations, there was a common call for continued investigation into the fundamental student rating-learning outcome link (e.g., Dowell & Neal, 1983). For example, a large percentage of prior research relies upon data from introductory college courses, taught with textbook created lesson plans, by graduate students. But as Taylor (2008) notes, an individual instructor ability to influence course content and learning is most likely to occur in more advanced and elective course, and 3) there was a growing recognition that the incautious use of SETEs for faculty performance evaluations were fundamentally changing the teaching focus of higher education away from the transmission of knowledge where society is viewed as the “customer” to a marketing model where the faculty member is viewed more as a salesperson. As Dowell and Neal (1983) observed early in this debate, "student ratings are inaccurate indicators of student learning and they are best regarded as indices of 'consumer satisfaction' rather than teaching effectiveness" (Dowell & Neil, 1983, p. 462). In spite of these early conclusions, empirical research measuring the actual relationship between SETEs and student learning outcomes essentially ceased by the mid-1980s, leaving the criterion validity issue open to vast differences of interpretation. Since this time the majority of empirical research investigating the “validity” and "reliability" issues of SETEs has shifted more toward the dimensionality problem of SETEs, including the number and stability of the different dimensions, as well as the substantive validity and consequential validity of SETEs within various contexts. Almost always recent SETE design studies have used a factor-analytic approach with data gathered from survey methodologies or focus groups (e.g., Barnes et al., 1 Cohen (1969) refers to r=0.10 (1.0% variance explained) as a small effect, r=0.30 (9.0% variance explained) as a medium effect, and r=0.50 (25.0% variance explained) as a large effect (see Cohen, 1992). Many researchers have inferred an r<0.30 (less than 9% variance explained) to signify a “small” effect for purposes of testing scale validity (e.g., Barrett et al, 2009; Hon et al, 2010; Varni et al, 2001; Whitfield et al, 2006). 12 | P a g e Draft, 11/10/2010 Not to be quoted 2008; Burdsal & Harrison, 2008; Hassan, 2009; Spooren & Mortelmans, 2006) and not on any independent measure of actual learning outcomes or teaching effectiveness. Even with the validity of student ratings essentially unanswered, and at best accounting for a relatively small amount of variation in final exam results in multi-section course studies, many authors claim that SETEs are still valid measures of teaching effectiveness and appropriate to personnel decisions. In fact, several influential authors in the student evaluation literature, such as Marsh (1987), McKeachie (1996); Wachtel (1998), Penny (2003) and Centra (2003), specifically argue this point, and even suggest that the research agenda needs to be refocused away from challenging the validity of SETEs to improving SETEs. Obviously this has not happened, and in the past two decades literally hundreds of articles have appeared that challenge various validity related aspects of SETEs (e.g., Balam & Shannon, 2010; Campbell & Bozeman, 2008; Davies et al, 2007; Emery et al, 2003; Langbein, 2008; Pounder, 2007). These include arguments that student perceptions of teaching are notoriously subject to various types of manipulation, such as high grades, or the often debated "grading leniency" hypothesis (e.g., Blackhart et al, 2006), course "easiness" (e.g., Bowling, 2008; Boysen, 2008; Felton et al, 2004) and even giving treats, such as "chocolate candy" prior to the evaluation (Youmans & Jee, 2007). Other research has demonstrated that student evaluations of teaching results are influenced by possible race, gender, and cultural biases (e.g., Anderson & Smith, 2005; Davies et al, 2007; Smith, 2007; Steward & Phelps, 2000), and the "likability and popularity" attributes of the instructor, such as personal appearance (e.g., Ambady & Rosenthal, 1993; Atamian & Ganguli, 1993; Buck & Tiene, 1989), stylistic presentations (Abrami et al, 1982; Naftulin et al, 1973), and “sexiness” (e.g., Felton et al, 2004; Riniolo et al, 2006). 13 | P a g e Draft, 11/10/2010 Not to be quoted While persuasive in their arguments, few, if any, of these more recent empirical efforts actually correlate SETEs with achievement of student learning outcomes. There are two recent studies that do correlate to SETEs to student learning outcomes. Recent Study 1: Carrell and West (2010), found that instructors receiving higher SETEs tended to excel more at “contemporaneous student achievement” (teaching to the test), but actually “harm the following on achievement of their students in more advanced classes” (2010, p. 409), that is, high SETEs are actually associated with lower levels of “deep learning”. Recent Study 2: Using course specific standardized student outcome learning measures for 1,800 students across 116 course sections with 87 different instructors, Galbraith, Merrill & Kline (2010) found little or no support for the validity of SETEs as a general indicator of teaching effectiveness or student learning. Using both traditional analytical techniques and Bayesian classification modeling, this study showed that student evaluations of teaching effectiveness (SETEs) accounted for less than 6.0% of the variance in standardized student learning outcome achievement when different delivery methods are analyzed together for both multi-sections of the same course, and across all courses in the sample. However, examining just face-to-face classes the power of SETEs in explaining student learning outcomes drops significantly. Since face-to-face instruction allows for greater opportunity to implement manipulation strategies, such as bring "treats", as well as more direct student assessment of instructor "likability" and "charisma", the decrease in explanatory power for face-to-face classes is perhaps not surprising. In fact, the underlying structure appears to be non-linear and possibly negatively bimodal where the most effective instructors are within the middle percentiles of student course ratings, while instructors receiving ratings in the top quintile or the bottom quintile are associated with 14 | P a g e Draft, 11/10/2010 Not to be quoted significantly lower levels of student achievement – in other words, high” student ratings may, in fact, be associated with lower student learning, This non-linear relationship was seen in both the full sample and the face-to-face sub-sample, for both different measures of SETE. See Figure 1 below. In addition, the empirical data indicates that faculty research productivity is a better predictor of student learning in the classroom that SETEs, even at a “teaching” university (Galbraith & Merrill, 2010). Figure 1: Prediction of Student Learning by SETE (Course): Trivariate Analysis Above Average Student Learning Full Sample Graduate Class Predictions Average Student Learning 10.0% 15.0% 20.0% 25.0% 30.0% 35.0% 40.0% 45.0% 50.0% 55.0% 60.0% 65.0% 70.0% 75.0% 80.0% 85.0% 90.0% 95.0% 100.0% Below Average Student Learning Cumulative Percentile Instructor SETE ranking Source: Galbraith, Merrill & Kline, 2010 15 | P a g e Draft, 11/10/2010 Not to be quoted Selected Bibliography and References to Report Abrami, P., Levanthal, L., & Perry, R. (1982). Educational seduction, Review of Educational Research, 52, 446-464. Ambady, N. & Rosenthal, R. (1993). Half a minute: Predicting teacher evaluations from thin slices of nonverbal behavior and physical attractiveness, Journal of Personality and Social Psychology, 64, 431‐41. Anderson, K. & Smith, G. (2005). Students preconceptions of professors, Benefits and barriers according to ethnicity and gender. Hispanic Journal of Behavioral Sciences, 27(2), 184-201. Atamian, R. & Ganguli, G. (1993). Teacher popularity and teaching effectiveness: Viewpoint of accounting students, Journal of Education for Business, 68(3), 163-169. Balam, E. & Shannon, D. (2010). Student ratings of college teaching, A comparison of faculty and their students. Assessment & Evaluation in Higher Education, 35(2), 209-221. Barnes, D., Engelland, B., Matherne, C., Martin, W., Orgeron, C., Ring, J., Smith, G., & Williams, Z. (2008) Developing a psychometrically sound measure of collegiate teaching proficiency, College Student Journal, 42(1), 199-213. Blackhart, G., Peruche, B., DeWall, C. & Joiner, T. (2006). Factors influencing teaching evaluations in higher education. Teaching of Psychology, 33, 37-39. Bowling, N. (2008). Does the relationship between student ratings of course easiness and course quality vary across schools? The role of school academic rankings. Assessment & Evaluation in Higher Education. 33(4), 455-464. Boysen, G. (2008). Revenge and student evaluations of teaching. Teaching of Psychology, 35(3), 218-222. Buck, S., & Tiene, D. (1989). The impact of physical attractiveness, gender, and teaching philosophy on teacher evaluations. Journal of Educational Research, 82, 172-177. Burdsal, C., & Harrison. (2008). Further evidence supporting the validity of both a multidimensional profile and an overall evaluation of teaching effectiveness, Assessment & Evaluation in Higher Education, 33(5), 567 – 576. Campbell, J & Bozeman, W. (2008). The value of student ratings: Perceptions of students, teachers, and administrators. Community College Journal of Research and Practice, 32(1), 1324. Carrell, S., & West, J. (2010). Does professor quality matter? Evidence from random assignments of students to professors. Journal of Political Economy 118(3), 409-432. Centra, J. (1983). Research productivity and teaching effectiveness. Education, 18(4), 379-389. Research in Higher Centra, J. (2003). Will teachers receive higher student evaluations by giving higher grades and less course work? Research in Higher Education, 44(5), 495–518. 16 | P a g e Draft, 11/10/2010 Not to be quoted Cohen J. (1981). Statistical Power Analysis for the Behavioural Sciences (2nd edition). Lawrence Erlbaum Associates. Cohen, J. (1992). A power primer. Psychological Bulletin, 112(1), 155-159. Cohen, P. (1982). Validity of student ratings in psychology courses: A research synthesis, Teaching of Psychology, 9(2), 78-82. Cohen, P. (1983). Comment on a selective review of the validity of student ratings of teaching. Journal of Higher Education, 54(4), 448-458. Davies, M., Hirschberg, J., Lye, J. & Johnston, C. (2007). Systematic influences on teaching evaluations: The Case for Caution. Australian Economic Papers, 46(1), 18-38. Dowel, D., & Neal, J. (1982). A selective review of the validity of student ratings of teaching. Journal of Higher Education 32(1), 51-62. Dowell, D., & Neal, J. (1983). The validity and accuracy of student ratings of instruction: A reply to Peter A. Cohen. Journal of Higher Education 54(4), 459-463. Emery, C., Kramer, T., & Tian, R. (2003). Return to academic standards: A critique of student evaluations of teaching effectiveness, Quality Assurance in Education, 11(1), 37-46. England, J. (1996). How evaluations of teaching are used in personnel decisions. Occasional Paper No. 33. American Council of Learned Societies, University of Michigan, Retrieved from http://archives.acls.org/op/33_Professonal_Evaluation_of_Teaching.htm Fant, G. (2010). Tricks for boosting student evaluations. The Chronicle of Higher Education (Online). Article and messages posted to http://chronicle.com/blogPost/Tricks-for-BoostingStudent/22033/ (September 15, 2010) Felton. J., Mitchell, J. & Stinson, M. (2004). Web-based student evaluations of professors: The relations between perceived quality, easiness and sexiness. Assessment & Evaluation in Higher Education, 29(1), 91-108. GalbraithC., & Merrill, G. (forthcoming, 2010) Faculty Research Productivity and Standardized Student Learning Outcomes in a University Teaching Environment: A Bayesian Analysis of Relationships. Accepted for publication at the Studies in Higher Education. Galbraith, C., Merrill, G., & Kline, D. (2010). Are Student Evaluations of Teaching Effectiveness Valid for Measuring Student Learning Outcomes in Business Related Classes? A Neural Network and Bayesian Analyses (working paper, manuscript under review at the Research in Higher Education) Hattie, J. & Marsh, H. (1996). The relationship between research and teaching. Review of Educational Research, 66(4), 507-542. Hassan, K. (2009). Investigating substantive and consequential validity of student ratings of instruction, Higher Education Research & Development, 28(3), 319-333. Johnson, I. (2010). Class size and student performance at a public research university: A crossclassified model. Research in Higher Education. Published online. http://www.springerlink.com/content/0l35t1821172j857/fulltext.pdf Langbein, L. (2008). Management by results: Student evaluation of faculty teaching and the mis-measurement of performance. Economics of Education Review, 27(4), 417-428. 17 | P a g e Draft, 11/10/2010 Not to be quoted Lopus, J., & Maxwell, N. (1995). Should we teach microeconomic principles before macroeconomic principles? Economic Inquiry, 33(2), 336–350. Marsh, H. (1987). Students’ evaluations of university teaching, research findings, methodological issues, and directions for further research, International Journal of Educational Research, 11(3), 253–388. Marsh, H. & Hattie, J (2002). The relationship between research productivity and teaching effectiveness, Journal of Higher Education, 73(5), 603-641. McCallum, L. (1984). A meta-analysis of course evaluation data and its use in the tenure decision. Research in Higher Education, 21, 150-158. McKeachie, W. (1979). Student ratings of faculty, A reprise. Academe, 65(6), 384-397. McKeachie, W. (1996). Student ratings of teaching. Occasional Paper No. 33. American Council of Learned Societies, University of Michigan, Retrieved from http://archives.acls.org/op/33_Professonal_Evaluation_of_Teaching.htm Messick, S. (1995). Validation of inferences from persons' responses and performances as scientific inquiry into score meaning. American Psychologist, 50, 741-749. Naftlin, D., Ware, J., & Donnelly, F. (1973). The Doctor Foction, Journal of Medical Education 48, 630-635. Pascarella, E., & Terenzini, P. (2005). How college affects students: A third decade of research. San Francisco: Jossey-Bass Penny, A. (2003). Changing the agenda for research into students’ views about university teaching, four shortcomings of SRT research, Teaching in Higher Education, 8(3), 399–411. Pounder, J. (2007). Is student evaluation of teaching worthwhile? An analytical framework for answering the question. Quality Assurance in Education, 15(2), 178-191. Remmers, H., & Brandenburg, G. (1927). Experimental data on the Purdue rating scale for instructors. Educational Administration and Supervision, 13 , 519-527. Riniolo, T., Johnson, K., Sherman, T. & Misso, J. (2006). Hot or not: Do professors perceived as physically attractive receive higher student evaluations? The Journal of General Psychology, 133(1), 19-35. Smith, B. (2007). Student ratings of teaching effectiveness: An analysis of end-of-course faculty evaluations. College Student Journal, 471(4), 788-800 Spooren, P., & Mortelmans, D. (2006). Teacher professionalism and student evaluation of teaching: Will better teachers receive higher ratings and will better students give higher ratings? Educational Studies, 32(2), 201-214. Steward, R. & R. Phelps. 2000. Faculty of color and university students: Rethinking the evaluation of faculty teaching. Journal of the Research Association of Minority Professors, 4(2), 49-56. Taylor, J. (2008). The teaching-research nexus and the importance of context: A comparative study of England and Sweden. Compare 31(1), 53-69 18 | P a g e Draft, 11/10/2010 Not to be quoted Wachtel, H. (1998). Student evaluation of college teaching effectiveness, A brief review. Assessment and Evaluation in Higher Education, 23(2), 191-212. University of Iowa (2010). Qualifications for specific ranks: School of Public Health, Retrieved from http://www.public-health.uiowa.edu/faculty-staff/faculty/handbook/pdf//AppendixJ.pdf Youmans, R. & Jee, B. (2007). Fudging the numbers: Distributing chocolate influences student evaluations of an undergraduate course. Teaching of Psychology 34(4), 245-247. Zietz, J., & Cochran, H. (1997). Containing cost without sacrificing achievement: Some evidence from college-level economics classes. Journal of Education Finance, 23, 177–192. 19 | P a g e