Erik posted about student evaluations last week and I took the opportunity to bring up the famous Air Force Academy study. (the paper; an accessible blog post about it).

The setup is extraordinary:

Prior to the start of the freshman academic year, students take course placement exams in mathematics, chemistry, and select foreign languages. Scores on these exams are used to place students into the appropriate starting core courses (i.e., remedial math, Calculus I, Calculus II, etc.). Conditional on course placement, the USAFA registrar employs a stratified random assignment algorithm to place students into sections within each course/semester. The algorithm first assigns all female students evenly throughout all offered sections, then places male-recruited athletes, and then assigns all remaining students. Within each group (i.e., female, male athlete, and all remaining males), assignments are random with respect to academic ability and professor.12 Thus, students throughout their 4 years of study have no ability to choose their professors in required core courses. Faculty members teaching the same course use an identical syllabus and give the same exams during a common testing period. These institutional characteristics assure that there is no self-selection of students into (or out of) courses or toward certain professors.

They focused on math classes because the grades are highly normalised:

The integrity of our results depends on the percentage of points earned in core courses being a consistent measure of relative achievement across students. The manner in which student scores are determined at USAFAparticularly in the Math Department, allows us to rule out potential mechanisms for our results. Math professors grade only a small proportion of their own students’ exams, vastly reducing the ability of “easy” or “hard” grading professors to affect their students’ scores. All math exams are jointly graded by all professors teaching the course during that semester in “grading parties,” where Professor A grades question 1 and Professor B grades question 2 for all students taking the course. These aspects of grading allow us to rule out the possibility that professors have varying grading standards for equal student performance. Hence, our results are likely driven by the manner in which the course is taught by each professor.

In some core courses at USAFA, 5–10 percent of the overall course grade is earned by professor/section-specific quizzes and/or class participation. However, for the period of our study, the introductory calculus course at USAFA did not allow for any professor-specific assignments or quizzes. Thus, potential “bleeding heart” professors had no discretion to boost grades or to keep their students from failing their courses. For this reason, we present results in this study for the introductory calculus course and follow-on courses that require introductory calculus as a prerequisite.

This is really amazing! It’s pretty close to a highly controlled experiment!

They found a pretty strong effect from instructor quality:

The USAFA’s comprehensive core curriculum provides a unique opportunity to test how introductory course professors affect follow-on course achievement free from selection bias. The estimate of 2 Var (lj 1) is shown in row 2, column 2 of table 4 and indicates that introductory course professors significantly affect follow-on course achievement.19 The variance in follow-on course value-added is estimated to be 0.0025 (SD p 0.050). The magnitude of this effect is roughly equivalent to that estimated in the contemporaneous course and indicates that a one standard-deviation change in introductory professor quality results in a 0.05-standard-deviation change in follow-on course achievement.

The striking bit (for me) is their examination of student evaluation and grades:

Next, we examine the relationship between student evaluations of professors and student academic achievement as in Weinberg, Hashimoto, and Fleisher (2009). This analysis gives us a unique opportunity to compare the relationship between value-added models (currently used to measure primary and secondary teacher quality) and student evaluations (currently used to measure postsecondary teacher quality).

…In column 1, results for contemporaneous value-added are positive and statistically significant at the.05 level for scores on all six student evaluation questions. In contrast, results in column 2 for follow-on course value-added show that all six coefficients are negative, with three significant at the .05 level and three significant at the .10 level .Since proposals for teacher merit pay are often based on contemporaneous teacher value-added, we examine rank orders between our professor value-added estimates and student evaluation scores. We compute rank orders of career average student evaluation data for the question, “The instructor’s effectiveness in facilitating my learning in the course was,” by professor… As an illustration, the calculus professor in our sample 1 who ranks dead last in deep learning ranks sixth and seventh best in student evaluations and contemporaneous value-added, respectively.

Our findings show that introductory calculus professors significantly affect student achievement in both the contemporaneous course being taught and the follow-on related curriculum. However, these methodologies yield very different conclusions regarding which professors are measured as high quality, depending on the outcome of interest used. **We find that less experienced and less qualified professors produce students who perform significantly better in the contemporaneous course being taught, whereas more experienced and highly qualified professors produce students who perform better in the follow-on related curriculum.** Owing to the complexities of the education production function, where both students and faculty engage in optimizing behavior, we can only speculate as to the mechanism by which these effects may operate. Similar to elementary and secondary school teachers, who often have advance knowledge of assessment content in high-stakes testing systems, all professors teaching a given course at USAFA have an advance copy of the exam before it is given. Hence, educators in both settings must choose how much time to allocate to tasks that have great value for raising current scores but may have little value for lasting knowledge.

And the key bit:

**Regardless of how these effects may operate, our results show that student evaluations reward professors who increase achievement in the contemporaneous course being taught, not those who increase deep learning.** Using our various measures of teacher quality to rank-order teachers leads to profoundly different results. Since many U.S. colleges and universities use student evaluations as a measurement of teaching quality for academic promotion and tenure decisions, this finding draws into question the value and accuracy of this practice

Now, in the comment thread, people pointed out that this result, while strong, was narrow (restricted to fairly low level math and “adjacent” courses). These facts are related, of course. By focusing on the data with the fewest confounders, they narrowed the scope but increased the strength. However, there have been subsequent students. For example, “Evaluating students’ evaluations of professors“:

The empirical analysis is based on data for one enrollment cohort of undergraduate students at Bocconi University, an Italian private institution of tertiary education offering degree programs in economics, management, public policy and law. We select the cohort of the 1998/1999 freshmen because it is the only one available where students were randomly allocated to teaching classes for each of their compulsory courses.

The students entering Bocconi in the 1998/1999 academic year were offered 7 different degree programs but only three of them attracted enough students to require the splitting of lectures into more than one class: Management, Economics and Law&Management. Students in these programs were required to take a fixed sequence of compulsory courses that span over the first two years, a good part of their third year and, in a few cases, also their last year.

…

The exam questions were also the same for all students (within degree program), regardless of their classes. Specifically, one of the teachers in each course (normally a senior faculty member) acted as a coordinator, making sure that all classes progressed similarly during the term and addressing problems that might have arisen. The coordinator also prepared the exam paper, which was administered to all classes. Grading was delegated to the individualteachers, each ofthem marking the papers ofthe students in his/her own class. The coordinator would check that the distributions were similar across classes but grades were not curved, neither across nor within classes.

They also looked at evaluation/grade correlations:

In this section we investigate the relationship between our measures of teaching effectiveness and the evaluations teachers receive from their students. We concentrate on two core items from the evaluation questionnaires, namely overall teaching quality and the overall clarity of the lectures.

The key hit:

Our benchmark class effects are negatively associated with all the items that we consider, **suggesting that teachers who are more effective in promoting future performance receive worse evaluations from their students. This relationship is statistically significant for all items (but logistics), and is of sizable magnitude.** For example, a one-standard deviation increase in teacher effectiveness reduces the students’ evaluations of overall teaching quality by about 50% of a standard deviation. Such an effect could move a teacher who would otherwise receive a median evaluation down to the 31st percentile of the distribution. Effects of slightly smaller magnitude can be computed for lecturing clarity.

Finally, I did peek at a 2017 meta-analysis which overrides earlier meta-analyses. The abstract:

Student evaluation of teaching (SET) ratings are used to evaluate faculty’s teaching effectiveness based on a widespread belief that students learn more from highly rated professors. The key evidence cited in support of this belief are meta-analyses of multisection studies showing small-to-moderate correlations between SET ratings and student achievement (e.g., Cohen, 1980, 1981; Feldman, 1989). We re-analyzed previously published meta-analyses of the multisection studies and found that their findings were an artifact of small sample sized studies and publication bias. Whereas the small sample sized studies showed large and moderate correlation, the large sample sized studies showed no or only minimal correlation between SET ratings and learning. Our up-to-date meta-analysis of all multisection studies revealed no significant correlations between the SET ratings and learning. These findings suggest that institutions focused on student learning and career success may want to abandon SET ratings as a measure of faculty’s teaching effectiveness.

I mean, it’s brutal:

In combination, our new up-to-date meta-analyses based on nearly 100 multisection studies, as well as our re-analyses of the previous meta-analyses make it clear that the previous reports of “moderate” and “substantial” SET/learning correlations were artifacts of small size study effects. **The best evidence − the meta-analyses of SET/learning correlations when prior learning/ability are taken into account − indicates that the SET/learning correlation is zero.** Contrary to a multitude of reviews, reports, as well as self-help books aimed at new professors (a few of them quoted above), the simple scatterplots as well as more sophisticated meta-analyses methods indicate that students do not learn more from professors who receive higher SET ratings.

And student evaluatons are a substantial part of the Teaching Evaluation Framework.

So that’s really bad!

When we throw in gender (and other) biases, it seems clear we have a huge problem.