ASA Against Student Evaluations

Yesterday in Active Learning Works But Students Don’t Like It I pointed out that student evaluations do not correlate well with teacher effectiveness and may discourage teachers from using more effective but less student-preferred methods of teaching. Coincidentally the American Sociological Association issued a statement yesterday discouraging student evaluations for tenure and promotion decisions.

SETs are weakly related to other measures of teaching effectiveness and student learning (Boring, Ottoboni, and Stark 2016; Uttl, White, and Gonzalez 2017); they are used in statistically problematic ways (e.g., categorical measuresare treated as interval, response rates are ignored, small differences are given undue weight, and distributionsare not reported) (Boysen 2015; Stark and Freishtat 2014); and they can be influenced by course characteristics like time of day, subject, class size, and whether the course is required, all of which are unrelated to teaching effectiveness. In addition, in both observational studies and experiments, SETs have been found to be biased against women and people of color (for recent reviews of the literature, see Basow and Martin 2012 and Spooren, Brockx, and Mortelmans 2015).

Student evaluations mostly evaluate entertainment value but, as my colleague Bryan Caplan notes, given how boring most classes are, entertainment value is worth something! Thus, the ASA notes that student evaluations can be useful as a form of feedback.

Questions on SETs should focus on student experiences, and the instruments should be framed as an opportunity for student feedback, rather than an opportunity for formal ratings of teaching effectiveness.

Student evaluations are on their way out in Canada, as a tool for tenure and promotion. I expect the same here, at least on paper. The ASA is big on “holistic” measures of teacher evaluation as a replacement which strikes me as weaselly. I’d prefer more objective measures such as value added scores.


Student surveys have had strong correlation with value-added in K-12. Perhaps depends on survey wording, framing, and process.


Great. Now, if teachers can start using objective measures to classify students, rather than relying on subjective means to justify their biases, such as essays, assignments, open-ended exam questions, and class participation, we could get somewhere.

So we should use standardized tests that teachers can just teach to?

Like math, reading, and writing? Hate for my kids to learn that.

"So we should use standardized tests that teachers can just teach to?"

Until the students can routinely score highly on the standardized test, then yes that's what you teach to. When your students can nail those tests, your free to teach how you want.

Teachers routinely complain about the lack of flexibility teach to the test gives them. But US PISA scores are mediocre.

It is a shrewd placebo.

Standard lefty trope on full display here: "The customer is always wrong."

No one is more qualified to judge a teacher than his students. And that, clearly, is the problem.

Is a student capable of measuring what is best for his or her learning? Perhaps, but that is not obviously true. What is more obvious that because many (perhaps most) students are at University to gain a credential, rather than learn, their bias is towards anything that makes the path to that credential easier, and are likely to evaluate professors on that basis.

"... their bias is towards anything that makes the path to that credential easier, and are likely to evaluate professors on that basis."

I can only offer anecdotal evidence in agreement with this statement...I don't mean to over-generalize, but in my experience, students prefer profs whose classes offer two main things: great lectures and the least amount of work required outside of the lecture hall or classroom. Thus, perhaps that style is rated highly most often by students.

It's obvious why students prefer the least amount outside work. And, long-lecturing style profs tend to like to hear themselves talk; therefore, they tend to demand the least amount of in-class participation (something many students fear worse than death) and they believe the least amount of outside work is necessary because in their minds the student surely picked up in all those grand lectures.

I think primary and secondary school experience plays into this where there you can be a successful student for the most part if you sit down, shut up, and do the occasional busy work worksheet and pass the occasional test. College students carry that model in their head over to higher ed.

"Is a student capable of measuring what is best for his or her learning? Perhaps, but that is not obviously true. What is more obvious that because many (perhaps most) students are at University to gain a credential, rather than learn, their bias is towards anything that makes the path to that credential easier, and are likely to evaluate professors on that basis." With that much animosity and cynicism, I hope you are not in a faculty or school administration position and are merely reflecting a subjective memory from your school years.

" their bias is towards anything that makes the path to that credential easier, and are likely to evaluate professors on that basis.""

Well it's not hard to find evidence in favor of his argument.

Here are student evaluations posted for Alex Taborrok that marked him as Poor/Awful:

""This was a hard class because of the amount of material you needed to study for each test."
"This class is far from an easy A. There's no way to cheat the system and pick out what you need to study for the tests because he doesn't you anything."
"our class average was a 60 % i'm a 3.7 student and I got a C. DON'T TAKE IS POSSIBLE!"
"This professor wrote his own book! He thinks everything he says is common sense to everyone and his tests (2 tests, 1 final; 30%, 30%, 40%) are SO HARD. Avoid if possible!!""

You lost me at "customer".

Students are not "customers". They are not buying a product, or a even a service in the same way you and I order coffee at the local Starbucks.

How so?

If you break it down, say, for undergrad, most of the cost is room and board, fees for student health and amenities, and books. Pretty easy to evaluate those as a customer. The remaining quarter or so, you’re paying for a credential that says you know something about math, history, finance, or engineering the same as you would pay for a credential to fix air conditioners or change tires. You either get the support to learn the skills or you don’t. I agree with Tyler that post-course evaluations reward cool teachers and have all kinds of biases. But students should after a few months in the workforce understand if they were prepared to succeed.

Because the quality of the Barista's service and my Americano do not depend on my own prior preparation, motivation, interest, or willingness or ability to do the work. Moreover, we can debate whether college should be "job training" in the same way as air conditioner or tire changers are credentialed. Maybe for vocational schools, or specific vocation-based majors that are dependent on a post-college credential to work in a specific field (i.e. nursing).

Also, consider that in college the students are wrong quite often. That is not only a normal part of learning, but in certain circumstances, may be one goal of a particular lesson or activity.

A lot of services depend on the Customer's preparation & motivation.

Try getting your taxes done without providing organized receipts or your tax statements. And yet, every tax accountant can tell you about customers who bring in a rat's nest of unorganized paperwork and then blamed them for the poor results.

Ah, but you are ignoring a key difference. My tax person does not hand out a grade for that rat's nest brought in on April 16th.

I would consider the size of your refund to be the Grade. Bring in all your paperwork, neatly organized, and your Grade will tend to be higher.

My law school used a Blind Anonymous Grading System (BAGS for short). Students who weren't happy with their grades on the exam were convinced that the professor had broken into the school office, copied the list of students with their BAGS numbers, and graded according to personal factors not the results reflected on the students' answers on the exam. Some things never change. I should add that the school had a mandatory Bell curve, which meant that only a few students were happy with their grades. Conspiracies are rampant in law school. I don't recall student evaluations, which means we likely didn't do them since so few students were happy.

"My law school used a Blind Anonymous Grading System": why would any institution not use it?

Go one step further: don't have the lecturer of the course set and mark the exam.

It's fun to read, especially for celebrity professors like Alan Dershowitz or the late David Foster Wallace.

It's pretty easy to guesstimate the SAT scores of the different commenters:

1. "Professor Smith's perspective on the Austro-Hungarian Empire as a managerial state challenged by anti-rationalist Romantic nationalism was quite eye-opening to me."

2. "Smith sux. LOL!!!"

What's interesting about RateMyProfessors is how little attention it attracts. As far as I can tell, a modest corps of students made use of it during the period running from 2003 to 2007, then it seems to have fallen off the radar of all but a few.

I have a suspicion that most of the praising reviews are written by Mary Rosh.

Barkley gets trashed. No surprise there.

A dear man of my acquaintance who had 36 years on the faculty of one institution was trashed in his last years. Uniformly trashed. By students who appeared to have substantive complaints about his grading methods and who claimed to have lodged complaints with the department head. I knew the man and had dealt with his students at an earlier point in his career and heard nothing like this from any party. I knew him outside the office as well. Quite reasonable fellow who'd also done short stints of Peace Corps type work abroad (when he was past 50). Can't process the contrast at all.

I had an addled older chemistry professor, but on the whole found it fun.

We did learn not to copy anything from the blackboard until he stood back to look at it and nodded. He too often erased half an equation and began again. The gap from the nod to erasing the board for the next topic did create a race condition. Students scribbling furiously.

Maybe more fun in retrospect than in the moment.

So the teachers union dislikes being evaluated by the customers, what a surprise! They prefer a more "holistic way", LOL! They are following their American K-12 union brethren!
A value added method, Alex? Are you serious? You see them agreeing to something like that? They will get a good laugh, though. I have been in academia a couple of decades and seen lots of teaching evaluations, take out the extremes and they pretty much tell you what kind of teacher the professor is...

I teach engineers at a university. I find the numerical part of the evaluation highly dubious; are they grading on a consistent scale, do they all agree what the question is? Studies show they do in fact evaluate women and minorities lower when all available evidence shows that the teaching delivered is equivalent.
I would not do away with evaluations, though. I would just do away with the numbers. The students who take the trouble to write out sentences describing what they liked and didn't like can be very honest and perceptive. It is well worth soliciting their opinions; I have learned from what they have told me in their evaluations, especially the negative ones.

Wouldn't the best value added score be bidding to have a seat in a given teacher's class? That might also help solve the problem of adjunct lecturers getting a flat, and rather low, rate for teaching a course.

You'd run into an information deficit problem. Most students will have a professor for only one class. If they've never been taught by a professor before, how are they to judge how much to bid for that seat? They're equipped to make an informed bid after the first few weeks of a course, but that's too late.

I suspect that student evaluations recorded a year after taking the class would be much more accurate.

You'd never get them to fill out the evaluations.
People's memories are highly suspect, particularly after months have passed.
The real value of education may not be revealed until long after graduation.

We are terrible at judging our learning and the learning of others, our perception is biased and we don't read the signals correctly.

Learning requires, at least, a relatively permanent change in long term memory. Much better if that knowledge is organized in a way that allows solving new problems.

We need to develop metacognitive skills in order to be able to understand and monitor the learning process.

The problem with getting rid of student evaluations is that some professors are just really bad teachers and/or don't put any effort into it. Especially assistant profs who think they need to publish more to get tenured and that teaching is something they are betting off ignoring. I think these are mostly the one's that complain about evaluations.

Over many years, I had four professors who had no business in a classroom. One was a contract researcher who had been assigned some teaching duties. The other three were holding or were later awarded endowed chairs. I knew another professor who had some performance deficits that were damaging, but I'd tend to put that down to his wife having had a baby who proved to be difficult.

"SETs have been found to be biased against women and people of color"

Of course they have.

There's actually some pretty good, unbiased research looking at that. Here's one recent study:

Although they did just as good of a job at teaching based on test scores, female instructors tend to be rated lower.

In every other job the manager takes responsibility for the performance of employees.

Yes, customer satisfaction surveys are useful but any competent manager every once in while does a quality check of the outcome of employee's work. If the employee is not up to the task, out.

Why can't department heads at universities learn a bit from every other services organization?

>In every other job the manager takes responsibility for the performance of employees.

I'd argue that your premises are wrong. At most research institutions, the desired output is high-impact publications. Teaching quality is a minor factor and SET's are just for show.

There's plenty of institutions that are not focused on research, do care a lot about high-quality teaching, yet still use a lot of SETs.

"Research by professors shows that student evaluations don't correlate well with professor "effectiveness" (as defined by professors)."

This sounds like there may be a conflict of interest somewhere. It sounds like an interesting area of research, constructing different types of student evaluations and comparing them. But let's bring in some unbiased adult supervision.

Can we talk about the fact that the preeminent paper on student evaluations is written by someone named Anne Boring?

Can't help but think she fared poorly on a few student evaluations with that name...

Quoing Steve Sailer above, "It's pretty easy to guesstimate the SAT scores of the different commenters", including here.

I find it difficult to extract reliable generalizations from this situation. There were a few cases from my undergrad (decade+ ago) of outrageous student behavior. Horrible ratemyprofessor reviews for an Econ professors who really pushed people, and gave us some challenging, innovative exercises. Some crybabies had parents call to complain the work was too hard. The onus should be on the student but I don't know how to enforce that.

I want to say the problem is tuition, spoiled brats whose parents pay their way, who don't take it seriously, treat it like a resort. Schools invite this behavior by promoting a "safe", therapeutic environment. That seems to be one segment in the model. But at the same time, lots of people pay tuition and then treat that as a skin-in-the-game type situation, focusing on getting as many skills out of the ruinously expensive experience as possible.

The whole system is a mess. Way too many people in colleges. Schools focus on things that have nothing to do with real education, and the development of remunerative, white collar skills. There has to be a better way to certify that a person is worthy of doing professional work. We need white collar trade schools and to reserve college for people who are PhD material, perhaps. Interestingly, the engineering schools, which are basically white collar trade schools, seem mostly free of these problems.

Unlike high school or grade schools which give kids standardized tests, colleges (except for professions which have exams before one can practice (like law or medicine), colleges do not have standardized measurements of proficiency.

So, students pick and reward easy profs, missing out on learning some hard things that are necessary for the world of work.

Consulting firms are relying less on grade point or the school from which the student graduates but are instead giving applicants for internships a problem set to work out or a paper to write as part of the interview.

In the end, it is not the student that is the customer, but rather the would be employer.

Case interviews have been around for decades.

Tech firms use white board interviews.

Banks test basic finance stuff, WACC EBITDA etc.

Hedge funds have used in depth quant stuff for their quant teams for a long time.

For undergrad recruiting though, GPA is still important. SATs are often used as well. From bschools we don’t care about GPA, if you’re at a top 5 we won’t even ask. GMAT is still a factor though.

Just what I’ve seen.

GMAT and LSAT rankings become proxies as well particularly for schools that are not top 10.

The point is: students should be measuring their prof by the outcome, because the outcome is what matters. Demonstrable outcome because grade inflation makes it difficult to compare.

What if the true value outcome is the starting salary? Lifetime earnings?

That’s why kids are there. Not to learn Professor Y’s take on whether Virginia Woolf’s stream of consciousness writing is a reflection of her intersectional queer identity in a patriarchal society.

From what I remember the recruiting cycle was studied for completely apart from whatever classes one took. Sure there’s corporate finance, but you don’t need a class to ace a case interview or a banking interview. And programmers weren’t judged on their grades, but rather by their GitHub work while a student and/or what demonstrable projects they completed. A 1600 SAT or 780 gmat means the kid won’t have any trouble with whatever quant is necessary aside from the uber nerds in the basement, and they have a rigorous quant test for them anyhow. And those self select towards PhDs in physics, math, or Econ prior to getting a real job, so there’s an actual filter.

The undergrad students are following rational incentives and the schools follow their customers. If the kid can land a six figure job at 22 years of age then the school did its job. And apparently the professors’ real value add is signaling eliteness and exclusivity. Not teaching.

Paging Bryan Caplan.

A prof is successful if the student lands at Goldman. A doctor is successful if the patient lives a healthy lifestyle to 85.

Neither the professor nor the doctor is adding value 95% of the time. The value is in the selection of the student/patient.

TLDR: I’m cynical. Profs aren’t there to teach and students aren’t there to learn. Profs either add prestige to the School or they add nothing. Any idiot can teach college level classes. A video plus a grad student could teach classes. And the market bears this out, with adjunct profs / lecturers being massively overpaid at $40k. They’re probably not worth $30k.

Student evaluations may have staying power for a number of reasons. Some faculty may complain about them but they are probably easier for faculty “pass” than serious alternatives. Most customer service surveys have high means and SETs are no different.

The real problem is that the people being evaluated are the ones involved in devising the means by which they are evaluated. (Shared governance at universities.). And at many universities, the path to true professor accountability is further stymied by the fact that often the faculty most involved in academic governance activities are not the best teachers and have incentives to protect the weaker teachers by setting a low bar for success. SETs provide that low bar that nearly everyone can pass. And with enrollment pressure at many places, there is that much more focus on pleasing students, which makes SETs that much more compelling.

I don’t think almost anyone in any position of authority has believed that SETs are a good measure of learning. The problem is not knowledge. The problem is the strong disincentives to do things differently.

It’s a corollary of what’s been called the implicit low-low contract between students and professors, ie professor expects less from students in exchange for students expecting less from faculty.

"in both observational studies and experiments, SETs have been found to be biased against women and people of color"

Usually, the realist will admit, such arguments consist of special pleading or downright lying. Is this one of the exceptions?

See ED Hirsch. There are chronic problems with observational studies in educational method.

It is human nature to distance yourself from views that cast you in a negative light. Truly bad teachers who just don't have it in them to become good teachers are more likely to ignore/downplay student evals than to acknowledge there is some truth to what the students are saying.

I teach engineering. I can evaluate how well an engineer does engineering if you give me some of his or her project work, and particularly if I can quiz them about that project work. Engineering is a scientific profession; there are several ways to solve many problems, but the best answer should remain the same. I can tell by their answer, and by how they arrived at it, whether they are a competent engineer.

I have take several teaching courses, but I am not nearly as certain that I can say that there is a right way to teach, and that I can reliably and finely (i.e. smaller than gross differences in competence) discriminate between the quality of teaching I see. Different teachers use different approaches for different classes, and I am not opposed to the notion that the best way of teaching depends heavily on what particular strengths that teacher has, the material being taught, and the makeup and size of the student body. Moreover, different students will be receptive to different styles of teaching. Teaching is a craft, not a science. It is hard, even for craftsmen, to evaluate a craftsman.

Evaluation is important, though, because it makes improvement possible. I favor non-numerical evaluation. Ask the students to express what they like and don't like, or had difficulty with, in sentences. The same is true for an experienced professor who comes to evaluate your lecture. Don't issue scores, make constructive criticisms. As far as evaluating a subordinate, I think a department head is better off reading a collection of written opinions than looking at scores. Scores give the false impression of precision because they are numerical. But the source of those numbers are people who don't agree what the numbers mean, what the scale should be, or even what the question meant. A number doesn't tell you how much that number should be trusted, what the error bars should be. Qualitative evaluations convey more precisely the opinion of the elevator than quantitative evaluations, precisely because of the obvious lack of precision in their sentences. The supposed precision of the quantitative evaluation is the real illusion.

The quality of teaching in general is poor; we need more active teaching in our classrooms, absolutely. That does not change the fact that our tools for evaluation remain limited, and imprecise.

" have take several teaching courses, but I am not nearly as certain that I can say that there is a right way to teach...."

The reason is pretty simple: Engineers deal with material objects, teaching deals with people. If I handed an engineer a cubic yard of fill dirt that was 20% different from what the specs called for, they'd reject it out of hand. In contrast, a teacher needs to be able to reach a very broad range of people, ranging from geniuses to dyslexics to, let's be honest, idiots. Teaching is akin to asking an engineer to build a bridge out of whatever material you choose to give him from a scrap yard--it can be done, but everyone's going to have a different pile of scrap so every solution is going to be different.

The most useful feedback from my evals have been the open-ended responses where students can communicate freely. I will add that I was always willing to ask students to tell me what to change AND what not to change before completing them.

Active learning works but students don't like it.
Direct Instruction works but teachers don't like it.

The problem with SET forms is that they're completed in a half-assed way by students devoting 15 minutes to it at the end of the term. An academic economist once tells me about survey research: can be quite a challenge to get people to answer your question. Ask them your question and they answer their question. The manner in which the forms are administered gets you haphazard responses which may indicate more about the professor's grading scheme. At best, they red-flag some things for the other faculty to consider.

Faculty should be reviewing course syllibi, grade distributions, videotapes of lectures and seminars, and conducting surprise visits. If my own experience as a consumer of professor services is any guide, about 1/4 have some important performance problems and about 1/2 of these have no business in a classroom. For the most part, teachers are adequate. It's the curriculum that's the problem. Instructional institutions are commonly less than the sum of their parts.

MIT has an award for the teacher that students liked best. It has some official name, but it's generally called the "Kiss of Death Award", because teachers that get it before getting tenure practically never get tenure.

My completely unscientific guess would be that students learn best when their teachers are earning a moderate student evaluation; classes should be easy to follow and enjoyable but not so much so that they don't require effort or don't include the parts of the material that aren't inherently entertaining. Whether you should try to increase or decrease student evaluation scores depends on where your teachers are now, and I could easily imagine that K-12 teachers aren't fun enough and college teachers are too fun.

I once had a colleague on the tenure committee tell me that a faculty member with NO rants or negative feedback is probably not doing a good job.

My father told me the same thing when I started managing subcontractors: "If they're not complaining, you're not working them hard enough."

My evals were always good, but I haven't read them in years. No useful content, even when critical.

This is all a bit namby-pamby. When I was a fresher we drove an incompetent lecturer out of the lecture theatre by shouting a lot and throwing things at him. That got us a new lecturer for the next class.

Them wuz the days.

Most students know good teachers from bad. If you think a teacher is goofy, the evaluations will confirm it. What are teachers afraid of? Their own mediocrity? Maybe the students should give more details but this would require coaching the students on how to fill out evaluations. We used them in private work. And required detailed criticism and examples. Maybe just tweak them. Worst teachers are math teachers. Many students know as much and sometimes more than the teachers when it comes to math.

"Most students know good teachers from bad."

The issue is, they don't. Or, to put it more accurately: Their standards for "good" and "bad" are generally not related to the quality of the teachers. Look at how much time has been spent here discussing how "entertaining" a teacher is. Further, when I was teaching, there was a pretty clear correlation between the students that put the least amount of effort into class (or were even actively disruptive) and negative reviews. Positive reviews didn't correlate; many of the diligent, active students didn't fill out reviews. Basically reviews were little more than a way for students to stick it to a professor if the professor had the audacity to call them out on the students' bad behavior.

The evaluations should be limited to "how interesting/entertaining was the teacher?"
Even that is problematic though. Depends on what the student is interested in. Too much depends on the rater's characteristics/mood, grievances, etc.

Better method: Ask "what did the teacher suck at and how could she have done better, or sucked less?"

And: "To what extent did the instructor's sex, sexual orientation/identification, race, ethnic affinity, or political or religious views affect how rating?

And: "To what extent are you a victim of something or other?"

Even better: Let the students evaluate to the heart's content, but don't take it very seriously in promotion decisions. Since they are "customers" let them foot with their dollars by enrolling somewhere else. I predict Harvard/Stanford, MIT won't be losing much sleep over this possibility.

Or use the Hollywood preview comment system. People who watch a preview version indicate briefly what they liked and didn't. It has generally worked well. Directors generally feel the post-preview version was better (at least in commercially relevant ways). The target viewers see things the creators missed, or vice versa.

Another idea.
For Big lecture, foundation type content- the tests are prepared and graded by other than the lecturer. The material to be tested is specified and available through readings or other sources other than the lecture. The lecture should be for inspiration, interest-generation (one of my profs assigned readings, her own classic, still valid, writings mostly, and spent the class lecture reminiscing about the old days, which I personally found a good way to divide the content up. At that point, she didn't remember what she wrote anyway and didn't care anymore (As it turned out, all the recent cutting edge paradigm shifting theories, were obsolete within the next ten years and laughed at now). The grad TA set the tests (under some supervision, possibly), and graded them. Key thing I believe is to separate the teaching/lecture face-time, from the testing, and make sure all tested material is available outside the lectures.
Of course there will those cases where the material is not yet published, but that be exceptional and probably for advanced, already incentivized students, and not large, UG compulsory classes.
In my personal experience, bad teachers (teachers I didn't care for) were people who repeated what was in the book(s), article(s).
Of course, there are some instructiors with unpleasant personalities, who rub you the wrong way and vice versa, or just don't want to be doing what they are doing, but the alternative is worse.
My personal theory is that ratings are just another way to screw the contingent faculty over, but getting the most bitter, disgruntled students to do it. Correct me if I'm wrong, but low rated teachers are at risk of being dismissed, but high rated instructors are not rewarded with more money, better classes, job security, or the other things that tenured folks seem to prize highly.

Nice post,

Thanks for sharing,

Comments for this post are closed