Ahem…

by on April 13, 2012 at 6:50 am in Education, Web/Tech | Permalink

The study, funded by the William and Flora Hewlett Foundation, compared the software-generated ratings given to more than 22,000 short essays, written by students in junior high schools and high school sophomores, to the ratings given to the same essays by trained human readers.

The differences, across a number of different brands of automated essay scoring software (AES) and essay types, were minute. “The results demonstrated that over all, automated essay scoring was capable of producing scores similar to human scores for extended-response writing items,” the Akron researchers write, “with equal performance for both source-based and traditional writing genre.”

“In terms of being able to replicate the mean [ratings] and standard deviation of human readers, the automated scoring engines did remarkably well,” Mark D. Shermis, the dean of the college of education at Akron and the study’s lead author, said in an interview.

Here is more.

A Berman April 13, 2012 at 6:57 am

To the extent anything can be described, it can be programmed. It’s called the Church-Turing thesis.

david April 13, 2012 at 7:18 am

You cannot acquire a human reader and run millions of essays through it until you optimize your way toward a high-scoring essay, though.

As economists since Lucas well know, it is easy to speculate on heuristics that approximate an existing human process. It is not so easy to identify the actual structural heuristics that said humans operate on, however, so relying on speculative heuristics as policy may simply alter human behavior.

e.g., trivially, it is improbable that automated essay scores have a way to check for patently absurd content in the essay rather than language use. But human markers do, and therefore students minimize making absurd claims (subject to ability) even when it might make employing impressive language easier. Thus this disparity in marker perception would not show up in any comparison of automated scorers to an existing field of human scorers.

This doesn’t make it useless – the automated scorer could score 90% of essays, say, with human scorers taking on a random 10%. Still, we should be wary of attempts to game an automated scorer.

Doc Merlin April 13, 2012 at 8:19 am

“You cannot acquire a human reader and run millions of essays through it until you optimize your way toward a high-scoring essay, though.”

Actually, you can. Its already been done. Handwriting, length, and a couple other small things where what mattered almost exclusively.

You forget that the stated reason for including the essay requirement in the SAT (PSAT) was not to better measure ability, but to increase percent of women in the top scoring brackets. This was to get around certain anti-sex-discrimination requirements.

improbable April 13, 2012 at 9:18 am

Doc, do you have a link for this claim about SAT?

Doc Merlin April 13, 2012 at 9:32 am

Oh, I found one! I knew from being told by College Board, though, not from reading it on the internet. (My year was the first year they did it for the PSAT)

http://www.fairtest.org/test-makers-revise-nat-merit-exam-address-gender-bias-fairtest-complaint-will-lead-millions-more-gir

Andrew' April 13, 2012 at 9:57 am

Do you you have a link to the claim that we forgot that?

Can we please bring back separate-but-equal in place of Title IX. The sports radio keeps getting pre-empted with women’s “sports.”

Steven Kopits April 13, 2012 at 10:02 am

Fascinating, Doc. A very Tyler-esque link.

Paul N April 13, 2012 at 9:56 am

Exactly, thanks david!

mpowell April 13, 2012 at 4:19 pm

I took the SAT II in writing 15 years ago. The way to get a good score was to not worry about the content of the claims that you were making but just be sure to organize your essay in exactly the way the rubric called for and use proper grammar throughout. This approach worked quite well for me. Much easier than actually developing a good thesis and defending it- especially if the best argument calls for 2 points instead of 3 or if point 1 has multiple subpoints that are difficult to develop in the appropriate number of paragraphs.

Long story short: I’m not sure how much content matters in certain contexts with human graders.

MD April 13, 2012 at 6:52 pm

Having once had a job where I graded college papers and essay exams, I found that just producing a reasonably well-organized paper with proper grammar is apparently too much for some people.

Anonymous coward April 13, 2012 at 7:30 am

Rather than reflecting positively on the software, I perceive this as reflecting poorly on the median human essay scorer.

Rahul April 13, 2012 at 7:36 am

+1 Many graders “skim” never read.

Also, given that: (a) Software replicates the Standard Deviation of human graders and (b) the differences between various AES’s were minute.

Does this mean that differences between human graders are minute too? Find that hard to believe.

Anonymous coward April 13, 2012 at 7:50 am

Why is it hard to believe? I don’t know firsthand, but I imagine the scorers are obligated to score according to some standardized set of criteria. Essentially the human scorer is just running the program defined by this set. If it differs slightly between different institutions, I would expect the differences to be trivial.

dan1111 April 13, 2012 at 8:33 am

I think you are right on with this. The only way that large-scale essay grading can be consistent is by coming up with a rigid, artificial set of rules. I doubt these essay-based standardized tests are very useful.

Rahul April 13, 2012 at 10:11 am

The deeper question is what is the trade-off between consistency and the other goals of assessment.

A (eccentric) rule-based assessment strategy based on, say, word-count, would be highly consistent yet almost entirely irrelevant.

Master of None April 13, 2012 at 7:54 am

Next step is the software that can write the essay, no? Can’t be too far of a leap.

ShardPhoenix April 13, 2012 at 8:36 am

Presumably essays are NP complete (ie relatively easy to check, relatively hard to do).

Samuel April 13, 2012 at 8:47 am

Any argument that it is too computationally expensive for computers to do it and thus solely in the realm of humans is really absurd. Programming a computer to do this is highly non-trivial, but saying people are any better at performing computationally hard tasks than computers does not make sense. What exactly are our brains besides highly error prone computers?

Anonymous coward April 13, 2012 at 8:57 am

They are insanely powerful computers. Any human brain has more raw processing power than the largest supercomputer existing, to say nothing about the interconnect density.

Samuel April 13, 2012 at 10:03 am

The numbers thrown around for their computational power of the brain are very misleading. Susceptibility to errors is significantly higher with the brain and this greatly reduces the effective computational power. The high numbers also use the assumption that every part of the brain is calculating at full capacity, which is never the reality. If you’re writing an essay, only a few parts of the brain would be heavily recruited for this task.

ShardPhoenix April 14, 2012 at 4:10 am

I know, I was responding to the “not too far a leap” part, not the possibility entirely.

Bill April 13, 2012 at 8:00 am

This essay was written by a machine.

Now it is graded by a machine.

Let’s cut out the middleman.

Ryan April 13, 2012 at 8:04 am

We already know we’re making children into robots. I am not at all surprised to learn that we’ve made the teachers into robots as part of the same process.

I had thought that the ruse was exposed by the Postmodernism Generator (http://www.elsewhere.org/pomo/), but I guess this is still news to some people?

Doc Merlin April 13, 2012 at 8:23 am

You are missing the point. In standardized tests, the inclusion of essays (at-least in the US with the SAT and PSAT) was just a way to increase female scores so they lined up closer to male scores in standardized tests. Its just regulatory arbitrage.

Ryan April 13, 2012 at 8:49 am

There are a lot of problems with that theory, Doc.

Doc Merlin April 13, 2012 at 9:29 am

Its what the college board told me (I was in the first year they included an essay in the PSAT.)

Doc Merlin April 13, 2012 at 9:33 am
Urso April 13, 2012 at 10:22 am

From the article:

“Even the test-makers’ own research admits that the [prior version of the] test underpredicts the performance of females and over-predicts the performance of males”

So (assuming this line is true) it wasn’t about boosting female scores for the sake of boosting female scores, it was about altering the test to more accurately reflect the academic potential of the test-takers.

Doc Merlin April 13, 2012 at 10:52 am

“So (assuming this line is true) it wasn’t about boosting female scores for the sake of boosting female scores, it was about altering the test to more accurately reflect the academic potential of the test-takers.”

1) Yes, I’ve read studies that show that women do better on classwork and class projects and men do better on tests, on average.

2) No, it was to boost female scores to avoid legal trouble (this was part of a settlement for a law suit).

buddyglass April 13, 2012 at 1:40 pm

It’s interesting that the % of females was so low considering how the NMSQT score is composed. It’s basically twice your PSAT verbal + PSAT math. Girls generally out-perform boys on the verbal section whereas boys generally out-perform girls on the math section. Doubling the verbal component also serves to juice the number of native English speakers relative to non-native speakers (read: Asians).

Dean April 14, 2012 at 12:28 am

Doc,

The legal trouble existed because the test artificially disadvantaged women. Any steps taken to avoid that legal trouble were taken in order to no longer artificially disadvantage women.

Urso is correct.

andy April 14, 2012 at 2:44 am

The legal trouble existed because the test artificially disadvantaged women. Any steps taken to avoid that legal trouble were taken in order to no longer artificially disadvantage women.

The definition of ‘artificially disadvantaged’ is ‘did worse’?

The Original D April 14, 2012 at 5:51 pm

Well, given that women are more likely to graduate college, it calls into question the value of the SAT to begin with. Maybe men’s scores should be discounted.

Doc Merlin April 13, 2012 at 9:46 am

Anyway, what where the problems? I am curious what you were going to say.

Ryan April 13, 2012 at 10:19 am

You’ll have to excuse me if I remain skeptical of a link from an advocacy group called “FairTest.org.”

Doc Merlin April 13, 2012 at 10:48 am

They were the ones who sued College Board over the tests, Ryan, alleging they were discriminatory.

mkt April 14, 2012 at 1:03 am

“the inclusion of essays (at-least in the US with the SAT and PSAT) was just a way to increase female scores so they lined up closer to male scores in standardized tests. Its just regulatory arbitrage.”

*Just* a way? Standardized tests were shown to be biased against females; you seem to be saying that taking steps to correct that (such as including a written essay) is a bad thing.

Was the 13th Amendment just a way to raise the income of Blacks?

The written test has had another useful effect: every study that I have seen (and done) shows that the SAT Writing score is a better predictor of college grades than either the SAT Math or SAT Critical Reading (what used to be called the SAT Verbal). Test scores are still not a very good predictor of college performance, but the SAT Writing is better than the SATM and SATV.

Note that the SAT essays are still graded by humans, not computers. I remain skeptical of computerized grading; see Les Perelman’s comment in the IHE article that Tyler linked to.

Mark April 13, 2012 at 8:06 am

The full article isn’t quite so positive as the excerpt would suggest…

Rahul April 13, 2012 at 8:09 am

Have you tried MIT’s SciGen ?

Actually, I wonder if the automated essay scoring software can be fooled by good , yet off-context essays?

MIchael Foody April 13, 2012 at 8:45 am

This is important and impressive but I think there are several interesting problems.
1) How easy is it to game? I can imagine that the machine grading uses some short cuts that would be easy to manipulate, a pastiche of GRE vocab words might be able to garner a decent grade in the absence of decent arguments.
2) If it uses some sort of predictive coding to determine grades I could see it penalizing novel arguments. If you aren’t making the same points as most of the smart kids it might not be able to distinguish you from the dumb kids.
3) Additionally, there are also people who for stylistic reasons might be hurt by this grading. I wonder how a Hemingway essay would fare?
Note that all these criticisms apply to actually human graders to a greater or lesser extent.

Bill April 13, 2012 at 8:54 am

This essay was machne graded and given a score of: +1

Bill April 13, 2012 at 8:55 am

machine, not machne. Damn machine.

Lou April 13, 2012 at 10:55 am

Of course they can be gamed. When I took the GMAT, which includes a machine-graded essay component, all the prep books gave you the formula for organizing your essays and a list of organizational words (e.g. “first”, “secondly”, “finally”, “in conclusion”) that the computer scans for. No business schools actually factor the essay score into their admission process.

Bill April 13, 2012 at 9:11 am

Along wih SAT primer courses, I can’t wait for the new series:

“Writing for Dummies: How to Write for Machine Graded Essays”

Next it will be poetry.

First they took the essay, and I didn’t speak,

Then they took the Poem.

Erik Olson April 13, 2012 at 9:20 am

I’m thrilled that a computer program can produce random numbers from a distribution while preserving the empirical mean and standard deviation of said distribution! Computers are so useful!

I want to know the rank correlation between human and computer grades. What was quoted alone is just sophistry. I see the article explodes this pretty well.

I look forward to the day when software replicates both the mean and standard deviation of human driving ability.

Try reading this interpretation on morning radio:

“For scores on a scale of 1-6, a computer was able to assign a grade of 3, 4, or 5 to virtually all of the tests a human would grade as a 4. However, the computers didn’t do so well at guessing within one point when the scores were restricted to a range of 1 to 4. Imagine, having your test regraded by a computer gave a 25% chance of raising or lowering your score one full point, nearly as unpredictable as having a teacher regrade your exam!”

Urso April 13, 2012 at 10:16 am

The first paragraph made me laugh out loud

Manoel Galdino April 14, 2012 at 12:40 pm

+1

M April 13, 2012 at 9:29 am

Are there options to get these checkers available privately? It would be about as useful as spell and grammar check in that it gives you a backstop against certain errors, but both over and under specifies others. This seems quite useful as a writing training tool rather than just as an evaluation tool.

Bill April 13, 2012 at 10:13 am

I think there are readability software programs used by insurance companies to put policies in plain English as required by statute.

I think, though, that some insurance companies set the program in reverse to make the policy less readable.

scott f April 13, 2012 at 9:44 am

I’m thinking Google Translate:

English -> Chinese -> English
This essay was graded by a machine -> 本文分級機 -> Grader in this article

Kal April 13, 2012 at 10:00 am

Right. Given what I know about the current limitations of natural language processing – and the investment required to produce even something like Siri, which only has to deal with one sentence at a time – this seems to me more like an indictment of putting essays on standardized tests. Or, if you think essays are sufficiently important, an indictment of our massive testing regime.

I mean, the program is certainly not grading the insight or originality of the ideas in the essay. So to the extent it’s correlated with human scorers, that means the human scorers are rewarding kids who memorize spellings over kids who can think critically.

Popeye April 13, 2012 at 10:20 am

The correlations between human/computer graders on the 8 test sets ranged from 0.61 to 0.85. Is that good enough? Calling the differences “minute” seems like a stretch.

“In terms of being able to replicate the mean [ratings] and standard deviation of human readers, the automated scoring engines did remarkably well” — laughably stupid.

JWatts April 13, 2012 at 11:29 am

“The correlations between human/computer graders on the 8 test sets ranged from 0.61 to 0.85. Is that good enough? Calling the differences “minute” seems like a stretch.”

Maybe, but I’d like to know what the correlation was between random human graders. It might not have been any better.

““In terms of being able to replicate the mean [ratings] and standard deviation of human readers, the automated scoring engines did remarkably well” — laughably stupid.”

Well, it’s not quite that bad, but yes that phrase doesn’t tell you everything.

Andy April 15, 2012 at 3:27 am

It’s probably comparable to the correlation between human graders. I’m only familiar with the AP English essay grading, but even there where they spent more time on each essay it’s common to get two different scores from human graders.

chuck martel April 13, 2012 at 11:43 am

So this is actually a computer-generated blog and the comments are being made by idle computers that owners neglected to shut down when they left for work this morning. In fact, I’m a computer myself and after I get through messing around here I’m going to go over and flame Carpe Diem.

buddyglass April 13, 2012 at 1:35 pm

“In terms of being able to replicate the mean [ratings] and standard deviation of human readers, the automated scoring engines did remarkably well,”

This is such a meaningless statement. What we really care about is some measure of aggregate error, not whether the mean and standard deviation are the same. I mean, imagine a computer scorer whose scores are the “mirror image” of the human scorer’s marks reflected across the mean. So if the mean is “B” and a human scorer gave a paper an “A” then the computer gives it a “C”. Etc. That scoring algorithm would generate an identical mean and stdev, but the aggregate error would be enormous. Except for those close to the mean, it would score every paper the exact opposite of how a human would.

What we really care about is whether there are any “big mistakes”. Are there any papers that a consensus of humans rate as exemplary that the computer rates as “average” or “below average”. Are there any papers that a consensus of humans rate as “terrible” that a computer rates as “average” or “above average”. Etc.

Should probably look at RMSE (or similar) between computer and humans.

Max April 13, 2012 at 1:39 pm

But as soon as humans figure out machines are grading them, it will be easier to fake an essay to pass the machines.

MikeDC April 13, 2012 at 2:38 pm

Doubtful. Writing a bunch of random and meaningless stuff into a cohesive, test passing result actually sounds much more difficult and time consuming than simply writing a reasonable essay.

Steve Sailer April 13, 2012 at 6:46 pm

The real world implication of this is that the new Writing component of the SAT can be fairly easily gamed by students whose parents pay for a lot of test prep to drill them in the right essay-writing gimmicks. In other words, recent changes in the SAT made in the name of increased fairness and diversity have mostly turned out to be gifts to the Tiger Mothers of the world.

Bradley Gardner April 13, 2012 at 7:04 pm

Someone should sell that software to Kaplan

Jim April 14, 2012 at 12:52 am

is the moral of the story that computers are smart or people are stupid?

Mark Thorson April 14, 2012 at 4:08 pm

There’s a practical application for this technology, if it works, which is reviewing TV and movie scripts for Hollywood executives.

http://www.vulture.com/2012/04/polone-who-reads-movie-and-tv-scripts.html

However, I’m deeply skeptical that this technology works.

The Original D April 14, 2012 at 5:53 pm

How long before they can grade econ tests? How long before they can *teach* econ courses?

Comments on this entry are closed.

Previous post:

Next post: