Ahem…

The study, funded by the William and Flora Hewlett Foundation, compared the software-generated ratings given to more than 22,000 short essays, written by students in junior high schools and high school sophomores, to the ratings given to the same essays by trained human readers.

The differences, across a number of different brands of automated essay scoring software (AES) and essay types, were minute. “The results demonstrated that over all, automated essay scoring was capable of producing scores similar to human scores for extended-response writing items,” the Akron researchers write, “with equal performance for both source-based and traditional writing genre.”

“In terms of being able to replicate the mean [ratings] and standard deviation of human readers, the automated scoring engines did remarkably well,” Mark D. Shermis, the dean of the college of education at Akron and the study’s lead author, said in an interview.

Here is more.

Comments

To the extent anything can be described, it can be programmed. It's called the Church-Turing thesis.

Comments for this post are closed

You cannot acquire a human reader and run millions of essays through it until you optimize your way toward a high-scoring essay, though.

As economists since Lucas well know, it is easy to speculate on heuristics that approximate an existing human process. It is not so easy to identify the actual structural heuristics that said humans operate on, however, so relying on speculative heuristics as policy may simply alter human behavior.

e.g., trivially, it is improbable that automated essay scores have a way to check for patently absurd content in the essay rather than language use. But human markers do, and therefore students minimize making absurd claims (subject to ability) even when it might make employing impressive language easier. Thus this disparity in marker perception would not show up in any comparison of automated scorers to an existing field of human scorers.

This doesn't make it useless - the automated scorer could score 90% of essays, say, with human scorers taking on a random 10%. Still, we should be wary of attempts to game an automated scorer.

"You cannot acquire a human reader and run millions of essays through it until you optimize your way toward a high-scoring essay, though."

Actually, you can. Its already been done. Handwriting, length, and a couple other small things where what mattered almost exclusively.

You forget that the stated reason for including the essay requirement in the SAT (PSAT) was not to better measure ability, but to increase percent of women in the top scoring brackets. This was to get around certain anti-sex-discrimination requirements.

Doc, do you have a link for this claim about SAT?

Oh, I found one! I knew from being told by College Board, though, not from reading it on the internet. (My year was the first year they did it for the PSAT)

http://www.fairtest.org/test-makers-revise-nat-merit-exam-address-gender-bias-fairtest-complaint-will-lead-millions-more-gir

Do you you have a link to the claim that we forgot that?

Can we please bring back separate-but-equal in place of Title IX. The sports radio keeps getting pre-empted with women's "sports."

Comments for this post are closed

Fascinating, Doc. A very Tyler-esque link.

Comments for this post are closed

Comments for this post are closed

Comments for this post are closed

Comments for this post are closed

Exactly, thanks david!

Comments for this post are closed

I took the SAT II in writing 15 years ago. The way to get a good score was to not worry about the content of the claims that you were making but just be sure to organize your essay in exactly the way the rubric called for and use proper grammar throughout. This approach worked quite well for me. Much easier than actually developing a good thesis and defending it- especially if the best argument calls for 2 points instead of 3 or if point 1 has multiple subpoints that are difficult to develop in the appropriate number of paragraphs.

Long story short: I'm not sure how much content matters in certain contexts with human graders.

Having once had a job where I graded college papers and essay exams, I found that just producing a reasonably well-organized paper with proper grammar is apparently too much for some people.

Comments for this post are closed

Comments for this post are closed

Comments for this post are closed

Rather than reflecting positively on the software, I perceive this as reflecting poorly on the median human essay scorer.

+1 Many graders "skim" never read.

Also, given that: (a) Software replicates the Standard Deviation of human graders and (b) the differences between various AES's were minute.

Does this mean that differences between human graders are minute too? Find that hard to believe.

Why is it hard to believe? I don't know firsthand, but I imagine the scorers are obligated to score according to some standardized set of criteria. Essentially the human scorer is just running the program defined by this set. If it differs slightly between different institutions, I would expect the differences to be trivial.

I think you are right on with this. The only way that large-scale essay grading can be consistent is by coming up with a rigid, artificial set of rules. I doubt these essay-based standardized tests are very useful.

The deeper question is what is the trade-off between consistency and the other goals of assessment.

A (eccentric) rule-based assessment strategy based on, say, word-count, would be highly consistent yet almost entirely irrelevant.

Comments for this post are closed

Comments for this post are closed

Comments for this post are closed

Comments for this post are closed

Comments for this post are closed

Next step is the software that can write the essay, no? Can't be too far of a leap.

Presumably essays are NP complete (ie relatively easy to check, relatively hard to do).

Any argument that it is too computationally expensive for computers to do it and thus solely in the realm of humans is really absurd. Programming a computer to do this is highly non-trivial, but saying people are any better at performing computationally hard tasks than computers does not make sense. What exactly are our brains besides highly error prone computers?

They are insanely powerful computers. Any human brain has more raw processing power than the largest supercomputer existing, to say nothing about the interconnect density.

The numbers thrown around for their computational power of the brain are very misleading. Susceptibility to errors is significantly higher with the brain and this greatly reduces the effective computational power. The high numbers also use the assumption that every part of the brain is calculating at full capacity, which is never the reality. If you're writing an essay, only a few parts of the brain would be heavily recruited for this task.

Comments for this post are closed

Comments for this post are closed

I know, I was responding to the "not too far a leap" part, not the possibility entirely.

Comments for this post are closed

Comments for this post are closed

Comments for this post are closed

Comments for this post are closed

This essay was written by a machine.

Now it is graded by a machine.

Let's cut out the middleman.

Comments for this post are closed

We already know we're making children into robots. I am not at all surprised to learn that we've made the teachers into robots as part of the same process.

I had thought that the ruse was exposed by the Postmodernism Generator (http://www.elsewhere.org/pomo/), but I guess this is still news to some people?

You are missing the point. In standardized tests, the inclusion of essays (at-least in the US with the SAT and PSAT) was just a way to increase female scores so they lined up closer to male scores in standardized tests. Its just regulatory arbitrage.

There are a lot of problems with that theory, Doc.

Its what the college board told me (I was in the first year they included an essay in the PSAT.)

Oh, here I found an article that backs up what I remember.
http://www.fairtest.org/test-makers-revise-nat-merit-exam-address-gender-bias-fairtest-complaint-will-lead-millions-more-gir

Comments for this post are closed

From the article:

"Even the test-makers' own research admits that the [prior version of the] test underpredicts the performance of females and over-predicts the performance of males"

So (assuming this line is true) it wasn't about boosting female scores for the sake of boosting female scores, it was about altering the test to more accurately reflect the academic potential of the test-takers.

Comments for this post are closed

"So (assuming this line is true) it wasn’t about boosting female scores for the sake of boosting female scores, it was about altering the test to more accurately reflect the academic potential of the test-takers."

1) Yes, I've read studies that show that women do better on classwork and class projects and men do better on tests, on average.

2) No, it was to boost female scores to avoid legal trouble (this was part of a settlement for a law suit).

Comments for this post are closed

It's interesting that the % of females was so low considering how the NMSQT score is composed. It's basically twice your PSAT verbal + PSAT math. Girls generally out-perform boys on the verbal section whereas boys generally out-perform girls on the math section. Doubling the verbal component also serves to juice the number of native English speakers relative to non-native speakers (read: Asians).

Comments for this post are closed

Doc,

The legal trouble existed because the test artificially disadvantaged women. Any steps taken to avoid that legal trouble were taken in order to no longer artificially disadvantage women.

Urso is correct.

Comments for this post are closed

The legal trouble existed because the test artificially disadvantaged women. Any steps taken to avoid that legal trouble were taken in order to no longer artificially disadvantage women.

The definition of 'artificially disadvantaged' is 'did worse'?

Comments for this post are closed

Well, given that women are more likely to graduate college, it calls into question the value of the SAT to begin with. Maybe men's scores should be discounted.

Comments for this post are closed

Comments for this post are closed

Anyway, what where the problems? I am curious what you were going to say.

You'll have to excuse me if I remain skeptical of a link from an advocacy group called "FairTest.org."

Comments for this post are closed

They were the ones who sued College Board over the tests, Ryan, alleging they were discriminatory.

Comments for this post are closed

Comments for this post are closed

Comments for this post are closed

"the inclusion of essays (at-least in the US with the SAT and PSAT) was just a way to increase female scores so they lined up closer to male scores in standardized tests. Its just regulatory arbitrage."

*Just* a way? Standardized tests were shown to be biased against females; you seem to be saying that taking steps to correct that (such as including a written essay) is a bad thing.

Was the 13th Amendment just a way to raise the income of Blacks?

The written test has had another useful effect: every study that I have seen (and done) shows that the SAT Writing score is a better predictor of college grades than either the SAT Math or SAT Critical Reading (what used to be called the SAT Verbal). Test scores are still not a very good predictor of college performance, but the SAT Writing is better than the SATM and SATV.

Note that the SAT essays are still graded by humans, not computers. I remain skeptical of computerized grading; see Les Perelman's comment in the IHE article that Tyler linked to.

Comments for this post are closed

Comments for this post are closed

Comments for this post are closed

The full article isn't quite so positive as the excerpt would suggest...

Comments for this post are closed

Have you tried MIT's SciGen ?

Actually, I wonder if the automated essay scoring software can be fooled by good , yet off-context essays?

Comments for this post are closed

This is important and impressive but I think there are several interesting problems.
1) How easy is it to game? I can imagine that the machine grading uses some short cuts that would be easy to manipulate, a pastiche of GRE vocab words might be able to garner a decent grade in the absence of decent arguments.
2) If it uses some sort of predictive coding to determine grades I could see it penalizing novel arguments. If you aren't making the same points as most of the smart kids it might not be able to distinguish you from the dumb kids.
3) Additionally, there are also people who for stylistic reasons might be hurt by this grading. I wonder how a Hemingway essay would fare?
Note that all these criticisms apply to actually human graders to a greater or lesser extent.

This essay was machne graded and given a score of: +1

machine, not machne. Damn machine.

Comments for this post are closed

Comments for this post are closed

Of course they can be gamed. When I took the GMAT, which includes a machine-graded essay component, all the prep books gave you the formula for organizing your essays and a list of organizational words (e.g. "first", "secondly", "finally", "in conclusion") that the computer scans for. No business schools actually factor the essay score into their admission process.

Comments for this post are closed

Comments for this post are closed

Along wih SAT primer courses, I can't wait for the new series:

"Writing for Dummies: How to Write for Machine Graded Essays"

Next it will be poetry.

First they took the essay, and I didn't speak,

Then they took the Poem.

Comments for this post are closed

I'm thrilled that a computer program can produce random numbers from a distribution while preserving the empirical mean and standard deviation of said distribution! Computers are so useful!

I want to know the rank correlation between human and computer grades. What was quoted alone is just sophistry. I see the article explodes this pretty well.

I look forward to the day when software replicates both the mean and standard deviation of human driving ability.

Try reading this interpretation on morning radio:

"For scores on a scale of 1-6, a computer was able to assign a grade of 3, 4, or 5 to virtually all of the tests a human would grade as a 4. However, the computers didn't do so well at guessing within one point when the scores were restricted to a range of 1 to 4. Imagine, having your test regraded by a computer gave a 25% chance of raising or lowering your score one full point, nearly as unpredictable as having a teacher regrade your exam!"

The first paragraph made me laugh out loud

Comments for this post are closed

+1

Comments for this post are closed

Comments for this post are closed

Are there options to get these checkers available privately? It would be about as useful as spell and grammar check in that it gives you a backstop against certain errors, but both over and under specifies others. This seems quite useful as a writing training tool rather than just as an evaluation tool.

I think there are readability software programs used by insurance companies to put policies in plain English as required by statute.

I think, though, that some insurance companies set the program in reverse to make the policy less readable.

Comments for this post are closed

Comments for this post are closed

I'm thinking Google Translate:

English -> Chinese -> English
This essay was graded by a machine -> 本文分級機 -> Grader in this article

Comments for this post are closed

Right. Given what I know about the current limitations of natural language processing - and the investment required to produce even something like Siri, which only has to deal with one sentence at a time - this seems to me more like an indictment of putting essays on standardized tests. Or, if you think essays are sufficiently important, an indictment of our massive testing regime.

I mean, the program is certainly not grading the insight or originality of the ideas in the essay. So to the extent it's correlated with human scorers, that means the human scorers are rewarding kids who memorize spellings over kids who can think critically.

Comments for this post are closed

The correlations between human/computer graders on the 8 test sets ranged from 0.61 to 0.85. Is that good enough? Calling the differences "minute" seems like a stretch.

"In terms of being able to replicate the mean [ratings] and standard deviation of human readers, the automated scoring engines did remarkably well" -- laughably stupid.

Comments for this post are closed

"The correlations between human/computer graders on the 8 test sets ranged from 0.61 to 0.85. Is that good enough? Calling the differences “minute” seems like a stretch."

Maybe, but I'd like to know what the correlation was between random human graders. It might not have been any better.

"“In terms of being able to replicate the mean [ratings] and standard deviation of human readers, the automated scoring engines did remarkably well” — laughably stupid."

Well, it's not quite that bad, but yes that phrase doesn't tell you everything.

It's probably comparable to the correlation between human graders. I'm only familiar with the AP English essay grading, but even there where they spent more time on each essay it's common to get two different scores from human graders.

Comments for this post are closed

Comments for this post are closed

So this is actually a computer-generated blog and the comments are being made by idle computers that owners neglected to shut down when they left for work this morning. In fact, I'm a computer myself and after I get through messing around here I'm going to go over and flame Carpe Diem.

Comments for this post are closed

"In terms of being able to replicate the mean [ratings] and standard deviation of human readers, the automated scoring engines did remarkably well,"

This is such a meaningless statement. What we really care about is some measure of aggregate error, not whether the mean and standard deviation are the same. I mean, imagine a computer scorer whose scores are the "mirror image" of the human scorer's marks reflected across the mean. So if the mean is "B" and a human scorer gave a paper an "A" then the computer gives it a "C". Etc. That scoring algorithm would generate an identical mean and stdev, but the aggregate error would be enormous. Except for those close to the mean, it would score every paper the exact opposite of how a human would.

What we really care about is whether there are any "big mistakes". Are there any papers that a consensus of humans rate as exemplary that the computer rates as "average" or "below average". Are there any papers that a consensus of humans rate as "terrible" that a computer rates as "average" or "above average". Etc.

Should probably look at RMSE (or similar) between computer and humans.

Comments for this post are closed

But as soon as humans figure out machines are grading them, it will be easier to fake an essay to pass the machines.

Doubtful. Writing a bunch of random and meaningless stuff into a cohesive, test passing result actually sounds much more difficult and time consuming than simply writing a reasonable essay.

Comments for this post are closed

Comments for this post are closed

The real world implication of this is that the new Writing component of the SAT can be fairly easily gamed by students whose parents pay for a lot of test prep to drill them in the right essay-writing gimmicks. In other words, recent changes in the SAT made in the name of increased fairness and diversity have mostly turned out to be gifts to the Tiger Mothers of the world.

Comments for this post are closed

Someone should sell that software to Kaplan

Comments for this post are closed

is the moral of the story that computers are smart or people are stupid?

Comments for this post are closed

There's a practical application for this technology, if it works, which is reviewing TV and movie scripts for Hollywood executives.

http://www.vulture.com/2012/04/polone-who-reads-movie-and-tv-scripts.html

However, I'm deeply skeptical that this technology works.

Comments for this post are closed

How long before they can grade econ tests? How long before they can *teach* econ courses?

Comments for this post are closed

Comments for this post are closed