Books for bots Goodhart’s Law edition?

Two employees at the East Lake County Library created a fictional patron called Chuck Finley — entering fake driver’s license and address details into the library system — and then used the account to check out 2,361 books over nine months in 2016, in order to trick the system into believing that the books they loved were being circulated to the library’s patrons, thus rescuing the books from automated purges of low-popularity titles.

Library branch supervisor George Dore was suspended for his role in the episode; he said that he was trying to game the algorithm because he knew that these books would come back into vogue and that his library would have to spend extra money re-purchasing them later. He said that other libraries were doing the same thing.

Data falsification will be one of the biggest stories of the next five years.  That is from BoingBoing, via Ted Gioia.

Comments

Isn't it already one of the biggest stories in academia?

Comments for this post are closed

It isn't data falsification that is the story, It is the blitheringly stupid software systems. The people working around the stupidity get fired.

The only solution is to have it work as designed, and document everything. This guy should have done up a list and gotten confirmation signed by his manager.

Or to simply ignore everything. They can't fire us all.

By the way, this is endemic, and I would suggest is the cause of the flat productivity in the last years.

I think you'd have bad employee retention if long-term employees at a library weren't able to vouch for the value of at least some handful of seldom used titles.

Maybe they should raise the threshold slightly and then make it some issue of seniority to do something like what you suggest. First year you get to save two (if you want), and it increases by one title every two years, with additional perks accompanying promotions.

Comments for this post are closed

Comments for this post are closed

The next five years? I think this has already been happening for quite some time. And on a more massive scale than is known.

Comments for this post are closed

He should be applying for a job at the WaPo or NYT.

Or Wells Fargo

He doesn't want to go to a place where he might get caught.

Comments for this post are closed

Comments for this post are closed

Comments for this post are closed

The newest libraries look like airport terminals and have more "customers" for computer internet services than books. In big cities libraries have become daycare facilities for the unemployed, who spend their afternoons playing games or watching cat videos. Since libraries have been, historically, places for the storage of books, why should lack of circulation lead to them being discarded? The librarians should be commended for data falsification in this case. Also, why are libraries so much alike?

"Since libraries have been, historically, places for the storage of books, why should lack of circulation lead to them being discarded? "

This has always been the case. Libraries have a limited amount of storage space and they routinely buy new books. Which means that after a certain period of time (when the shelves hit maximum capacity) they routinely purge the same number of books as they buy every period.

This was a case of two Librarians deciding to choose to preference books they liked and let other books they didn't like be discarded.

I wonder if Fahrenheit 451 made the cut.

Comments for this post are closed

"This was a case of two Librarians deciding to choose to preference books they liked and let other books they didn’t like be discarded."

Hasn't this been the responsibility of librarians since the dawn of time? This is not a story of data falsification. This is a story of self-styled technocrats who incorrectly believe that rote algorithms (here, measuring the frequency of circulation) is the be-all end-all of human society, and the brave individuals smart enough to realize that the technocrats are about 1/10th as clever as they think they are. Vox delenda est.

Whether you think it was good or bad, it was still clearly data falsification. They mislead everyone else by falsifying the data regarding circulation numbers for certain books.

Sure it's data falsification, but the bigger question is why they were in a position where they felt they had to falsify data?

Comments for this post are closed

Sure it's data falsification but that's a secondary issue, compared to the question of trusting an algorithm over human judgement.

Data falsification (or preventing it) is merely a means to an end; what truly matters is not data falsification but rather the accuracy of the results, the efficiency of the library's operations, etc.

What we don't know is if the librarians' judgements will be better than the algorithms. From the article we don't know how well-tested the algorithm is, how robust and how relevant were the training data that it used, and how accurate the algorithm is under various scenarios. And also how wise the librarians are.

Comments for this post are closed

Comments for this post are closed

Comments for this post are closed

Comments for this post are closed

Comments for this post are closed

I doubt the motivation was to save these so-called rarely read books. I think the books were chosen so that the scam wouldn't be revealed when actual people tried to check out a Finley book. If I had to guess, the library manager was probably trying to show higher numbers of patrons utilizing the book inventory. When caught, he decided he needed a story that made him seem more noble.

I don't think that's the case, because they only created one patron. However, it's possible that the number of books was significant to their actual circulation numbers if they were a small branch library.

How do you know it was the only one? "Chuck Finley" might have eventually caught someone's eye because it was a pseudonym used extensively by Bruce Campbell's character on Burn Notice.

Thanks for catching that. I think we have a plot line for My Name is Still Bruce.

Comments for this post are closed

I was really wondering if that's how they picked the name. It stood out to me when Bruce Campbell used it because of the former major league pitcher Chuck Finley.

Comments for this post are closed

Comments for this post are closed

In addition, it was almost surely easier to create a small number of super voracious readers than it was to create a large number of normal readers. My suspicion is that the metric being targeted was number of books checked out over a given period. This probably plays into budgeting and staffing decisions made at the town/county level.

"How do you know it was the only one?"

There might have been a few more, but the shear volume of checkouts means that there was probably only one. Or only one significant fake ID.

"My suspicion is that the metric being targeted was number of books checked out over a given period."

Agreed.

Comments for this post are closed

Comments for this post are closed

Comments for this post are closed

Comments for this post are closed

'Data falsification will be one of the biggest stories of the next five years. '

In our dawning post truth world, this will not matter. After all, anything data we don't like is clearly falsified - just ask any climate change denialist. Or young earth creationist. Or Infowars reader.

Comments for this post are closed

I am not a bot.

Um, yeah, neither am I! Chuck let's meet up for human things like eating and, like, cuddling? Sleeping? You know, stuff we humans do all the time.

Chuck and Skynet and everyone else here, let's all try to take our pulses, so we can be absolutely sure whether we are bots or not. We must find out.

Ok, Chuck and Skynet - it's your birthday, and someone gives you a calfskin wallet. How do you react?

Comments for this post are closed

Comments for this post are closed

This fake Skynet is fooling no one.

Comments for this post are closed

Comments for this post are closed

Comments for this post are closed

I'm curious to know which books they were.
What books do librarians like that library patrons do not?

Probably seldom read classics.

BTW, I wonder if libraries have an established canon of great or significant books they keep around even though most of them are seldom read.

Comments for this post are closed

Comments for this post are closed

If there's anyone I can forgive for falsifying data, it's a librarian trying to save good books from being discarded when they are temporarily out of favor.

I find it abhorrent when everything is a popularity contest, and a short term one at that.

I am frequently astounded at discovering classic books that my local library no longer carries.

Books become public domain after 70 years, so you can download an ebook for free or buy a copy ridiculously cheap of the "classics". For this reason, I don't find it strange that libraries aren't carrying these books

70 years after the death of the author, which means it's more like >100 years.

Comments for this post are closed

Depends where you live. Some places only protect American authors (or those who obtained rights) for 50 years. Some not at all.

Comments for this post are closed

Comments for this post are closed

Comments for this post are closed

This is old news. Orlando Sentinel 12/30/2016. Librarians claimed they knew better than algorithm. I'd bet librarians were wrong - although I'm assuming some competence by people who chose the software, which may be overly naive.

Comments for this post are closed

Probably, at some point, all books that have been in print will have been committed to an accessible electronic record and this issue will disappear.

Comments for this post are closed

Hazel Meade asks the question pertinent to determine whether the proposed defense makes sense. A discussion on morality could follow. Everyone expounding on some theory about 'what was really going on' is only signalling, instead of caring about the facts of the matter.

Comments for this post are closed

"Data falsification " - true you are inserting fake data into the pool of big data

But what about "automated purges of low-popularity titles." We need a term for that. Not the automated purge but this implicit assumption that an algorithm is by default the 'correct' answer to the question of which titles should be purged and which shouldn't.

Why shouldn't the algorithm be assumed to be 'under construction' as it generates suggestions for purges until it is agreed that it is suitable? I think, for example, an algorithm should also look back over the last ten years for requests for books that were 5+ years old in order to try to stop throwing away books whose popularity cycle every few years.

Even better I think Kasperov's idea that the best system combines man with computer. Let the algorithm generate suggestions but let the librarians be able to swap a book out of the kill list if they manually choose another in its place.

I'm under the impression this is old news and that humans no longer add value to computers in advanced chess. I.e., having a human "help" the computer just hurts the quality of play. The rule of thumb in AI news is that if there's ever a feel-good component to the story where people come out on top, that part is false or soon to become quaint and obsolete...

In related news, though, a newer version of AlphaGo secretly played 60 matches against the worlds top players with no losses and no draws (at least until one match ended in a connection timeout). So we have that going for us.

http://www.wsj.com/articles/ai-program-vanquishes-human-players-of-go-in-china-1483601561

I can believe that human judgement won't help the top computers when they're playing chess. I'm willing to bet that human judgement would improve the decisions made by algorithm being used by the East County Lake Library.

We could even sic IBM, Deep Blue, and Watson on the problem and maybe they could come up with an algorithm that beats the humans.

But only if they have a big pool of relevant training data! If the East County Lake Library's data only go back several years, IBM and Watson won't be able to do squat.

I suspect we're overthinking what was likely an attempt to make the library seem more used, so that these guys could keep their easy jobs.

Training data relevant to what target? I suspect there is not much more to the algorithm in question than looking at aggregated check-outs over some window X and then grabbing the bottom Y% to be disposed of. The only possible addition the librarians could then make would be to extend X because they know of some longer cycle of popularity. And that sounds like a political question. (i.e. "No, Mr. Smith, the school board does not want to spend more shelf space on the Brontës."

Of possible interest: http://scholarworks.sjsu.edu/cgi/viewcontent.cgi?article=7825&context=etd_theses

Comments for this post are closed

According to the article, that is exactly the additional information that the librarians are bringing to the decision: just because a book is currently at the bottom of the popularity list doesn't mean it will stay there. A librarian with 30 years of experience is going to have a lot more information than a computer with data that go back say 5 years. (The article doesn't say how far back the training data go, nor how far back the librarians go. But good quality databases usually do not have many years of good data in them, especially compared to a 60-year old professional's experience.)

I don't see where politics comes in. The article doesn't say that the librarians think that the library needs to hold on to more books; they think it should hold on to the right books. Their actions would not result in any more shelf space being needed.

And the algorithm might very well be using the simplistic decision-making that you describe, which unlike say Deep Blue's decision-making can easily be improved upon by a human.

Comments for this post are closed

There's also guesses based on data outside lending history. For example, just suppose there's some books about NAFTA. Pretty boring and dull and over 20 years old at this point but if trade wars light up in the next few years as the Orange King takes the throne those books might suddenly seem relevant again.

Even if you feed the computer lending history of the last 50 years, that possibility would probably not occur to it but it very well might to even a semi-skilled librarian just mulling over a kill list of unpopular books.

Comments for this post are closed

Comments for this post are closed

Comments for this post are closed

Comments for this post are closed

Comments for this post are closed

My local library at the corner of my street purged my books years ago. But sometimes when I'm traveling, I'll pop into a random library in whatever part of the world, and there they'll be, waiting on the shelves, all comfortable and homey, and -- I always check -- usually dog-eared and well-read.

Comments for this post are closed

The obvious question is what moves were checked out. Chuck Finely must have checked out the Evil Dead a time or two hundred.

Comments for this post are closed

Chuck Finley?

Obviously these librarians are a bunch of lefties.

Comments for this post are closed

Imagine that history judged 2016 based entirely on clicks. You'd get mostly clickbait, no?

Just think how many stories there are of someone digging through realms of seldom used materials, and then discovering something amazing, leading to some scientific discovery or major artistic work.

I think trust the curators over the technocrats. There should be explicit room for library staff to intervene in the process.

Comments for this post are closed

The manipulation of library check-out data is a sub-plot in "Andromeda Klein" by Frank Portman.

Comments for this post are closed

"Data falsification will be one of the biggest stories of the next five years."

Hasn't it already been a big story for the past fifty years or more? People have been committing money laundering, theft and embezzlement using falsified records long before "big data" became a trendy topic. In the world of books, it's been well-known for a while that some authors inflate their book sales by having institutions under their control buy their books in huge volumes. If anything, more sophisticated algorithms and forensic analysis of various sorts will, in the next five years, make it easier to catch fraudulent schemes to manipulate data.

Assuming that's what the system wants. A church or political group that buys a particular book in mass in order to force it to the best seller list may or may not be data falsification depending upon what you're looking for. If you're a publisher it isn't. You want books sales and it doesn't really matter to you if the people who buy the book read it or not. A hundred books brought and stashed in a closet is ten times better than ten people who buy the book and read it in a 20 hour binge.

If you're a cultural critic or someone trying to get a feel for what is on people's mind, though, it isn't good. 10 million purchases of "L Ron Hubbard was the greatest guy to ever live" doesn't mean Scientologists are the Tea Party of the latter half of the second decade of the 21st century, it just means the cult is burning off some of their tax-exempt cash.

Comments for this post are closed

In other words, data falsification implies the data was once good but is now ruined by someone gaming it. But it wasn't. In the past a cultural critic might try to guage what types of books people were reading by looking at what people were reading in cafes, seeing what people were writing about, maybe talking too bookstore owners, bookclub members etc. Deciding you will ditch that for just reading the best seller list assumes that data is a 'clean picture' of what people are reading. That assumption may be at fault rather than 'data falsification' done by 'bad people' who ruin what was once a perfectly good data set.

On the other hand consider Amazon book reviews when the site first launched versus later on. The initial reviews were mostly readers and could be counted on to some degree but later on marketers figured out pushing 'fake reviews' are a good strategy. What was once a valid data set for one purpose then becomes ruined by those trying to game it.

Comments for this post are closed

Comments for this post are closed

This is a subplot in Connie Willis's excellent "Bellwether."

Comments for this post are closed

Comments for this post are closed