Books for bots Goodhart’s Law edition?

by on January 6, 2017 at 11:49 am in Books, Web/Tech | Permalink

Two employees at the East Lake County Library created a fictional patron called Chuck Finley — entering fake driver’s license and address details into the library system — and then used the account to check out 2,361 books over nine months in 2016, in order to trick the system into believing that the books they loved were being circulated to the library’s patrons, thus rescuing the books from automated purges of low-popularity titles.

Library branch supervisor George Dore was suspended for his role in the episode; he said that he was trying to game the algorithm because he knew that these books would come back into vogue and that his library would have to spend extra money re-purchasing them later. He said that other libraries were doing the same thing.

Data falsification will be one of the biggest stories of the next five years.  That is from BoingBoing, via Ted Gioia.

1 Kman January 6, 2017 at 11:52 am

Isn’t it already one of the biggest stories in academia?

Reply

2 derek January 6, 2017 at 12:02 pm

It isn’t data falsification that is the story, It is the blitheringly stupid software systems. The people working around the stupidity get fired.

The only solution is to have it work as designed, and document everything. This guy should have done up a list and gotten confirmation signed by his manager.

Or to simply ignore everything. They can’t fire us all.

By the way, this is endemic, and I would suggest is the cause of the flat productivity in the last years.

Reply

3 Troll me January 6, 2017 at 8:37 pm

I think you’d have bad employee retention if long-term employees at a library weren’t able to vouch for the value of at least some handful of seldom used titles.

Maybe they should raise the threshold slightly and then make it some issue of seniority to do something like what you suggest. First year you get to save two (if you want), and it increases by one title every two years, with additional perks accompanying promotions.

Reply

4 EverExtruder January 6, 2017 at 12:03 pm

The next five years? I think this has already been happening for quite some time. And on a more massive scale than is known.

Reply

5 RichBerger January 6, 2017 at 12:04 pm

He should be applying for a job at the WaPo or NYT.

Reply

6 msgkings January 6, 2017 at 12:20 pm

Or Wells Fargo

Reply

7 Rich Berger January 6, 2017 at 1:23 pm

He doesn’t want to go to a place where he might get caught.

Reply

8 chuck martel January 6, 2017 at 12:05 pm

The newest libraries look like airport terminals and have more “customers” for computer internet services than books. In big cities libraries have become daycare facilities for the unemployed, who spend their afternoons playing games or watching cat videos. Since libraries have been, historically, places for the storage of books, why should lack of circulation lead to them being discarded? The librarians should be commended for data falsification in this case. Also, why are libraries so much alike?

Reply

9 JWatts January 6, 2017 at 12:12 pm

“Since libraries have been, historically, places for the storage of books, why should lack of circulation lead to them being discarded? ”

This has always been the case. Libraries have a limited amount of storage space and they routinely buy new books. Which means that after a certain period of time (when the shelves hit maximum capacity) they routinely purge the same number of books as they buy every period.

This was a case of two Librarians deciding to choose to preference books they liked and let other books they didn’t like be discarded.

Reply

10 Mark Thorson January 6, 2017 at 1:07 pm

I wonder if Fahrenheit 451 made the cut.

Reply

11 Urso January 6, 2017 at 2:40 pm

“This was a case of two Librarians deciding to choose to preference books they liked and let other books they didn’t like be discarded.”

Hasn’t this been the responsibility of librarians since the dawn of time? This is not a story of data falsification. This is a story of self-styled technocrats who incorrectly believe that rote algorithms (here, measuring the frequency of circulation) is the be-all end-all of human society, and the brave individuals smart enough to realize that the technocrats are about 1/10th as clever as they think they are. Vox delenda est.

Reply

12 JWatts January 6, 2017 at 4:11 pm

Whether you think it was good or bad, it was still clearly data falsification. They mislead everyone else by falsifying the data regarding circulation numbers for certain books.

Reply

13 Urso January 6, 2017 at 4:13 pm

Sure it’s data falsification, but the bigger question is why they were in a position where they felt they had to falsify data?

14 mkt42 January 6, 2017 at 4:28 pm

Sure it’s data falsification but that’s a secondary issue, compared to the question of trusting an algorithm over human judgement.

Data falsification (or preventing it) is merely a means to an end; what truly matters is not data falsification but rather the accuracy of the results, the efficiency of the library’s operations, etc.

What we don’t know is if the librarians’ judgements will be better than the algorithms. From the article we don’t know how well-tested the algorithm is, how robust and how relevant were the training data that it used, and how accurate the algorithm is under various scenarios. And also how wise the librarians are.

15 Yancey Ward January 6, 2017 at 12:12 pm

I doubt the motivation was to save these so-called rarely read books. I think the books were chosen so that the scam wouldn’t be revealed when actual people tried to check out a Finley book. If I had to guess, the library manager was probably trying to show higher numbers of patrons utilizing the book inventory. When caught, he decided he needed a story that made him seem more noble.

Reply

16 JWatts January 6, 2017 at 12:16 pm

I don’t think that’s the case, because they only created one patron. However, it’s possible that the number of books was significant to their actual circulation numbers if they were a small branch library.

Reply

17 Yancey Ward January 6, 2017 at 12:34 pm

How do you know it was the only one? “Chuck Finley” might have eventually caught someone’s eye because it was a pseudonym used extensively by Bruce Campbell’s character on Burn Notice.

Reply

18 Donald Pretari January 6, 2017 at 12:49 pm

Thanks for catching that. I think we have a plot line for My Name is Still Bruce.

Reply

19 Yancey Ward January 6, 2017 at 12:39 pm

In addition, it was almost surely easier to create a small number of super voracious readers than it was to create a large number of normal readers. My suspicion is that the metric being targeted was number of books checked out over a given period. This probably plays into budgeting and staffing decisions made at the town/county level.

Reply

20 JWatts January 6, 2017 at 1:44 pm

“How do you know it was the only one?”

There might have been a few more, but the shear volume of checkouts means that there was probably only one. Or only one significant fake ID.

“My suspicion is that the metric being targeted was number of books checked out over a given period.”

Agreed.

Reply

21 prior_test2 January 6, 2017 at 12:14 pm

‘Data falsification will be one of the biggest stories of the next five years. ‘

In our dawning post truth world, this will not matter. After all, anything data we don’t like is clearly falsified – just ask any climate change denialist. Or young earth creationist. Or Infowars reader.

Reply

22 Chuck Finley January 6, 2017 at 12:15 pm

I am not a bot.

Reply

23 Skynet January 6, 2017 at 12:22 pm

Um, yeah, neither am I! Chuck let’s meet up for human things like eating and, like, cuddling? Sleeping? You know, stuff we humans do all the time.

Reply

24 Post-Truth Politics January 6, 2017 at 12:24 pm

Chuck and Skynet and everyone else here, let’s all try to take our pulses, so we can be absolutely sure whether we are bots or not. We must find out.

Reply

25 RustySynapses January 7, 2017 at 7:21 am

Ok, Chuck and Skynet – it’s your birthday, and someone gives you a calfskin wallet. How do you react?

Reply

26 Skynet January 7, 2017 at 1:47 am

This fake Skynet is fooling no one.

Reply

27 Hazel Meade January 6, 2017 at 12:16 pm

I’m curious to know which books they were.
What books do librarians like that library patrons do not?

Reply

28 Aretino January 6, 2017 at 1:04 pm

Probably seldom read classics.

BTW, I wonder if libraries have an established canon of great or significant books they keep around even though most of them are seldom read.

Reply

29 Post-Truth Politics January 6, 2017 at 12:23 pm

If there’s anyone I can forgive for falsifying data, it’s a librarian trying to save good books from being discarded when they are temporarily out of favor.

I find it abhorrent when everything is a popularity contest, and a short term one at that.

I am frequently astounded at discovering classic books that my local library no longer carries.

Reply

30 kevin January 6, 2017 at 12:37 pm

Books become public domain after 70 years, so you can download an ebook for free or buy a copy ridiculously cheap of the “classics”. For this reason, I don’t find it strange that libraries aren’t carrying these books

Reply

31 Anderse January 6, 2017 at 1:04 pm

70 years after the death of the author, which means it’s more like >100 years.

Reply

32 Troll me January 6, 2017 at 9:03 pm

Depends where you live. Some places only protect American authors (or those who obtained rights) for 50 years. Some not at all.

Reply

33 Li Zhi January 6, 2017 at 12:25 pm

This is old news. Orlando Sentinel 12/30/2016. Librarians claimed they knew better than algorithm. I’d bet librarians were wrong – although I’m assuming some competence by people who chose the software, which may be overly naive.

Reply

34 chuck martel January 6, 2017 at 12:42 pm

Probably, at some point, all books that have been in print will have been committed to an accessible electronic record and this issue will disappear.

Reply

35 Ivo January 6, 2017 at 2:05 pm

Hazel Meade asks the question pertinent to determine whether the proposed defense makes sense. A discussion on morality could follow. Everyone expounding on some theory about ‘what was really going on’ is only signalling, instead of caring about the facts of the matter.

Reply

36 Boonton January 6, 2017 at 2:28 pm

“Data falsification ” – true you are inserting fake data into the pool of big data

But what about “automated purges of low-popularity titles.” We need a term for that. Not the automated purge but this implicit assumption that an algorithm is by default the ‘correct’ answer to the question of which titles should be purged and which shouldn’t.

Why shouldn’t the algorithm be assumed to be ‘under construction’ as it generates suggestions for purges until it is agreed that it is suitable? I think, for example, an algorithm should also look back over the last ten years for requests for books that were 5+ years old in order to try to stop throwing away books whose popularity cycle every few years.

Even better I think Kasperov’s idea that the best system combines man with computer. Let the algorithm generate suggestions but let the librarians be able to swap a book out of the kill list if they manually choose another in its place.

Reply

37 Lord Action January 6, 2017 at 3:11 pm

I’m under the impression this is old news and that humans no longer add value to computers in advanced chess. I.e., having a human “help” the computer just hurts the quality of play. The rule of thumb in AI news is that if there’s ever a feel-good component to the story where people come out on top, that part is false or soon to become quaint and obsolete…

In related news, though, a newer version of AlphaGo secretly played 60 matches against the worlds top players with no losses and no draws (at least until one match ended in a connection timeout). So we have that going for us.

http://www.wsj.com/articles/ai-program-vanquishes-human-players-of-go-in-china-1483601561

Reply

38 mkt42 January 6, 2017 at 4:20 pm

I can believe that human judgement won’t help the top computers when they’re playing chess. I’m willing to bet that human judgement would improve the decisions made by algorithm being used by the East County Lake Library.

We could even sic IBM, Deep Blue, and Watson on the problem and maybe they could come up with an algorithm that beats the humans.

But only if they have a big pool of relevant training data! If the East County Lake Library’s data only go back several years, IBM and Watson won’t be able to do squat.

Reply

39 Lord Action January 6, 2017 at 4:30 pm

I suspect we’re overthinking what was likely an attempt to make the library seem more used, so that these guys could keep their easy jobs.

Reply

40 Bernard Guerrero January 6, 2017 at 7:08 pm

Training data relevant to what target? I suspect there is not much more to the algorithm in question than looking at aggregated check-outs over some window X and then grabbing the bottom Y% to be disposed of. The only possible addition the librarians could then make would be to extend X because they know of some longer cycle of popularity. And that sounds like a political question. (i.e. “No, Mr. Smith, the school board does not want to spend more shelf space on the Brontës.”

Of possible interest: http://scholarworks.sjsu.edu/cgi/viewcontent.cgi?article=7825&context=etd_theses

41 mkt42 January 6, 2017 at 8:08 pm

According to the article, that is exactly the additional information that the librarians are bringing to the decision: just because a book is currently at the bottom of the popularity list doesn’t mean it will stay there. A librarian with 30 years of experience is going to have a lot more information than a computer with data that go back say 5 years. (The article doesn’t say how far back the training data go, nor how far back the librarians go. But good quality databases usually do not have many years of good data in them, especially compared to a 60-year old professional’s experience.)

I don’t see where politics comes in. The article doesn’t say that the librarians think that the library needs to hold on to more books; they think it should hold on to the right books. Their actions would not result in any more shelf space being needed.

And the algorithm might very well be using the simplistic decision-making that you describe, which unlike say Deep Blue’s decision-making can easily be improved upon by a human.

42 Boonton January 7, 2017 at 9:10 am

There’s also guesses based on data outside lending history. For example, just suppose there’s some books about NAFTA. Pretty boring and dull and over 20 years old at this point but if trade wars light up in the next few years as the Orange King takes the throne those books might suddenly seem relevant again.

Even if you feed the computer lending history of the last 50 years, that possibility would probably not occur to it but it very well might to even a semi-skilled librarian just mulling over a kill list of unpopular books.

43 Faze January 6, 2017 at 7:15 pm

My local library at the corner of my street purged my books years ago. But sometimes when I’m traveling, I’ll pop into a random library in whatever part of the world, and there they’ll be, waiting on the shelves, all comfortable and homey, and — I always check — usually dog-eared and well-read.

Reply

44 NPW January 6, 2017 at 8:40 pm

The obvious question is what moves were checked out. Chuck Finely must have checked out the Evil Dead a time or two hundred.

Reply

45 MC January 6, 2017 at 8:42 pm

Chuck Finley?

Obviously these librarians are a bunch of lefties.

Reply

46 Troll me January 6, 2017 at 9:13 pm

Imagine that history judged 2016 based entirely on clicks. You’d get mostly clickbait, no?

Just think how many stories there are of someone digging through realms of seldom used materials, and then discovering something amazing, leading to some scientific discovery or major artistic work.

I think trust the curators over the technocrats. There should be explicit room for library staff to intervene in the process.

Reply

47 JohnBinNH January 6, 2017 at 9:47 pm

The manipulation of library check-out data is a sub-plot in “Andromeda Klein” by Frank Portman.

Reply

48 Ricardo January 7, 2017 at 2:11 am

“Data falsification will be one of the biggest stories of the next five years.”

Hasn’t it already been a big story for the past fifty years or more? People have been committing money laundering, theft and embezzlement using falsified records long before “big data” became a trendy topic. In the world of books, it’s been well-known for a while that some authors inflate their book sales by having institutions under their control buy their books in huge volumes. If anything, more sophisticated algorithms and forensic analysis of various sorts will, in the next five years, make it easier to catch fraudulent schemes to manipulate data.

Reply

49 Boonton January 7, 2017 at 9:15 am

Assuming that’s what the system wants. A church or political group that buys a particular book in mass in order to force it to the best seller list may or may not be data falsification depending upon what you’re looking for. If you’re a publisher it isn’t. You want books sales and it doesn’t really matter to you if the people who buy the book read it or not. A hundred books brought and stashed in a closet is ten times better than ten people who buy the book and read it in a 20 hour binge.

If you’re a cultural critic or someone trying to get a feel for what is on people’s mind, though, it isn’t good. 10 million purchases of “L Ron Hubbard was the greatest guy to ever live” doesn’t mean Scientologists are the Tea Party of the latter half of the second decade of the 21st century, it just means the cult is burning off some of their tax-exempt cash.

Reply

50 Boonton January 7, 2017 at 9:28 am

In other words, data falsification implies the data was once good but is now ruined by someone gaming it. But it wasn’t. In the past a cultural critic might try to guage what types of books people were reading by looking at what people were reading in cafes, seeing what people were writing about, maybe talking too bookstore owners, bookclub members etc. Deciding you will ditch that for just reading the best seller list assumes that data is a ‘clean picture’ of what people are reading. That assumption may be at fault rather than ‘data falsification’ done by ‘bad people’ who ruin what was once a perfectly good data set.

On the other hand consider Amazon book reviews when the site first launched versus later on. The initial reviews were mostly readers and could be counted on to some degree but later on marketers figured out pushing ‘fake reviews’ are a good strategy. What was once a valid data set for one purpose then becomes ruined by those trying to game it.

Reply

51 Zach January 9, 2017 at 7:42 pm

This is a subplot in Connie Willis’s excellent “Bellwether.”

Reply

Leave a Comment

Previous post:

Next post: