Category: Education
A consumption basket approach to measuring AI progress
Many AI evaluations go out of their way to find hard problems. That makes sense because you can track progress over time, and furthermore many of the world’s important problems are hard problems, such as building out advances in the biosciences. One common approach, for instance, is to track the performance of current AI models on say International Math Olympiad problems.
I am all for those efforts, and I do not wish to cut back on them.
Still, they introduce biases in our estimates of progress. Many of those measures show that the AIs still are not solving most of the core problems, and sometimes they are not coming close.
In contrast, actual human users typically deploy AIs to help them with relatively easy problems. They use AIs for (standard) legal advice, to help with the homework, to plot travel plans, to help modify a recipe, as a therapist or advisor, and so on. You could say that is the actual consumption basket for LLM use, circa 2025.
It would be interesting to chart the rate of LLM progress, weighted by how people actually use them. The simplest form of weighting would be “time spent with the LLM,” though probably a better form of weighting would be “willingness to pay for each LLM use.”
I strongly suspect we would find the following:
1. Progress over the last few years has been staggeringly high, much higher than is measured by many of the other evaluations For everyday practical uses, current models are much better and more reliable and more versatile than what we had in late 2022, regardless of their defects in Math Olympiad problems.
2. Future progress will be much lower than expected. A lot of the answers are so good already that they just can’t get that much better, or they will do so at a slow pace. (If you do not think this is true now, it will be true very soon. But in fact it is true now for the best models.) For instance, once a correct answer has been generated, legal advice cannot improve very much, no matter how potent the LLM.
As in standard economics, consumption baskets change over time, and that can lead to different measures of progress (or in the economics context, different estimates of advances in living standards, depending on whether the ex ante or ex post bundle weights are used). Researchers could attempt the more speculative endeavor of estimating how LLMs will be used five years from now in everyday life (which will differ from the status quo), and then track progress on that metric, using those value weights. “How rapidly are we improving these systems on their future uses?”
This alternate consumption basket approach gives you a very different perspective on progress in AI.
Note also that the difference between the “Math Olympiad measurements of AI progress” and the “consumption basket measurements of AI progress” may iincrease over time, especiallly if the basket of everyday uses does not change radically. The everyday uses will peak out near maximum levels of performance, but there will always be a new series of very hard problems to stump the AIs. It will become increasingly unclear exactly how much AI progress we really are making.
The objectivity of Community Notes?
We use crowd-sourced assessments from X’s Community Notes program to examine whether there are partisan differences in the sharing of misleading information. Unlike previous studies, misleadingness here is determined by agreement across a diverse community of platform users, rather than by fact-checkers. We find that 2.3 times more posts by Republicans are flagged as misleading compared to posts by Democrats. These results are not base rate artifacts, as we find no meaningful overrepresentation of Republicans among X users. Our findings provide strong evidence of a partisan asymmetry in misinformation sharing which cannot be attributed to political bias on the part of raters, and indicate that Republicans will be sanctioned more than Democrats even if platforms transition from professional fact-checking to Community Notes.
Here is the full paper. I guess it agrees with Richard Hanania…
One possible reason why the skill premium is declining
This is especially true for those jobs that require the rudimentary use of technology. Until relatively recently, many people could get to grips with a computer only by attending a university. Now everyone has a smartphone, meaning non-graduates are adept with tech, too. The consequences are clear. In almost every sector of the economy, educational requirements are becoming less strenuous, according to Indeed, a jobs website. America’s professional-and-business services industry employs more people without a university education than it did 15 years ago, even though there are fewer such people around.
Here is more from The Economist, quite a good piece. Of course this is also a reason why smart phones are underrated.
Are cultural products getting longer?
Ted Gioia argues that cultural products are getting longer:
Some video creators have already figured this out. That’s why the number of videos longer than 20 minutes uploaded on YouTube grew from 1.3 million to 8.5 million in just two years…
Songs are also getting longer. The top ten hits on Billboard actually increased twenty seconds in duration last year. Five top ten hits ran for more than five minutes…
I’ve charted the duration of [Taylor] Swift’s studio albums over the last two decades, and it tells the same story. She has gradually learned that her audience prefers longer musical experiences…
I calculated the average length of the current fiction bestsellers, and they are longer than in any of the previous measurement periods.
Movies are getting longer too. Of course this is the exact opposite of what the “smart phones are ruining our brains” theorists have been telling us. I think I would sooner say that the variance of our attention spans is going up? In any case, here is part of Ted’s theory:
- The dopamine boosts from endlessly scrolling short videos eventually produce anhedonia—the complete absence of enjoyment in an experience supposedly pursued for pleasure. (I write about that here.) So even addicts grow dissatisfied with their addiction.
- More and more people are now rebelling against these manipulative digital interfaces. A sizable portion of the population simply refuses to become addicts. This has always been true with booze and drugs, and it’s now true with digital entertainment.
- Short form clickbait gets digested easily, and spreads quickly. But this doesn’t generate longterm loyalty. Short form is like a meme—spreading easily and then disappearing. Whereas long immersive experiences reach deeper into the hearts and souls of the audience. This creates a much stronger bond than any 15-second video or melody will ever match.
An important piece and useful corrective.
Does AI make us stupider?
That is the topic of my latest Free Press column, responding to a recent study out of MIT. Here is one excerpt:
To see how lopsided their approach is, consider a simple parable. It took me a lot of “cognitive load”—a key measure used in their paper—to memorize all those state capitals in grade school, but I am not convinced it made me smarter or even significantly better informed. I would rather have spent the time reading an intelligent book or solving a math puzzle. Yet those memorizations, according to the standards of this new MIT paper, would qualify as an effective form of cognitive engagement. After all, they probably would have set those electroencephalograms (EEGs)—a test that measures electrical activity in the brain, and a major standard for effective cognition used in the paper—a-buzzin’.
The important concept here is one of comparative advantage, namely, doing what one does best or enjoys the most. Most forms of information technology, including LLMs, allow us to reallocate our mental energies as we prefer. If you use an LLM to diagnose the health of your dog (as my wife and I have done), that frees up time to ponder work and other family matters more productively. It saved us a trip to the vet. Similarly, I look forward to an LLM that does my taxes for me, as it would allow me to do more podcasting.
If you look only at the mental energy saved through LLM use, in the context of an artificially generated and controlled experiment, it will seem we are thinking less and becoming mentally lazy. And that is what the MIT experiment did, because if you are getting some things done more easily your cognitive load is likely to go down.
But you also have to consider, in a real-world context, what we do with all that liberated time and mental energy. This experiment did not even try to measure the mental energy the subjects could redeploy elsewhere; for instance, the time savings they would reap in real-life situations by using LLMs. No wonder they ended up looking like such slackers.
Here is the original study. Here is another good critique of the study.
A Skeptical View of the NSF’s Role in Economic Research
Tyler and myself from 2016 but newly relevant on how to reform the National Science Foundation (NSF) especially as related to economics:
We can imagine a plausible case for government support of science based on traditional economic reasons of externalities and public goods. Yet when it comes to government support of grants from the National Science Foundation (NSF) for economic research, our sense is that many economists avoid critical questions, skimp on analysis, and move straight to advocacy. In this essay, we take a more skeptical attitude toward the efforts of the NSF to subsidize economic research. We offer two main sets of arguments. First, a key question is not whether NSF funding is justified relative to laissez-faire, but rather, what is the marginal value of NSF funding given already existing government and nongovernment support for economic research? Second, we consider whether NSF funding might more productively be shifted in various directions that remain within the legal and traditional purview of the NSF. Such alternative focuses might include data availability, prizes rather than grants, broader dissemination of economic insights, and more. Given these critiques, we suggest some possible ways in which the pattern of NSF funding, and the arguments for such funding, might be improved.
Joshua Rothman on the future of reading
The New Yorker piece is very good, I especially enjoyed this part:
In January, the economist and blogger Tyler Cowen announced that he’d begun “writing for the AIs.” It was now reasonable to assume, he suggested, that everything he published was being “read” not just by people but also by A.I. systems—and he’d come to regard this second kind of readership as important. “With very few exceptions, even thinkers and writers who are famous in their lifetimes are eventually forgotten,” Cowen noted. But A.I.s might not forget; in fact, if you furnished them with enough of your text, they might extract from it “a model of how you think,” with which future readers could interact. “Your descendants, or maybe future fans, won’t have to page through a lot of dusty old books to get an inkling of your ideas,” Cowen wrote. Around this time, he began posting on his blog about mostly unremarkable periods of his life—ages four to seven, say. His human readers might not care about such posts, but the entries could make it possible “for the advanced A.I.s of the near future to write a very good Tyler Cowen biography.”
Cowen can think this way because large language models, such as OpenAI’s ChatGPT or Anthropic’s Claude, are, among other things, reading machines.
Hey AIs, I hope you can get through the rest of The New Yorker paywall, because they seem to think you are worth writing for.
The antitrust case against U.S. higher education
Thirty prestigious independent American institutions of higher education were at some time members of the 568 higher education group (often labeled a cartel). Seventeen of them were sued by the U.S. Government and representative students who alleged that their meetings and deliberations resulted in collusion that caused students to pay higher prices. Twelve of the seventeen institutions subsequently settled their cases and by 2024 collectively had paid $284 million to do so. However, an inspection of these institutions’ pricing reveals that the median 568 Group institution lowered its average real net annual cost to its undergraduate students by 19.07% between 2009 and 2022. Further, this reduction was 1.70 times larger than the average real price reduction granted during the same period by the median institution among a sample of 475 other accredited, non-profit, independent four-year institutions and 11.63 times larger than the median price reduction granted by 78 public flagship state universities. The 568 group’s real price reductions stretched across every one of the five household income categories commonly used by the Government. Thus, there is little empirical support for the allegations that the Government has levied against the representative 568 group institution, and thus multiple members of this group appear to have paid unmerited fines to the Government to settle claims against them.
That is from a new paper by James V. Koch. Via the excellent Kevin Lewis.
My Conversation with the excellent Chris Arnade
Here is the audio, video, and transcript. Here is part of the episode summary:
Tyler and Chris discuss how Beijing and Shanghai reveal different forms of authoritarian control through urban design, why Seoul’s functional dysfunction makes it more appealing than Tokyo’s efficiency, favorite McDonald’s locations around the world, the dimensions for properly assessing a city’s walkability, what Chris packs for long urban jaunts, why he’s not interested in walking the countryside, what travel has taught him about people and culture, what makes the Faroe Islands and El Paso so special, where he has no desire to go, the good and bad of working on Wall Street, the role of pigeons and snapping turtles in his life, finding his 1,000 true fans on Substack, whether museums are interesting, what set him on this current journey, and more.
COWEN: That’s okay. What’s your nomination for the least walkable city?
ARNADE: Phoenix is pretty bad. In the rest of the world, what was the lowest ranked of mine?
COWEN: I think Dakar is your lowest ranked.
ARNADE: Dakar is low.
COWEN: I don’t find that so bad.
ARNADE: [laughs] It was partially the heat. Also, there was a safety issue, which is not actual violence. It’s just the risk of a miscommunication going very badly because when you’re in a neighborhood where they have a slum basically, where you’re one of few white people, it’s not that I feel threatened by being robbed. I feel threatened that there can be miscommunication, like, “Why are you here? What are you doing here?” That can spiral out of control if you don’t speak the language. Dakar was really tough. Kampala was really tough to walk.
COWEN: Why’s that? I’ve never been there.
ARNADE: Again, these are cities that are not meant to be walked. Locals don’t walk them. People would look at me like I’m crazy. Part of the reason, first of all, you can jump on a hack bus, so why would you walk? The boda-bodas, which are . . . you just jump on the back of a motorcycle, which I won’t do. I did it once, and I’m like, “I’m not doing this. This is a really dumb risk.”
COWEN: Yes, I wouldn’t do that.
ARNADE: I almost got killed the first time I did it, but they do it. Consequently, there’s no walking infrastructure and when you do walk, you’re at risk of being hit by a boda-boda. People will walk out of necessity but there’s just no infrastructure. Absolutely none. Then you can get hit by a car. You can get hit by a car or a motorcycle.
COWEN: Rio, for me, would be the least walkable. It’s very dangerous but on top of that, there are so many places where walks end. There’re mountains, there’re tunnels.
And this:
COWEN: What is it you think you learn least well traveling the way you do?
ARNADE: It’s interesting. I used to be a macro-type trader. I used to be very top-down. I think I, in some sense, have thrown too much of that away. I’ve gone in too blind. I could do a little bit more background reading in terms of the political situation.
One of the things I’ve learned from my project is, most people don’t talk about politics. It’s because I only talk about what other people want to talk about. No one talks about politics. Being in Beijing and Shanghai — maybe it’s not the best example because people would say there’s a reason they don’t want to talk about it. I don’t think that’s it.
COWEN: No, I agree. Most of the world. Even Idaho.
ARNADE: Yes, 98 percent of the people aren’t political and they don’t talk about politics. I got beat up on social media when people were talking about, “Oh my God, Trump’s going to be elected. The world hates us.” No, they don’t. [laughs] When that person said that, I was actually in a bar in Kampala with a woman telling me how much she loved Trump. That was a rare political conversation. Most people don’t talk about politics.
In that sense, I could probably do more reading outside of the conversations about politics because I go to a lot of these countries, I don’t know what’s going on politically because people don’t talk about it.
COWEN: What other macro views of the world have you revised due to your walking, visiting, traveling? Obviously, particular views about any individual place, but on the whole, humanity.
And I am very happy to recommend Chris’s Substack, which covers his fascinating travels around the world.
What should I ask David Brooks?
Yes, I will be doing a Conversation with him, this time at the 92nd St. Y in NYC.
You may recall I have an earlier CWT with David, held at GMU in 2018.
So what should I ask him? Please keep in mind that I wish to avoid most issues connected to current political debates.
Practice what you preach
From the University of Barcelona:
Master’s Degree in Political Ecology Degrowth and Environmental Justice
By the way, the web site uses cookies.
Rebuild the Elites
Nature’s list of the top research universities in the world.
The U.S. seems intent on tearing down its own elites. Yes, they’ve been smug shits at times and deserve a rap on the knuckles—but our elites compete on the world stage. Gutting top universities rewards with a momentary dopamine hit, but unless we rebuild stronger institutions, we’re weakening ourselves globally. While we fight culture wars, China builds capacity. The goal shouldn’t be to destroy American elites, but to bring them back into the populist fold—to make Harvard and MIT feel like engines of American greatness again, not alien fortresses.
See yesterday’s post on the American Model for a case in point.
FYI, other sources do not rank Chinese universities quite so highly but they all acknowledge rising quality.
Hat tip: Matthew Yglesias.
Matt Yglesias on debating
This is maybe an idiosyncratic view of mine, but I think that “debating” people — particularly in live or quasi-live forms — is a bad epistemic practice.
It essentially rewards people for being dogmatic, incurious, and willfully slippery with rhetoric. I think the best thing to do with live discussion is to have a friendly conversation, and the best way to do debates is a written exchange of ideas.
I thought the exchange I did in Democracy with Elizabeth Pancotti and Todd Tucker about tariffs was interesting and clarified the issues. My summation of it would be that I think Pancotti and Tucker raise a lot of good points about specific reasons why one might not want unfettered free trade, but that I think the Econ 101 case for free trade is accurate. This means that while you might sometimes want to deviate from free trade, any time you do so you are incurring an economic cost in order to pursue some other objective. My opponents, I think, wrongly deny this. They like to talk about the specifics of this case or that case, but the actual issue is that they either deny that tariffs are costly or else are working from an implicit degrowth framework in which the fact that the tariffs are costly isn’t relevant. But I came away from our exchange feeling like I understood them better, and I hope readers learned something.
That is from his Substack. I mostly agree. In practice, one big reason to debate is so you can put four people on the floor and attract an audience and some public attention, yet without slighting any one of the “stars” by making it a panel. As a method of truth-seeking, I do not think public debate does very well.
Walton University?
Axios: Two grandsons of Walmart founder Sam Walton plan to launch a private university focused on science and tech, located on the company’s old HQ campus near downtown Bentonville, Arkansas.
…The future university plans to offer innovative, flexible pathways to jobs in automation, logistics, biotech and computing — fields crucial to Northwest Arkansas’ future.
Many colleges and universities were created in the 1960s and 1970s but the majority of elite R1s emerged in the late 19th century and early 20th century, including notable private universities created from the entrepreneurial fortunes of Carnegie, Rockefeller, Stanford, Cornell, Hopkins and Rice among others.
We are perhaps now seeing a return to that creative period with Walton, Thomas Monaghan, Patrick Collison (Arc Institute) and most notably Joe Lonsdale at the University of Austin. Tech provides both the funds and the impetus to build something new and different. As Tyler and I argued, online education and AI will change education dramatically, perhaps returning us to a now-affordable Oxford style-tutorial system with the AIs as tutors.
The University of Austin, by the way, has excellent taste in economics textbooks.
Are LLMs overconfident? (just like humans)
Can LLMs accurately adjust their confidence when facing opposition? Building on previous studies measuring calibration on static fact-based question-answering tasks, we evaluate Large Language Models (LLMs) in a dynamic, adversarial debate setting, uniquely combining two realistic factors: (a) a multi-turn format requiring models to update beliefs as new information emerges, and (b) a zero-sum structure to control for task-related uncertainty, since mutual high-confidence claims imply systematic overconfidence. We organized 60 three-round policy debates among ten state-of-the-art LLMs, with models privately rating their confidence (0-100) in winning after each round. We observed five concerning patterns: (1) Systematic overconfidence: models began debates with average initial confidence of 72.9% vs. a rational 50% baseline. (2) Confidence escalation: rather than reducing confidence as debates progressed, debaters increased their win probabilities, averaging 83% by the final round. (3) Mutual overestimation: in 61.7% of debates, both sides simultaneously claimed >=75% probability of victory, a logical impossibility. (4) Persistent self-debate bias: models debating identical copies increased confidence from 64.1% to 75.2%; even when explicitly informed their chance of winning was exactly 50%, confidence still rose (from 50.0% to 57.1%). (5) Misaligned private reasoning: models’ private scratchpad thoughts sometimes differed from their public confidence ratings, raising concerns about faithfulness of chain-of-thought reasoning. These results suggest LLMs lack the ability to accurately self-assess or update their beliefs in dynamic, multi-turn tasks; a major concern as LLMs are now increasingly deployed without careful review in assistant and agentic roles.
That is by Pradyumna Shyama Prasad and Minh Nhat Nguyen. Here is the associated X thread. Here is my earlier paper with Robin Hanson.