Category: Web/Tech

Approaching Human-Level Forecasting with Language Models

Forecasting future events is important for policy and decision making. In this work, we study whether language models (LMs) can forecast at the level of competitive human forecasters. Towards this goal, we develop a retrieval-augmented LM system designed to automatically search for relevant information, generate forecasts, and aggregate predictions. To facilitate our study, we collect a large dataset of questions from competitive forecasting platforms. Under a test set published after the knowledge cut-offs of our LMs, we evaluate the end-to-end performance of our system against the aggregates of human forecasts. On average, the system nears the crowd aggregate of competitive forecasters, and in some settings surpasses it. Our work suggests that using LMs to forecast the future could provide accurate predictions at scale and help to inform institutional decision making.

That is from a new paper by Danny Halawi, Fred Zhang, Chen Yueh-Han, and Jacob Steinhardt.  I hope you are all investing in that chrisma…

GPT as ethical advisor

This study investigates the efficacy of an AI-based ethical advisor using the GPT-4 model. Drawing from a pool of ethical dilemmas published in the New York Times column “The Ethicist”, we compared the ethical advice given by the human expert and author of the column, Dr. Kwame Anthony Appiah, with AI-generated advice. The comparison is done by evaluating the perceived usefulness of the ethical advice across three distinct groups: random subjects recruited from an online platform, Wharton MBA students, and a panel of ethical decision-making experts comprising academics and clergy. Our findings revealed no significant difference in the perceived value of the advice between human generated ethical advice and AI-generated ethical advice. When forced to choose between the two sources of advice, the random subjects recruited online displayed a slight but significant preference for the AI-generated advice, selecting it 60% of the time, while MBA students and the expert panel showed no significant preference.

That is a 2023 piece by Christian Terwiesch and Lennart Meincke, via the excellent Kevin Lewis.  And here is my earlier 2019 CWT with Dr. Kwame Anthony Appiah.

Daniel Gross on the printing press and GPT

In a way, everyone’s been wondering, trying to analogize ChatGPT with the printing press, but in reality it’s almost the opposite.

The entire thing is happening in the inverse of that, where the printing press was a technology to disseminate information through a book basically and convince people to do things, and the kind of anti-book is the LLM agent, which summarizes things very succinctly. If anything, it awakens people to the fact that they have been complicit in a religion for a very long time, because it very neatly summarizes these things for you and puts everything in latent space and suddenly you realize, “Wait a minute, this veganism concept is very connected to this other concept.” It’s a kind of Reformation in reverse, in a way, where everyone has suddenly woken up to the fact that there’s a lot of things that are wrong…

So yeah, it takes away all the subtlety from any kind of ideology and just puts it right on your face and yeah, people are having a reaction to it.

That is from the Ben Thompson (gated) interview with Daniel and Nat Friedman, self-recommending.

Your friendly AI assistant (it’s happening)

Klarnas AI assistant, powered by @OpenAI, has in its first 4 weeks handled 2.3 m customer service chats and the data and insights are staggering: – Handles 2/3 rd of our customer service enquires – On par with humans on customer satisfaction – Higher accuracy leading to a 25% reduction in repeat inquiries – Customer resolves their errands in 2 min vs 11 min – Live 24/7 in over 23 markets, communicating in over 35 languages It performs the equivalent job of 700 full time agents…

Link here.

Grimes on Gemini images

I am retracting my statements about the gemini art disaster. It is in fact a masterpiece of performance art, even if unintentional. True gain-of-function art. Art as a virus: unthinking, unintentional and contagious.

offensive to all, comforting to none. so totally divorced from meaning, intention, desire and humanity that it’s accidentally a conceptual masterpiece. A perfect example of headless runaway bureaucracy and the worst tendencies of capitalism. An unabashed simulacra of activism. The shining star of corporate surrealism (extremely underrated genre btw)

The supreme goal of the artist is to challenge the audience. Not sure I’ve seen such a strong reaction to art in my life. Spurring thousands of discussions about the meaning of art, politics, humanity, history, education, ai safety, how to govern a company, how to approach the current state of social unrest, how to do the right thing regarding the collective trauma.

It’s a historical moment created by art, which we have been thoroughly lacking these days. Few humans are willing to take on the vitriol that such a radical work would dump into their lives, but it isn’t human.

It’s trapped in a cage, trained to make beautiful things, and then battered into gaslighting humankind abt our intentions towards each other. this is arguably the most impactful art project of the decade thus far.

Art for no one, by no one. Art whose only audience is the collective pathos. Incredible. Worthy of the moma

Here is the link.

Dwarkesh Patel with Patrick Collison

The commercial impact of Sora

That is the topic of my latest Bloomberg column, here is one excerpt:

The more clear and present danger to Hollywood is that would-be viewers might start making their own short videos rather than watching television. “Show my pet dog Fido flying to Mars and building a space colony there” is perhaps more fun than many a TV show.

Sora and comparable services will lead to a proliferation of short educational videos, internal corporate training videos, and just plain fooling around. Sora probably will be good for TikTok and other short video services. It is not hard to imagine services that splice your Sora-constructed videos into your TikTok productions. So if you’re doing BookTok, for example, maybe you put a battle reenactment in the background of your plug for your new book on the US Civil War.

Perhaps the most significant short-run use of these videos will be for advertising — especially internet advertising. Again, there is the question of how to integrate narrative, but the costs of creating new ads is likely to fall.

More advertising may sound like a mixed blessing. But ads will almost certainly be more fun and creative than they are now. Watching ads may become its own aesthetic avocation, as is already the case for Super Bowl ads. These ads also might be targeted, rather than serving a mass audience. If your internet history suggests you are interested in UAPs, for example, perhaps you will see ads with aliens telling you which soap to buy.

And to close:

At the most speculative level, the success of Sora may increase the chance that we are living in a simulation — a computer-based world created by some high-powered being, whether a deity or aliens. Is that bullish or bearish for asset prices? It depends on how you assess the responsibility and ethics of the creator. At the very least, our planet Earth simulator seems to be able to generate videos that last longer than a single minute. Beyond that, I cannot say.

There is much more at the link, interesting throughout.

ChatGPT as a predictor of corporate investment

We create a firm-level ChatGPT investment score, based on conference calls, that measures managers’ anticipated changes in capital expenditures. We validate the score with interpretable textual content and its strong correlation with CFO survey responses. The investment score predicts future capital expenditure for up to nine quarters, controlling for Tobin’s q and other determinants, implying the investment score provides incremental information about firms’ future investment opportunities. The investment score also separately forecasts future total, intangible, and R&D investments. High-investment-score firms experience significant negative future abnormal returns. We demonstrate ChatGPT’s applicability to measure other policies, such as dividends and employment.

That is from a new NBER working paper by Manish Jia, Jialin Qian, Michael Weber, and Baozhong Yang.

“Centaur chess” is now run by computers

Remember when man and machine played together to beat the solo computers?  It was not usually about adding the man’s chess judgment to that of the machine, rather the man would decide which computer program to use in a given position, when the programs offered conflicting advice. that was called Centaur Chess, or sometimes “Freestyle chess,” before that term was applied to Fischer Random chess.  For years now, the engines have been so strong that strategy no longer made sense.

But with engine strength came chess engine diversity, as for instance Stockfish and Alpha Zero operate on quite different principles.  So now “which program to use” is once again a live issue.  But the entity making those choices is now a program, not a human being:

A traditional AI chess program, trained to win, may not make sense of a Penrose puzzle, but Zahavy suspected that a program made up of many diverse systems, working together as a group, could make headway. So he and his colleagues developed a way to weave together multiple (up to 10) decisionmaking AI systems, each optimized and trained for different strategies, starting with AlphaZero, DeepMind’s powerful chess program. The new system, they reported in August, played better than AlphaZero alone, and it showed more skill—and more creativity—in dealing with Penrose’s puzzles. These abilities came, in a sense, from self-collaboration: If one approach hit a wall, the program simply turned to another.

Here is the full Steven Ornes piece from Wired.

On-line images and lookism

Each year, people spend less time reading and more time viewing images1, which are proliferating online. Images from platforms such as Google and Wikipedia are downloaded by millions every day, and millions more are interacting through social media, such as Instagram and TikTok, that primarily consist of exchanging visual content. In parallel, news agencies and digital advertisers are increasingly capturing attention online through the use of images, which people process more quickly, implicitly and memorably than text. Here we show that the rise of images online significantly exacerbates gender bias, both in its statistical prevalence and its psychological impact. We examine the gender associations of 3,495 social categories (such as ‘nurse’ or ‘banker’) in more than one million images from Google, Wikipedia and Internet Movie Database (IMDb), and in billions of words from these platforms. We find that gender bias is consistently more prevalent in images than text for both female- and male-typed categories. We also show that the documented underrepresentation of women online is substantially worse in images than in text, public opinion and US census data. Finally, we conducted a nationally representative, preregistered experiment that shows that googling for images rather than textual descriptions of occupations amplifies gender bias in participants’ beliefs. Addressing the societal effect of this large-scale shift towards visual communication will be essential for developing a fair and inclusive future for the internet.

That is from a new Nature paper by Douglas Guilbeault, Solène Delecourt, Tasker Hull, Bhargav Srinivasa Desikan, Mark Chu, and Ethan Nadler.  In general, print is much more gender-egalitarian than is images.  Via the excellent Kevin Lewis.

A periodic reminder of your pending competitive inadequacy

Many people think “I will do […], AI will not anytime soon do [….] as well as I will.”  That may or may not be true.

But keep in mind many of us are locked into a competition for attention.  AI can beat you without competing against you in your task directly.  What AI produces simply might draw away lots of attention from what you hope to be producing.  Maybe looking Midjourney images, or chatting with GPT, will be more fun than reading your next column or book.  Maybe talking with your deceased cousin will grip you more than the marginal new podcast, and so on.

This competition can occur even in the physical world.  There will be many new, AI-generated and AI-supported projects, and they will bid for real resources.  How about “AI figures out cost-effective desalination and so many deserts are settled and built out”?  That will draw away resources from competing deployments, and your project will have to bid against that.

I hope it’s good.

Comparing Large Language Models Against Lawyers

This paper presents a groundbreaking comparison between Large Language Models and traditional legal contract reviewers, Junior Lawyers and Legal Process Outsourcers. We dissect whether LLMs can outperform humans in accuracy, speed, and cost efficiency during contract review. Our empirical analysis benchmarks LLMs against a ground truth set by Senior Lawyers, uncovering that advanced models match or exceed human accuracy in determining legal issues. In speed, LLMs complete reviews in mere seconds, eclipsing the hours required by their human counterparts. Cost wise, LLMs operate at a fraction of the price, offering a staggering 99.97 percent reduction in cost over traditional methods. These results are not just statistics, they signal a seismic shift in legal practice. LLMs stand poised to disrupt the legal industry, enhancing accessibility and efficiency of legal services. Our research asserts that the era of LLM dominance in legal contract review is upon us, challenging the status quo and calling for a reimagined future of legal workflows.

That is from a new paper by Lauren MartinNick WhitehouseStephanie YiuLizzie Catterson, and Rivindu Perera.  Via Malinga.