Nat Friedman discusses Sora, and image wisdom

NF: Yeah, I interviewed the Sora creators yesterday, the day before on stage at an event and it was super interesting to hear their point of view. I think we see Sora as this media production tool, that’s not their view, that’s a side effect. Their view is that it is a world simulator and that in fact it can sort of simulate any kind of behavior in the world, including going as far as saying, “Let’s create a video with Ben and Daniel and Nat and have them discuss this,” and then see where the conversation goes. And their view is also that Sora today is a GPT-1 scale, not a lot of data, not a lot of compute, and so we should expect absolutely dramatic improvement in the future as they simply scale it up and thirdly that there’s just a lot more video data than there is text data on the Internet…

And then Andrej Karpathy, I was talking to him the other day too, and he said, “There’s something strange going on-”

[Ben Thompson] And a picture is worth a thousand words by the way, so the number of tokens there is just astronomically larger.

NF: He was exploring this idea that the world model and image and video models actually might better than in text models. You ask it for a car engine, someone fixing a carburetor and just the level of detail that can be in there is extraordinary, and maybe we made a mistake by training on the text extracted from Common Crawl and what we should do instead. I asked him for his most unhinged research idea. He said what we should do instead is train on pictures of web pages and when you ask the model a question, it outputs a picture of a web page with the answer and maybe we’d get way more intelligence and better results from that.

That is from his dialogue with Ben Thompson and Daniel Gross, gated but worth paying for.