46 Comments
Dec 27, 2023·edited Dec 27, 2023Liked by Dwarkesh Patel

"Presumably, animal drawings in LaTeX are not part of GPT-4’s training corpus."

Why presume? If you simply search for "drawing animals with latex", you find a huge pre-existing literature on how to make animal drawings with raw LaTeX code or libraries like TikZ. LaTeX art is a well-established thing, and I fooled around with it as an undergrad long ago.

Never underestimate how much leakage there may be between training and test sets, and how much memorization can be happening! "Text from the internet" contains much weirder stuff than you think.

But the deeper point is that it's completely impossible to evaluate LLM performance without knowing the training set, which the big companies all refuse to reveal. My favorite paper on this is "Pretraining on the Test Set Is All You Need" https://arxiv.org/abs/2309.08632 where the author shows that you can beat all the big LLMs on the benchmarks with a tiny LLM if you train on the test set. It's a brilliant parody, but it has a point: how do we know the big LLMs aren't also doing this accidentally? I wouldn't update my beliefs much on how well LLMs can generalize, until the big companies reveal much more about how their models were built.

Expand full comment

Just a spectacular piece of writing. Thank you Dwarkesh for being a continuing source of enlightenment.

Expand full comment
Dec 29, 2023Liked by Dwarkesh Patel

"Even taking handwavy scaling curves seriously implies that we’ll need 1e35 FLOPs for an AI that is reliable and smart enough to write a scientific paper (that’s table stakes for the abilities an AI would need to automate further AI research and continue progress once scaling becomes infeasible)"

This is almost the opposite of the truth, and I'm confused why Tamay and Matthew didn't correct this. They say in the report "The Direct Approach yields an upper bound on the training compute to automate a task because there may be more efficient ways of getting a model to automate some task than to train the model to emulate humans performing that task directly."

So your sentence should read instead "even taking handwavy scaling curves seriously implies that without any additional algorithmic progress, simply scaling up existing architectures with more data, should result in AI that is reliable and smart enough to write academic papers by 1e35 FLOP." Notice the huge difference between "we need at least X flop" and "we need at most X flop."

Realistically we'll need far less than 1e35 FLOP because, as they say, there are way more efficient ways to get AI to be good at some task than the way implicitly assumed in this model.

(i.e. = train the AI on human demonstrations until it is so damn good at simulating human demonstrations that it can autoregressively generate tens of thousands of tokens before anyone can tell the difference. This is SO UNNECESSARILY HARD. Imagine if you, a human, were being trained to solve hard math problems, but the way we did it was by showing you completed proofs with chunks of the ending cut off and asking you to finish the proofs but with a timer so that you couldn't stop to think, you just had to keep typing as fast as possible and you couldn't undo mistakes either. Grok how much harder it would be to learn to prove difficult math theorems this way! "harder" is an understatement, more like "impossible!")

Expand full comment
Dec 26, 2023Liked by Dwarkesh Patel

re: synthetic data

My understanding is that to the extent LLMs have world models, it is by establishing statistical relationships between concepts, as represented by each concept's constituent components (tokens). Unless the synthetic data provides the model with new information about the relationships between various concepts, I don't understand the mechanism by which it can make LLMs more intelligent.

I think you would need an LLM to "observe" more phenomena about the world and create its own additional training data (akin to Whisper) rather than create completely artificial training data.

Expand full comment
Dec 27, 2023Liked by Dwarkesh Patel

Great post! You should write more in this style

Expand full comment

Great writing, insightful. I enjoyed reading that.

Expand full comment
Jan 19Liked by Dwarkesh Patel

You should put this on the podcast feed

Expand full comment

The point about technological development preceding theory is key - in that it typifies the misconception that anchors most current thought around AI. This is not a novel technology, but a novel science - and science does not develop in absence of supporting theory or a large body of empirical evidence drawn from systematic experimentation.

The key terms in AI still lack consensus definitions - to weigh comparative architectural success in ‘reasoning’ is meaningless given there is no consensus relative to AI what reasoning denotes and, crucially, a complete disinterest about the organic processes that the principals are seeking to emulate. The fact that clarificatory notions like mechanistic interpretability derive from an unfancied stream of the overall AI epistemology, and that the computer scientists in question are satisfied with the obscurity of even these present systems unfit to emulate the functions that are the stuff of ASI dreams, demonstrates that the appetite to grasp the foundations of this stuff is just not there (yet). I imagine this is substantially a function of the money to be made in overestimating the near-term attainability of transformative versions of this technology, and partly one of profound idealistic self-deception on the part of CS majors who have decisively limited conceptions about how organic intelligence works.

Once one bridges to notions like consciousness the improbability of the extension of present methods leading to success becomes completely fanciful - to credit scaling is to credit the notion that one of the most analysis-impervious concepts in all of the knowledge estate (i.e. consciousness), which we cannot even faintly model in its mechanics or the mathematical formalisms to which it must presumably abide, will be achieved in ignorance even of the precise outline of the unknown areas within it, by comparatively straightforward efforts of iterate-and-scale. It’s unimaginably unlikely.

Expand full comment

This article is very interesting, but I'm slightly struggling with the logic.

The first part lays out some detailed arguments on why or why not we might expect scaling to continue. This is a very interesting debate. The data issue still looks to me like a big problem. Feeding LLM output into LLM input feels like it should break, there's no extra information coming in here, just reinforcing what the LLM already thinks. But that's just an intuition.

The second part (the conclusion) basically throws all this reasoning out of the window and just looks at track records. Which is a totally reasonable argument, but makes the first part seem a bit pointless. I'd be interested to see specific predictions systematically evaluated to expand on how well the hypothesis has held up over the past decade.

Expand full comment

What does self-play actually mean? Getting AI to create text and using another AI to label it as good or bad, then training AI on that data.

What if AI is trained on captcha like data? Sending tweets and other forms of communication to humans and scoring for intelligible responses?

Expand full comment

I'll register my LLM-scaling-to-ASI skepticism right here and now. A brilliant neural net outsmarting humanity would be like a brilliant neuron outsmarting a brain. It can only work if a part emulates the whole, e.g. via a virtual civilization, perhaps needing physical feedback/friction.

GenAI will be economically and politically transformative, but will not recursively self-improve and become ASI.

Expand full comment

I would be interested to break down the question more. Sometimes you say “automate most cognitive labor” as if that is a fixed and singular goal. But there are so many ways that could “automate most cognitive labor” that would transform the world in different degrees. Would lawyers be automated away? Doctors? Would we just still have all the same jobs but everyone has decent assistants? Will physical jobs like folding laundry just be harder to handle than software engineer? Sales? Marketing? Finance?

I know we don’t have the answers to these questions, but the whole post is about grappling with a question that we don’t have the answer to. Would it be easier or harder to answer a question like “how will the advance of AI affect medical practice”?

Expand full comment

Nice post, this was a good read!

> “Oh we have 5x less data than we need - we just need a couple of 2x improvements in data efficiency, and we’re golden”

Out of curiosity, did you manage to find any numbers on what the data efficiency doubling rate is?

Expand full comment

That 1e35 flops number is an *upper bound* and is basically a complete shot in the dark. We have no clue what abilities are unlocked at certain perplexity.

Expand full comment

What does self-play actually mean? Getting AI to create text and using another AI to label it as good or bad, then training AI on that data.

What if AI is trained on captcha like data? Sending tweets and other forms of communication to humans and scoring for intelligible responses?

Expand full comment

On data bottlenecks, I suspect people will discover new stores of data now that they are valuable to language models. Scarcity creates high prices which alleviates scarcity.

Can new data come from synthetic data? I think no (except for specific areas) unless we have teams of people filtering text generated by language models. See my discussion with Gwern here:

https://www.lesswrong.com/posts/vh4Cq6gwBAcPSj8u2/bootstrapping-language-models

The straightforward solution is to steadily increase data quality and quantity using more and more people. Teams of people (with AI assistants) can produce high quality text and training datasets for the LM's, putting increasing amounts of effort and resources into data quality as the model scales. We can repeat this process for every conceivable task, putting in ever more work to weed out errors and make models robust.

Expand full comment