Will Sam Altman’s $7 Trillion Plan Rescue AI?

It’s all over the web: Sam Altman has called for a massive investment, and not a bit of a massive investment: he has apparently called for $5-$10 trillion of investment in AI-chip manufacturing. Sam regularly seems to suggest Artificial General Intelligence (AGI) — by supersizing Generative AI — is the way forward (for humanity, no less). What to think of it?

Some picture this as a ‘moonshot’, an audacious goal that creates massive innovation, like the goal that literally got people walking on the moon. In today’s money, the US spent $150 billion on the moon program, and at some point it was about 4% of the US federal budget: one in every 25 dollars was spent on the space race.

Sam’s ‘moonshot’ is about creating AI, specifically the Generative AI that has caught everyone’s attention (pun intended) and imagination. And his ‘moonshot’ is 50(!) times as large as the original ‘moonshot’ — the Apollo program — was in its day. It doesn’t seem Sam wants new technology and new breakthroughs from that money — as the Apollo program delivered. Sam’s $7 trillion vision apparently — a detailed plan isn’t available — exists mainly to address a shortage in current production, i.e. the GPUs from companies like NVidia. He wants volume. And lots of it.

Now, many have taken shots at his idea, as in: what about the energy cost, the environment, the (skilled) labour? Or: with that kind of money, we could solve quite a few other pressing issues such as climate change, etc. But it seems — Sam is seldom very transparent or clear — that Sam has the conviction that all our problems will be solved automatically if we just get AGI, and thus scaling the current Generative AI driven boom is what is needed. Sam’s Generative AI route is a silver bullet that will slay all humanity’s problems.

A most succinct observation came from ‘grand old man of being intelligent about IT’ Grady Booch:

You know what? Grady’s ‘professional estimate’ can actually be supported by observation and some analysis based on OpenAI’s own publications.

Before OpenAI stopped publishing in peer-reviewed scientific journals, they submitted an influential paper to Computation and Language called Language Models are Few-Shot Learners in which they reported a lot of data on GPT3’s performance. What was especially notable was that they reported on 8 different ‘sizes’ of the GPT-3 architecture:

Language Models are Few-Shot Learners, Computation and Language, 2020

The one we know of as “GPT-3” is simply the largest of these, with a 12.288 token dictionary size and 175 billion parameters (what this means, see here or here). But what this report enables is to illustrate how size influenced performance. And OpenAI did that. Here are two examples, scores on multiple choice SAT analogies test left, and predicting a missing last word in a sentence right:

The blue line of the left graph is GPT’s zero-shot performance depending on size. ‘Zero-shot’ means more or less that the model did not get any help from humans, so its raw performance (for more: see this explanation of zero-shot etc. — and why it is important to understand these as they may be misleading when looking at benchmark results).

We ignore the misleading one-shot and few-shot values, because anything except zero-shot cannot be used for benchmarking the actual type of behaviour that users will exhibit.

The paper mentions several time that performance scales ‘smoothly’ with model size. So, it seems from the paper that scaling really works.

There is a problem, though. The graphs above use a logarithmic scale for the x-axis (number of parameters). Using logarithmic scales can be highly misleading. And in this case it is. Because if you change the axis from logarithmic to linear, quite a different picture emerges. For example, this blue line from the right hand image above

becomes this blue line when we turn the axis ‘normal’:

That second ‘reverse-hockeystick’ one is a more ‘honest’ representation of what scaling does for LLMs. From a 13 billion GPT-3 to a 175 billion GPT-3, the size increased ~1400% and the Lambada performance increased ~5%. Visually, it seems that already in 2020 the transformer approach was plateauing, size-wise.

Of course there are many more ways to improve the performance of a model, e.g. by spending a lot of effort in fine-tuning, for instance creating hand-crafted data sets or using cheap labour in countries like Kenya. Which is what OpenAI had been doing between the moment they had their 175 billion parameter model (probably around late 2019, given that paper) and launching ChatGPT to the public (late 2022).

There is a reason why OpenAI/Microsoft are spending most of their efforts these days not to improve the LLMs at the core of their products, but instead spend their time on engineering the hell around the limitations they know will not easily disappear by scaling up.

How much scaling do you need for GenAI to become human-level AI?

You can use these numbers to make an estimate for how much sizing you need to get in the neighbourhood of human performance with your ‘approximation of’ human intelligence by token/pixel prediction’. I’ve done this ‘sketchily’ by

  • Find out how many logarithm operations you need to turn the curve actually straight so we can do a linear extrapolation. This is about 3: log-log-log *)
  • Take the difference between the 13B and 175B models and extrapolate to human performance (e.g. on Lambada, humans score 95%, on SAT Analogies 57%) in the log-log-log setting
  • Calculate back to how many parameters you need for that human-level performance

For SAT Analogies and Lambada, the numbers are 0.1 and 7.8 quadrillion parameters respectively, or 660–44,000 times the size of GPT-3. Let’s say that OpenAI investment has been around $10 billion so far. His $7 trillion gets him 700 times that amount. At the low end of the above range — and that will give him a single model that can do well.

It gets a bit hilarious when you use the numbers from the Winogrande benchmark, that is one that is especially hard on GenAI, because to score high on that one you really need to understand the text. The test has been designed to require insight, e.g. it contains problems like “Robert woke up at 9:00am while Samuel woke up at 6:00am, so he had less time to get ready for school.” and the question of course is: Who is ‘he’?”. Easy enough for a human, hard for attention-based-next-token prediction systems like GPT.

Of course, these are quick and very dirty calculations. They also ignore all the ‘engineering the hell around limitations’ that is already be going on. But it definitely seems Uncle Grady’s ‘professional estimate’ isn’t crazy. The core architecture of the current GenAI hype, the one that needs all those GPUs Sam wants to produce — Generative AI — doesn’t scale the way Sam suggests it does. Throwing $7 trillion at the situation simply to scale won’t solve much, but will create a few problems of its own..

Sam’s ‘moonshot’ of $7 trillion thus isn’t a moonshot. The current engineering-with-GenAI-frenzy is already an admission of defeat as far as Generative AI is concerned. OpenAI and friends should already know that with the current resources the current architecture has no chance of getting actually near ‘intelligent’ and LLMs will become more and more an element in a larger non-AI ecosystem (like python being used behind the scenes to circumvent the lack of arithmetic capabilities of GPT). They also must know that from the numbers even $7 trillion isn’t enough to get to their professed goal of human-level AI. But convictions are hardy beasts.

I do expect we still will get some useful (but ultimately dumb) tools out of this, like ‘living documentation’ and ‘summarising’. Combinations of technologies (what Microsoft is doing with CoPilot) may be engineering that results in usable tools. In the same way as we got useful stuff from the original moonshot (which of course did not entirely halt, we stayed in space, we just did not go to the moon for a long while). But how much and at what (long term) price? And will the experience of many with the fundamental lack-of-understanding of this technology result in another ‘AI Winter’? It is hard to imagine now, but Meta’s AI Lead Yann LeCun recently said in an interview that they originally called their technique ‘deep learning’ so they did not need to use the term ‘AI’, the mentioning of which at that time was a quick and certain way to be shown the way out.

It is also good to repeat another perspective here. GenAI doesn’t produce ‘results’ and ‘hallucinations’. It produces ‘approximations’, period. Even its correct answers are approximations of understanding, not understanding itself. We should thus stop labelling these as ‘errors/hallucinations’ but as ‘failed approximations’, as the use of the word ‘error’ suggests the default is actual real ‘understanding’ (even Sam Altman has said these aren’t bugs but features).

*) We do not have these benchmark numbers or sizes for GPT4 (or GPT-3.5 for that matter). The exception is Winogrande for which zero-shot benchmarks results can be found it the GPT-4 technical report. These seem to suggest Winogrande results have uncommonly improved from GPT-3 to GPT-4. This might mean that the relation between parameters and performance might need to be different for my rough estimation, and that means the numbers could be less bad. It could also mean that that specific benchmark had profited from specific fine-tuning. Without OpenAI releasing benchmark numbers and model sizes we can’t really do a halfway decent rough estimate beyond the 2020 paper used here.

P.S. You might not think it after reading this, but I do find ChatGPT and friends impressive technology.

This article is part of the The “ChatGPT and Friends” Collection

1 comment

Leave a comment