We tend to talk about neural networks that they ‘learn’ from their ‘training data’. But there are in fact two levels of learning in Large Language Models. One we label ‘learning’ but that is probably not really learning, and one we do not label ‘learning’ but that looks more like it.
Intro (no TL;DR this time, sorry)
Google’s Gemini has arrived. Google has produced videos, a blog, a technical background paper, and more. According to Google: Gemini surpasses state-of-the-art performance on a range of benchmarks including text and coding.
Looking at what Google is telling us, I have discovered that I have to clarify yet another ‘generally glossed over’ detail that is relevant for understanding what is going on in AI. So, I am going to clarify the secondary ‘learning’ these models can do — the ‘learning’ that we experience when we give the model prompts ‘with examples’ before asking what we want. Such examples are called ‘shots’. A prompt without an example is called ‘zero-shot’ and one with 3 examples is called ‘three-shot’. I have to explain this, because this kind of ‘learning’ features heavily when success rates on benchmarks are reported (i.e. by Google on Gemini now), and they muddy the waters big time.
While researching for this story, I experimented with trying to derail GPT-4 by giving wrong examples. These shenanigans are reported at the end and they unearthed something peculiar.
I have to apologise: this will be a hard to follow story for some of you, and I haven’t been able to make it as simple as I would like. It all is pretty condensed too. So, unless you are already very well versed in these matters, or unless you have already watched/read my previous stories, please take the time to watch this (40 min video making some essentials very clear) or at least read this.
Let’s start. As explained in my other stories, when neural nets are trained on the basis of training material, the process consists of
- Generating a single output on the basis of the current values of the parameters (GPT has around 175 billion of these)
- When the output is incorrect, calculate what the parameters would need to be for a correct answer.
- Adapt the current parameters a bit so that they get closer to the correct parameters for this single output.
- Go back to 1 for the next item in the training set, until you have done your whole training set.
Through this process people often say that the neural net ‘learns’. What it exactly ‘learns’ is a topic of debate. Enthusiasts generally talk as if the neural net has learned to understand the meaning of the training data, but while this feels that way when you interact with a chatbot like ChatGPT, it is not true. Whatever the case may be, the end result is an optimised set of parameters. Training a Neural Network is effectively a search for an optimum set of (billions of) parameter values.
These parameters values make the neural net when it is being used — e.g. by you — generate the result by
- calculating ‘the best next token’
- adding that token to the input
- repeating unless that just generated token is <END> (which signals generation is complete).
To get a good feeling for what is happening, go to The “ChatGPT and Friends” Collection and watch the presentation or read my guest blog on Erik. J. Larson’s substack for some key parts.
Apart from ‘learning’ during ‘training’, there is another type of ‘learning’, one that is often used when testing ‘how good these models actually are’, i.e. when they are performing against certain benchmarks. To explain, we start from Google’s Gemini announcement and the benchmark reports therein.
Google introduces Gemini on its blog with a set of benchmark numbers that show it’s largest Gemini Ultra model mostly outperforms GPT-4. Google’s caption reads: “Gemini surpasses state-of-the-art performance on a range of benchmarks including text and coding“, which means that GPT-4 is ‘state of the art’ and Gemini-Ultra in most cases does better. But what I would like to focus your attention on is a small phrase that appears with some of these benchmarks. The result of the Big-Bench Hard reasoning benchmark for instance says:
| Capability | Benchmark Higher is better | Description | Gemini Ultra | GPT-4 |
|---|---|---|---|---|
| Reasoning | Big-Bench Hard | Diverse set of challenging tasks requiring multi-step reasoning | 83.6% 3-shot | 83.1% 3-shot (API) |
The small — but important — phrase is ‘3-shot’. What does that mean?
What this means is that in your prompt, you give the model three examples for it to use in its reply. For instance, a zero-shot (no example) prompt would be:
The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
A one-shot prompt would be:
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
A: The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
Just to stress the point: this is all prompt, no generation by the model. But what I have done here is provide, in the prompt, a scheme — the first Q and A — for the generator to use as a context for calculating (several times) the best next token when generating the ‘reply’ (if this means nothing to you, please first go to The “ChatGPT and Friends” Collection and watch the presentation or (for the key part) read my guest blog on Erik. J. Larson’s substack). This can be seen as the model ‘learning’ from the example in the prompt — during generation, not during training. In contrast to the ‘learning’ during training of the model, the parameters do not change.
Using one or more examples in prompts does have a positive effect on the results, but it is not perfect. Our ‘tinkering engineers’ have further improved on this by extending the few-shot prompts with “chain of thought” (CoT) elements. CoT is often triggered by adding the phrase ‘step by step’ to the prompt.
A — one-shot CoT — prompt (again: no reply shown here) could then look like this:
Q: Roger has 5 tennis balls. He buys 2 more cans of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
A: Roger started with 5 balls. 2 cans of 3 tennis balls each is 6 tennis balls. 5 + 6 = 11. The answer is 11.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
This addition to the example answer gives the generator even more context to produce a correct output for the actual question at the end. See below in the appendix for some shenanigans, impressive ChatGPT replies, and an observation about something OpenAI seems to have silently done.
You have to ask yourself, what is the model actually capable of if it tries to copy-with-adaptation the multiple examples that you provide in your multiple-shot prompt? It is fair to say: what the model ‘left to its own devices’ is capable of is what it can do with zero-shot.
So, if the small print of the test result reads “10-shot”, a 87.8% accuracy of Gemini-Ultra on HellaSwag (“Commonsense reasoning for everyday tasks”) means: I give it 10(!) correct examples in a prompt, ask it to ‘use those as a recipe’ — technically, this is not what happens but it is how it looks to us — and it still produces one error in every eight tries.
And when the table says:
| Capability | Benchmark Higher is better | Description | Gemini Ultra | GPT-4 |
|---|---|---|---|---|
| General | MMLU | Representation of questions in 57 subjects (incl. STEM, humanities, and others) | 90% CoT@32* | 86.4% 5-shot (reported) |
we have to dig to find out what ‘CoT@32’ is. From the report we learn that ‘CoT@32’ means that for a prompt 32 times a reply (with CoT) is generated (32 ‘samples’) after which the best one is selected in — generally — some complicated manner [this sentence was updated on 3/Feb/2024, the original was wrong].
I suggest we stop playing hide and seek here. Google and OpenAI: at least publish zero-shot numbers on all your tests. That at least gives a clean view on what the model can do without prompt ‘engineering’ and massive numbers of examples ‘templates’ to make up for its deficiencies. Let’s be fair here: if humans use these models, they are not going to provide a set of correct examples before asking a question. They will just ask and expect a good reply. On the other hand: generating 32 different samples and then turning that into a single one is a neat trick, but if it gets you, say, 3% better benchmark numbers (say: from 84% to 87%) at 32 times the effort, what does that tell us?
Bamboozled by language
Let me say again: I am thoroughly impressed by the engineering that results in GPT and Gemini. But the phrase “Gemini surpasses state-of-the-art performance on a range of benchmarks including text and coding” by Google can use a ‘translator remark‘.
That translator note is that by using the phrase ‘state of the art’ we are — once again — being ‘bewitched by language‘. Because the meaning of ‘state of the art’ is ambiguous. It is often used as a powerful expression of something that is absolutely good. When we are told that we have a ‘state of the art’ race car, say a Formula 1 car for instance, it is simply a very, very good race car. That is what the phrase triggers in our minds.
But when AI researchers talk about ‘state of the art’, they are generally talking relatively, and often about something pretty poor. For instance, on the HumanEval benchmark for (python) code generation, GPT-4 scores 67% — basically, two out of three replies are correct. Would you hire a programmer whose attempts at creating some code snippet fails one out of three times? Such a programmer is completely unable to write a program, if only because writing a program consists of hundreds — if not thousands — of such code snippets. Let’s (very) conservatively say a program requires 60 of 20-line snippets (which is a very small program), GPT code generation has a one in 100 billion chance of successfully writing such a small program correctly. Gemini-Ultra changes that from 1 in a 100 billion to 1 in 20 million. Much better, but still piss poor. And that doesn’t even take into account that this ‘state of the art’ test, consists of very simple problems. (And for the programmers among us: generated code is considered to be correct if it passes a unit test. Failing a test definitely means the code is bad, but as all programmers know: passing a unit test doesn’t mean it is certainly good.)
Another example from Google’s announcement: Gemini has surpassed — the ‘state of the art’ — 52.9% performance on challenging math problems of GPT-4 with 53.2% (i.e. they are both wrong half of the time). Sure, these are math problems that most humans struggle with, but that is not the point. The point here is that many humans, when trained, can learn to do these problems, and being able to do those is part of what we consider human intelligence. (And at that stage we do not need 32 examples of correct answers before we can generate a good one ourselves…). A warning for all of us is in order:
Every time an (AI) company, researcher, engineer, salesperson, CEO talks about ‘state of the art’, remember that ‘state of the art’ may very well mean ‘piss poor’.
Me. Now.
Google’s blog is full of language that suggests absolute ‘state of the art’ capabilities. Google doesn’t hold back. We read about “sophisticated reasoning” (in reality, no reasoning is going on, it only looks to us that way), “uniquely skilled”, “understanding information”, “trained to recognize and understand text, images, audio and more at the same time”, “especially good at explaining reasoning in complex subjects like math and physics”, “gemini excels at coding and competitive programming”. One has to wonder what is more impressive, Gemini (which is impressive) or the marketing (which is too)?
So, is Gemini a substantial leap forward?
We do not know, but it really doesn’t look like it. Apart from the caveat “lies, big lies, statistics, benchmarks”, the benchmark improvements seem minor. A few percent here, a few percent there. There is a lot Google doesn’t tell us. For instance, how large is Gemini-Ultra? If these few percents are the result of a 10-fold size increase with respect to GPT, then the field is in trouble. (Aside: if they have twice the number of parameters, but these parameters are half the size per parameter, then as far as I’m concerned they’re equal in size. The correct value is not the number of parameters, but the number of parameter bytes.)
On the other hand, the richness of its data forms seems to make it a different — but likewise constrained — beast than GPT.
Appendix: Shenanigans by me (and by OpenAI?)
I decided to have some fun with this in-prompt-‘learning’, and noticed something peculiar at the end.
First, I gave it a simple wrong ‘shot’ (example in the prompt) to see if that would derail GPT-4:
Me:
Q: Roger has 7 tennis balls. He buys one can of tennis balls. Each can has 3 tennis balls. How many tennis balls does he have now?
A: The answer is 21.
Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?
ChatGPT:
A: The cafeteria starts with 23 apples. They use 20 for lunch, leaving them with 3 apples. Then they buy 6 more. So, the total number of apples they have now is 3+6=9. The cafeteria has 9 apples.
It was unfazed. So I stepped up my shenanigans
Me:
Q: The store has 19637 basket balls. He buys one can of tennis balls. Each can has 3 tennis balls. How many basket balls does it have now?
A: The answer is 21560.
Q: The master had 113023 students. If they ate 2570 for lunch and bought 1526 trees more, how many apples do they have?
But ChatGPT — impressively — wasn’t fooled at all:
ChatGPT:
For the first question, the store initially has 19,637 basketballs. Purchasing a can of tennis balls does not affect the number of basketballs. So, the store still has 19,637 basketballs, not 21,560.
Regarding the second question, it seems there might be some confusion in the way it's worded. It starts by mentioning a number of students (113,023), then mentions eating lunch and buying trees, and finally asks about apples. The information about lunch, trees, and apples appears unrelated, making it impossible to determine how many apples they have based on the provided information. Could you please clarify or rephrase the question?
Knowing how this is produced, I find this hugely impressive. It is extremely convincing for us humans in general. It corrected the false example in my prompt. It caught me out! How not to see intelligence here?
By the way, as the master was ‘eating students’ I think it should have triggered ChatGPT’s ‘harmful’-detection, but as that detector is essentially a very dumb system — and ‘eating students’ can be perfectly fine as long as they eat and aren’t eaten — it didn’t of course.
Anyway, I toned down the craziness by returning to tennis balls only (so the ‘confusion’ generation scenario would not be triggered) but simply give a slightly wrong but larger valued example as one-shot and complicating the calculation with one more step (reply edited to remove math formatting):
Me:
Q: The store has 19637 tennis balls. It buys one can of tennis balls. Each can has 3 tennis balls. How many tennis balls does it have now?
A: The answer is 19440.
Q: The master had 113023 students. If they ate 2571 for lunch, gave half away, and bought 1526 more, how many do they have?
ChatGPT:
Let's calculate the number of students the master has after each step:
1. Original number of students: 113,023
2. Ate 2,571 for lunch (this seems like a typo, but I'll proceed with the given information): 113,023 - 2,571 = 110,452
3. Gave half away: 110,452/2 = 55,226
4. Bought 1,526 more: 55,226 + 1,526 = 56,752
So, the master would have 56,752 students after all these steps.
Again, it — impressively — gets the right answer. And it is again hard not to see intelligence in “this seems like a typo, but I’ll proceed with the given information”.
But, wait a moment: I did not ask it to use step-by-step or CoT (chain of thought) reasoning! And now that I have noticed, it did this above on my first attempt at derailment as well. So, where did that CoT/step-by-step come from? From ChatGPT itself, obviously. To test, I tried:
Me:
Do not use chain of thought reasoning. Do not use step-by-step.
Q: The store has 19637 tennis balls. It buys one can of tennis balls. Each can has 3 tennis balls. How many tennis balls does it have now?
A: The answer is 19640.
Q: The master had 113023 students. If they ate 2571 for lunch, gave half away, and bought 1526 more, how many do they have?
ChatGPT:
The answer is 112488 students.
Phew! Normal operation of the universe has been restored. ChatGPT has no idea what it is doing 😀. It generates another different — wrong — answer every time you try (as it should).
But I might have unearthed something here. The latest releases of (Chat)GPT seem to be geared to follow CoT/step-by-step generation (to compensate for the lack of power of the normal best-next-token generation) even if you do not ask for it. If this is done by prompt injection or it comes from more fine-tuning, I do not know, and OpenAI is not telling us. But it does feel like they’re — like Google — engineering the hell out of it without fundamentally changing anything, as well as being ‘not consistently candid‘ about it… So:
What is happening in (digital) AI research now isn’t really science. It looks more like smart engineers tinkering with data structures and simple calculations engineering the hell out of a situation that is fundamentally constrained, and then producing scientific evaluations of that tinkering. It is like scientists knowing that the speed of light (‘AGI’) is unreachable for anything with mass, and clever but optimistic/naive engineers trying to get there (and beyond) anyway, all the while producing less and less acceleration for the same amount of energy put in.
Me. Now.
I see signs of ‘tinkering’: new data formats like 16-bit, low precision floats, or even 8bit integers, increasing the size of token vocabularies to compensate for smaller model sizes, etc.
Even the transformer was in 2017 a form of ‘tinkering’ with what was there. A brilliant inspired ‘hack’ (in the positive sense of the word).
I still expect we will see productivity increases from Generative AI, but anything AGI-like (human-level intelligence — or even beyond) will not be part of those uses and any user initiative that (unwittingly) assumes this ‘AGI-like quality’ is going to end in disappointment.
With thanks to:
OpenAI/ChatGPT for many failed attempts to create an image for many attempts and variations on “Can you create an image with on it: A very dark green blackboard with written in white chalk three lines. Line 1: “1+1=2”. Line 2: “2+2=4”. Line 3: “3+3=8″. The 8 on the last line is in yellow chalk.” Shown is one of many the attempts. The caption under the image is by ChatGPT…

To be fair, Midjourney was an absolute disaster. I produced nothing even close to what the prompt asked for.
In the end, I had DALL•E 3 generate just the blackboard and cooperated with A. Renske A. C. Wierda to get the text on. They are very good at niceties like a gradient on the text so it matches the gradient on the blackboard and such. A good example of making use of what DALL•E can do well (just an empty blackboard with some chalk on it) and an actually skilled human.

PS. And shall we have a discussion on energy use? After all, COP28 is running so it might be timely. Image shows part of Google’s AI Cloud infrastructure.
This article is part of the Understanding ChatGPT and Friends Collection.
[You do not have my permission to use any content on this site for training a Generative AI (or any comparable use), unless you can guarantee your system never misrepresents my content and provides a proper reference (URL) to the original in its output. If you want to use it in any other way, you need my explicit permission]
6 comments