Generative AI ‘reasoning models’ don’t reason, even if it seems they do

I do not know much about the actual internals of models like GPT. We know — since Vaswani (2017), yes it already has been 8 years… — the basics, of course:

LLMs do not work on words, it only seems that way to humans. They work with tokens. If you do not really know what a ‘token’ is, it is not always a word (or anything that you would recognise as a linguistic element). A token is mostly a fragment of characters of variable length that enable you to glue them together to create all text. LLMs work with a limited set, GPT3 had a fifty thousand token ‘dictionary’ (again a somewhat misleading term as for humans that means ‘meaningful words’). Token dictionaries may be 100k or 200k in size. Watch this video or read this summary if you want the whole story. Or read OpenAI’s own short explanation and play with the tokeniser there. Confusing tokens with words misleads a lot of people.

Some tokens represent entire words, but if the word is not available as a token it is stitched together as a sequence of multiple tokens. For instance, in GPT3, the word ‘paucity’ was made up of three tokens: paucity, in GPT3.5 and GPT4 it is made up of three tokens as well, but differently: paucity, but GPT-4o goes back to paucity. Grok-1 has paucity as a single token: paucity.

What the model does is break the initial prompt (your input) into tokens and then repeatedly use whatever has come before (both your initial prompt and what it has added to that itself) and produce a new token. So: every newly generated token is added to what is already there until the model signals it is done by providing an ‘end’ token. From my initial EABPM talk in 2023:

This mechanism is called autoregression. So, what these models do is not calculate the best next word, they calculate the next — from a human’s perspective meaningless — ‘best’ next token. ‘Best’ here is ‘randomly selecting from a set of likely best next tokens’. In the example above: after having produced the p token, the ity token gets a low likelihood score (‘pity’ is a valid word, sure, but the token isn’t statistically very likely given what has come before), and most others out of the tens of thousands possible tokens, like the the x of if token even lower (strings like ‘px’ or ‘pif’ don’t happen much in the training material, especially not in this context of other tokens that have gone before).
By increasing the randomness (you can, see my talk — from 2023 but fundamentally still valid) you can get more creative output but also more gibberish.

Basically, a good short definition of an LLM is:

LLMs understand statistical relations between tokens (not per se words) and with that they can approximate the result of real understanding to a certain degree (which sometimes can be amazingly high, and sometimes frustratingly low). LLMs are generally embedded in a complex landscape of other systems that in part mask this from being all-too-visible.

Some of those that think LLMs are a step towards true Artificial General Intelligence (AGI) are convinced that this approximation — if the mechanism is scaled up enough — becomes indistinguishable from real intelligence. Specifically, they think that even logical reasoning — a main weak point of the LLMs as logic runs at the semantic level while tokens are meaningless from a human perspective and run below even the grammar level — will be possible. The thought sometimes in part simply is: we humans can do logic and we too employ neural nets, so it must be possible. (This is not the first time, when AI really started in the late 1950s, it was thought that since neurons ‘fire’ they are ‘digital’ on/off signals. This is false, neurons are very analog and the current ‘neural nets’ tend to be built on that perspective).

But in practice LLMs doing logic is like trying to logic based on the ink distribution of a printed logical argument. Here is an example from my 2023 EABPM talk where you see how token generation has a hard time being logical:

These simple examples are now harder to generate as much development has gone into approximating better and hiding the token-generation game inside a complex web of systems. But fundamentally, they are still there.

Others think that scaling is not enough and something extra is required, often this suggestion takes the form that we need ‘neurosymbolic’ systems, where neural nets such as LLMs are integrated with systems that employ ‘symbolic logic’, i.e. the logic we employ when we for instance deduce that if A is larger than B an d B is larger than C, then A is larger than C. People think this because this kind of logic is extremely hard to do through the approximation of LLMs.

Most people intuitively agree that we may call any mechanism that generates output that we find indistinguishable from what known (human) intelligence can generate simply ‘intelligence’. The Turing Test is an example to decide along those lines (but it fails to test this because it turns out that human intelligence is quite easily fooled). I have been saying this on several occasions: the most interesting thing we are going to learn during this part of the Digital Revolution is probably about our own intelligence (and its strengths and above all weaknesses).

The essential calculation that happens for each token that gets selected as next token

At its core, a LLM calculates one ‘next token’ at a time. It does that by calculating a sort of ‘fitness’ score for all possible tokens. These ‘possible tokens’ make up the ‘token dictionary’. So, for a token dictionary of size 50k, the model calculates 50k times (for each possible token) how good its score is as ‘next token’. This produces a set of ‘best tokens’ of which one is selected with some randomness. That randomness is what makes the LLMs creative. The more randomness you allow (this is called the ‘temperature’) the more creative the model becomes. But the model makers limit this, because if you set the temperature too high, it fails to approximate good grammar. You can see an example in my original talk. They limit this temperature for a reason: regardless of the meaning of what the model puts out, if it fails at basic grammar it becomes far less convincing for us humans. (We can produce good meaning with bad grammar, but the LLMs probably cannot — I never tried).

One of the potential tokens that can be generated is the <END> token, which signals ‘stop generating’.

Note: what people tend to call ‘hallucinations’ aren’t errors at all. They are the ‘perfect’ outcome of the above process and as calculated they are a best generation the model could do. The model approximates what the result of understanding would have been without having any understanding itself. It doesn’t know what a good reply is, it knows what a good next token is, and there is a fundamental difference. It produces good ‘best next tokens’, but these may create a ‘hallucination’. Much work has gone into minimising this effect, but as it is fundamental to the architecture it is probably impossible to get fully rid of it (though at the expense of much processing power you can minimise this but only for a certain type of inference, see ‘reasoning models’ below).

Scale factors of ChatGPT and Friends

With the usual digression out of the way and before we discuss what the ‘reasoning models’ do and what that means, let’s set the stage. Let’s start with a high level look at a Large Language Models. We go step by step into an overview of the elements and how they can scale. This is the absolute basic idea of an LLM architecture in a diagram:

The green and grey elements in the diagram are data, the yellow element is behaviour. H3 stands for “Helpful, Honest, Harmless”. The reality these days is much more complex, but as a basic mechanism, this works. The key elements that can scale are:

  • The complexity of the model. The transformer works not on words — as is often sloppily explained — but on vectors (sets of numbers) that represent tokens. Each token is represented by a vector, the size of which (say a set of 12,000 values for GPT3) is one dimension and the number of different tokens (say, 50,000 for GPT3) is another dimension.
  • The volume (not the number) of parameters. Training an LLM ends into a large set of bits that store a fuzzy set of all the different patterns of tokens of the training that training has encountered. The number of parameters comes from the complexity of the model (e.g. layers, etc.), their size depend on the precision of the calculations where they are employed. There has been a development of moving towards ‘more but less precise’ parameters, in part because that is more efficient. It does mean the algorithms have become ‘larger’. So, smaller parameters but more statistical-algorithm. Even simple measures such as ‘number of parameters’ that has often been mentioned is foggy. After all, there is quite a difference between a 32-bit ‘floating point number’ (float32) or — what Gemini is (partly) using — a simple 8 bit value (8bit). For now, the best measure to compare model sizes may be the parameter volume (in bytes) in combination with number of parameters, the latter a measure for algorithm size.
    What also has happened is that a single set of parameters hides multiple separate approximation approaches (the ‘mix of experts’ approach). Such optimisations may make models more efficient, by for instance at some point deciding to not use several of these experts anymore. DeepSeek is a model that does that for instance.
  • The volume (and quality) of the Training Material. This must be multiplied by how often this training material is used during training. Here too are some ‘engineering the hell out of it‘ gains: like having models piggybacking their training on other models (again something DeepSeek has done)
  • The size of the token dictionary. How many possible tokens there are. GPT3 worked with around 50k. The latest models from OpenAI have a token dictionary of around 200k. The more potential tokens there are, the more fine-grained statistical relations between them are in play.

When we train the model, trillions of ‘tokens’ (how text is represented in these models — not in a way we humans do by the way, see elsewhere in this series) are turned into billions of (bytes of) parameters. In a way, this can be seen as a form of compression of the training material, but not literally. You cannot decompress it to get the training data, but you can generate new material based on the statistical relations between tokens in the training material. If the parameters represent the training material exceptionally well, you can get memorisation (or data leakage — these are the same) of the training material.

When we use the model, the training material is not available, only the parameters and the input are. That input, starting with your ‘prompt’, is the context for generating the next plausible token. Many tokens may be plausible at any point, and with somewhat of a throw of the dice, one is chosen. This generated token is then added to your existing context (your prompt and everything already generated so far), after which the procedure repeats until the next plausible token is an <END> token, which signals that generation should end, as was shown above. This was the situation when ChatGPT hit the world and blew everyone away.

In essence, this basic mechanism has remained the same and so have the limitations of approximating the results of understanding without actually understanding. But OpenAI, Google, and others have been engineering like crazy to improve on the often poor (but still convincing) results.

The technical improvements beyond 2019 GPT3

I am not privy to what goes inside OpenAI, Google, etc., but from what I know of LLMs and what we can see, I can make some educated guesses. I am presenting a few here that are (almost certainly) fact. Warning: I am somewhat assuming that you have read/watched my earlier explanations so words like ‘context’ are understood in their LLM-meaning, not the general meaning. You can glance over them, but if I lose you, the necessary background is elsewhere in my writing and presentations.

1. Context size (via the ‘turbo’ models to the large context sizes of today)

The size of the context — how much of what has gone before do you still use when selecting the next token — is one of the first sizes that have gone up. The OpenAI models with a larger context window were originally called ‘turbo’ models. These were more expensive to run as they required much more GPU memory (the KV-cache) during the production of every next token. I do not know what they exactly did of course, but it seems to me that they must have implemented a way to compress this in some sort of storage where not all values are saved, but a fixed set that doesn’t grow as the context grows that is changed after every new token has been added. To me this smells like a solution that effectively passes on a gigantic ‘state’, not unlike the Long Short Term Memory (LSTM) from the Recurring Neural Networks (RNNs) that gave birth to the Transformer architecture (Vaswani’s Attention [to context GW] is all you need). There LSTM was a huge bottleneck for model growth during training, but during generation this should not be a problem. This state may either come from a simple but inspired trick (like the transformer architecture itself), or it may even be driven by some neural net that is trained as well to create the optimal state. Both, I can imagine. Which one? I do not know. But context size has seized to be the GPU memory bottleneck that it originally was.

2. A bigger token dictionary

This is a small one, but increases the granularity of available tokens for next token and with that the number of potentially tokens to check at each round. From a token dictionary of 50k tokens to a token dictionary of 200k tokens meant that for each step, 4 times as many calculations had to be made. We’ll get to the interrelations of the various dimensions and sizes below.

3. Parallel/multi-threaded autoregression

Generating means calculating a set of potential ‘best next tokens’, randomly selecting one of these ‘best next tokens’, adding this token to what has been put in and generated so far and repeating that until that <END> token is generated. One of the improvements has been to do several different generations in parallel. If you have multiple generations, you can compare them over a longer stretch. One regression may come up with an overall higher average fitness score for all its tokens so far, even if maybe the first of that stretch has a slightly lower fitness. And if you delay outputting them until all threads have at that position the same token, or you prune those threads that get into ‘low fitness score sets’ you can give your generation a kind of useful memory. Of course, every extra parallel thread increases the number of calculations that are done. In a picture (with for completeness the ‘system prompt’ added — and reality will be different than the simple picture here)

Definitely not how it exactly goes, but good enough to present the idea

It is not essential to understand how they calculate ‘token thread’ fitness on top of ‘token fitness’. There are two ways this may have been implemented. At token level, which is simple (just have a formula using the token fitness values). But as this article describes it, it is an extension of Chain of Thought, the approach pioneered in prompt engineering to ask the model to go ‘step by step’. By making the model generate text that shows step by step reasoning, it has a minimising effect on the typical erroneous (confabulations/’hallucinations’/failed approximations) that LLMs tend to make. E.g. by adding ‘do it step by step’ to a simple calculation that went wrong (as the one showed above), it would make less errors. These approaches have one thing in common: use the model multiple times to create multiple continuations in parallel, then have some sort of mechanism to select one of all these continuations.

For those somewhat knowledgeable of early AI, this introduces heuristics (tree search — the typical 1960’s AI approach that ran into a wall combinatorial explosion) to LLMs. I suspect this tree search has already been built into the models at token selection level (this may even already have happened with GPT4, it is such a simple idea and quite easy to implement after all). The Tree of Thoughts is more a prompt engineering approach, but these too can lead to new internals, which brings us to OpenAI’s o1/o3 models, that have been created in an attempt to become better at true/false reasoning.

4. Fragmentation/optimisation of the parameter volume and algorithm: Mix of Experts

The ‘set of best next tokens’ to select from needs to be calculated, and as an efficiency improvement some models (we know if for some, it may have been why GPT4 was so much more efficient than GPT3) have been fragmented in many sub-models, all trained/fine-tuned on different expertises. By using a ‘gating network’ the system determines which fragments should be activated during a computation. Then only a few are activated to calculate the ‘set of best next tokens’ from which at random a token is added to the result so far after which autoregression (see above) takes over again. This seems pretty standard optimisation, but apart from some already noted issues from research (like here) I suspect that the gating approach may result in some brittleness. Still, it works. Mix of Experts also sounds a lot better than ‘model fragmentation’, so that works too…

Which long preparation brings us to the point:

The ‘reasoning models’ … aren’t reasoning at all

All these scalings are nearing a point of exhaustion. There is no more data to train on. Increasing parameter count (algorithm complexity) or parameter volume can be done, but the gain is ever less. However, apart from brute force scaling, a separate trick has been added. This trick has led to the so-called ‘reasoning models’.

As became clear rather quickly, approximating correct grammar from token statistics is relatively easy, approximating meaning/understanding from token statistics is harder, and especially hard is approximating logical reasoning from token statistics. The reasoning model (e.g. ChatGPT4-o3) approach has been to train (fine-tune) the model not so much on the generated final result in one go, but train it on the (intermediate) form of a step-by-step way to get a result. The inspiration for this came form people discovering that if you put ‘reason step by step’ in the prompt, the model approximated better as it started to approximate the form of reasoning steps which made the model less prone to going astray.

So, instead of a model trained to be the best approximation of the form of good results directly, it has become trained (additionally) on the form of each fragmented ‘step’ in a good result, but specifically for domains where this reasoning can help. That is why it will be better at scoring on a math olympiad test or the ARC-AGI-1-PUB benchmark as these are specific domains where analytic steps of a certain form help you to provide a better answer. In some areas, the ‘reasoning models’ do worse, though, as Sam Altman has said. There also is a price to pay, reasoning models like GPT-o3 explodes in compute use. Which is not weird given all that extra ‘form’ it has been trained on.

Fundamentally, the way to describe the ‘reasoning models’ is;

What has been added is a layer of indirection when approximating. Not reasoning.

The indirection doesn’t even need to be ‘reasoning’ at all. It has been so far, as reasoning problems was the issue they were addressing, but theoretically it could be quite something else, though within the constraints of something that has repeatable patterns.

A big downside of indirection is that it is really expensive. Note especially: the extra cost for ‘reasoning models’ is not mostly training cost, it is mostly running cost. If Sam Altman says that even at $200/month and a limit on queries they are making a loss: believe him (this is one of the rare times you should actually believe him). This cost explosion is a real problem. GPT4-o3’s remarkably good score on ARC-AGI-1-PUB seems to have come on a cost of $3000/task, where a human does it at $5/task… Another sign the scaling limit of the current transformer approach is close.

There is another catch as well. For each domain, the form of these ‘reasoning steps’ — and it is that form that is approximated — are different. A model optimised on the form of steps useful in scoring well on a Math Olympiad won’t do well on ARC-AGI-1-PUB. So, we’re entering the approach of narrow solutions like AlphaGo, which plays master level Go very well (but could be fooled by bad/amateur moves 😀) but doesn’t play chess or can do math problems. However, with that above mentioned Mix of Experts (i.e. model fragmentation) approach you can of course create an ever larger set of these capabilities which overall is pretty efficient, and this actually resembles the way humans too have a set of skills that are not always all used at all times. The question that remains open (and I am not optimistic about) how much this ‘set of narrows’ will approximate ‘general’. And if the energy/cost payoff is worth it (probably not). The limited improvement of GPT4.5 — which probably employs the Mix of Experts (model fragmentation) efficiency trick quite heavily — compared to GPT4 seems to bear this out.

The main reason I am not optimistic about ‘general’ coming from a large set of such indirect-narrows is the observation that while creating a better approximations of certain domain-specific forms of reasoning steps, at the root we’re still approximating on the basis of token statistics. Just much, much more of it on more ‘non-understanding’ parts of text. The ‘reasoning models’ still do not understand. In the same way as the general models do not ‘understand’ but ‘approximate’ correct results, the ‘reasoning models’ do not understand but approximate the form of reasoning steps. And really, that still isn’t reasoning even if it would become a very good approximation of what reasoning will deliver.

None of these are in any way a fundamental shift, they’re engineering the hell out of a fundamentally constrained approach. And the labels given to these approaches are often prime examples of ‘bewitchment by language’ (cf. Uncle Ludwig).

We should add a question: if approximation of reasoning is good enough, isn’t that as effective as reasoning? And the answer can be yes, It is yes when reasoning isn’t sensitive to small errors. It is no when the outliers, the small errors, become essential. But there is a problem: reasoning in itself is a fundamental capacity that is independent from the actual subject. It is thinking in true/false and using that certainty to build edifices. Approximation of the steps of reasoning remains ‘domain-dependent’. It isn’t that ‘free from time and space’ capability. It is for that reason that — even if in a certain domain it is as effective as reasoning — it isn’t reasoning.

In the end, Artificial General Intelligence (AGI), reasoning, remain red herrings. They are mostly irrelevant in the debate, even if ‘narrow’ productive tools can be created this way. The true effect of Generative AI doesn’t have much to do with AGI, being trustworthy, or with not having hallucinations. It seems to be a set of specific ‘narrow’ reliable uses and a wide area of ‘good enough’ (cheap) results.

It seems a waste of time to have to repeat this over and over: forget AGI, it is a red herring in the current discussion on Generative AI. ‘AGI’ discussions distract us from the actual things Generative AI can do (and is doing), which is not AGI but which is producing incredible volumes of cheap not-quite-trustworthy material, amongst which will be some real gold nuggets. A bit like the internet itself or like social media: a lot of low-quality, noise, garbage and junk, but there are some real jewels sprinkled inside it. A lot of junk exists on YouTube, but also some incredibly good material. So don’t disregard Generative AI, and especially don’t disregard the way it will be able to damage factuality (just like social media has been able to do). Because even the so-called ‘reasoning models’ fundamentally do understand Zilch. Zippo. Nada. Niets. Nothing. And there is nothing that has been proposed even (let alone implemented) that truly bridges that gap.

Generative AI delivers a new mode of ‘intelligence’, and trying to explain it with only the old categories is never going to work.

If you like this story or think it is a useful contribution. Please share.

This article is part of the ChatGPT and Friends Collection. In case you found any term or phrase here unclear or confusing (e.g, I can understand that most people do not immediately know what a ‘token’ is (and it is relevant to understand that), that ‘context’ in LLMs is whatever has gone before in human (prompt) and LLM (reply) generated text, before producing the next token), you can probably find a clear explanation there.

5 comments

  1. The fundamental weakness of your argument is this:

    “reasoning in itself is a fundamental capacity that is independent from the actual subject”

    That is not so. Human reasoning is not independent of actual subject at all.

    Pure logical reasoning is an ideal, and we arrived very late at that, during Euclid’s time in geometry, and it has been used only in the highly specialized math domain since then.

    Everyday human reasoning is based on heuristics, experience, and most important of all, frequent examination of the consequences of your actions.

    What machines need not “true reasoning”, but the ability to check if they are on track, and if not restart or search around.

    Like

    1. You’re making a good point. However, the logical part (i.e. our discrete reasoning sense) might predate its pure use as well as written history. I agree that we’re almost entirely an estimating species (and use automations for that, we can call convictions, assumptions, beliefs, etc.), however, there is a difference in actually having that (integrated) very small bit of pure logic (discrete, true/false) or not.

      We also are in Uncle Ludwig territory here. “Reasoning” and “pure logic” aren’t the same thing, agreed, (so I must put in a warning, I guess) but in the LLM space “reasoning models” are meant to use logical steps (if then, and such). “Checking if they are on track” begs the question what “checking” entails as you probably very well know, but let’s leave that for another time, shall we?

      Like

Leave a comment