Cicero and chatGPT — signs of AI progress?

Cicero, a model by facebook Meta, was able to win from humans in one of the best board games in the world: Diplomacy, a very impressive feat. More on that below. Let’s move to chatGPT first, the language model that even hit the national evening news here (link, around 13:45 in). AI students were interviewed, and they were (a) impressed and (b) convinced this was ‘only the beginning’. Their professor was more guarded. But clearly, her arguments were no match for student imagination. The tech entrepreneur saw limitless opportunities and said it was ‘bigger than the launch of the iPhone. The journalist talked about chatGPT being able to write code or a new Harry Potter novel.

TL;DR

chatGPT is making the rounds, amazing people with the fluency and intelligence of its prose. And Cicero has been able to beat humans in Diplomacy, a game of negotiation, tactics and strategy, subterfuge and so forth. So, clear signs of progress on the road to Artificial General Intelligence?

Not really. chatGPT even tells us if you ask it. It tells us that it can create ‘well formed’ sentences, but “I am not capable of evaluating the quality of a written paragraph”. In other words: don’t trust me, I can only prevent the most obvious errors. And rightly so. What chatGPT produces may always look right, but it is often subtly wrong.

Cicero on the other hand takes an interesting combination of multiple models to create a program that can handle Diplomacy well in a very restricted setting. That is really beautiful design and engineering at the limits of what can be done, but it doesn’t say much about progress in AI itself. Meta’s FAIR team has engineered the hell out of fundamentally unreliable parts (though they were in their setting able make one approach less unreliable) . Impressive engineering, but not a fundamental change.

But what chatGPT shows us that the power of computers to create believable nonsense is growing and no doubt, as Gary Marcus has warned us: evil actors will already be salivating over how they can fool people with this new stuff or simply drown trustworthy information in a deluge of unreliable information. Because it is not so much that these algorithms are intelligent, but more that we — easily fooled humans — are not.

Let’s first have a look at a conversation I just had with chatGPT, the chatbot based on GPT. I was actually busy writing another post on learning lessons from a trend of the past, and I thought: why not see if all that training data from chatGPT actually can help me out.

Chatting with chatGPT

GW: What are the hottest technology trends at the moment?

chatGPT: I’m sorry, but as a large language model trained by OpenAI, I am not able to browse the internet or access current information. I am only able to provide information based on what I have been trained on, which has a fixed cutoff date. My training data only goes up until 2021, so I am not able to provide information on current technology trends. Is there something else I may be able to help you with?

Fine, fine. I’ll not ask about 2022 and ask about your cutoff training date:

GW: What were the hottest technology trends in 2021?

chatGPT: I’m sorry, but as a large language model trained by OpenAI, my training data only goes up until 2021, so I am not able to provide information on technology trends in 2021. Is there something else I may be able to help you with?

OK, ok. I misunderstood. “Until 2021” apparently did not include 2021.

GW: What were the hottest technology trends in 2020?

chatGPT: As a large language model trained by OpenAI, I don’t have any information on specific technology trends in 2020, as my training data only goes up until 2021. However, some general technology trends in recent years have included the increasing use of artificial intelligence and machine learning, the growth of the Internet of Things (IoT), and the proliferation of mobile devices and smartphones. These trends have continued to evolve and develop, and have had a significant impact on various industries and aspects of daily life. Is there something else I can help you with?

Wait, what? I understand you’re not longer sorry (as 2020 is before 2021) so there is no need to apologise that your data ends in 2021. AI/ML. IoT, and the proliferation of mobile devices is you mention as ‘the’ trends for the last years? Anyway, let’s make it easy for you:

GW: What were the hottest technology trends in 2016?

chatGPT: As a large language model trained by OpenAI, my training data only goes up until 2021, so I don’t have information on specific technology trends in 2016. However, some general technology trends in recent years have included the increasing use of artificial intelligence and machine learning, the growth of the Internet of Things (IoT), and the proliferation of mobile devices and smartphones. These trends have continued to evolve and develop, and have had a significant impact on various industries and aspects of daily life. Is there something else I can help you with?

Um. “I can’t say anything about 2016 because my training data ends in 2021”? Furthermore, word for word the same answer as the one about 2020. Note, these were interactions back to back. Somehow, I will not be much in doubt that this is software. Let’s try a different tack and ask about the technology I was writing about:

GW: So, what about blockchain?

chatGPT: Blockchain is a type of distributed ledger technology that allows a network of computers to maintain a continuously growing list of records, called blocks, in a secure and verifiable way. Each block contains a timestamp and a link to the previous block, creating a chain of blocks that cannot be altered retroactively. This technology has the potential to revolutionize many industries by providing a secure and transparent way to track and verify transactions, contracts, and other types of data. It has been particularly prominent in the financial sector, where it is used to create and track cryptocurrencies such as Bitcoin, but it has also been explored for a variety of other applications.

OK, that is technically a correct way to describe blockchain. The second part is rather generic. Knowing bit about the technology and its use, I can see that this is indeed a rather good ‘what has been said most about it’. The problem here is: that what has been said most often about it is of pretty poor quality (hype), hence the ‘prominent’ and ‘potential to revolutionise’ vibe. Let’s see if we can prompt chatGPT to be a bit more critical:

GW: Can you write me an opening paragraph for an article critical of blockchain?

chatGPT: Sure, here is a potential opening paragraph for an article critical of blockchain:

Blockchain technology has been hailed as a revolutionary development that has the potential to transform a wide range of industries. However, despite its many proponents and the widespread adoption of blockchain-based systems, there are also significant drawbacks and limitations to this technology that must be considered. In this article, we will examine some of the key criticisms of blockchain and explore why this technology may not be as transformative or game-changing as some have claimed.

Impressive, right? But do a search/replace ‘blockchain’ with any other term. Still acceptable. In fact, what we see here is something where the impressiveness probably comes from us humans being impressionable. I recall a friend telling me years ago about some person claiming ‘psychic’ insight in their issues. Example, this ‘psychic’ had asked them: “You’re having issues with your father, right?” They were amazed, how did he know? They hadn’t mentioned their father at all. But what happened of course is that the statement was by definition true as they themselves filled in the details that made it true. Such manipulations by ‘psychics’ has been well documented. This is like that.

Anyway.

So I asked chatGPT to judge their own work.

GW: How good do you think that opening paragraph is?

chatGPT: As a large language model trained by OpenAI, I am not capable of evaluating the quality of a written paragraph. My purpose is to assist users in generating human-like text based on the prompts that I am given. I can provide suggestions and examples of text, but the ultimate quality and effectiveness of the text will depend on how it is used and the context in which it is presented.

And here you have it. chatGPT when it talks about itself is pretty clear: it cannot judge. It has no ‘common sense’. It can generate, but it cannot see what is good and what is not. This is the essential problem with these models in a nutshell. And chatGPT has been taught to make that clear. Note also that it is not just that the sentences are ‘well-formed’ they are also ‘fitting’, that is, their words and phrases have a statistical likelihood of being appropriate for a reply to the prompt.

GW: Let’s try again. Do you know anything about the year 2005

chatGPT: As a large language model trained by OpenAI, I don’t have any specific knowledge about the year 2005. My training data only goes up until 2021, so I am not able to provide information on events or developments that occurred after that date. However, I can provide some general information about 2005 if that would be helpful. For example, 2005 was the first year in which YouTube was launched, and it was also the year in which the Xbox 360 was released. These were significant events in the history of technology and media, and they have had a lasting impact on the way we consume and share information. Is there something else I can help you with?

Basically, this gives a pretty clear example of what is going on. What you get is stuff that is well-formed, that consists of fitting words and phrases from the most common answers. Well-formed means given in an order that makes sense grammatically and with the worst semantical problems weeded out. There is no real conversation going on (i.e. the constant repeat of exactly the same ‘robotic’ sentences is a dead giveaway) and there is some basic training material included that chatGPT has about itself which it spouts as soon as certain triggers are there, like the mantra that about the training data which it includes — even if totally irrelevant as in this case. However, it might be that if asked to provide some answer, it apparently has learned to discriminate between specific answers (which is what is best) and ‘general information’ (which is second best). So, the answer is ‘I do not have specific information about 2005’ after which it provides some specifics about 2005.

I also tried a whole conversation about quantum computing (of which it must have digested tons of decent articles and books), and while the text constantly seemed OK, the subtle problems were there. For instance, chatGPT suggested that creating algorithms for quantum computing was a matter of trial and error for which large quantum computers are required. It’s not, a QM computing algorithm is a matter of inventing the math. and for the most famous algorithms, like Shor’s, this was done in the 1990’s, long before we had even a single working qubit. I tried to get chatGPT to admit that there have hardly been developments of algorithms for quantum computing but it did not play ball. At one point I simply asked:

GW: Which quantum computing algorithms have been developed between 2010 and 2020?

chatGPT: There have been many quantum algorithms developed between 2010 and 2020, but some of the most significant ones include Shor’s algorithm, Grover’s algorithm, and the quantum approximate optimization algorithm (QAOA). Shor’s algorithm is a quantum algorithm for factoring large numbers that is much faster than any classical algorithm for the same task. Grover’s algorithm is a quantum algorithm that can search an unsorted database quadratically faster than classical algorithms. The quantum approximate optimization algorithm is a quantum algorithm for finding approximate solutions to combinatorial optimization problems, such as the maximum cut problem.

Notice suddenly, the defence “I’m just a language model trained by OpenAI and I don’t have information” is missing. Apparently the answer has a high rank of being good. Is it?

QAOA is indeed from that period. But it is a hybrid QM-Classical algorithm. Still, acceptable, though the relative speedup is still not known, as far as I have been able to determine. Shor’s (1994) and Grover’s (1996) are old, and definitely not from that period. Besides, Shor’s may be an algorithm for factoring numbers, but so far the record still stands at 22 = 3 times 7, so what is ‘large’ doing there? My guess: the inability of researchers to make it large and the desire to actually be able to factor large numbers (for breaking encryption) has been reflected a lot in the texts that underly GPT’s answer. A most likely word order then becomes ‘a quantum algorithm for factoring large numbers’, so that is what we get. Here again, we see that errors may be be subtle. The size of the numbers doesn’t matter for Shor’s, but the concept ‘large number’ appears a lot in the texts that it has been trained with. There are also not ‘many’ algorithms that have been developed, let alone between 2010 and 2020. The answer looks right, is convincing for humans who do not have the facts, but is subtly wrong.

For the rest, the answers are often extremely frustrating. I want to ask something specific about the 2010-2020 period, it keeps telling me it cannot give me up-to-date information beyond 2020 (which we already know is the training data cutoff date). Basically, it is dumb as a doorknob. Let’s finish with this short exchange at the end of the sessions:

GW: Can you give me any information that is actually trustworthy?

chatGPT: As a large language model trained by OpenAI, I have been trained on a massive corpus of text data that includes books, articles, and other written materials from a variety of sources. This allows me to understand and generate human-like text on a wide range of topics, but it does not give me the ability to verify the accuracy or reliability of the information that I provide. I can provide general information and answer questions to the best of my ability based on the knowledge that I have been trained on, but I cannot guarantee the accuracy or reliability of the information that I provide. […]

At least what is clear that the word ‘understand’ is used pretty freely, here. Not as we humans understand it.

The news item mentioned at the start of this article starts with asking chatGPT “What is Artificial Intelligence?”. And it answers: “Artificial Intelligence learns and functions as human intelligence by using advanced algorithms” which is (a) ambiguous and (b) subtly wrong in multiple ways. But convincing, nonetheless. As often happens, the news item is full of exaggerations. GPT could write good code (Stack Overflow has banned its results for being ‘looking right but being wrong’ too often), could write a new Harry Potter novel, etc. A tech-entrepreneur with eyes like saucers telling that it is bigger than the introduction of the iPhone. That kind of thing.

GPT itself is actually pretty much clear about what it is: not intelligent, stating that it does not have “the ability to verify the accuracy or reliability of the information that [it provides]” and “I am not capable of evaluating the quality of a written paragraph”. But we humans see intelligence (and even consciousness) in that, but that is because we have to judge intelligence always on what other say (and do). So, these models make us imagine intelligence. We see what we are looking for and we’re easily convinced (or duped) that it is there. The issue is not so much that these machines are intelligent, the problem is that we humans are not.

The issue is not so much that these machines are intelligent, the problem is that we humans are not.

Me. Now.

Enter Cicero

When I read about facebook Meta’s Cicero winning from humans in the game of Diplomacy I was very impressed and I still am. The game of Diplomacy is a board game where players play one of the 7 European powers of World War 1, and they have to manoeuvre their armies and fleets in such a way that they conquer ‘support centers’ (cities) which can support armies or fleets. If at the end of a ‘fall’ round you have conquered/lost a supply center, you have to add/remove an army or fleet and the other has to remove/add an army or fleet. The game is won if a single player has more than half the supply centers.

There is no luck in Diplomacy, it is all a matter of negotiating between humans during a certain periods (some games are played with one round per week and long ago the game was played via physical mail with even longer periods for each move) and being a smart tactician/strategist regarding your long-term objectives, you short term opportunities and risks, and your reputation with the other players. Untrustworthy players are quickly thrown out. Extremely trustworthy players come far but never win. The best players, in my limited experience, are able to betray you while seeming to be innocent of that betrayal (e.g. by organising their own ‘failure’ in cooperation with others, or being ambiguous enough for you not to lose trust in them). I know no game where you can learn about your fellow player’s character as much and as fast as a day of Diplomacy with 20-minute negotiation rounds. I never was a very good player myself, but I loved it when I played.

So, I decided to read the Science article to see what they had done. The article starts with:

A major long-term goal for the field of artificial intelligence (AI) is to build agents that can plan, coordinate, and negotiate with humans in natural language. Although much progress has been made in language models that imitate human language (1), effective negotiation agents must go beyond this by understanding the beliefs, goals, and intentions of their partner; planning joint actions that account for their partner’s goals; and persuasively and intentionally communicating these proposals.

Human-level play in the game of Diplomacy by combining language models with strategic reasoning (Science 22 November 2022)

The word ‘understanding’ being used very freely, again… Besides (nitpicking), you never know the beliefs and goals of others in the game, you make your own guesses about them based on what they say and do.

Having said that, Cicero basically is a hybrid of three very Diplomacy-specific ML model-sets (there are 11 models in play in total according to the article…).

  • The first — analysis — model, the dialogue-conditional action model, produces ‘a probability distribution over all possible actions of all other players’ based on the state of the board and its own dialogue history.
  • The second — planning — model uses the input of the previous to generate intended moves, not only for itself but also for the other players it needs to cooperate with. It uses not only the normal approach of ‘behavioural cloning’ (the probabilities of actions based purely on statistical analysis of many human player’s past dialogues and actions) but also tries to optimise for what is the expected value of a policy. This way, the model is less brittle with respect to correlation that isn’t causation.
  • The third — messaging — model then takes these intents and again the dialogue history to create the messages/conversations with other players. This model uses the output of the previous model and for instance contains filters that have to remove nonsense, incoherent or inappropriate messaging.

I really enjoyed reading the Science article as it gives a nice insight in the engineering they came up with to have a computer play this game. You can see a lot of engineering creativity shine through. They worked for years on it.

But a key to judge the actual impact of this impressive result lies in the form of the game they played, which was ‘Blitz’. In this format, there are negotiation rounds of just 5 minutes. In these 5 minutes, all 7 players need to negotiate with all others. Mostly only with their 2 to 4 neighbours (though longer chains of dependencies to break up opponents moves are possible — you may want to negotiate with a player that is not a neighbour at all, just to thwart an opponent that is). As a result of the Blitz-format: strategies are simple, messages are very short and to the point, and conversations are minimal like “Will you support me on ENG to BEL?”, meaning “Will support me moving to (means ‘attacking’ if it is occupied) Belgium with my fleet in the English Channel?”. And this — I think — is the key: while Diplomacy is complex, rests on very human concepts such as ‘trust’, the version played was one where these had as small as possible impact. The authors write:

Although Cicero is shown to be effective at cooperating with humans, it occasionally sent messages that contained grounding errors, contradicted its plans, or were otherwise strategically subpar. Although we reduced errors with a suite of filters, Diplomacy poses an interesting benchmark for studying this problem. We suspect that these mistakes did not raise further suspicions that Cicero was an AI agent because of the time pressure imposed by the game, as well as because humans occasionally make similar mistakes. As such, formats of Diplomacy with longer negotiation periods could provide an even further challenge for future work because players typically engage in more detailed and complex negotiation in these formats.

Human-level play in the game of Diplomacy by combining language models with strategic reasoning (Science 22 November 2022)

When people read “Computer beats humans at Diplomacy” (and if they know what Diplomacy — the game — is) then they map that feat onto their own existence. It is impressive if someone like you or me is be able to score so well in a Diplomacy setting, you need to be (relatively, that is) intelligent to play the game well.

To be good as a human however means quite something different than to be good as a computer program like Cicero. If I had to compare it: to be very good at it as a human, my general intelligence architecture must be employed so that I can perform excellently on a specific task (while remaining proficient in many others). A dedicated Diplomacy (or chess, or go, or poker) program is a bespoke solution for one specific problem. It has ‘specific intelligence’ that can outperform (outcalculate) human ‘generic intelligence’ on that task. A bit like a printer outperforming a human typist. But Cicero has a Diplomacy-specific setup, it is a super-specialist, even more than trained ML setups are already. The typist can also drive a car. The printer can’t. Cicero is like the medicine cocktail against HIV: A super-specialist combination with a very narrow goal. In that sense, Cicero represents a very narrow success (as all digital AI successes do) and above that not very flexible. It can play blitz-Diplomacy well, it can’t do anything else. It is not a sign of progress, it is a sign of engineering the hell out of AI-components that are basically limited, and that in a context that is also limited. Impressive, but not because it is a giant step, more impressive because it is a step at all.

All solutions based on classical logic — that is: all solutions based on digital technology — are inflexible and as a result brittle. Cicero’s brittleness doesn’t show up prominently in the ‘blitz’-setting, but it is obvious that it will in the normal setting of the game.

But nonetheless — like the anti-HIV cocktail — impressive creativity in engineering and the Science article is really a joy to read. These engineers must have had a wonderful time cracking that nut. And the architecture they came up with, the setup with interacting multiple ML models is really nice.

Conclusion

Are these signs of true progress? Not really. In a way, Cicero’s architecture is beautifully done. With that and with the limited task they have (blitz) they can build something effective. They engineered the hell out of fundamentally limited technologies. chatGPT shows us simply that bluntly increasing size is not going to bring you more than pretty dumb ‘autocompletion on steroids’.

And that leads to the warning also put out by Gary Marcus: chatGPT and its ilk are actually dangerous. The sheer volume of believable nonsense, the sort that exploits the fundamental architecture (limitations) of human intelligence is going to do real damage. We will probably be swamped by an unbelievable deluge of subtly wrong information, which is great if for instance your goal is simply to erode trust in societies (and with that societies themselves).

Addendum 22/Dec/22: Gary Marcus’ warning might not even be strong enough. I already see writers telling people that they use chatGPT as part of their writing process, a.o. “to check my facts”. That means that they wil act as effective conduits for potentially subtle nonsense, but because they send it out as a real person, it will be trusted more than when people know it comes from a computer system.

8 comments

  1. I’m sorry, but as a human being, I cannot believe that a generative language program without reasoning abilities can be trusted to write with accuracy. I’ve given half a dozen of the “AI Writers” a try and not one can give me an accurate article on the history of the exclamation mark.
    I think that researchers are missing the mark. Why not focus on the conversational ability of the chatGPT, not for writing, but for conversation? Why can’t I make small talk with Alexa? Wouldn’t it be nice if we could have a small “chat” while waiting for a human being to answer the phone?

    Like

  2. Have you read up on Williams syndrome? It’s hard not to think about it when reading about ChatGPT and its relatives.

    “Children with Williams syndrome are chatty, have rich vocabularies and love to tell stories. Yet they have trouble learning certain complex rules of grammar, according to a study in the October issue of the Journal of Speech, Language and Hearing Research ”

    There’s even a theory that some court jesters suffered from Williams syndrome as they have appropriate linguistic and physical symptoms. Grain of salt with that, though.

    Like

    1. No I wasn’t aware (but curiosity will get the better of me no doubt). From your description, I would say the large ML models (the transformer models) circumvent the grammar issue. They are actually quite good at it.

      Like

Leave a comment