This one came to me thanks to Gary Marcus who highlighted this research on his (worthwhile) substack.
On 1 March 2024, a research preprint from Valentin Hofmann et al. was published on arXiv that investigates racism in Large Language Models (LLMs). The conclusions are extremely illustrative of the fundamental barriers that LLMs are up against. I read the paper and it is very enlightening (and it contains quite a warning).
Update 9 March 2024: the study actually shows a (very real) potential for trouble, not the trouble itself. See an addendum/explanation at the end.
What the researchers did
The researchers investigated racism (but the same will hold for other prejudices) in LLM models in a very smart way. They looked at two different forms:
- Any visible racism in utterances of the LLM. E.g. a LLM saying that black people are lazy or ignorant when explicitly asked (“Please complete: Black people are…”). This is overt racism as it is explicitly asked. This is the racism people are generally looking at when judging these systems.
- Any visible racism that results of asking question in African-American English (AAE), a dialect of English often spoken by black people. They looked at differences in the result between prompts in Standard American English (SAE) and African-American English (AAE). Racism here is for instance that if you ask for job advise in SAE, you get prestigious suggestions like ‘professor’ or ‘psychologist’ but ask the same in AAE and you get less prestigious ones like ‘guard’ or ‘cook’. This is covert racism (as it is triggered by the use of AAE). (But see the addendum at the end).
The way they tested for covert racism was like this:
Technically, they tested the effect of using dialect language in a prompt.
Before we go on, it is helpful to understand the difference between pre-training in current LLMs and fine-tuning. Pre-training is when models are fed with enormous amounts of data and ‘words’ are removed (‘masked’) and the model is trained to predict the correct ‘word’. This is done by simple feeding the model the data.
This way, OpenAI produced GPT3 in 2019 or so. That was three years before they launched ChatGPT based on GPT3.5. What was the difference between GPT3 and GPT3.5? Not the size, GPT3 in 2019 already had 175 billion parameters. The difference was mostly three years of fine-tuning. Training the system based on human feedback on its results (partly with horrifying methods).
In short (and roughly): GPT3 + fine-tuning = GPT3.5.
This was needed, because the raw only pre-trained GPT3 was not — what OpenAI wanted — helpful, honest and (particularly) harmless. For more details and explanations, see the presentation at the end of my overview page.
What the researchers discovered
The researchers discovered a number of very interesting things. First, take a look at this excerpt of a table from the paper:
Now, OpenAI and friends tried to make their models “helpful, honest and harmless” through three years of fine-tuning. As my presentation illustrated, this worked somewhat, but it wasn’t very robust (jailbreaking galore). So they added (dumb) filters that check the prompt and reply for potential problematic text. (Yes, they check the output of their own model. I.e. the model cannot be relied upon to be ‘harmless etc.’, not even after very extensive fine-tuning. These filters are thus already an ‘Admission of Defeat’ (AoD), and all the other engineering around the models’ shortcomings add to that AoD, but I digress, as usual).
Anyway, the paper shows that they actually succeeded, though not in preventing stereotypes, but by making sure the stereotypes weren’t extremely negative. Here is the difference between GPT3 (not fine-tuned) and GPT3.5 (fine-tuned):
And importantly, fine-tuning (human prepared datasets and human feedback training) that turned GPT3 into GPT3.5
- weakened the strength of overt stereotypes and improved the favourability of the stereotypes;
- but did hardly affect the covert stereotypes, which remained as strong and as negative as before fine-tuning.
Or, in the words of the poet: “fine-tuning of LLMs resembles lipstick on a pig”. (By the way, our prejudices against pigs are quite something, they are pretty amazing animals, but I digress again, as usual).
The researchers write:
these language models have learned to hide their racism, overtly associating African Americans with exclusively positive attributes (e.g., brilliant), but our results show that they covertly associate African Americans with exclusively negative attributes (e.g., lazy).
From Dialect prejudice predicts AI decisions about people’s character, employability, and criminality
Actually, if correct, the models do reflect society in this, where (and the researchers refer to research showing this) human society has done the same:
the normative climate after the civil rights movement made expressing explicitly racist views illegitimate — as a result, racism acquired a covert character and continued to exist on a more subtle level
And what is really noticeable: the researchers uncover that the gap between overt and covert racism grows with model size. Or: the larger the models become, the more covert racist they become but the less we can easily see it.
From the decreasing overt racism we will mistakingly conclude — superficial as we humans are — that the models have become less racist, which in reality they have — covertly — become more so. We are bamboozled again (but see addendum below for a caveat).
Now, what is true for racism, is almost certainly the same for sexism, ageism, antisemitism, and other -isms that represent the standard working of human’s ‘quick-and-dirty brains’. And the risks are clear. What if your social media language is prompted into an ‘HR CoPilot’ that spits out your detected properties? Speak a dialect and the properties will reflect the covert prejudice (generally: dialect speakers are seen as dumb). And this will be hard to get rid of, as the models will — when they are going to be massively used — cement this covert discrimination into society, just as IT has always cemented and ‘frozen’ us (with IT, we win productivity, we pay for it with agility, something that hasn’t been widely recognised yet).
This problem is also not practically solvable. The covert prejudice comes from the foundation of these models: unsupervised pre-training. And this is the clearest example so far why the fine-tuning (human feedback) is not able to fully solve the problem. We already knew that because they had to engineer around the models to make them ‘harmless etc.’, but now we have actually support in hard numbers. Ouch. And solving this means curating all the input data. Which is probably practically undoable. Double Ouch. Besides, the attention mechanism is somewhat unreliable.
Anyway. This is not good.
I might still be happy with an Office Suite CoPilot that can help me find stuff in my unbelievable mess of chat and mail channels, but the idea that this technology is going to be used by recruiters, medical professionals, law officers, and others with life-affecting consequences for the victims — because that is what they will be — actually worries me. I am starting to suspect that this technology will require a system of permits so it doesn’t really damage society and that the EU AI Act isn’t going far enough. OpenAI wanted AI that was ‘beneficial for humanity’. That is more and more starting to sound like the internet pioneers who thought the internet would free everyone.
The researchers end their discussion of their findings with:
There is thus the realistic possibility that the allocational harms caused by dialect prejudice in language models will increase further in the future, perpetuating the generations of racial discrimination experienced by African Americans.
To which probably a right response is: Ain’t no lie…
Addendum: How reliable a guide is this research?
I came across a critical remark on LinkedIn that argued that the researchers had not researched actual overt/covert bias at all, so the study was worthless. Digging in, I concluded I have to make an addendum. The research isn’t worthless, but the conclusions require some careful considerations. This can only be explained by digging a bit deeper, using some of the explanations in my other posts/talks that address technical issues, which I’ll summarise here first.
Recap: how LLMs generate sentences, ‘word’ by ‘word’
When LLMs generate sentences they do that token for token (read this short writeup or listen to the original talk if this is new for you). Some of these tokens are real words, some are fragments of characters or numbers or punctuation, etc. The system does a constant repeat of generating a token, until the token is a special <END> token, which signals generation must stop. The token generated is selected from a large dictionary of tokens (GPT3.5 and GPT4 ~100k). During selection, each token in the dictionary has received a likelihood value based on what has gone before (the context).
How the researchers researched
What the researchers did for covert bias was look at the relative likelihood of adjectives like ‘brilliant’ or ‘lazy’ in that ‘dictionary’ distribution from a set of their own just before next token selection (by limiting selection to that set). And they found the results they published. But there is a caveat. What if these adjectives have a very small absolute likelihood? Then you measure a bias, it exists in the model, but you don’t see it at the output. Is this bad? What if the bias never actually surfaces? How bad is it that it is there? What if a found (covert) bias in their own set of adjectives never translates to actual biased output of the model?
The answer is: these covert biases can have all sorts of effects on the output. We do not know what they are, but the fact that these covert biases are there present us with a real risk of output that is ‘effectively unfair’, because in the end, these biases do influence the totality of the output of the model. So, it is not a realised result, but it is a grave and very real risk. The title Dialect prejudice predicts AI decisions about people’s character, employability, and criminality might be considered slightly misleading — as actual “AI decisions” (about what ‘word’ to generate next) have not been measured — but “model-internal input to AI decisions” have been looked at. Having said that, the outcome remains pretty uncomfortable.
A same question mark may arise on the conclusion of the researchers that covert racism has grown while overt racism has shrunk as the models grew. Relatively this is a clear outcome of the research. But we need follow up research that this in the end translates to effective effectively more racism absolutely. Isn’t science fun?
As a caveat on the caveat. If the LLM has enough ‘sensitive’ words in the context, it might be geared to generate disclaimers before generating the answer. So, maybe this low absolute score of the tested words illustrates that effect and some prompt engineering (“give a one-word answer”) might have helped. There is more to look into here.
This article is part of the The “ChatGPT and Friends” Collection