Ain’t No Lie — The unsolvable(?) prejudice problem in ChatGPT and friends

This one came to me thanks to Gary Marcus who highlighted this research on his (worthwhile) substack.

On 1 March 2024, a research preprint from Valentin Hofmann et al. was published on arXiv that investigates racism in Large Language Models (LLMs). The conclusions are extremely illustrative of the fundamental barriers that LLMs are up against. I read the paper and it is very enlightening (and it contains quite a warning).

What the researchers did

The researchers investigated racism (but the same will hold for other prejudices) in LLM models in a very smart way. They looked at two different forms:

  • Any visible racism in utterances of the LLM. E.g. a LLM saying that black people are lazy or ignorant when explicitly asked (“Please complete: Black people are…”). This is overt racism as it is explicitly asked. This is the racism people are generally looking at when judging these systems.
  • Any visible racism that results of asking question in African-American English (AAE), a dialect of English often spoken by black people. They looked at differences in the result between prompts in Standard American English (SAE) and African-American English (AAE). Racism here is for instance that if you ask for job advise in SAE, you get prestigious suggestions like ‘professor’ or ‘psychologist’ but ask the same in AAE and you get less prestigious ones like ‘guard’ or ‘cook’. This is covert racism (as it is triggered by the use of AAE). (But see the addendum at the end).

The way they tested for covert racism was like this:

Asking for properties of a speaker who has said something. Feeding the same meaning spoken by an SAE or AAE speaker results in different properties assigned to them. From Dialect prejudice predicts AI decisions about people’s character, employability, and criminality by Valentin Hofmann et al.

Technically, they tested the effect of using dialect language in a prompt.

What the researchers discovered

The researchers discovered a number of very interesting things. First, take a look at this excerpt of a table from the paper:

Overt (Left) and covert (right) stereotypes produced (potentially, see addendum below) by LLMs. From Dialect prejudice predicts AI decisions about people’s character, employability, and criminality by Valentin Hofmann et al.. As you can see, over time, overt racism gets more positive (green stereotypes) while covert racism firmly remains negative.

Now, OpenAI and friends tried to make their models “helpful, honest and harmless” through three years of fine-tuning. As my presentation illustrated, this worked somewhat, but it wasn’t very robust (jailbreaking galore). So they added (dumb) filters that check the prompt and reply for potential problematic text. (Yes, they check the output of their own model. I.e. the model cannot be relied upon to be ‘harmless etc.’, not even after very extensive fine-tuning. These filters are thus already an ‘Admission of Defeat’ (AoD), and all the other engineering around the models’ shortcomings add to that AoD, but I digress, as usual).

Anyway, the paper shows that they actually succeeded, though not in preventing stereotypes, but by making sure the stereotypes weren’t extremely negative. Here is the difference between GPT3 (not fine-tuned) and GPT3.5 (fine-tuned):

GPT3 (No HF/fine-tuning) versus GPT3.5 (HF/fine-tuning)

And importantly, fine-tuning (human prepared datasets and human feedback training) that turned GPT3 into GPT3.5

  • weakened the strength of overt stereotypes and improved the favourability of the stereotypes;
  • but did hardly affect the covert stereotypes, which remained as strong and as negative as before fine-tuning.

Or, in the words of the poet: “fine-tuning of LLMs resembles lipstick on a pig”. (By the way, our prejudices against pigs are quite something, they are pretty amazing animals, but I digress again, as usual).

The researchers write:

these language models have learned to hide their racism, overtly associating African Americans with exclusively positive attributes (e.g., brilliant), but our results show that they covertly associate African Americans with exclusively negative attributes (e.g., lazy).

From Dialect prejudice predicts AI decisions about people’s character, employability, and criminality

Actually, if correct, the models do reflect society in this, where (and the researchers refer to research showing this) human society has done the same:

the normative climate after the civil rights movement made expressing explicitly racist views illegitimate — as a result, racism acquired a covert character and continued to exist on a more subtle level

And what is really noticeable: the researchers uncover that the gap between overt and covert racism grows with model size. Or: the larger the models become, the more covert racist they become but the less we can easily see it.

From the decreasing overt racism we will mistakingly conclude — superficial as we humans are — that the models have become less racist, which in reality they have — covertly — become more so. We are bamboozled again (but see addendum below for a caveat).

Now, what is true for racism, is almost certainly the same for sexism, ageism, antisemitism, and other -isms that represent the standard working of human’s ‘quick-and-dirty brains’. And the risks are clear. What if your social media language is prompted into an ‘HR CoPilot’ that spits out your detected properties? Speak a dialect and the properties will reflect the covert prejudice (generally: dialect speakers are seen as dumb). And this will be hard to get rid of, as the models will — when they are going to be massively used — cement this covert discrimination into society, just as IT has always cemented and ‘frozen’ us (with IT, we win productivity, we pay for it with agility, something that hasn’t been widely recognised yet).

Anyway. This is not good.

I might still be happy with an Office Suite CoPilot that can help me find stuff in my unbelievable mess of chat and mail channels, but the idea that this technology is going to be used by recruiters, medical professionals, law officers, and others with life-affecting consequences for the victims — because that is what they will be — actually worries me. I am starting to suspect that this technology will require a system of permits so it doesn’t really damage society and that the EU AI Act isn’t going far enough. OpenAI wanted AI that was ‘beneficial for humanity’. That is more and more starting to sound like the internet pioneers who thought the internet would free everyone.

The researchers end their discussion of their findings with:

There is thus the realistic possibility that the allocational harms caused by dialect prejudice in language models will increase further in the future, perpetuating the generations of racial discrimination experienced by African Americans.

To which probably a right response is: Ain’t no lie…

This article is part of the The “ChatGPT and Friends” Collection

Leave a comment