Something is (still) rotten in the kingdom of artificial intelligence

Nobody can deny that artificial intelligence (or machine learning, deep learning, or cognitive computing) is booming these days. And — as before, as this is in fact the second (or third, depending on how you count) round for AI — the hype is almost unlimited. But there are serious problems, and I suspect it will not be long before they become undeniable again and we’re back to a more realistic assessment of what the technology is bringing us.

There are roughly four types of problems in the land of AI. Let’s start with an illustration of one of these, where the hype doesn’t live up to reality yet and where it is quite possible that it never will.

Gorillas

In 2015, Google had to apologize because its analytics software tagged a picture of two black people as gorillas. Google’s chief architect at the time wrote: “We used to have a problem with people (of all races) being tagged as dogs, for similar reasons. We’re also working on longer-term fixes around both linguistics (words to be careful about in photos of people) and image recognition itself (e.g., better recognition of dark-skinned faces). Lots of work being done and lots still to be done, but we’re very much on it”. Fast-forward three years and it turns out what Google’s quick and dirty ‘linguistic’ fix was to: ban the word ‘gorilla’ altogether. This is very indicative of a fundamental problem: In three years’ time it was not able to fix the issue. Banning the word ‘gorilla’ altogether is not a fix, it is an admission of failure.

And here is a sign that we’re still deep in hype mode: when reporting on this, the Guardian writes: “The failure of the company to develop a more sustainable fix in the following two years highlights the extent to which machine learning technology, which underpins the image recognition feature, is still maturing” (italics mine). This is a sign of hype as it assumes (without reasonable proof) that the problem is solvable at all. And it is by definition limited in time, as you cannot keep writing that something is ‘maturing’ for extended periods of time.

These kinds of problems (and this kind of reporting in the press fully mirrors what happened in the 1960s and 1970s. Many reported successes weren’t real and the failures were either painted over, not reported, or labelled “still maturing”, implying: there is nothing wrong with the approach itself. I noticed the same vein in reporting in journals like the New Scientist already years ago. Good examples were articles listed “the X most interesting scientific and technological breakthroughs”. Such lists generally contained at least one AI-like item and such an item was often also the only item that wasn’t reality yet. Often (just as in the first AI hype period) these were reported as failures “that just needed to mature”, “fix a few outstanding issues”, etc. All the others in the list were proven breakthroughs (e.g., a mechanism that worked, even in volume, but only was too expensive yet to produce in volume), but the AI ones were always only breakthroughs if we assumed totally unfounded extrapolations. There is a disturbing lack of criticism when AI is being reported upon. By the way, I like New Scientist a lot. This just shows that even serious science journalism is not immune to the hype.

For those experienced enough (a nice way of saying ‘old buggers’) to have been on the inside of the first period of AI hype (1960 to 1990), the current hype shows some distressing parallels with the previous one, in which hundreds of billions were spent in vain to get the promised ‘artificial intelligence’. In 1972, from Hubert Dreyfus’s book What Computers Can’t Do, it was already clear the approaches were doomed. It is a testament to the strength of (mistaken) conviction that was behind the first AI hype that it took another 20 years and billions of dollars, spent by governments (mainly in the US, the EU, and Japan) and private companies (Bill Gates’s Microsoft was a big believer for instance) for the AI hype to peter out and the (second) ‘winter of AI‘ to set in.

Did that mean AI failed completely in the previous century? Not completely, but what definitely did fail were the initial waves and the initial assumptions on which they were built. This current AI hype is seemingly built on different assumptions, so theoretically it might succeed. The question then becomes “Are we closing in on the goal this time?” The answer is again no, for some fundamental reasons I mention below, but the situation is also different and more actual value results from the current efforts.

Great expectations and failure, round 1

With the early rise of computers in the 1950s and 1960s there followed a fast rise in the belief that these wonderful computers, that could calculate so much faster than any human could, would give rise to artificial (human-like) intelligence. Early pioneers wrote small scale programs that were able to first beat humans in tic-tac-toe, then checkers, and in 1970 we already read predictions that computers would be “as smart as humans in a matter of years” and “much smarter a few months later”. Within ten years the world chess champion would be beaten by a computer.

Computers as intelligent as humans (even if specialist programs eventually could outcalculate human intelligence in microworlds such as chess) failed to materialize decade after decade, until in the 1980s a ‘winter of AI’ slowly set in. I often lament the billions Bill Gates’s Microsoft poured into AI (in which he believed strongly) to produce duds like Microsoft Bob, Microsoft Agent, the infamous Clippy, and lots of vaporware, where it could have fixed its software to be less of a security nightmare and actually work decently (e.g., it is moronic that even in Windows 10 today, if you share a spreadsheet from within Excel, which creates a mail message in Outlook, Outlook is incapable to do anything else until the window containing the spreadsheet to be sent has been closed. Something NeXTStep, now macOS, could already do 30 years ago, but I digress). [Update Mar 2019: After thirty years, this seems to have been fixed now in recent versions of Office.]

But the whole subject in the end came to nothing. Expert systems turned out to be costly to build and maintain and very, very brittle. Machine language translation based on grammars (rules) and dictionaries never amounted to much. The actual successes often had one thing in common: they restricted the problem to solve to something computers could manage, a microworld. For instance, using a small subset of English (e.g., Simplified Technical English). Or, using the fact that the possible existing combinations of street names, numbers, postal codes, and city names gave such a good limitation of the search space that mail-sorting machines could get by with a measly 70 percent accuracy of actually reading the addresses themselves. It is how a system like Siri still works (poorly, as far as my experience goes) today: it is limited to a small set of ‘microworlds’ in which it can try to create a structure from sounds (which is a far cry from an actual natural language interface).

As mentioned above, the failure was already predicted in the late 1960s by Dreyfus, who — after studying the approaches and ideas of the cognitive science and AI communities — noticed that their approaches were based on misunderstandings already known from analytic philosophy, which had hit those same limitations before. But who listens to philosophers, even if they are of the analytic kind?

Anyway, in 1979, the second revised edition of Dreyfus’s seminal book was published. It showed, that much of the additionally reported progress was again illusory and it analyzed why. And finally, in 1992 the third and definitive edition was published by MIT Press (MIT being a hotbed of anti-Dreyfus sentiment in the years before). Twenty years in a field supposedly developing with a speed beyond everything that went before, with computers that had become many orders of magnitude more powerful, but the book remained correct, and in fact is correct (and worthwhile to read) to this day.

In the 1992 edition, Dreyfus described an early failure of the then-nascent ‘neural networks’ approach (the approach that underlies many of today’s successes, such as Google beating the human Go champion). The U.S. Department of Defense had commissioned a program that would — based on neural network technology — be able to detect tanks in a forest based on photos. The researchers built the program and fed it with the first batch of a set of photos that showed the same location, but one with a tank and one without. After having trained the neural network thus, the program was able to detect tanks with an uncanny precision on photos from the training batch itself. Even more impressive, when they used the rest of the set (the photos that had not been used to train the neural network), the program did extremely well and the Department of Defense was much impressed. Too soon, though. The researchers went back to the forest to take more pictures and were surprised to find that the neural network was not capable of discriminating between pictures with and without tanks at all.

An investigation provided the answer: It took time to put tanks in place or remove them. As a result, the original photos with tanks had been made on a cloudy day, the photos without tanks had been made on a sunny day. The network had effectively been trained to discriminate between a sunny (sharp shadows) and a cloudy day (hardly shadows). Fast-forward thirty (!) years and the deep learning neural networks or equivalents from Google run into exactly the same problem. Their neural networks have learned to discriminate, but not enough to discriminate between black people and gorillas, and it is actually not very clear what they discriminate on.

And yes, Google can now beat the best people at Go, which is much more difficult than chess (where IBM beat the best people in the 1990s), but Go and Chess are still examples of the microworlds of the late 1960s. Even more important: these are strictly logical (discrete) microworlds, and people are extremely bad at (discrete) logic. We might be better at it than any other living thing on this planet, but we’re still much better at Frisbee. (This by the way is one of the reasons why rule-based approaches for enterprise/IT architecture fare so poorly: The world in which enterprises/computers must act is not logical at all.)

Great expectations, round 2. Failure ahead?

Neural networks and other forms of statistics require lots of data and ‘deep’ (i.e., many-layered) networks to amount to anything. In the 1960s and 1970s the computers were puny and the data sets were tiny. So, generally, the ‘intelligence’ was based on direct creation of symbolic rules in expert systems without a need for statistics on data. These rule-based systems (sometimes disguised as data-driven — such as the Cyc project which stubbornly plods along with the goal of modeling all facts and thus come to intelligence — failed outside of a few microworlds. Cyc, by the way, is proving that even with a million facts and rules, you still have nothing that looks like true intelligence.

Anyway, the first neural networks were ‘thin’, initially two, maybe three layers to create the correlations between input (e.g., photo) and output (“Tank!”). Such thin networks are very brittle. With more ‘depth’ today, they have more hidden rules (e.g. have a look at this Nvidia study on a 9-layer, 27 million connection, 250 thousand ‘parameter’, neural network for a self-driving car). The statistical correlations are still equivalent to rules, these days deeper than mistaking a sunny day versus a cloudy day for “tank” versus “no tank”. In fact, while we do not know the actual rules, the neural networks on digital computers are still data-driven, rule-based systems in disguise. Smart “psychometrics” systems may still contain meaningless relations like between liking “I hate Israel” on Facebook with a tendency to like Nike shoes and Kit Kat.

Now, over the last decade, the amount of data that is available has become huge (or, at least according to our current view, don’t forget that researchers in the 1960s already talked about the issues that came with trying to program the “very powerful” computers of the day). This has opened up a new resource — data, in this light sometimes called the new “oil” — that can be exploited with statistics. To us, logically woefully underpowered humans, it looks like magic and intelligence. These statistical tricks can draw reliable conclusions that are impossible for us to draw, if alone because we as human beings are unable to process all that data and calculate those statistics without the help of computers.

But look a bit deeper, and you will often still find shockingly simple rules that are in fact unearthed and used. One specialist told me: Most of the time, when we do statistical research on the data, we find correlations around some 14 different aspects: ZIP code, age, sex, income, education, etc. And these predict with a decent enough reliability how we should approach customers (and they often work as useful proxies, but that comes with risks, see below).

The patterns (rules) unearthed are thus rather simple and have two important properties:

  • They are not perfectly reliable. They all come with a percentage of reliability, as in “90 percent reliable correlation”. That means in many cases they produce erroneous results.
  • The more restricted the range of predictions, the more reliable they are.

So, how bad is it that these methods have reliability issues? Will this statistics-based AI fail as spectacularly as the earlier symbolic wave did? The answer is no. If you can support your customer with a method nine out of ten times well, that is a good thing, right? Unless, of course, the remaining 10 percent are making so much fuss in social media that your brand suffers (“Google labeled me a gorilla!”). At this stage in the game, nobody pays much attention to the outliers, and that is a recipe for disaster.

Remember the financial crisis? You know what a big part of its origin was? Statistics. The quants of the financial sectors had created statistical models that gave their users the illusion that they had done away with uncertainty. Their spreadsheets with formulas (1960s computer intelligence) and Monte Carlo simulations (1980s computer intelligence) broke down because of the outliers, unleashing the biggest economic crisis since the 1930s. Given the naivete of many current initiatives, it’s just a matter of time of something likewise happening in the new big data space.

But sometimes it is enough for statistical methods to have very small effects to be useful. So, say you want to influence the US elections. You do not need to convince everybody of your message (fake or not). You maybe can swing an election a very worthwhile 0.2 percent by sending negative messages about a candidate to a selected group of his or her supporters. Suppose you can target a specific 22 percent, say black people who support your opponent for 90 percent. You get 50 percent of them to see a message that puts the candidate in a racist or antiblack context. If this suppresses the turnout of 80 percent of those that saw the message by 10 percent while it increases the turnout of 5 percent of that same group by 60 percent (as they are really angry about the unfairness of the message), then you have just created a 0.2 percent lower poll result for your opponent. A couple of such differences may win you elections. This is not farfetched. Such weaponized information has been used in the US 2016 election and in the Brexit referendum, where very small effects like these have apparently had a big effect on the outcome. It gets even better the more ‘micro’ the targeting becomes.

That information can sway voters is a given. But, as one of the major players stated: it does not have to be true, it just has to be believed. This is supported by evidence: a recent study from Ohio State University suggests (not proves) that believing fake news may have influenced the outcome of the US 2016 presidential election by influencing defection by voters. So, the combination of statistics on large data sets and fake news is a ‘weapon’ in the the age of information warfare we now find ourselves in. And we, the population, are the target. We, the population, have been lured by ‘free’ and addictive products like Facebook, and we, the population, have been not only been set up as a resource to be bought and sold (not physically, but mentally) and set up as the instruments by which information warfare is fought. So, how does it feel to be a means to somebody else’s end? Nobody is aware, so nobody cares.

That suggests that something is different this time. Apparently, the law of large numbers makes it possible for even statistically small or not very reliable technology to have a deciding influence. The first wave of AI was largely ineffective. This new wave actually does have a very big effect on societies. This is especially true in situations of near equilibrium (such as hard-fought elections) where small changes may have huge effects or in situations where false positives or false negatives are not very damaging. District-based democracies may be more vulnerable because of their winner-takes-all setup, but equal representation democracies are not immune. In other words: Our statistics-based methods work very well where small changes in averages are the goal. If you want to improve revenue by a few percent, using methods that may alienate a small part of the population but endear a larger part is a good thing — as long as the negative side doesn’t blow up in your face.

Which brings us to some actual limitations I’d like to draw your attention to.

A shortlist of limitations

Fundamental limitations of statistics-based technologies are:

  • Errors in singulars. Statistical methods are not perfect. We saw this already in the gorilla example, but the best example may be Google Translate. It is OK, but it is far from something that is reliable. Most of the time it needs quite a bit of human intelligence to guess what has been said. Small errors can have too much of a devastating effect for it to become as good as it need be. So, on average the technology does something useful, but for the individual user it cannot be relied on. This means that when you are thinking about the role of statistical methods for your organization, you need to stay away from anything where the answer for the user must be precise. Advice with legal ramifications (e.g., if you give financial advice and the law sets requirements on the care for the individual customer) is a good example where you need more than a chance of being correct to be useful. And who is to blame when a self-driving car kills a pedestrian?
  • Unwanted prejudice. Neural networks (deep or otherwise) and mathematically equivalent techniques link inputs and outputs, but you do not know how. So, as with the tank example, while you think you are targeting richer people, you may be targeting whiter people. Although whiter people may on average be richer, it is still not ethically acceptable to target whiteness as a proxy for richness. Which, ironically, is why proxies are also often used where direct targeting is illegal or unethical. So, while proxies can work for you, unwanted proxies may work against you.
  • Conservatism. Basing your decisions on data from the past turns you into a natural conservative. So, these technologies will fare poorer when change is in the air or when change is needed. This is a bit like Henry Ford’s old statement that if you would have asked the population about improving transportation they would have requested a faster horse, not a car.
  • Spinning out of control. When I worked for BSO Language Technology in the early 1990s, the researchers had created a very simple statistics-based method that outperformed rule-based (in Lisp often) text indexing at a fraction of the fraction of the cost. It went in production at a Dutch national newspaper. The newspaper was so impressed that it wondered if it could fire its entire indexing staff, but we had to warn them. Small errors accumulate and can completely derail the reliability of these statistics-based systems. A good recent example was how Microsoft (really, it should put their efforts in fixing Windows) in March 2016 went live with Tay, a chatbot that was quickly subverted by its users to become a racist persona tweeting praise on Adolf Hitler and got shut down after 16 hours by Microsoft. A year later Microsoft was at it again with Tay’s successor Zo. Which was then caught making similar errors including stating that Windows 8 was spyware, that Windows 10 was no improvement over Windows 7, and that Windows XP was better than Windows 7.

And then there is the growing problem of inefficiency. The first numbers are not encouraging. As reported in New Scientist (issue 15 June 2019), just training an AI on a serious taks may have a carbon footprint that is five times the carbon footprint of the total lifetime of an average car. The more accurate you want your AI to be, the more energy consuming the process is. So, if you’re into environmental responsibility, think again before you add frivolous AI-based niceties. This, by the way, is also a fundamental issue: digital logic is inherently brittle (as the gorilla- and tank-examples illustrate) and we generally compensate by adding much more of it (‘multiplying infrastructure for increased reliability’ or ‘increasing depth and width of neural networks’ are both expressions of the same type of ‘fix’ for that fundamental brittleness).

So, where does that leave us?

Well, we can clearly see that the amount of data that is available makes more and more statistical methods effective. And it is growing exponentially, so we’ve seen nothing yet, including the interesting side effects this is having on IT infrastructure. Some label it “intelligent”, “deep learning”, or “cognitive computing”, but the fact remains that it is nothing more than (often pretty simple) statistics on huge data sets. It is not ‘intelligent’, nor does it really ‘learn’: the learning is just data-driven, hidden-rules computing, and Dreyfus’s critique that rules have little to do with intelligence still holds. And if someone comes by talking about cognitive computing, please do not listen; the use of such a term is a clear sign that he or she has no clue what he or she is talking about. Don’t forget, already in 1957 researchers produced programs they christened General Problem Solver while it definitely wasn’t a ‘general’ problem solver at all.

The most important lessons are:

  • Statistics can be very effective and worthwhile; it’s not nonsense. But …
  • Make sure your plans for analytics do not assume you can do singulars without people in control (analytics-assisted human activity, or AHA).
  • Make sure your plans take the new brittleness of the ‘new AI’ in account (again: You will need people).
  • Make sure your new statistics-based operations are ethical.
  • Make sure you plan for much more storage and compute power close to that storage.
  • Ignore everyone who talks about “cognitive computing” or “the singularity”, and in general everyone who champions new technologies without understanding their limitations. These people are peddling General Problem Solvers, and they’re going to be very expensive to listen to.

Follow up article on the limits of Quantum Computing here.
Related article on chatbots here.
Review of Erik Larson’s The Myth of Artificial Intelligence and an interview with the author here.
[Addition 2022-10-01] A good appraisal of (Deep) Neural Network / Machine Learning can be found in Gary Marcus’ Deep Learning: A Critical Appraisal.
[Addition 2022-10-01] To illustrate that I am not a nay-sayer: I think Google’s DeepMind should be nominated for the Nobel Prize for the published results of the AlphaFold Protein Folding Machine Learning system (a catalog). Though it might turn out I have been fooled into too much enthusiasm as well.
[Addition 2023] The whole Generative AI explosion in 2022-2023 has given rise to an entire series: The “ChatGPT and Friends” Collection which contains a good explanation of Large Language Models, usable to strategic insight.

This article first appeared in a slightly different form on InfoWorld.

Featured image: CNN Architecture from the mentioned Nvidia study End to End Learning for Self-Driving Cars.

22 comments

  1. Hello, Gerben — good article! Your warnings and caveats are very timely and definitely needed.

    “Expert systems turned out to be costly to build and maintain and very, very brittle.” That’s true if the rules that capture the expertise are arcane, and/or written in an arcane language, making them difficult to analyze and change (i.e. “brittle”). I am the creator of an expert system foundation, named XTRAN, whose rules language automates the assessment / analysis, transformation / re-engineering, and translation of many computer languages, as well as of data and text. (That rules language has functional [!] similarities to Lisp, but is easier to learn and much richer in built-in domain knowledge in the form of primitives.)

    What’s unusual about XTRAN is that, because of its domain of expertise, the Expert and the Knowledge Engineer who captures the expertise as rules can be the same person. And we routinely train senior software engineers in that rules language, so there’s a substantial community who can potentially understand a set of rules. That also provides transparency and accountability, compared to the intricate inference webs constructed by machine learning algorithms, which can be virtually impenetrable even with “self-explaining” systems (“it is actually not very clear _what_ they discriminate on”).

    “For instance, using a small subset of English (e.g., Simplified Technical English).” Interesting that you mention that. I am starting to apply XTRAN to natural language prose (e.g., I created a set of “Fog Index” rules in a bit over 2 hours), but of course full natural language is wild and woolly, with lots of rules exceptions and idioms. I am in the process of exploring Simplified Technical English (STE) as a potential opportunity for XTRAN, since it disambiguates English substantially. I also see legal language as a potential opportunity, for similar reasons. As a practicing Expert Witness, I already apply XTRAN to automate forensic analysis of code, data, and text that’s needed in the course of legal work. Being able to apply it to prose such as requirements specifications, documentation, and legal documents could provide much-needed automation during discovery and litigation / prosecution.

    “This by the way is one of the reasons why rule-based approaches for enterprise/IT architecture fare so poorly: The world in which enterprises/computers must act is not logical at all.” I have XTRAN rules that reverse-engineer a system’s architecture from the code itself (the only knowledge source guaranteed to be accurate). I also have XTRAN rules that assess _and remediate_ code quality on a large scale. It’s true that much of IT is chaotic, rife with poor design and poor implementation; in my view, rules-based automation (like XTRAN) is a serious answer to that problem.

    “And we, the population, are the target.” To the exent that such targeting is effective, it is so because of ignorance and the inability of people to think clearly and solve problems. That, in turn, is largely due to the catastrophic collapse of our education system. As an educator myself, I am appalled at the vast wasteland that system has become. What’s even worse, much of the “educational” activity is actually indoctrination instead of education. Clear thinking and considering alternative viewpoints are actively discouraged, to the point of bullying and ostracism. “Diversity” based on external features such as skin color is worshiped, while diversity of thought is punished. In such a climate, the application of AI (especially statistics-based) is fraught with peril.

    “…outperformed rule-based (in Lisp often) text indexing” — as any competent indexer knows, the hard part is deciding what gets indexed and how; you have to think about how the index will be used by the reader. Once that work is done and annotated, however, the actual index creation plus Wiki-style hypertexting can be automated; that’s what XTRAN does to its own User’s Manual (written in HTML), including a KWIC-style index and an “index to the index”. As with so many tasks, the objective is to automate what can be reliably automated and focus human intervention like a laser exactly where it’s needed. So the objective is not necessarily 100% automation (which can sometimes but not usually be achieved), but to automate as much as possible without compromising the integrity of the process being automated.

    I do not view a system like XTRAN as AI itself (although it has a lot of built-in knowledge about its domain). But we and our users often write rules for it that learn from experience, and at that point I think it’s legitimate AI.

    Like

Leave a comment