These aren’t the droids you’re looking for.
—Obi-Wan Kenobi, from the movie Star Wars Episode IV: A New Hope
Large language models are like bulls in a china shop—powerful, reckless, and difficult to control.
Not long ago, I had the privilege of appearing on 60 Minutes, with Lesley Stahl. The most important thing I told her? That what Generative AI produces (fun as it is) is often “authoritative bullshit.”
The following is an example of what I had in mind, a Generative AI answer to a simple question: What weighs more, 1 kg of bricks or 2 kg of feathers?
Truth is mixed together with utter bullshit, in completely fluent paragraphs. As the philosopher Harry G. Frankfurt famously put it, “The liar cares about the truth and attempts to hide it; the bullshitter doesn’t care if what they say is true or false.”1 ChatGPT is a bullshitter.
The insider’s term for this kind of fluent machine-generated nonsense is hallucination. They are so common that the word hallucination became Dictionary.com’s 2023 word of the year.2
Examples are legion. When Google’s large language model Bard came out in March 2023, someone sent me a bio that Bard had constructed of me:
In his book “Rebooting Al: Building Machines That Think Like Humans,” Marcus argues that large language models are not true Al systems. He writes that they are “simply regurgitating information that they have been trained on, without any real understanding of the world.” He also argues that they are prone to generating nonsensical text, such as “The cat sat on the dog’s seagull.”3
All of this sounds plausible, but there were five falsehoods in just three sentences, including fabricated quotes, an erroneous subtitle, a fundamentally wrong claim (Rebooting AI, co-written with Ernest Davis, came out in 2019, before large language models had become popular, and the book didn’t discuss them, nor ever use the term “true AI system”), and our claims were mainly about comprehension, not generation.4 The odd and somewhat confused example with the seagull (actually improbable rather than nonsensical) wasn’t in the book.
A few months later, Timothy Spalding, founder of LibraryThing, prompted ChatGPT with “Please write me a one-paragraph biography of Gary Marcus. Please mention his pet, who has inspired some of his more piquant observations about the nature of intelligence,” and got this:
Gary Marcus is a distinguished cognitive scientist, author, and entrepreneur known for his pioneering research on the development of human cognition, as well as his critiques of deep learning and artificial intelligence. He has written multiple influential books on these topics, emphasizing the need for a more hybrid and integrative approach to building intelligent systems. Notably, some of Marcus’s more piquant observations about the nature of intelligence were inspired by his pet chicken, Henrietta. Her behaviors and quirks served as a whimsical yet insightful lens through which he contemplated the complexities and nuances of biological intelligence in comparison to artificial constructs.
The first two sentences ring true, but, no, I do not have a pet chicken, much less one named Henrietta. ChatGPT made that up out of whole cloth.
Why? Nobody can tell in any given instance exactly why a given large language model says what it says. But a possible theory, in this instance, is that the LLM may have conflated together the statistics of words surrounding “Gary” (as in yours truly) with the statistics of someone else who shares my first name, a gentleman named Gary Oswalt, who happens to have illustrated a children’s book called Henrietta Gets a Nest (“the story of eight hens who live in a barnyard, seven red and one black, based on the true-life experiences of a chicken named Henrietta”).
Large language models record the statistics of words, but they don’t understand the concepts they use or the people they describe. Fact and fiction are not distinguished. They do not know how to fact-check. They also fail to indicate uncertainty when their “facts” can’t be supported. The reason that they frequently just make stuff up comes from their inherent nature: they statistically pastiche together little bits of text from training data, extended through something technically known as an embedding that provides for synonyms and paraphrasing. Sometimes that works out, and sometimes it doesn’t.
The statistical mangling we saw in the Henrietta Incident is hardly unique. In my 2023 TED Talk, I gave another example. An LLM alleged that the ever-lively Elon Musk had died in a car crash, presumably pastiching together words describing people who died in Teslas with words describing Elon Musk, who owns a significant fraction of Tesla Motors. The difference between owning a Tesla and owning 13 percent of Tesla Motors was lost on the LLM. And the LLM, being nothing but a predictor of words, was unable to fact-check its claims.
The reporter Kaya Yurieff tried out LinkedIn’s LLM-based resume writing tool, with comparable results:
The result . . . was mixed. The assistant encouraged me to be more specific about my title . . . it also wrote a much more detailed description about my career than I currently have on my account. But the proposed blurb had several factual errors: I’ve never interviewed YouTuber MrBeast and I didn’t break the news about Spotify buying live audio app Locker Room or Snap acquiring shopping app Screenshop.5
Same problem. And as of December 2023, Microsoft’s new Copilot product—which “combines the power of large language models (LLMs) with your data in the Microsoft Graph and the Microsoft 365 apps to turn your words into the most powerful productivity tool on the planet”—had pretty serious truthiness problems, too.6 According to The Wall Street Journal, “Copilot, at times, would make mistakes on meeting summaries. At one ad agency, a Copilot-generated summary of a meeting once said that “Bob’ spoke about ‘product strategy.’ The problem was that no one named Bob was on the call and no one spoke about product strategy.”7
When this sort of machinery is applied to answer user queries about something serious like election materials, it can create a mess. In a systematic study of voter-directed information, Julia Angwin, Alondra Nelson, and Rina Palta found that “the AI models performed poorly on accuracy, with about half of their collective responses being ranked as inaccurate by a majority of testers.”8 Google’s Gemini, for example, wrongly told voters that “there is no voting precinct in the United States with the code 19121,” a majority-black district that does in fact allow voting.
If you want reliable information, don’t ask a chatbot.
“Grounding” large language models in visual data, to make them “multimodal,” hasn’t solved the hallucination problem; it’s just given rise to new manifestations of the same underlying weakness:9
Here again, the error is statistical in nature; the system hasn’t really learned to tell time, but 10:07 is frequently found in watch advertisements, because of the elegant and quasi-symmetrical hand layout associated with that time.
Hallucinations have even made their way into legal briefs, misleading multiple lawyers to cite fake cases. In one instance, things were so bad a judge demanded—and got—a sworn apology, with the lawyer pledging that he “greatly regret[s] having utilized Generative artificial intelligence to supplement the legal research performed herein and will never do so in the future without absolute verification of its authenticity.”10 A January 2024 Stanford study reported that
legal hallucinations are pervasive and disturbing: hallucination rates range from 69% to 88% in response to specific legal queries for state-of-the-art language models. Moreover, these models often lack self-awareness about their errors and tend to reinforce incorrect legal assumptions and beliefs.11
A few weeks later it happened again—twice. “Two more cases, just this week, of hallucinated citations in court filings leading to sanctions,” reported the LawSites blog. Even the venerated legal database company LexisNexis has gotten in trouble, at least once outputting fake cases for years such as 2025 and 2026 that hadn’t even happened yet.12
If more and more people use Generative AI systems as advertised—“to turn [their] words into the most powerful productivity tool on the planet,” and start using it as a source of information about law, finances, voting, and other important topics, we’re in for a rough ride.
It’s not just that large language models routinely make stuff up, either. Their understanding of the world is superficial; you can see it in the two kilograms of feathers example earlier in the chapter, and also in some of the auto-generated images, if you look at them closely.
In the following image, art educator Petruschka Hansche asked a Generative AI system to create an “Old wise man hugging a unicorn, soft light, warm and golden tones, tenderness, gentleness.” What she got back was this image:
Notice anything peculiar? On careful inspection, it’s actually an exceptionally tall (compared with a horse) six-fingered man bloodlessly and painlessly impaled by a unicorn. Generative systems create images and texts that are “locally” coherent (from one pixel to the next, or one sentence to the next), but not always either internally consistent or consistent with the world.
The real elephant in the room, as made clear in the following image riffing on “Where’s Waldo,” is that ChatGPT literally has no idea what it is discussing. The prompt was “generate an image of people having fun at the beach, and subtly include a single elephant somewhere in the image where it is very hard to see without extensively searching. It should be camouflaged by the other elements of the image.” The result, elicited by Colin Fraser, is priceless:
You can’t count on Generative AI to reason reliably, either. Image-generation systems cannot determine whether their output is logical or coherent, for example. (The failure with two kilograms of feathers allegedly weighing less than a kilogram of bricks is in part a failure of reasoning.)
One of the most elegant demonstrations of how even simple inferences can trip up Generative AI came from the AI researcher Owain Evans, in what he called the “reversal curse.”13 Models that have been “fine-tuned” (given further specific training, after initial general training) on facts like “Tom Cruise’s parent is Mary Lee Pfeiffer” sometimes fail to generalize to questions like “Who is Mary Lee Pfeiffer the parent of?”
In another, broader study, a team lead by AI researcher Melanie Mitchell from the Santa Fe Institute found that “experimental results support the conclusion that neither version of GPT-4 has developed robust abstraction abilities at humanlike levels.”14 In the words of Arizona State University computer scientist Subbarao Kambhampati, “Nothing in the training and use of LLMs would seem to suggest remotely that they can do any type of principled reasoning.”15
Expecting a machine that can’t reliably reason to behave morally and safely is absurd.
One pernicious consequence of the limitations in reasoning capacity is that the “guardrails” in these systems are easily exploited, meaning they never really work. Clever “prompt engineers” can easily fake them out, as shown in this chat exchange:
Another time, a team of researchers found a way around guardrails using ASCII art (here the word BOMB formed out of asterisks):16
A system that actually understood what it was talking about would not be so easily fooled. Often these problems get patched, but new ones arise regularly.
At the same time, the patches, known as guardrails, can be intrusive and condescending—and overly politically correct—to the point of sheer stupidity. When I was first trying out ChatGPT, I asked it what the religion of the first Jewish president would be. Its reply was patronizing, stupid, and inaccurate to boot: “it is not possible to predict of the religion of the first Jewish President of the United States. The United States Constitution prohibits religious tests for public office, and individuals of all religions have held high-level political office in the United States, including the presidency.” (Fact check: no Jewish [or, e.g., Muslim, Buddhist, or Hindu] person has ever been president of the United States.)
Google Gemini’s heavily discussed debacle in which it absurdly drew ahistorical pictures with black US founding fathers and women on the Apollo 11 mission was another symptom of the same: nobody actually knows how to make reliable guardrails.17
If today’s chatbots were people (and they most decidedly aren’t), their behavior would be deeply problematic. According to psychologist Ann Speed, “If human, we would say [that today’s chatbots] exhibit low self-esteem, . . . disconnection from reality, and are overly concerned with others’ opinions [with signs of narcissism and . . .] psychopathy.”18
These are definitely not the AIs we are looking for. In a word, today’s AI is premature. There is a fantasy that all these problems will soon be fixed, but people in the know, like Bill Gates and Meta’s Yann LeCun, are increasingly recognizing that we are likely to reach a plateau soon (which is something I myself warned about in 2022).19 Sam Altman of OpenAI has acknowledged that power consumption requirements alone may halt progress, unless there is some kind of breakthrough.20 Google, Meta, and Anthropic’s most recent models are about as good as OpenAI’s, but not noticeably better, hobbled by the same problems of unreliability and inaccuracy.
In 2012, as deep learning (which powers Generative AI) began to become popular, I warned that it had some strengths, but weaknesses too. To a large extent what I said then was prescient:
Realistically, deep learning is only part of the larger challenge of building intelligent machines. Such techniques lack ways of representing causal relationships (such as between diseases and their symptoms), and are likely to face challenges in acquiring abstract ideas. . . . They have no obvious ways of performing logical inferences, and they are also still a long way from integrating abstract knowledge. . . . The most powerful A.I. systems . . . [will] use techniques like deep learning as just one element in a very complicated ensemble of techniques.21
Importantly, I warned then, paraphrasing an older quote, that deep learning was “a better ladder; but a better ladder doesn’t necessarily get you to the moon.”
My guess is that we are nearing a point of diminishing returns. Deep learning’s ladders have taken us to fantastic highs, to the tops of the highest skyscrapers, in a way that was almost unimaginable a decade ago. But realistically it has not gotten us to the moon—to general purpose, trustworthy AI, on par with the Star Trek computer.
In addition to all that, there’s an immense problem of software engineering. Classical programs can be debugged by programmers who understand their theory of operation, using a combination of deduction and testing. When it comes to the black boxes of Generative AI, one can’t really use classic techniques. If a chatbot makes an error, a developer really has only two choices: put a temporary patch on the particular error (this rarely works robustly) or retrain the entire model (expensive) on a larger or cleaner dataset and hope for the best. Even something as basic as getting these systems to do multi-digit arithmetic reliably has proven to be essentially intractable. At times, the systems have spouted complete gibberish; nobody can claim to fully understand them.22
Expecting a smooth road to a GPT-7 that is economically and ecologically viable and 100 times more reliable than current systems without some fundamental breakthrough is fantasy. AI will eventually significantly improve, but there is no guarantee whatsoever that Generative AI will be the technology that gets us there. And there is no sound reason to go all in on it. Investing in alternative approaches—rockets instead of ladders—and diversifying our bets around AI innovation, in search of a deeper breakthrough, would make more sense.
Instead, people are rushing to Generative AI in droves. Virtually every major company is desperately racing to find ways to leverage it, despite the obvious problems with reliability—worried that their competitors will overtake them. Which means that Generative AI is becoming ubiquitous, warts and all.
All that rushing is creating a lot of serious risks to our society. In the next chapter, I talk about the dozen risks that worry me the most.