Jagged Intelligence

The dangerous unknowns at the heart of LLMs

Melanie Mitchell

IN 2023, A FEW MONTHS after OpenAI released the AI chatbot ChatGPT, Terrence Sejnowski, a neuroscientist and pioneer in the field of neural networks, wrote:

Something is beginning to happen that was not expected even a few years ago. A threshold was reached, as if a space alien suddenly appeared that could communicate with us in an eerily human way. . . . Some aspects of their behavior appear to be intelligent, but if it’s not human intelligence, what is the nature of their intelligence?

Sejnowski’s astonishment at ChatGPT’s language abilities was shared by longtime AI researchers and ordinary people alike. The chatbot could generate fluent natural language. It could answer questions; write essays, poems, and rap lyrics; compose text in the style of famous authors; do students’ homework; and generate convincing peer reviews of scientific papers.

ChatGPT was the first large language model (LLM) chatbot to be easily accessible to the general public, and other companies soon followed with competitors, such as Google’s Gemini, Anthropic’s Claude, Meta’s Llama, and Microsoft’s Copilot. Improved versions have been released every few months. While early versions of LLMs—the systems underlying today’s chatbots—were dismissed as “stochastic parrots” and “autocomplete on steroids,” current ones often give the appearance of understanding language, and the physical and social worlds described by language, in a deep, humanlike way.

AI has become increasingly skillful. It can engage in conversations on seemingly any topic, write complex pieces of code, as well as generate extraordinarily realistic images and videos, prizewinning art, and chart-topping songs. Large AI systems have recently earned gold medals at the International Mathematical Olympiad, helped humans solve long-standing problems in mathematics and biology, and contributed to major improvements in weather prediction and drug design, among other achievements. In 2024, in recognition of this astounding progress, AI researchers were awarded Nobel Prizes in both Physics and Chemistry. An avalanche of tech-company blog posts, breathless media, and assertions from AI experts has communicated to the public an overriding narrative: after decades of unfulfilled promises, true artificial intelligence has finally arrived, and it will change everything about our lives.

THAT LLMS APPEAR to understand language, though, does not mean they actually understand it as humans do. Indeed, while AI boosters have touted the superhuman capabilities of LLMs and their astounding successes, other AI users have noticed, and reported on, their puzzling, unhumanlike failures, which have not gone away as these systems have progressed. How can a system that has exceeded human performance on advanced math problems sometimes fail at simple elementary-school-level problems? Why do these systems answer a question perfectly when it is worded one way but struggle when it is worded in a different but (to a human) equivalent way? How can a system that generates accurate and incisive summaries of books also produce similarly confident and authoritative-sounding summaries of nonexistent titles? How can a system that has been extensively trained to refuse dangerous requests be easily fooled by “prompt engineering” into cheerfully providing the prohibited information?

For humans, one kind of skill can often predict abilities at similar skills; this is not the case in the jagged landscape of AI.

In general, today’s AI systems perform extremely well until, often unexpectedly, they don’t. They are inconsistent, lack a sense of when they should be confident or uncertain about their answer, are susceptible to manipulative prompting, and struggle with tasks that differ sufficiently from their training data. For example, one study performed by AI researchers at Apple showed that simply adding irrelevant information to simple word problems in a widely used mathematics benchmark test caused several AI models to perform dramatically worse than they did when given the problems without the extraneous information. Here’s one example (with irrelevant added information in italics): “Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday, but 5 of them were a bit smaller than average. How many kiwis does Oliver have?” In 2025, researchers from Apple found that all of the models they tested performed substantially worse on such variations than on the original problems.

A new term has been coined to describe AI in its current form: “jagged intelligence.” The term captures the fact that the landscape of AI capabilities is profoundly uneven: the tools demonstrate excellent abilities on certain problems but surprising failures on other similar problems. For humans, one kind of skill can often predict abilities at similar skills; this is not the case in the jagged landscape of AI. Last fall, Ilya Sutskever, a cofounder of OpenAI, argued that there are no easy fixes to this problem: “These models somehow just generalize dramatically worse than people. It’s a very fundamental thing.”

THE INCONSISTENT INTELLIGENCE of these AI systems arguably results from the intrinsic limitations of LLMs, which are neural networks that have been trained to mimic human language. Neural networks are computer programs whose workings are, roughly speaking, inspired by brains, and they are the basis of most modern AI applications, including image and speech recognition, language translation, game-playing programs, and chatbots. (LLMs are far from being the only kind of AI system, though informal use of the term “AI” often refers to LLMs.) A neural network simulates idealized neurons, groups of which are arranged in layers, like the floors in a tall building. The input to the network—say, the pixels making up a photo of a dog—is given at the bottom layer and repeatedly transformed by mathematical operations that are applied at each layer, with the results fed to the next layer in the stack. Finally, the last, highest layer encodes the output of the network—for example, the dog’s breed.

The LLMs underlying chatbots such as ChatGPT or Claude are a special kind of neural network called a “transformer,” which is designed to process sequences, such as words in a text or conversation. (The words are each encoded and input to the LLM as a vector—a list of numbers.) At each training iteration, the LLM is given as input a sequence of text with the last word missing. Its goal is to predict what the next word in the sequence will be. The LLM’s output is a probability for each word in its vocabulary, representing that word’s likelihood of being next in the sequence.

For example, given the sequence “The professor gave the student a lower grade than he…,” some words (for example, grapes, under, apocalypse) should have a very low probability of being the next word, while others (such as deserved or expected) should have a much higher probability. When given the text, the LLM generates a guess, and its parameters (or “weights”—strengths of connections between neurons) are modified to make the actual next word in the sequence have a higher probability. These modifications are made by a separate training program (often called an “optimizer”), using an algorithm called “back-propagation,” which computes which weights should be modified, and by how much, given the error made by the LLM in predicting the next word.

We humans, like most other biological intelligences, are active seekers of information, not passive predictors of next tokens.

To learn how to construct the meaning of words in a given sequence, LLMs are trained on vast bodies of text. LLMs are called “large” because of the sheer number of their parameters. OpenAI’s GPT-3 had about 175 billion parameters; the sizes of more recent OpenAI models were not revealed by the company, but are likely substantially larger. Such a large network needs a comparably large volume of text to train it: web pages, books, transcripts of online videos—whatever the LLM’s trainers can get their hands on. The combination of enormous numbers of parameters and vast amounts of data enables LLMs to become experts at generating humanlike language. LLM designers have found that, in general, as model size and training corpus are scaled up, the language-generation performance of the model improves dramatically.

This big-data approach is how an LLM learns the general structure and patterns of human language. It takes additional training to turn an LLM into a conversational agent or a system that can generate computer code or an entity that can reason its way through hard math problems. The system needs to be trained on the back-and-forth of conversation, on how to follow a human’s instructions, on how to structure a computer program. And, as disasters with early chatbots have shown, it needs to be trained to be friendly and encouraging to its users, to refuse to reveal dangerous information, to suppress toxic text, and to avoid hallucinating in its replies. It needs, in a phrase coined by the OpenAI competitor Anthropic, to be “helpful, honest, and harmless.”

Much of this extra training comes not from text on the internet but from human workers who generate feedback to the system. For example, a human worker might be asked to choose the best response from among several LLM-generated responses to a given prompt; the human’s preferences when it comes to these kinds of choices are then used for additional LLM training. (AI companies employ thousands of human workers, often in low-income countries, to perform such training, sometimes at great emotional cost to those humans who have to sift through horrific words, images, and videos to ensure “harmlessness.”) This extra training and development means that ChatGPT and similar programs are not simply language models. They’re highly complex software systems.

THE COMPLEXITY AND black box nature of these AI systems have given rise to a fierce debate over what, exactly, the systems have learned from all this training, what their true capabilities are, and whether it makes sense to speak of them as being intelligent (the term artificial intelligence notwithstanding). It is clear that LLMs, trained with the objective of next-word prediction, have grasped the syntax of language, but is language alone sufficient to impart meaning? Does this extensive training somehow imbue LLMs with an understanding of the world?

Many AI researchers and industry advocates express no doubt that it does. Geoffrey Hinton, an AI pioneer and Nobel laureate, said back in 2023 that he thought AI systems would “be much more intelligent than us in the future,” and AI systems are routinely described today as having attained “PhD-level intelligence.” While Sutskever, now the CEO of an AI company called Safe Superintelligence, is keenly aware of these systems’ limitations, he has also argued that being good at predicting the next word in a string of text requires an understanding of the world, and that such an understanding has indeed emerged in AI systems. In a 2023 interview, Sutskever said that when a neural network is being trained to accurately predict the next words of texts, it is “learning a world model. . . . What the neural network is learning is more and more aspects of the world, of people, of the human conditions, their hopes, dreams, and motivations.”

Other prominent AI researchers, though, disagree. The neural network pioneer and Turing Award winner Yann LeCun, writing with the philosopher Jacob Browning, put it this way in a 2022 essay in the journal Noema: “A system trained on language alone will never approximate human intelligence, even if trained from now until the heat death of the universe.”

Why won’t training on language alone suffice? Because human development and learning are entirely unlike the training process of LLMs. We humans, like most other biological intelligences, are active seekers of information, not passive predictors of next tokens. Unlike LLMs, we are curious, constantly intervening in the world, driven to gain both rewarding experiences and understanding. And, perhaps most important, unlike LLMs we are embodied creatures with a sense of self, a sense of others, and (at least for most of us) a profound caring about the consequences of our actions. An LLM has no body, of course, no conception of itself as a “self,” and no self-generated desires or motivations.

Most in the AI field have treated these factors—embodiment, intrinsic drives, and engagement with the world—as irrelevant to intelligence and therefore to training machines to think. The success of language-trained AI models has encouraged this belief that intelligence, in a kind of pristine and rational form, is something that can be sifted away from the messy world of bodies, emotions, and caring. And it is true that an enormous amount of information about the physical and social worlds is captured in language, at least at the vast scale used for training today’s AI. But the jaggedness of AI models—their unhumanlike errors and lack of consistency (not just making mistakes, but making them in bizarre and unpredictable ways), their failures of generalization, their absence of grounded connection to truth or falsity—has shown that whatever “world models,” if any, such systems may have, those internal models are not like ours.

EVEN THOUGH AI systems’ intelligence is not human, these systems can perform tasks that greatly assist or even replace human intelligence in some domains. Investigating the nature of their intelligence or, less controversially, their capabilities is crucial to determining what we can trust them to do, how we need to supervise them, and what their economic and societal impacts will be. Unfortunately, the current methods we use to test these systems’ capabilities are, at best, deeply flawed.

In the AI industry and research communities, the coin of the realm for testing AI is benchmark performance. AI benchmarks are collections of test items that, taken together, are intended to evaluate a particular capability. For example, a grade school math benchmark known as GSM-8K, which consists of over eight thousand elementary-school-level word problems, has been popular for testing the basic mathematical reasoning abilities of AI systems. A more advanced math benchmark is called FrontierMath, made up of hundreds of math problems ranging from the undergraduate level to the research level. Other well-known benchmarks attempt to evaluate machines on reading comprehension, commonsense reasoning, image classification, and so on.

Several studies have shown that AI systems tend to be brittle in the face of such variations, a clear illustration of the jaggedness of these systems’ abilities.

The incentives for excelling on benchmarks are strong in AI research and in the AI industry. Presentations at prestigious AI conferences often focus on how the authors’ proposed new methods perform against particular benchmarks; a substantial increase in “state-of-the-art performance” can lead to a paper being accepted. A tech company’s announcements about its release of a new AI system almost always feature its system’s performance against certain benchmarks, usually those that feature the new system performing better than those of its competitors.

Performance on benchmarks is also what companies and the media point to when making claims about AI “surpassing humans.” The journal Nature reported that AI systems “very nearly match or exceed human performance in tasks including reading comprehension, image classification and competition-level mathematics.” OpenAI reported that “today’s frontier models are already approaching the quality of work produced by industry experts” on several “economically valuable” tasks. OpenAI’s CEO, Sam Altman, has claimed that ChatGPT is “a better diagnostician than most doctors in the world.” But such comparisons between humans and models are based on benchmarks, not on extended assessments in real-life settings. And benchmark performance rarely predicts an AI system’s actual capabilities in the real world.

Indeed, despite the ubiquitous use of such benchmarks for assessing AI capabilities, there is a sense in the AI research community that, in the words of one recent paper, “benchmarking is broken.” So why might an AI system’s performance on benchmarks oversell its real-world capabilities? One widespread problem is what’s called “data contamination”: an AI system might have had some of the benchmark items, or very similar kinds of questions, in its training data, making the test one of memorization rather than of reasoning. OpenAI’s GPT-4, for instance, performed exceptionally well on a computer programming benchmark when it was answering questions published before 2021, when the model’s training was stopped. On problems published after 2021, GPT-4’s performance declined sharply—a strong indication that the earlier problems were in the system’s training data.

A second issue is that, in most cases, benchmark performance is measured only in terms of accuracy—the fraction of questions or problems the AI system answers or solves correctly. However, accuracy alone does not reveal anything about the system’s robustness and generality (for example, its ability to address simple variations on the benchmark questions or to adapt to minor rewording of the prompts used). Several studies have shown that AI systems tend to be brittle in the face of such variations, a clear illustration of the jaggedness of these systems’ abilities.

Another worry is that the systems are getting correct answers for the wrong reasons. For example, a 2018 paper reported that a neural network trained on images of skin lesions was highly accurate in classifying these lesions as benign or malignant. But it turned out that the network was more likely to classify a lesion as malignant if the image contained a ruler, since these measuring devices often appeared in images of malignant lesions but not benign ones. The network was basing its answers, in part, on a “spurious association,” one that the humans doing the training did not initially notice.

The phenomenon of AI systems being right for the wrong reasons has long been an issue in assessing their capabilities. Cognitive science scholars have pointed out that similar issues arise in both developmental and comparative psychology when researchers are attempting to understand the cognitive capabilities of “alien intelligences,” like babies and animals. In these fields, researchers spend years learning to carefully design control experiments to deal with the benchmark issues I’ve discussed above. Such rigorous methodologies have not yet been generally adopted by AI researchers.

There is one final concern about AI benchmarking: it has become common to assess the capabilities of AI systems on tests that were originally designed for humans, such as IQ tests or standardized tests like the bar exam. But such tests, meant to assess human abilities, have unstated assumptions built into them—for example, the assumption that humans have not memorized large portions of the internet or thousands of books during their lifetimes. Such assumptions are not necessarily valid for state-of-the-art LLMs, so any predictive value these tests have for human success may not be applicable to LLMs.

IT IS NOTABLE that the practice of evaluating AI systems on intelligence tests designed for humans implicitly accepts, and reinforces, the metaphorical framing of modern AI systems as individual intelligent agents. To many of us, it seems natural to think of these systems as analogous to individual humans, with their own distinct “personalities.” But some scholars have challenged this framing, arguing that AI systems should instead be thought of as cultural and social technologies. The idea here is that LLMs are a new kind of technology, akin to writing, the printing press, the library, markets, bureaucracy, and the internet, all technologies that allow humans to access information that has been accumulated and processed by collective human societies and cultures, and all technologies that enable large-scale coordination. In this view, AI will alter society in transformative ways, but as with other such technologies, its impacts will develop over time instead of hitting all at once. For this reason, some have called AI a “normal technology” in terms of the likely trajectory of its effect on society.

If today’s AI bots can be viewed as “normal” cultural and social technologies rather than as intelligent (or superintelligent) agents, why do most of us think of them as the latter? In part, this framing is due to AI systems’ fluent language interfaces, first-person façades, and other anthropomorphic touches. But another factor is the term artificial intelligence itself, a label that could be called AI’s original sin. When the field first began to take shape in the 1950s, there was disagreement about what to call it. John McCarthy, the computer scientist who was one of AI’s founders, pushed for the fledgling field’s name to be “artificial intelligence,” whereas his cofounders Herbert Simon and Allen Newell argued for “complex information processing”—a nonanthropomorphic phrase that is more evocative of cultural and social technologies than of smart, agentic machines.

One couldn’t make such a case for fair use if, say, Google’s Gemini were considered a library rather than a version of a person.

As documented by the cognitive linguist and philosopher George Lakoff and his fellow collaborators, humans tend to frame abstract concepts as metaphors. This is especially true when we struggle with how to conceptualize something new, be it the internet, social media, or AI. Our chosen metaphors can strongly affect how we interact with these technologies, as well as how we legislate and regulate them. Today, users often view chatbots, which interact with them using the first-person “I,” as companions, therapists, or even romantic partners, roles that cannot be played by a system understood to be more like a library or bureaucracy. The recent rise of AI “agents,” often positioned as “personal assistants,” has only solidified the metaphor of AI as a kind of individual mind.

This metaphor has also channeled scientific inquiry down certain paths: if you are a scientist operating under this metaphor, it may seem perfectly natural to give an LLM an IQ test. Moreover, you might think it informative to put LLMs through personality assessments, or to use them as substitutes for humans in psychological experiments, or even to ponder how we should treat these systems in a moral sense. In the legal sphere, this metaphor is used to defend training AI on copyrighted materials. For example, in arguing against AI copyright lawsuits, Satya Nadella, the current CEO of Microsoft, made this analogy explicit: “If I read a set of textbooks and I create new knowledge, is that fair use? . . . If everything is just copyright, then I shouldn’t be reading textbooks and learning because that would be copyright infringement.” One couldn’t make such a case for fair use if, say, Google’s Gemini were considered a library rather than a version of a person.

Similarly, the way we conceptualize AI shapes our approach to regulation. If we think of AI as a novel technological tool, its applications might be regulated like those of new medical devices or automobile designs. If, however, we see it as an intelligent agent that may go “rogue,” with a humanlike impetus for power that poses a possible threat to our very existence, this will call for a different, and more aggressive, kind of regulation, such as limiting the computing power used to train AI models and requiring “compelling evidence that AI systems will not autonomously deceive humans.”

THE RAPID DEPLOYMENT of AI systems into all aspects of our lives means that these questions about the nature of their intelligence and their real-world capabilities are no longer in the realm of philosophy and idle academic discussion, but are now playing out in real time. Yet it is unclear whether we are much closer to answering these questions, or making accurate predictions about AI’s future. Back in 1965, Herbert Simon predicted that “machines will be capable, within twenty years, of doing any work that a man can do.” That turned out to be incorrect. In 2016, Hinton said, “People should stop training radiologists now—it’s just completely obvious within five years deep learning [i.e., training neural networks] is going to do better than radiologists.” Ten years later, AI has not replaced any radiologists—in fact, there is actually a dire shortage of human radiologists. But the predictions persist. In January of this year, Dario Amodei, the CEO of Anthropic, predicted that “we might be 6–12 months away from models doing all of what software engineers do end-to-end.” Sam Altman of OpenAI has said that by 2030, AI will replace 40 percent of human jobs.

There are (at least) two problems with such predictions. The first is that they are based on AI performance on benchmarks, which, as we’ve seen, has a poor record of predicting success in the real world. The second is that AI is tested on “tasks,” such as classifying medical images, writing computer code according to given specifications, or generating ad copy for real estate listings. But human jobs are not simply collections of independent fixed tasks; most jobs require the jobholder to understand how different tasks relate to one another, to adapt to change on the fly, and, more generally, to be flexible based on the open-ended nature of the real world. As the information-and-society scholars Sayash Kapoor and Arvind Narayanan write, “The easier a task is to measure via benchmarks, the less likely it is to represent the kind of complex, contextual work that defines professional practice. By focusing heavily on capability benchmarks to inform our understanding of AI progress, the AI community consistently overestimates the real-world impact of the technology.” Because we can’t set up “sufficiently convincing simulacra of the messy complexity of the world,” they argue, it is impossible to forecast whether AI will actually be able to automate particular jobs away.

Understanding the nature of AI’s intelligence, its capabilities and weaknesses, its trustworthiness, and its most pressing dangers is, then, a formidable challenge. AI researchers, including myself, are still struggling to design effective evaluation methods, conceive insightful metaphors, and smooth out the jagged terrain of these systems’ skills. But it is also a challenge that must be met, as must the even more formidable problem of finding a way for our society to collectively decide—just like it does for all “normal” but transformative technologies—what AI should be used for and what we actually want the nature of its intelligence to be.

Melanie Mitchell is a scientist at the Santa Fe Institute, working at the intersection of artificial intelligence, cognitive science, and complex systems. She has written or edited six books, including Artificial Intelligence: A Guide for Thinking Humans.