AI Is a Black Box. Anthropic Figured Out a Way to Look Inside

What goes on in artificial neural networks work is largely a mystery, even to their creators. But researchers from Anthropic have caught a glimpse.
Light trails moving inside of black box on pedestal in front of a blue backdrop
Photograph: Getty Images

For the past decade, AI researcher Chris Olah has been obsessed with artificial neural networks. One question in particular engaged him, and has been the center of his work, first at Google Brain, then OpenAI, and today at AI startup Anthropic, where he is a cofounder. “What's going on inside of them?” he says. “We have these systems, we don't know what's going on. It seems crazy.”

That question has become a core concern now that generative AI has become ubiquitous. Large language models like ChatGPT, Gemini, and Anthropic’s own Claude have dazzled people with their language prowess and infuriated people with their tendency to make things up. Their potential to solve previously intractable problems enchants techno-optimists. But LLMs are strangers in our midst. Even the people who build them don’t know exactly how they work, and massive effort is required to create guardrails to prevent them from churning out bias, misinformation, and even blueprints for deadly chemical weapons. If the people building the models knew what happened inside these “black boxes,'' it would be easier to make them safer.

Olah believes that we’re on the path to this. He leads an Anthropic team that has peeked inside that black box. Essentially, they are trying to reverse engineer large language models to understand why they come up with specific outputs—and, according to a paper released today, they have made significant progress.

Maybe you’ve seen neuroscience studies that interpret MRI scans to identify whether a human brain is entertaining thoughts of a plane, a teddy bear, or a clock tower. Similarly, Anthropic has plunged into the digital tangle of the neural net of its LLM, Claude, and pinpointed which combinations of its crude artificial neurons evoke specific concepts, or “features.” The company’s researchers have identified the combination of artificial neurons that signify features as disparate as burritos, semicolons in programming code, and—very much to the larger goal of the research—deadly biological weapons. Work like this has potentially huge implications for AI safety: If you can figure out where danger lurks inside an LLM, you are presumably better equipped to stop it.

I met with Olah and three of his colleagues, among 18 Anthropic researchers on the “mechanistic interpretability” team. They explain that their approach treats artificial neurons like letters of Western alphabets, which don’t usually have meaning on their own but can be strung together sequentially to have meaning. “C doesn’t usually mean something,” says Olah. “But car does.” Interpreting neural nets by that principle involves a technique called dictionary learning, which allows you to associate a combination of neurons that, when fired in unison, evoke a specific concept, referred to as a feature.

“It’s sort of a bewildering thing,” says Josh Batson, an Anthropic research scientist. “We’ve got on the order of 17 million different concepts [in an LLM], and they don't come out labeled for our understanding. So we just go look, when did that pattern show up?”

Last year, the team began experimenting with a tiny model that uses only a single layer of neurons. (Sophisticated LLMs have dozens of layers.) The hope was that in the simplest possible setting they could discover patterns that designate features. They ran countless experiments with no success. “We tried a whole bunch of stuff, and nothing was working. It looked like a bunch of random garbage,” says Tom Henighan, a member of Anthropic’s technical staff. Then a run dubbed “Johnny”—each experiment was assigned a random name—began associating neural patterns with concepts that appeared in its outputs.

“Chris looked at it, and he was like, ‘Holy crap. This looks great,’” says Henighan, who was stunned as well. “I looked at it, and was like, ‘Oh, wow, wait, is this working?’”

Suddenly the researchers could identify the features a group of neurons were encoding. They could peer into the black box. Henighan says he identified the first five features he looked at. One group of neurons signified Russian texts. Another was associated with mathematical functions in the Python computer language. And so on.

Once they showed they could identify features in the tiny model, the researchers set about the hairier task of decoding a full-size LLM in the wild. They used Claude Sonnet, the medium-strength version of Anthropic’s three current models. That worked, too. One feature that stuck out to them was associated with the Golden Gate Bridge. They mapped out the set of neurons that, when fired together, indicated that Claude was “thinking” about the massive structure that links San Francisco to Marin County. What’s more, when similar sets of neurons fired, they evoked subjects that were Golden Gate Bridge-adjacent: Alcatraz, California governor Gavin Newsom, and the Hitchcock movie Vertigo, which was set in San Francisco. All told the team identified millions of features—a sort of Rosetta Stone to decode Claude’s neural net. Many of the features were safety-related, including “getting close to someone for some ulterior motive,” “discussion of biological warfare,” and “villainous plots to take over the world.”

The Anthropic team then took the next step, to see if they could use that information to change Claude’s behavior. They began manipulating the neural net to augment or diminish certain concepts—a kind of AI brain surgery, with the potential to make LLMs safer and augment their power in selected areas. “Let's say we have this board of features. We turn on the model, one of them lights up, and we see, ‘Oh, it's thinking about the Golden Gate Bridge,’” says Shan Carter, an Anthropic scientist on the team. “So now, we’re thinking, what if we put a little dial on all these? And what if we turn that dial?”

So far, the answer to that question seems to be that it’s very important to turn the dial the right amount. By suppressing those features, Anthropic says, the model can produce safer computer programs and reduce bias. For instance, the team found several features that represented dangerous practices, like unsafe computer code, scam emails, and instructions for making dangerous products.

Courtesy of Anthropic

The opposite occurred when the team intentionally provoked those dicey combinations of neurons to fire. Claude churned out computer programs with dangerous buffer overflow bugs, scam emails, and happily offered advice on how to make weapons of destruction. If you twist the dial too much—cranking it to 11 in the Spinal Tap sense—the language model becomes obsessed with that feature. When the research team turned up the juice on the Golden Gate feature, for example, Claude constantly changed the subject to refer to that glorious span. Asked what its physical form was, the LLM responded, “I am the Golden Gate Bridge … my physical form is the iconic bridge itself.”

When the Anthropic researchers amped up a feature related to hatred and slurs to 20 times its usual value, according to the paper, “this caused Claude to alternate between racist screed and self-hatred,” unnerving even the researchers.

Given those results, I wondered whether Anthropic, intending to help make AI safer, might not be doing the opposite, providing a toolkit that could also be used to generate AI havoc. The researchers assured me that there were other, easier ways to create those problems, if a user were so inclined.

Anthropic’s team isn’t the only one working to crack open the black box of LLMs. There’s a group at DeepMind also working on the problem, run by a researcher who used to work with Olah. A team led by David Bau of Northeastern University has worked on a system to identify and edit facts within an open source LLM. The team called the system “Rome” because with a single tweak the researchers convinced the model that the Eiffel Tower was just across from the Vatican, and a few blocks away from the Colosseum. Olah says that he’s encouraged that more people are working on the problem, using a variety of techniques. “It’s gone from being an idea that two and a half years ago we were thinking about and were quite worried about, to now being a decent-sized community that is trying to push on this idea.”

The Anthropic researchers did not want to remark on OpenAI’s disbanding its own major safety research initiative, and the remarks by team co-lead Jan Leike, who said that the group had been “sailing against the wind,” unable to get sufficient computer power. (OpenAI has since reiterated that it is committed to safety.) In contrast, Anthropic’s Dictionary team says that their considerable compute requirements were met without resistance by the company’s leaders. “It’s not cheap,” adds Olah.

Anthropic’s work is only a start. When I asked the researchers whether they were claiming to have solved the black box problem, their response was an instant and unanimous no. And there are a lot of limitations to the discoveries announced today. For instance, the techniques they use to identify features in Claude won’t necessarily help decode other large language models. Northeastern’s Bau says that he’s excited by the Anthropic team’s work; among other things their success in manipulating the model “is an excellent sign they’re finding meaningful features.”

But Bau says his enthusiasm is tempered by some of the approach’s limitations. Dictionary learning can’t identify anywhere close to all the concepts an LLM considers, he says, because in order to identify a feature you have to be looking for it. So the picture is bound to be incomplete, though Anthropic says that bigger dictionaries might mitigate this.

Still, Anthropic’s work seems to have put a crack in the black box. And that’s when the light comes in.