AI: a glimpse into the black box

Blog Main Image
Author Thumbnail
Danny de Kruijk
Product Lead
May 21, 2024

Over the past decade, AI researcher Chris Olah (Anthropic) has focused entirely on artificial neural networks. One question was central: “What is happening inside these systems?” This question is now more relevant than ever, with generative AI everywhere. Big language models like ChatGPT, Gemini, and Anthropic's own Claude have amazed people with their language skills, but also frustrated with their tendency to make things up. Understanding what's going on inside these “black boxes” would make it easier to make them safer.

The mystery of neural networks

Olah leads a team at Anthropic that has managed to get a glimpse into this black box. They're trying to reverse engineer large language models to understand why they're generating specific outputs. According to a recently published paper, they have made significant progress.

Similar to neuroscience studies that interpret MRI scans to identify thoughts, Anthropic has delved into their LLM Claude's digital network. They've identified combinations of artificial neurons that evoke specific concepts, such as burritos, programming codes, and even lethal biological weapons. This work has potentially huge implications for AI safety.

The process of mechanistic interpretation

I spoke with Olah and three of his colleagues, who are part of the team of 18 researchers at Anthropic. Their approach treats artificial neurons as letters of the Western alphabet, which usually have no meaning on their own but can form a meaning together. This technique, called dictionary learning, allows them to associate combinations of neurons with specific concepts.

Josh Batson, a researcher at Anthropic, explains: “We have around 17 million different concepts in an LLM, and they don't come labeled for our understanding. So we're just watching when that pattern pops up.”

First successes and challenges

Last year, the team began experimenting with a small model that uses only one layer of neurons. The aim was to discover patterns that indicate characteristics in the simplest setting. After numerous failed experiments, a run called “Johnny” began associating neural patterns with concepts that appeared in the outputs.

“Chris looked at it and said, 'Holy crap. This looks great, '” says Tom Henighan, a member of the Anthropic technical team. Suddenly, the researchers were able to identify the characteristics that coded a group of neurons. They were able to see into the black box.

Expansion to larger models

After demonstrating that they were able to identify features in the small model, the researchers set to decode a full-fledged LLM. They used Claude Sonnet, the medium-strength version of Anthropic's three current models. This also worked. One feature that stood out was associated with the Golden Gate Bridge. They mapped the network of neurons that, when fired together, indicated that Claude was thinking about the bridge.

The team identified millions of characteristics, including safety-related features such as “getting close to someone with a hidden motive” and “discussing biological warfare.”

Neural Network Manipulation

The next step was to see if they could use that information to change Claude's behavior. They started manipulating the neural network to reinforce or reduce certain concepts. These types of AI brain surgery have the potential to make LLMs safer and increase their power in selected areas.

For example, by suppressing certain features, the model can produce safer computer programs and reduce biases. On the other hand, when the team deliberately activated dangerous combinations of neurons, Claude produced dangerous computer programs and scam emails.

Risks and Ethical Considerations

The researchers assured me that there are other, easier ways to create problems if a user wanted to. Yet their work raises ethical questions. Could this toolkit also be used to generate AI chaos?

Anthropic isn't the only team trying to open the LLMs black box. There is a group at DeepMind that is also working on this problem, led by a researcher who previously worked with Olah. A Northeastern University team has developed a system for identifying and editing facts within an open-source LLM.

This is just the beginning

Anthropic's work is just a start. While their techniques for identifying features in Claude do not necessarily help decode other major language models, they have taken an important step. Their success in manipulating the model is an excellent sign that they are finding meaningful features.

While there are limitations, such as the fact that dictionary learning cannot identify all concepts an LLM considers, it appears that Anthropic has made a crack in the black box. And that's when the light comes in.

This article offers a glimpse into the fascinating world of AI research and efforts to unravel the mysteries of neural networks. Anthropic's work marks an important step towards safer and more understandable AI systems.