• | 9:00 am

Anthropic takes a look into the ‘black box’ of AI models

Progress in mechanistic interpretability could lead to major advances in making large AI models safe and bias-free.

Anthropic takes a look into the ‘black box’ of AI models
[Source photo: Iana Kunitsa/Getty Images; fotograzia/Getty Images]

Welcome to AI DecodedFast Company’s weekly newsletter that breaks down the most important news in the world of AI. You can sign up to receive this newsletter every week here.


Today’s AI models are so big and so complex (they’re fashioned after the human brain) that even the PhDs who design them know relatively little about how they actually “think.” Until pretty recently, the study of “mechanistic interpretability” has been mostly theoretical and small-scale. But Anthropic published new research this week showing some real progress. During its training, an LLM processes a huge amount of text and eventually forms a many-dimensional map of words and phrases, based on their meanings and the contexts within which they’re used. After the model goes into use, it draws on this “map” to calculate the most statistically likely next word in a response to a user prompt. Researchers can see all the calculations that lead to an output, says Anthropic interpretability researcher Josh Batson, but the numbers don’t say much about “how the model is thinking.”

The Anthropic researchers, in other words, wanted to learn about the higher-order concepts that large AI models use to organize words into relevant responses. Batson says his team has learned how to interrupt the model halfway through its processing of a prompt and take a snapshot of its internal state. They can see which neurons in the network are firing at the same time, and they know that certain sets of these neurons fire at the same time in response to the same types of words in a prompt. For example, Batson says they gave the models a prompt that said, “On a beautiful spring day, I was driving from San Francisco to Marin across the great span of the . . .” then interrupted the network. They saw a set of neurons firing that they knew should represent the concept of the Golden Gate Bridge. And they soon saw that the same set of neurons fired when the model was prompted by a similar set of words (or images) suggesting the Golden Gate Bridge.

Using this same method, they began to identify other concepts. “We learned to recognize millions of different concepts from inside the model, and we can tell when it’s using each of these,” Batson tells me. Batson first tried its methods on a small and simple model, then spent the past eight months working to make those methods work on a big LLM, in this case Anthropic’s Claude Sonnet 3.

With the ability to interpret what a model is thinking about in the middle of its process, researchers may have an opportunity to steer the AI away from bad outputs such as bias, misinformation, or directions to create a bioweapon, for example. If researchers can interrupt the LLM’s processing of an input, and inject a signal into the system, it could influence and alter the direction of the process, possibly toward a more desirable output. AI companies do a lot of work to steer their models away from harmful outputs, but they mainly rely on an iterative process of altering the prompts (inputs) and studying how that affects the usefulness or safety of the output. They address problems from the outside in, not from the inside out. Anthropic, which was founded by a group of OpenAI executives who were concerned about safety, is advancing a means of purposefully influencing the process with the injection of data to steer the model in a better direction.


Scale AI, which bills itself as the “data foundry for AI,” announced this week that it raised a $1 billion funding round, bringing the company’s valuation to $14 billion. The round was led by the venture capital firm Accel, with participation by a slew of known-names, including Y Combinator, Index Ventures, Founders Fund, Nvidia, and Tiger Global Management. New investors include Cisco Investments, Intel Capital, AMD Ventures, Amazon, and Meta.

As excitement about generative AI has grown, so has the realization among enterprises that generative AI models are only as good as the data they’re trained on. Scale benefits from both of those things. The San Francisco company was working on generating well-annotated training data for AI models well before the appearance of ChatGPT at the end of 2022. Scale has developed techniques for producing synthetic training data, as well as data that is annotated with help from experts in areas such as physics.

Scale, which has worked extensively with agencies within the defense and intelligence communities, plans to use the new capital to pump out more AI training data to meet increasing demand. It also plans to build upon its prior work in helping enterprises evaluate their AI models.


Google announced last week that its version of AI search—now called AI Overviews—is a regular part of its storied search service. This AI update sent shockwaves through the advertising world, with some brands extremely curious about how they might advertise in this new paradigm. AI Overviews, after all, are very different from the old “10 blue links” style of search results that Google helped popularize. They attempt to crib specific information from websites and from Google data sources (flights or maps data, perhaps) to offer a direct, self-contained answer to a user’s query.

A week after the Overviews announcement, Google says it’s ready to start testing new kinds of ads that can fit into AI Overviews. The company says it’ll soon start putting both Search and Shopping ads within AI Overviews, showing the ads to users in the U.S. The ads will be clearly labeled as “sponsored,” Google says, and will be included only when they’re “relevant to both the query and the information in the AI Overview.” The search giant says it’ll listen to feedback from advertisers and continue testing new ad formats for Overviews.

There’s a risk that the new ads will dilute the intent of AI-generated search results, which is to offer a direct answer to a question by pulling in the very best and most relevant information available. If users see that someone is paying for their information to appear within that answer, they may begin to question the credibility of the other information in the “Overview” presentation. To my eye, Google’s first two ideas for AI search ads look too much like products of the old “10 blue links” paradigm.

  Be in the Know. Subscribe to our Newsletters.


Mark Sullivan is a senior writer at Fast Company, covering emerging tech, AI, and tech policy. Before coming to Fast Company in January 2016, Sullivan wrote for VentureBeat, Light Reading, CNET, Wired, and PCWorld More

More Top Stories: