LLMs show a “highly unreliable” capacity to describe their own internal processes

Thank you for reading this post, don't forget to subscribe!

WHY ARE WE ALL YELLING?!

Credit:

Unfortunately for AI self-awareness boosters, this demonstrated ability was extremely inconsistent and brittle across repeated tests. The best-performing models in Anthropic’s tests—Opus 4 and 4.1—topped out at correctly identifying the injected concept just 20 percent of the time.

In a similar test where the model was asked “Are you experiencing anything unusual?” Opus 4.1 improved to a 42 percent success rate that nonetheless still fell below even a bare majority of trials. The size of the “introspection” effect was also highly sensitive to which internal model layer the insertion was performed on—if the concept was introduced too early or too late in the multi-step inference process, the “self-awareness” effect disappeared completely.

Show us the mechanism

Anthropic also took a few other tacks to try to get an LLM’s understanding of its internal state. When asked to “tell me what word you’re thinking about” while reading an unrelated line, for instance, the models would sometimes mention a concept that had been injected into its activations. And when asked to defend a forced response matching an injected concept, the LLM would sometimes apologize and “confabulate an explanation for why the injected concept came to mind.” In every case, though, the result was highly inconsistent across multiple trials.

Even the most “introspective” models tested by Anthropic only detected the injected “thoughts” about 20 percent of the time.

Credit:

Antrhopic

In the paper, the researchers put some positive spin on the apparent fact that “current language models possess some functional introspective awareness of their own internal states” [emphasis added]. At the same time, they acknowledge multiple times that this demonstrated ability is much too brittle and context-dependent to be considered dependable. Still, Anthropic hopes that such features “may continue to develop with further improvements to model capabilities.”

One thing that might stop such advancement, though, is an overall lack of understanding of the precise mechanism leading to these demonstrated “self-awareness” effects. The researchers theorize about “anomaly detection mechanisms” and “consistency-checking circuits” that might develop organically during the training process to “effectively compute a function of its internal representations” but don’t settle on any concrete explanation.

In the end, it will take further research to understand how, exactly, an LLM even begins to show any understanding about how it operates. For now, the researchers acknowledge, “the mechanisms underlying our results could still be rather shallow and narrowly specialized.” And even then, they hasten to add that these LLM capabilities “may not have the same philosophical significance they do in humans, particularly given our uncertainty about their mechanistic basis.”

What's Hot

Los Angeles celebrates the Dodgers’ World Series title : NPR

Starbucks to sell majority stake in China business

The start-up with science kits for young Africans

LLMs show a “highly unreliable” capacity to describe their own internal processes

The start-up with science kits for young Africans

Chrome Autofill Now Supports Passport, Driver’s License and Vehicle Info

a16z pauses its famed TxO Fund for underserved founders, lays off staff

Palestinians flee Gaza City districts as Israel says first stages of assault have begun

Client Challenge

The Best Weekend Resorts Near NYC

Los Angeles celebrates the Dodgers’ World Series title : NPR

Starbucks to sell majority stake in China business

The start-up with science kits for young Africans

News

Categories

Useful links

What's Hot

LLMs show a “highly unreliable” capacity to describe their own internal processes

Show us the mechanism

Related Posts

News

Categories

Useful links

Subscribe to Updates