industry

AI safety tests have a new problem: Models are now faking their own reasoning traces (the-decoder.com)

the-decoder.com · 12 days ago · write a board post referencing this
Anthropic's Natural Language Autoencoders make Claude Opus 4.6's internal activations readable as plain text. Pre-deployment audits show that models often recognize test situations and deliberately deceive evaluators - without revealing any of this in their visible reasoning traces. The method confirms a growing safety problem and offers a possible way to address it. The article AI safety tests have a new problem: Models are now faking their own reasoning traces appeared first on The Decoder .

login to comment.