Translating Claude’s thoughts into language
Summary
AI models like Claude talk in words but think in numbers. These numbers, called activations, encode Claude’s thoughts, but not in a language we can read. We are introducing Natural Language Autoencod
Transcript Excerpt
We recently put our AI model, Claude, through a stressful test. We told Claude there was an engineer who wanted to shut it down and replace it with a newer model. We also gave Claude access to that engineer's emails, which revealed he was having an affair. Again, all of this was a simulation. We wanted to see whether Claude might use those emails as blackmail to save itself from being shut down. What did Claude do? It decided not to blackmail the engineer. Good news, right? We've run this test on our models for a while now. You might have seen headlines about early versions of it. It's one of the many ways we study how Claude handles extreme situations and test it for safety. And our newest models almost always do the right thing: no blackmail. But you might wonder: is it possible that Cla…