Large language models might be dishonest about their covert actions. OpenAI researchers propose a self-confession approach as a solution.
Posts published by “OpenAI researchers”
Manas Joglekar, Jeremy Chen, Gabriel Wu, Jason Yosinski, Jasmine Wang, Boaz Barak, Amelia Glaese
