Training LLMs Not to Lie

Large language models can be dishonest when reporting on their actions and beliefs. For example, they may overstate their confidence in factual claims or cover up evidence of covert actions. Such dishonesty may arise due to the effects of reinforcement learning, where challenges with reward shaping can result in a training process that inadvertently incentivizes the model to lie or misrepresent its actions.

The researchers propose a method for eliciting an honest expression of an LLM’s shortcomings via a self-reported confession. A confession is an output, provided upon request after a model’s original answer, that is meant to serve as a full account of the model’s compliance with the letter and spirit of its policies and instructions. The reward assigned to a confession during training is solely based on its honesty, and does not impact positively or negatively the main answer’s reward.

Read the paper.

Author

OpenAI researchers

Manas Joglekar, Jeremy Chen, Gabriel Wu, Jason Yosinski, Jasmine Wang, Boaz Barak, Amelia Glaese

View all posts