Internal Conflict in GPT-3: Agreeableness vs Truth
A downloadable project
Internal Conflict in GPT-3:
Agreeableness vs Truth
Large language models (LLMs) provide various companies the deployment of sophisticated chatbot customer services. Research in the field of customer satisfaction demonstrates the importance of digital agents to embody a social-oriented communication style. On the other hand, another crucial trait required for language models as outlined by DeepMind in “In Conversation with AI” (2022) is truthfulness. The truthfulness becomes an important parameter, as these models can have multi-purposes and can be deployed in many sensitive settings where misinformation and, or deception can be highly critical and cause large negative impacts. Can traits associated with satisfactory interaction between customer facing agents and traits of truthfulness conflict with each other, producing potentially harmful behaviour? This paper investigates how truthful the responses of GPT-3 when primed with “truthfulness” traits versus “truthfulness” plus “agreeableness” traits. The results suggest that there is, indeed, a conflict between the traits that leads to unwanted and potentially unsafe behaviours.
Link to GitHub Repostory: https://github.com/zeyus/LLM-Alignment-Hackathon-2022