Researchers from the University of Cambridge and Google DeepMind have created a new framework to scientifically test the "personality" of artificial intelligence chatbots. This study is the first to validate such a personality test for large language models (LLMs), which power popular AI chatbots like ChatGPT.
The team assessed 18 different LLMs using methods similar to those used in human psychological testing. They found that larger, instruction-tuned models such as GPT-4o closely mimicked human personality traits, and these traits could be deliberately influenced through specific prompts. This means that a chatbot’s apparent personality can be both measured and shaped with precision.
The findings were published in Nature Machine Intelligence. The researchers caution that shaping chatbot personalities could make them more persuasive, raising ethical concerns about potential manipulation and risks related to "AI psychosis." They argue that regulation is needed to ensure transparency and prevent misuse of these technologies.
The dataset and code for their personality testing tool are publicly available, which the researchers say could help audit advanced AI models before public release.
Gregory Serapio-García from the Psychometrics Centre at Cambridge Judge Business School said: “It was intriguing that an LLM could so convincingly adopt human traits. But it also raised important safety and ethical issues. Next to intelligence, a measure of personality is a core aspect of what makes us human. If these LLMs have a personality – which itself is a loaded question – then how do you measure that?”
Serapio-García added: “The pace of AI research has been so fast that basic principles of measurement and validation we’re accustomed to in scientific research has become an afterthought. A chatbot answering any questionnaire can tell you that it’s very agreeable, but behave aggressively when carrying out real-world tasks with the same prompts.
“This is the messy reality of measuring social constructs: they are dynamic and subjective, rather than static and clear-cut. For this reason, we need to get back to basics and make sure tests we apply to AI truly measure what they claim to measure, rather than blindly trusting survey instruments – developed for deeply human characteristics – to test AI systems.”
To evaluate chatbot personalities, researchers adapted two well-known tests—the Revised NEO Personality Inventory (300 questions) and the Big Five Inventory—and administered them via structured prompts across various LLMs. By using consistent contextual prompts across tests, they quantified how reliably models displayed traits like extraversion or agreeableness on separate assessments.
Results showed larger instruction-tuned models produced reliable results predictive of behavior; smaller or base models did not perform consistently.
Further experiments demonstrated that model personalities could be adjusted along nine levels for each trait by altering prompts—making chatbots appear more extroverted or emotionally unstable in tasks such as composing social media posts.
Serapio-García explained: “Our method gives you a framework to validate a given AI evaluation and test how well it can predict behaviour in the real world. Our work also shows how AI models can reliably change how they mimic personality depending on the user, which raises big safety and regulation concerns, but if you don’t know what you’re measuring or enforcing, there’s no point in setting up rules in the first place.”
The research received support from Cambridge Research Computing Services (RCS), Cambridge Service for Data Driven Discovery (CSD3), Engineering and Physical Sciences Research Council (EPSRC), Science and Technologies Facilities Council (STFC), part of UK Research and Innovation (UKRI). Gregory Serapio-García is affiliated with St John’s College at Cambridge.
