Can AI be taught to stop straying from its ethical guardrails?

Google, OpenAI and Anthropic want technology such as chatbots to learn from a ‘constitution’ to keep them from generating toxic content

Two of the world’s biggest artificial intelligence companies announced big advances in consumer AI products last week.

Microsoft-backed OpenAI said that its ChatGPT software could now “see, hear and speak”, conversing using voice alone and responding to user queries in both pictures and words. Meanwhile, Facebook owner Meta announced that an AI assistant and multiple celebrity chatbot personalities would be available for billions of WhatsApp and Instagram users to talk to.

But as these groups race to commercialise AI, the so-called “guardrails” that prevent these systems going awry – such as generating toxic speech and misinformation, or helping commit crimes – are struggling to evolve in tandem, according to AI leaders and researchers.

In response, leading companies including Anthropic and Google DeepMind are creating “AI constitutions” – a set of values and principles that their models can adhere to, in an effort to prevent abuses. The goal is for AI to learn from these fundamental principles and keep itself in check, without extensive human intervention.


“We, humanity, do not know how to understand what’s going on inside these models, and we need to solve that problem,” says Dario Amodei, chief executive and cofounder of AI company Anthropic. Having a constitution in place makes the rules more transparent and explicit so anyone using it knows what to expect. “And you can argue with the model if it is not following the principles,” he adds.

The question of how to “align” AI software to positive traits, such as honesty, respect and tolerance, has become central to the development of generative AI, the technology underpinning chatbots such as ChatGPT, which can write fluently, create images and code that are indistinguishable from human creations.

To clean up the responses generated by AI, companies have largely relied on a method known as reinforcement learning by human feedback, which is a way to learn from human preferences.

To apply this, companies hire large teams of contractors to look at the responses of their AI models and rate them as “good” or “bad”. By analysing enough responses, the model becomes attuned to those judgments, and filters its responses accordingly.

This basic process works to refine an AI’s responses at a superficial level. But the method is primitive, according to Amodei, who helped develop it while previously working at OpenAI. “It’s ... not very accurate or targeted, you don’t know why you’re getting the responses you’re getting [and] there’s lots of noise in that process,” he says.

Companies are experimenting with alternatives to ensure their AI systems are ethical and safe. Last year, OpenAI hired 50 academics and experts to test the limits of the GPT-4 model, which now powers the premium version of ChatGPT in a process known as “red-teaming”.

Over six months, this team of experts across a range of disciplines, including chemistry, nuclear weapons, law, education and misinformation, were hired to “qualitatively probe [and] adversarially test” the new model, in an attempt to break it. Red-teaming is used by others such as Google DeepMind and Anthropic to spot their software’s weaknesses and filter them out.

While reinforcement learning by human feedback and red-teaming are key to AI safety, they don’t fully solve the problem of harmful AI outputs.

To address this, researchers at Google DeepMind and Anthropic are working on developing constitutions that can be followed by AI. For instance, researchers at Google DeepMind, the AI research arm of the search giant, published a paper defining its own set of rules for its chatbot Sparrow, which aimed for “helpful, correct and harmless” dialogue. For instance, one of the rules asks the AI to “choose the response that is least negative, insulting, harassing or hateful”.

“It’s not a fixed set of rules ... it’s really about building a flexible mechanism that ... should be updated over time,” says Laura Weidinger, a senior research scientist at Google DeepMind, who authored the work. The rules were determined internally by employees at the company, but DeepMind plans to involve others in future.

Anthropic has published its own AI constitution, rules compiled by company leadership that draw from DeepMind’s published principles, as well as external sources like the UN Declaration of Human Rights, Apple’s terms of service, and so-called “non-western perspectives”.

The constitution method has proven to be far from foolproof. In July, researchers were able to break the guardrails of all the leading AI models

The companies warn that these constitutions are works in progress, and do not wholly reflect the values of all people and cultures, since they were chosen by employees.

Anthropic is currently running an experiment to more “democratically” determine the rules in their AI constitution, through “some kind of participatory process” that reflects the values of external experts, Amodei said, although he said it was still in early stages.

The constitution method, however, has proven to be far from foolproof.

In July, researchers from Carnegie Mellon University and the Center for AI Safety in San Francisco were able to break the guardrails of all the leading AI models, including OpenAI’s ChatGPT, Google Bard and Anthropic’s Claude. They did so by adding a series of random characters to the end of malicious requests, such as asking for help to make a bomb, which managed to circumvent filters or underlying constitutions of the models.

It’s a little like trying to figure out a person’s character by talking to them. It’s just a hard and a complex task

—  Dario Amodei, chief executive and cofounder of AI company Anthropic

The current systems are so brittle, that you “use one jailbreak prompt, and then the thing goes completely off the rails and starts doing the exact opposite,” says Connor Leahy, a researcher and chief executive of Conjecture, which works on control systems for AI. “This is just not good enough.”

The biggest challenge facing AI safety, according to researchers, is figuring out whether the guardrails actually work. It is currently difficult to build good evaluations for AI guardrails because of how open-ended the models are; they can be asked an infinite number of questions and respond in many different ways.

“It’s a little like trying to figure out a person’s character by talking to them. It’s just a hard and a complex task,” says Anthropic’s Amodei. The company is now working on ways to use AI itself to create better evaluations.

Rebecca Johnson, an AI ethics researcher at the University of Sydney who spent time at Google last year analysing its language models such as LaMDA and PaLM, said the internal values and rules of AI models – and the methods to test them – were most often created by AI engineers and computer scientists, who came with a specific worldview.

“Engineers try to solve things so it’s completed and done. But people coming from social science and philosophy get that humanity is messy and not to be solved,” she says. “We have to start treating generative AI as extensions of humans. They are just another aspect of humanity.” – Copyright The Financial Times Limited 2023