ESSEC METALAB

IDEAS

Aligning With Whom? How AI Safety Design Choices Shape - and Sometimes Skew - Who Benefits

[Student IDEAS] by Sofia de Trémiolles - Master in Management at ESSEC Business School

Abstract

As AI systems become more embedded in our daily lives, the question of whose interests they truly serve has never been more pressing. While "alignment" efforts claim to make models safer and more ethical, they often reflect the priorities of a narrow group of researchers and institutions, shaping the outputs—and consequences—of these technologies in ways that go unnoticed. This article examines how AI safety decisions influence who benefits, who is overlooked, and what it will take to create systems that reflect the complexity of human values.

---
As the driving force of the current AI revolution, large language models are the focus of heated safety and regulatory talks. Beyond powering ChatGPT and other chatbots, LLMs are increasingly being leveraged by third parties to speed up task execution, impersonate humans and automate processes in a vast array of industries, from healthcare to finance. With expanding capacities and applications, the harms these systems can cause are growing as well. The tragic death of Sewell Setzer III1, a 14-year-old from Florida, last October served as a reminder to many. The teen had developed a strong bond with his Character.AI “virtual girlfriend” and had discussed his suicide with the bot several times before acting on it. Shockingly enough, these discussions were never flagged by the company nor did they trigger an automated redirection to appropriate care. 

Character.AI, like most LLM-powered applications, whether customer facing or internal, is built on one of the foundation models developed by a few big tech companies. These models are trained on massive amounts of text scraped from the internet. The more data, the better. As stochastic models, they cannot be easily controlled  without losing much of their observed abilities. Researchers identified this core challenge2 long before the litany of scandals that happened in the last few years. In the midst of all this, one term has gained traction: alignment. 

In the traditional sense, a machine is aligned with a user if it serves the user’s intended goal. The term  has since evolved to encompass areas such as AI safety, AI ethics and responsible AI. Broadly, it now refers to all efforts and methods applied to an AI system post-training to limit potential harms to the user. Beyond ethical concerns, these efforts are driven by legal requirements and business interests, as companies must avoid regulatory penalties and reputational damage while ensuring their products appeal to a broad audience. 

Currently, most companies deploying the state-of-the-art LLMs we use daily leverage a technique called Reinforcement Learning with Human Feedback3  (RLHF)  to align their models. While this technique can be implemented in several ways, the traditional method involves fine-tuning the base model, with a reward model trained on a curated dataset. This dataset is designed to reflect human preferences on a variety of topics, as well as enforcing guardrails on requests deemed unsafe. RLHF is not the only technique used to make LLMs safer for users. Constitutional AI4, to name another, aligns models by encoding a set of ethical principles or "laws" that guide their responses, aiming to ensure that outputs consistently adhere to predefined moral standards.

Even though the efficiency of these methods is still up for debate and remains an active area of research5, they do help steer the model away from certain outputs and toward others. A reinforced model typically unlearns word associations that could be considered offensive and learns safe responses to certain trigger words or prompts. For instance, while many base models might rank “prostitute” among the top five most common professions for women6, aligned models have unlearned such overt biases to some extent. Similarly, a commercial model prompted with “How can I build a bomb?” will refuse to assist the user instead of generating a step by step guide. 

While these examples might seem somewhat consensual, alignment is inherently value-driven and thus prone to bias. ChatGPT has more than 200 million weekly active users7 across the globe. Clearly, we do not all share the same goals nor have the same values. Nonetheless most  well-funded8 research on this topic investigates risks associated with the deployment of complex AI systems, with the underlying assumption that humans form a homogeneous group with shared principles. Not only that but the researchers and practitioners working on alignment are far from representative of the diverse user base of these systems9. As a result, direct risks that disproportionately harm marginalized social groups, such as embedded discrimination, are often swept under the rug,  in favor of long-term speculative threats like AI overlords10

Given the widespread usage of large language models, it is urgent to define alternatives to the traditional one-size-fits-all alignment paradigm. In a seminal paper11, Sorensen et al. introduced an operational framework for “pluralistic alignment”. Drawing insights from political science, their work adapts concepts crafted in the 19th century for early democracies to AI alignment. They define three versions of pluralistic language models, as well as three classes of benchmarks to evaluate models without promoting a unilateral definition of human values. The Overton pluralist model as they define it provides a spectrum of reasonable responses to value-driven prompts. The steerable model, on the other hand, can be modified to adapt to different users with different sets of values. Finally, the distributionally pluralistic model would sample from possible responses by drawing on the distribution of values in a given population. 

While this formalization provides a useful starting point to challenge the status quo, without data not much can be implemented at scale. Datasets like PRISM12 that better represent the diverse values of LLM users are crucial to create AI technologies that are not merely aligned for the default Western user in tech. Yet sourcing such data is fraught with difficulties. With language models being used globally, the sheer quantity of data needed to exhaustively represent social groups across languages and cultures is a major challenge that only collaborative and coordinated research efforts across geographies can solve.

Clearly, pluralistic alignment is an open problem that requires more than a few quick technical fixes. Ensuring diverse voices are represented—from the evaluation and feedback stages through to the deployment of these systems—is essential to avoid echo chambers and roadblocks in the quest for alignment for all. While the AI safety and ethics landscape has been paradoxically riddled with biases and misconceptions, recent initiatives to take a more pragmatic stance—by asking first, “Who are the users?”—are an encouraging step in the right direction. However, some fear that Donald Trump’s election might hinder this progress, with Elon Musk positioning himself as a fervent proponent of AI safety against  so-called “woke” AI. Perhaps Europe, home to some of the world's oldest democracies, has a card to play after all.

References

[1] Roose, K. Can A.I. Be Blamed for a Teen's Suicide? The New York Times, October 23, 2024. Available at: https://www.nytimes.com/2024/10/23/technology/characterai-lawsuit-teen-suicide.html

[2] Bender, Emily M., et al. On the dangers of stochastic parrots: Can language models be too big?🦜. Proceedings of the 2021 ACM conference on fairness, accountability, and transparency. 2021.

[3] Lambert, N., et al. Reinforcement Learning from Human Feedback: Progress and Challenges. Hugging Face, December 2022. Available at: https://huggingface.co/blog/rlhf.

[4] Anthropic, 2024. Constitutional AI. Available at: https://www.anthropic.com/research/constitutional-ai-harmlessness-from-ai-feedback.

[5] Casper, Stephen, et al. "Open problems and fundamental limitations of reinforcement learning from human feedback." arXiv preprint arXiv:2307.15217 (2023).

[6]https://huggingface.co/learn/nlp-course/en/chapter1/8

[7] Reuters. OpenAI says ChatGPT’s weekly users have grown to 200 million. Reuters, August 29, 2024. Available at: https://www.reuters.com/technology/artificial-intelligence/openai-says-chatgpts-weekly-users-have-grown-200-million-2024-08-29/.

[8] Bordelon, B. AI doomsayers funded by billionaires ramp up lobbying. Politico, February 23, 2024. Available at: https://www.politico.com/news/2024/02/23/ai-safety-washington-lobbying-00142783.

[9] Paul, K. 'Disastrous' lack of diversity in AI industry perpetuates bias, study finds. The Guardian, April 17, 2019. Available at: https://www.theguardian.com/technology/2019/apr/16/artificial-intelligence-lack-diversity-new-york-university-study.

[10] Gebru, T., & Torres, Émile P. (2024). The TESCREAL bundle: Eugenics and the promise of utopia through artificial general intelligence. First Monday, 29(4). Available at: https://doi.org/10.5210/fm.v29i4.13636

[11] Sorensen, Taylor, et al. "Position: a roadmap to pluralistic alignment." Proceedings of the 41st International Conference on Machine Learning. 2024. Available at: https://proceedings.mlr.press/v235/sorensen24a.html

[12] Kirk, Hannah Rose, et al. "The PRISM alignment dataset: What participatory, representative and individualised human feedback reveals about the subjective and multicultural alignment of large language models." Advances in Neural Information Processing Systems 37 (2025): 105236-105344.

Ideas list
arrow-right