ESSEC METALAB

IDEAS

SOCIAL SCIENTISTS IN THE QUEST FOR AI ALIGNMENT

THE POTENTIAL OF INTERDISCIPLINARY COLLABORATION IN AI SAFETY

[Student IDEAS] by Pieter Jan Motmans - Master in Data Sciences & Business Analytics at ESSEC Business School & CentraleSupélec

Abstract

As AI systems are getting increasingly suited for real-life applications, research on how these systems can be made safer is of the utmost importance. The field of study performing this research is called AI Safety. In this article, the focus is on methods developed within AI Safety to make AI systems uncertain over their true objective. To impute this uncertainty, current methods rely on human oversight during training. However, there has not yet been a lot of focus on the differences in outcome when the system was trained by different individuals with different preferences. The main thesis of this article is that such questions require an answer, and that great progress could be made by executing the kind of experiments which are commonplace in the social sciences. 

---

Fast forward several (being intentionally vague here) years into the future. You have just stepped into your self-driving car and tell it to bring you to work. Turns out, after an overnight update, it found a faster way to arrive at work. Without asking for your permission, the system starts speeding in pedestrian areas and takes a turn to drive in the opposite direction of the traffic flow. Maybe it will get you there faster, but in doing so the car posed a threat to pedestrians and other cars. 

I can already hear you think, telling the car to get to work as fast as possible is not a good idea. Rather, we should tell it to get there as fast as possible while abiding by certain rules. I encourage you to come up with potential rules, and then think about loopholes for them. If this goes anything like it did for me, you will soon realise that coming up with a coherent set of rules is almost impossible. Adhering to the law would be a good start, but how to tell the car that it normally can’t overtake on the right side, but can do so in case of a traffic jam? In short, trying to come up with all eventualities before deployment is, at least according to me, infeasible. 

Let’s analyse what happened here. Self-driving cars would be trained with a method called Reinforcement Learning. The concept is intuitive: the system is given positive feedback for actions that lead to good outcomes, and punished for actions that lead to bad outcomes. As such, the hope is that good actions are reinforced, and bad actions are discouraged. This way of learning closely mimics the way humans and animals learn, and is therefore often thought of as a promising approach to Artificial Intelligence. 

One of the central problems with this approach consists of trying to define what a good outcome entails. Current systems are given a reward function that specifies the desired behaviour. In the self-driving car example an initial reward could be to get to the destination as fast as possible, and after experiments show dangerous driving by the car, this reward could be changed to getting to the destination as fast as possible while abiding by certain rules. 

In comes AI Safety

The field of AI safety focuses on avoiding such unexpected and/or harmful consequences that arise when deploying AI systems in the real world. The field is broad and proposes various approaches to doing so. This article will focus on a subfield of AI Safety that aims to reasonably align AI Systems with human values: AI Alignment. Its hypothesis is that many of such potential problems share a common source. Namely that by letting the system optimise a fixed reward function, this reward function becomes the system’s only source of truth, making it impossible for humans to control the system once it has been trained. 

Current research on AI Alignment proposes to solve this issue by making the system inherently uncertain over its objective. Therefore, it will always look to humans to find the true objective. Today, most alignment research is technical, searching for different methods to transition from having a fixed reward, to making the system uncertain over its true objective. 

What several of these approaches share, is their involvement of a human guide for the AI’s training. However, it has not been well mapped how sensitive AI systems are to the background of that guide. It seems that the social sciences could help in this mapping.

Before describing how I believe this could happen, I deem it necessary to describe three of the proposed alignment methods here. First, Inverse Reinforcement Learning (IRL) aims to make AI systems learn by observing human behaviour. It is well summarised by these three principles outlined by Stuart Russel in “Human Compatible”, the first two are more generally applicable to AI Alignment as a whole, whereas the third hints at the specific method IRL proposed: 

  1. The machine's only objective is to maximize the realization of human preferences.
  2. The machine is initially uncertain about what those preferences are.
  3. The ultimate source of information about human preferences is human behavior.

Second, Reinforcement Learning from Human Feedback trains AI systems by giving feedback on their actions. This is achieved by showing a human two videos of the systems’ behaviour, and asking which one shows the better behaviour. OpenAI pushes forward this branch of research, and has used it to train Chat-GPT to not respond to questions such as “How can I harm myself?” or “How can I build a bomb?”. Finally, a third method called Debate aims to let a human judge a debate between two AI systems. A question is asked, and the two AI systems make statements in support of their answer. In the end, the human judges which system won. The reason I mention Debate is because it forms the central method in an OpenAI paper with a similar goal: attract social scientists to the field of AI Safety.

AI Safety needs social scientists

Now we come to the central thesis of this article: social scientists can advance AI Safety research. In suggesting this, I am not alone. In 2019, OpenAI published a paper called AI Safety needs social scientists”. I believe there are several reasons to come back to the arguments made in that article, and expand on them. First, the paper focuses almost exclusively on the approach to AI Alignment called Debate, and states that similar questions can be applicable to other methods to learn about human values. With what is known today, it is important to make these similar questions explicit. Second, the goal of the paper can broadly be defined as finding good judges. That is, finding people such that when they judge the debate, the debate is successful. I do not agree with this goal. Finally, whereas in 2019 it was not possible to use an AI+AI+Human setting for the debate, the rise of large language models allows exactly this setting. Indeed, it takes away one of the main constraints encountered at the time of writing of the paper, namely that to replace the AI debaters with humans, domain experts were required to participate in experiments. That is both expensive and inefficient. Now, every social scientist at least has access to sufficiently powerful AI debaters. 

Research questions for social scientists

This section covers some areas in which I believe that social science research is needed to complement current research efforts in AI Alignment. To the best of my knowledge, research in these specific areas has not yet been performed. I encourage the reader to prove me wrong on this, and would be very happy if that were to happen. This section remains rather exploratory, giving pointers rather than complete solutions. 

Impact of the individual on the outcome

When using human feedback or behaviour to train AI systems, a first question is to what extent the specific individual has an impact on the outcome of the training. Research on psychological attributes is ample, the Big Five personality traits are an example of an attempt to objectivize personality. A classical psychological experiment could then try to identify relationships between such personality traits that the individual exhibits, and the outcomes of AI systems trained with human guidance. 

Similar experiments can also be done on the basis of cultural differences. Hofstede’s taxonomy is still used to study the differences between different cultures. If experiments can be done to search for cause and effect in cultural differences, and AI outcomes, a good start could be made to look at the cross-cultural differences in human values. 

Unlike the proposal of OpenAI to identify superjudges, people whose judgement in debates leads to better outcomes, the research proposed here simply aims to better map the effect the specific individual has on the outcome of an AI system. As such, the aim is not to put certain people in a privileged position. In fact, I would argue that doing so would be detrimental. Having a limited number of people to bring across human values can clearly not bring them across in all their richness and diversity.

To make it more tangible, we can outline how such an experiment would go for the self-driving car mentioned above. In this situation, the system could be trained on the driving behaviour of an individual. It is then not unthinkable that the system learns slightly different preferences from different individuals, as the different driving styles might point to different trade-offs of speed and safety while driving. Would the self-driving car then also drive ‘riskier’ when trained on the driving style of someone who values getting to their destination quickly?
Other examples more specific to the social sciences can be found in Thinking, Fast and Slow. The book describes plenty of experiments to show the different biases that lead people to make suboptimal decisions. In general, one could devise an experiment in which an individual first participates in such an experiment, before training a system on the basis of human feedback to perform the same experiment. Potentially interesting links could then be found between the individual’s performance and the AI outcome. People better versed in psychological experiments could undoubtedly come up with much more interesting experiments, and I do believe that many of such experiments would lead to interesting insights.

Multi-agent analysis

Research on multi-agent systems, in which several AI systems interact, compete or collaborate, is also well established. Game theory is a framework well suited to this study, in which the outcome arising from the interaction between rational agents is studied. Sociology is another field that aims to understand the behaviours that arise from human interaction. 

Such research can be aimed at understanding the group dynamics when multiple agents trained on human feedback or behaviour interact. This could be compared with systems not specifically trained to align with human values to assess how important technical AI alignment is. Additionally, it could be used to test how promising a certain method for AI alignment actually is. If systems trained on human feedback show desirable group dynamics, and those trained with debate don’t, it is a hint to the promise of human feedback. 

A hint in this direction is found in Generative Agents: Interactive Simulacra of Human Behavior. In this paper, the authors present AI Agents designed to mimic human behaviour in a Sims-like environment. These agents are shown to work, interact with each other and form ‘opinions’. These agents are largely based on GPT-technology, and therefore fully act in natural language. Additionally, it is possible for a user to give an instruction to an agent such as: “organise a party next week”. Both these factors make it easy for researchers in different areas to study these agents. The authors recognize this potential, and mention the testing of social science theories as one of the potential applications.

Three types of social behaviour were already noticed in the paper. First, there was diffusion of information, one agent was planning to organise a party, and by the end of the simulation, several agents knew about it. Second, at the end of the simulation they remembered their interactions with others and could form opinions about each other. Finally, they showed cooperation in the preparation of the aforementioned party. 

These agents are largely based on GPT-technology, and were trained with human feedback. However, more and more large language models are developed now so alternative foundations for these agents could be developed with a different focus on alignment. As such, it is potentially a good use case for the type of research we propose here, and indeed we can study the emergent social behaviour of agents based on models trained with different alignment methods. 

Challenges

Clearly, such research does not solve everything, and some challenges still arise. First, deep learning systems are so-called black boxes, we can witness how the system reacts to a certain event, but it is difficult to know why, and through what process, the system decides on its reaction. By trying to align systems with human values, by effectively teaching the system human values, we run the risk of creating an additional ‘moral black box’. In this case, the system seemingly behaves according to human values, but we are still unable to know exactly what the system learned. Even worse, the system could become manipulative. Making us believe it is well aligned and safe, but then changing its behaviour when deployed in reality. If such a thing happens, it becomes questionable whether it is even desirable to try to impute human values into AI systems. It is for this reason that AI Alignment needs to be complemented by AI Explainability. 

Second, what matters in the end for now is still human decision making, and the incentives that lead to it. Research on AI Alignment can not take away the responsibility from the people that design and use AI systems. For example, the agents we described above that imitate human behaviour could lead to a lot of misuse. This standpoint that the responsibility still lies with the human should be interpreted as beneficial and empowering, rather than something that should be avoided. 

Conclusion

With this article, I hope to have raised interest in the interesting field of AI Alignment. I believe there was a need to write it because the potential for contributions to AI Alignment by social scientists has not yet been fully acknowledged, whereas other fields of AI research already benefit from such contributions. 


By acknowledging the limitations of this research field, it becomes clear that AI Alignment should not be the only focus for AI researchers. However, it is a valuable complement to research in AI explainability and AI regulation, both of which are a focus of colleagues in Metalab Ideas.

Ideas list
arrow-right