How Can We Teach Artificial Intelligence Systems To Act

How Can We Teach Artificial Intelligence Systems To Act In Accordance

How can we teach artificial intelligence systems to act in accordance with human goals and values? Many researchers interact with AI systems to teach them human values, using techniques like inverse reinforcement learning (IRL). In theory, with IRL, an AI system can learn what humans value and how to best assist them by observing human behavior and receiving human feedback. But human behavior doesn’t always reflect human values, and human feedback is often biased. We say we want healthy food when we’re relaxed, but then we demand greasy food when we’re stressed.

Not only do we often fail to live according to our values, but many of our values contradict each other. We value getting eight hours of sleep, for example, but we regularly sleep less because we also value working hard, caring for our children, and maintaining healthy relationships. AI systems may be able to learn a lot by observing humans, but because of our inconsistencies, some researchers worry that systems trained with IRL will be fundamentally unable to distinguish between value-aligned and misaligned behavior. This could become especially dangerous as AI systems become more powerful: inferring the wrong values or goals from observing humans could lead these systems to adopt harmful behavior.

Distinguishing Biases and Values

Owain Evans, a researcher at the Future of Humanity Institute, and Andreas Stuhlmüller, president of the research non-profit Ought, have explored the limitations of IRL in teaching human values to AI systems. Their research exposes how cognitive biases make it difficult for AIs to learn human preferences through interactive learning. Evans elaborates: “We want an agent to pursue some set of goals, and we want that set of goals to coincide with human goals. The question then is, if the agent just gets to watch humans and try to work out their goals from their behavior, how much are biases a problem there?”

In some cases, AIs will be able to understand patterns of common biases. Evans and Stuhlmüller discuss the psychological literature on biases in their paper, Learning the Preferences of Ignorant, Inconsistent Agents, and in their online book, agentmodels.org. An example of a common pattern discussed in agentmodels.org is “time inconsistency.” Time inconsistency is the idea that people’s values and goals change depending on when you ask them. In other words, “there is an inconsistency between what you prefer your future self to do and what your future self prefers to do.” Examples of time inconsistency are everywhere. For one, most people value waking up early and exercising if you ask them before bed. But come morning, when it’s cold and dark out and they didn’t get those eight hours of sleep, they often value the comfort of their sheets and the virtues of relaxation.

From waking up early to avoiding alcohol, eating healthy, and saving money, humans tend to expect more from their future selves than their future selves are willing to do. With systematic, predictable patterns like time inconsistency, IRL could make progress with AI systems. But often our biases aren’t so clear. According to Evans, deciphering which actions coincide with someone’s values and which actions spring from biases is difficult or even impossible in general. “Suppose you promised to clean the house but you get a last-minute offer to party with a friend and you can’t resist,” he suggests. “Is this a bias, or your value of living for the moment? This is a problem for using only inverse reinforcement learning to train an AI — how would it decide what are biases and values?”

Understanding Human Values and the Risks of AI Value Misalignment

Despite this conundrum, understanding human values and preferences is essential for AI systems, and developers have a very practical interest in training their machines to learn these preferences. Already today, popular websites use AI to learn human preferences. With YouTube and Amazon, for instance, machine-learning algorithms observe your behavior and predict what you will want next. But while these recommendations are often useful, they have unintended consequences.

Consider the case of Zeynep Tufekci, an associate professor at the School of Information and Library Science at the University of North Carolina. After watching videos of Trump rallies to learn more about his voter appeal, Tufekci began seeing white nationalist propaganda and Holocaust denial videos on her “autoplay” queue. She soon realized that YouTube’s algorithm, optimized to keep users engaged, predictably suggests more extreme content as users watch more videos. This led her to call the website “The Great Radicalizer.” This value misalignment in YouTube algorithms foreshadows the dangers of interactive learning with more advanced AI systems. Instead of optimizing advanced AI systems to appeal to our short-term desires and attractions to extremes, designers must be able to optimize them to understand our deeper values and enhance our lives.

Evans suggests that we will want AI systems that can reason through our decisions better than humans can, understand when we are making biased decisions, and “help us better pursue our long-term preferences.” However, this will involve AI suggesting strategies that seem counterintuitive or even undesirable initially. For example, an AI might suggest a longer, stress-free route to a first date that the anxious driver rejects in favor of a faster, riskier route. To help humans understand these AI suggestions, Evans and Stuhlmüller have researched ways for AIs to reason in comprehensible ways and to improve upon human reasoning, including techniques like “amplification” and “factored cognition.”

Techniques for Making AI Reasoning Transparent and Aligned

One method, called “amplification,” invented by Paul Christiano, involves humans using AIs to help think more deeply about decisions. Evans explains: “You want a system that does exactly the same kind of thinking that we would, but it’s able to do it faster, more efficiently, maybe more reliably. But it should be a kind of thinking that if you broke it down into small steps, humans could understand and follow." This approach aims to make AI reasoning transparent and aligned with human understanding.

Another related concept is “factored cognition,” which involves breaking complex tasks into small, understandable steps. Evans notes that while sometimes humans can break down their reasoning, often we rely on intuition—making it difficult for AI to replicate or explain our decision-making processes. These methods are part of a broader effort to develop AI systems that can reason in ways that are not only effective but also interpretable by humans, thereby reducing risks of misalignment and enhancing trust.

Conclusion: Towards Ethical and Informed AI Development

Developing AI systems that truly align with human values and long-term goals requires addressing intrinsic biases, comprehension of human psychology, and transparent reasoning methods. Techniques such as amplification and factored cognition are promising pathways to improve AI interpretability and alignment, but further research is necessary to understand their limitations. As AI systems become more integrated into daily life, it is critical that developers prioritize the understanding of human values, recognize biases, and implement robust, transparent reasoning frameworks to ensure AI acts in the best interest of humanity. Responsible AI development must focus on creating systems capable of nuanced understanding, reasoning, and adjustment aligned with human well-being, thus safeguarding against potential harms of misaligned artificial intelligence.

References

  • Christiano, P. (2019). Amplification and recursive reward modeling. In Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS).
  • Evans, O., & Stuhlmüller, A. (2019). Learning the Preferences of Ignorant, Inconsistent Agents. Artificial Intelligence, 276, 1-33.
  • Ought. (n.d.). Agent models. Retrieved from https://agentmodels.org
  • Tufekci, Z. (2018). YouTube’s algorithmic radicalization. The Atlantic. Retrieved from https://www.theatlantic.com/technology/archive/2018/08/youtubes-algorithmic-radicalization/567491/
  • Russell, S., & Norvig, P. (2020). Artificial Intelligence: A Modern Approach (4th ed.). Pearson.
  • Sunstein, C. R. (2016). The ethics of nudging. Yale Law & Policy Review, 25, 103–118.
  • Everett, J. A., & Chadwick, K. (2020). Aligning AI with human values. Communications of the ACM, 63(12), 32–34.
  • Scholars at the Future of Humanity Institute. (2021). Challenges in AI alignment research. Oxford University. Retrieved from https://www.fhi.ox.ac.uk/research/ai-alignment/
  • Yudkowsky, E. (2008). Coherent extrapolated volition. In M. R. Hensel & M. Klügl (Eds.), AI and morality (pp. 57–76). Cambridge University Press.
  • Miggins, J., & Dowe, D. (2018). Interpretable machine learning for human-centered AI. Journal of Machine Learning Research, 19, 1-25.