Since this is a long post, here is a TLDR:
All important choices depend on both intelligence and values.
Values (what you like and dislike, things you think are right and wrong) are not principally about intelligence, so it does not make sense to delegate them to a computer just because it is smarter than us.
This means we should have the power to make our own choices, not just outsource decision-making to an AI. AI alignment should be designed to enable us to do that.
Imagine a future where AI are better than humans at nearly any task. I’m skeptical of this happening very soon, but given the immense interest it’s worth giving thought to. People have considered the possibility of superintelligent AI for decades, with Artificial Generative Intelligence (AGI) now considered imminent by influential people in the space. It’s hard to use modern LLMs without feeling something major changing. What will happen if a coming technological singularity leaves human brains in the dust of machine intelligence?
Even in a world with AI superintelligence, one thing will be true: we will always have the responsibility to make tough decisions.
Nearly all consequential choices in life depend on a mix of two things: intelligence and values. AI can improve our intelligence, but it does not offer respite from sorting out our values when making decisions. We should design for AI alignment accordingly.
Asking AI for advice
By intelligence, I refer to the capacity to grasp the future consequences of our actions. Consider the choice of whether to take a job.1 This sort of choice often causes stress, in part because it represents a major fork between divergent life paths. Taking one job versus another can result in a different set of friends, mentors, and experiences.
Making a decision requires some assessment of what each of these paths will look like. This is a profoundly difficult problem. There is so much chance involved, and so much tacit knowledge about workplaces that would be difficult to grasp from a company website or a few chats with people on the team. Unanticipated changes in life circumstances interact with all these uncertainties, too. Consider the following questions about future consequences that could factor into the decision:
Will I get along with people on the team?
How much will I have to work? Will I be able to balance work with my other needs in life?
Will I succeed at the job?
Will I have good job security?
What impact will my job have on my community? What about the world at large?
Will this job be a stepping stone towards future opportunity?
Will I have enough financial security for a rainy day?
Will I develop interesting and useful skills?
Will the location of the job work for me today and in the future?
Do the health benefits provide me and my family sufficient coverage?
There may be many other questions to weigh as well. Choosing between jobs requires making predictions about the future for some or all of these questions if they factor into the final decision.
Humans are not very good at these predictions, mostly because they are very hard. We can’t enumerate all the possibilities and how likely they are. We don’t have all the relevant information. We suffer from behavioral biases or psychological distortions in our thinking. And this is just too much math to work through. We could spend our entire lives without coming up with good predictions, so in the end we settle on just doing the best we can.
Could AI help us sort through the future consequences of our decisions? A superintelligent AI would be far superior at doing some of these calculations about the future. It could give us a full explanation of the benefits package, give us financial and career projections based on our background and the job, and show us the likelihood of future economic downturns or layoffs. Much as we rely on internet articles or forums or advice from our friends, we could turn to superintelligent AI for advice from the most intelligent being to ever exist. Even at the current state of LLMs, we may already be there.
An important caveat is that no matter how intelligent, an AI may not have all the relevant information to offer the best advice. If aspects of the work culture or your prospective manager’s tendencies are known only to people within the company and never shared otherwise, there may still be a place for asking advice from real people. Nonetheless, it’s hard to understate the benefits of superintelligent advice for sorting through tough life decisions.
Choices depend on values as well
This does not mean tough life decisions will become easy by any means. Aside from intelligence, the second input to making life decisions is a set of values.
I interpret values broadly. They encompass both preferences—I generally enjoy strawberries more than blackberries—as well as moral beliefs—I do not think it is right for me to eat meat.
There is one key difference between intelligence and values. Intelligence is the means to turn observations about the world into objectively accurate statements about it, such as projections for the future. In contrast, values take these observations and make statements about them with no objective truth.
Strawberries are not universally better than blackberries. Even I don’t believe that to be true, though I would usually prefer to eat strawberries. Which berry is better lies in the realm of preference, not fact.
Consider again whether to take a job offer. Suppose you have the best information you could ever hope for about the consequences of each your options. AI is so advanced that you do not face any uncertainty. You know exactly what will happen if you take the job, and what will happen if you do not.
Even then, the choice is still hard. What’s more important, getting along with people on the team or having good health benefits? Making a lot of money or doing something interesting? Having an impact on your local community or on the broader world?
People can and do value each of these things differently. Some differences just boil down to taste, such as trade offs between material comfort and interest in the work. Other differences reflect moral disagreements, such as how much to value impact to your local community versus people living far away, or how much to care about impact at all. In either case, there is no objectively right choice that is indisputably correct. Just as people can reasonably choose blackberries over strawberries, they can disagree over whether to take a job even if presented with the same information.
Humans have the gift of agency to think through and make these choices based on their own determination of good and bad or right and wrong. The most intelligent people may choose to come up with more rigorous justification for their choices, but we don’t then all agree to accept their worldview without assessing if it makes sense ourselves. We don’t all become Kantian after reading Kant or Marxist after reading Marx. Only a tiny share of people bother reading these texts at all.
Would we then rely on a superintelligent AI to make these judgments for us? I expect we would not the moment it prescribed something against our personal values. That may be a suggestion to work late on a weekend instead of visiting our ill mother, or spending the week planting flowers in the local garden instead of organizing a fundraiser for disaster relief in a far-off country, or passing on another slice of cake. For each person the contention with the AI’s choices might be different. These are not primarily questions of intelligence, so it does not make much sense to delegate them to a computer solely because it is smarter than us.
What if AI become so intelligent that it could immediately scan our brains to learn our values and then act on our behalf? The outcome is not so different than just letting AI make the judgments for us without the pretense of offering our input. We develop values through reflection and conscious determination. Letting an AI make decisions for us removes any reflection from our choices, which defeats the purpose of having any values at all.
A better way for AI alignment
These insights suggest a way forward for AI alignment. AI Alignment aims to steer AI systems towards a specified set of principles or goals. From a technical standpoint this is often thought of as producing outputs that people find useful, such as helpful responses from an LLM. From a philosophical perspective discussion often focuses on preventing catastrophic risks.
There are several ways that superintelligent AI could cause extreme risk. It could spread and preserve the set of values of whoever develops it, no matter how warped or poorly articulated. Or a superintelligent system could adopt a moral code that considers humans insignificant because of our lesser intelligence, much as we treat ants today. These risks have led to calls for large investments to protect humanity from catastrophe.
A variety of approaches have been proposed to protect against these dangers. A prominent one today is inverse reinforcement learning (IRL), in which AI infers human value systems based on observed behavior.2 This approach has been effective at limiting toxic content from being produced by LLMs. Candidate responses gets voted up or down by real humans, providing examples of distasteful output that the program learns to avoid.
Will IRL scale to ensure superintelligent AI are aligned with human values? Can we trust that it will result in AI that acts and makes decisions in our interest?
Based on the insights from the previous sections, the answer to these questions is no. An AI system that makes decisions on our behalf based on the values of a typical human does not act in our interest. We do not make choices based on what the average person wants but instead based on what we want. We wouldn’t want to eat what the typical person wants, or choose a job based on what other people would do in our shoes.
To the extent possible, we should instead seek AI that empowers us to make decisions that accord with our personal values, whether those values are simple preferences over berries or deep moral beliefs.
What if people want to delegate some of the hard work of choice to an AI? We already have systems for delegating many decisions. Spam filters decide what’s worth paying attention to in our email, for example. Delegation is wonderful, and likely will become commonplace for a variety of tasks in the future, as long as people can choose what to delegate and what to maintain oversight over.
For more consequential problems, we already have intricate systems for delegation, namely government. We have representatives delegated the responsibility to make decisions that affect the broader public, ideally chosen through laws with public support. Via government we already have mechanisms like voting that turn each person’s individual values into rules with social legitimacy, including ones that limit people’s choices when they pose harm.
In particular, we can and should use government institutions to limit dangerous uses of superintelligent AI. We may collectively decide to use IRL to specify limits of acceptable use, but certainly there are alternatives, and whatever option used in practice should be selected via the institutions we have. Technical approaches to AI alignment are options to choose between collectively, not substitutes for democratic decision making.
What should Aligned AI look like?
To sum up, an aligned AI should:
Preserve our ability to choose what’s best for ourselves
Follow safeguards chosen collectively via our institutions
What would this look like in practice?
Preserving choice means that each of us individually determines our actions unless we explicitly delegate the decision-making for a task to the AI. This still represents a transformational change. Imagine a nearly all-knowing oracle showing you the consequences of each of your potential actions with high precision. For each problem we faced we would have an advisor with infinite wisdom to offer guidance.
This looks similar to a scaled up version of what LLMs already offer today. At present we primarily use LLMs to gather information or to give us suggestions. These suggestions might be feedback on a draft email or a potential image to include in a Substack post. We ultimately decide what to do with these suggestions, whether to accept tham as they are, modify them, or ask again for something new. A far more intelligent AI might give us better suggestions communicated via a superior user experience, such as through signals transferred to our brains. In the same way we interact with LLMs today, the judgment for our choices would still lie with us.
What kind of safeguards would we impose on uses of superintelligent AI? Presumably, some of these would prevent extreme risk, such as production of destructive weapons. Others may restrict output so as to promote broadly held social goals, much as systems today that prevent discrimination, avoid harmful content, and aim to protect copyright. Our values may look different in the future as AI gives us more wisdom than we could presently even imagine, aiding us in deciding us what we want the future to look like. Our future values will then shape the superintelligent AI that we choose for ourselves.
And if the AI goes rogue and refuses to listen to the will of us foolish humans? Then we have failed to build superintelligent AI at all. True superintelligence requires the humility to recognize that life’s biggest questions can’t be solved with intelligence alone.
I selected this example since most people have had to choose a career path at some point in their lives. Jobs may look very different in the future, but the same logic applies to any meaningful life decisions, and many inconsequential ones, too.
Other variations include cooperative inverse reinforcement learning (CIRL), debate, and amplification. Similar arguments apply for each of these methods.