The project of Friendly AI

September 19, 2009

What is the problem and project of Friendly AI? This issue is rather confused, so I’ll outline the motivation and break the problem into its two main components.

Much of the power of technology manifests as predictable tools we create. Predictability comes in many forms: a bowl is expected not to leak liquids, an excavator is expected to be useful for digging holes, a written note is expected to bring back forgotten memories. Tools can be trusted to deliver their predictable effects, and so can be safely designed to wield great power. An A-bomb, based on its design, is trusted not to blow up spontaneously, and software in the banks is trusted to correctly keep track of everyone’s accounts.

Humans can inventively reason about a lot of things, but our ability to correctly anticipate the effects of detailed plans is pretty limited. When designing a bridge, it is not enough to pick its shape and materials and estimate intuitively whether it’ll stand a load: this mode of operation will yield an unpredictable result, one that can’t be trusted. To get better at designing predictable tools, we invent more tools targeted at helping in this task.

Computers can be used to implement huge calculations, if the problem statement can be entered explicitly. For example, you can program the material and mechanical laws in an engineering application, enter a building plan, and have the computer predict what’s going to happen to it, or what parameters should be used in the construction so that the outcome is as required. That’s the power outside human mind, directed by the correct laws, and targeted at the formally specified problem.

The process of decision-making has two aspects: prediction (factual estimation) and valuation (moral estimation). To be selected, a plan has to be both feasible and lead to good consequences. It is possible to implement a nuclear winter, but people don’t want that to happen. So far, people have been fairly successful at designing powerful mental tools for prediction (think physics, not futurism), but outside narrow domains, the application of the resulted plans always has to be “manually” morally evaluated by people in order to proceed with the decisions. We can create designed powerful tools to augment only half of the decision process, the other half remains hopelessly in the domain of human brain.

Let’s say we built an AI, a tool capable of planning in any domain, that is also capable of estimating desirability of plans, and so can make decisions autonomously. If this AI is considered independently of its goals, it’s like an engineering application with a random building plan: it can powerfully produce a solution, but it’s not a solution to the problem anyone needs solving. If you can specify a problem, but don’t have the AI, nothing happens. If you have the AI but give it a random goal, it solves a random problem, with all its power of precision and autonomy. The AI algorithm is essential when you do have an ability to specify the problem, but it’s a separate issue from specifying the problem statement that comes from human nature.

I tentatively identify Friendly AI is an autonomous decision-making tool that is powerful at what it does and can be trusted not only with factual estimation, but also with moral estimation. You don’t have to manually check what it deems desirable, just as you don’t have to manually check how a calculator arrived at each specific result, to be confident that the result is correct.

What is the difficulty then? Why can’t we program human values in a computer, just like a building plan, to be computed in higher resolution? The answer is that we can’t explicitly see our values. We can use them, with varying levels of success, but we can’t write them down, cast the whole of human preference in explicit form. Any direct attempt to do so will end up as a crude caricature that breaks in situations not at all difficult to find. A moral machine would need to work with human values, but human programmers can’t enter them, and neither can they do in their heads what a machine would be able to do given a formal problem statement, because humans can’t handle this problem statement, it’s too big. It could exist in a computer explicitly, but it can’t be entered there by programmers.

So, here is the barrier: problem statement (human values) resides in the structure of human mind, but the strong power of inference doesn’t, while the strong power of inference (potentially) exists in computers outside human minds, where the problem statement can’t be manually transmitted. Creating Friendly AI requires these components to meet in the same system, but it can’t be done in a way other kinds of programming are done.

On the surface, the problem of Friendly AI seems to be about engineering an algorithm capable of powerful planning that is guaranteed by design to follow a clearly defined goal system. But the deeper problem seems to be extracting that goal system from humanity, seeing values in the messy detail of a given physical system.

Technically understanding the more or less arbitrary physical artifact as an instance of goal-directed algorithm is a problem much more general than constructing a specific algorithm. To see the human values in detail, the basic paradigm of what values are, as a property of physical processes, is necessary. Here we seem to be on a pre-Newtonian stage, there is no “mass” or “force” in the description of preference (but there is a lot of existing science to throw at this problem).

The project of understanding arbitrary physical systems as formal goal-directed agents is (1) more general than designing a specific goal-directed AI, so that the solution to the latter may not even meaningfully contribute; (2) a necessary component of any successful FAI design; (3) safer than designing an AI, which, given arbitrary goals, is a very dangerous thing to have around; and even (4) may answer some fundamental conceptual questions in AI design, allowing to complete the project.