This post is part of The Friendly AI problem sequence.
Followup to: The project of Friendly AI, Preference is general and precise, Preference is resilient and thorough.
What we need is not superintelligence, but supermorality, which includes superintelligence as a special case.
Sufficiently advanced AIs reliably implement a specific unchanging preference, eventually affecting any reachable aspect of the world if not stopped (which would be hard to impossible). This observation prohibits quite a few otherwise plausible assumptions about how the intelligent agents could behave, with important consequences for the design of Friendly AI (FAI), an autonomous decision-making machine that is built so that it can be trusted with moral evaluation of its decisions.
Since preference is thorough and general, FAIs can’t be limited to a narrow domain, neither in the aspects of the world that they are supposed to act upon (make decisions about), nor in the aspects of human preference that they are supposed to apply in making moral estimation of the decisions. Any autonomous AI intended for a narrow domain will just fill in the blanks and become an agent in the general domain, but with parts of preference not determined by our own. This adds difficulty to the project of FAI: the scope of the problem can’t be restricted, partial solutions don’t work as intended at all.
Since preference is precise and resilient, FAI’s preference has to be not only comprehensive in its scope, but also specified correctly with precision, and on the first try. Small differences in preference escalate to overwhelming differences in the outcome (resulting state of the world shaped by an agent acting for that preference), spread out through all of its aspects, with no simple regularity to account for the change. Once implemented, the genie can’t be easily put back into the bottle or reformed, it’ll try to protect its preference, resisting any change or threat of extinction.
On the other hand, once the problem is solved for an implementation of human preference in a single FAI machine, the rest takes care of itself. Preservation of preference is a basic drive, so if we can trust this particular FAI agent with moral decisions (even if it’s computationally relatively limited and has a long way to go in improving its ability to prepare complex plans), we can also trust the next-generation agents it constructs to make decisions for more and more aspects of the world. The dangers of autonomous AIs turn into virtues once their preference is the right one. The FAI need not immediately take over everything, need not be a superintelligence from the start and for however long it takes to get there (assuming that all that can be done is being done, so that competing factors won’t likely take over); the mere presence of competitive autonomous agents reliably holding our preference ensures a good chance for our preference having a significant say in shaping the future.
Intelligent agents have two thresholds in ability important in the long run: autonomy and reflective consistency. Autonomy is a point where an intelligent agent has a prospect of open-ended development, with a chance to significantly influence the whole world (by building/becoming a reflectively consistent agent). Humanity is autonomous in this sense, as probably are small groups of smart humans if given a much longer lifespan (although cultish attractors may stall progress indefinitely). Reflective consistency is the ability to preserve one’s preference, bringing the specific preference to the future without creating different-preference free-running agents. The principal defects of merely autonomous agents are uncontrollable preference drift and inability to effectively prevent reflectively consistent agents of different preference from taking over the future; only when reflective consistency is achieved, does the drift stop, and the preference extinction risk gets partially alleviated.
As with advanced AI, so is with humanity, there is danger in lack of reflective consistency. An autonomous agent, while not as dangerous as a reflectively consistent agent (though possibly still lethal), is a reflectively consistent agent with alien preference waiting to happen. Most autonomous agents would seek to construct a reflectively consistent agent with same preference, their own kind of FAI. A given autonomous agent can (1) drift from its original preference before becoming reflectively consistent, so that the end-result is different, (2) construct another different-preference autonomous non-reflective agent, which could eventually lead to a different-preference reflective agent, (3) fail at the construction of its FAI, creating a de novo reflectively-consistent agent of wrong preference; or, if all goes well, (4) succeed at building/becoming a reflectively consistent agent of same preference. Humanity faces these risks, and any non-reflective autonomous AI that we may develop in the future would add to them, even if this non-reflective AI shares our preference exactly at the time of construction. A proper Friendly AI has to be reflectively consistent from the start.
The very motivation behind the Friendly AI problem turns around in light of the problem of preservation of human preference and implications of its successful resolution. The original motivation for FAI, as I stated it, was to build a tool for augmenting the moral evaluation side of human decision-making, a kind of a calculator for right and wrong where we already have calculators for could and couldn’t, allowing us to find better solutions for harder problems. The updated motivation is to construct a vehicle for human preference, means of its propagation and application in the future, with humanity itself in the present form inadequate for this role. (This isn’t a decision of replacing people with FAIs, seeing it this way would be a category error; I’ll return to this point in later posts.)
Posted by Vladimir Nesov 