Abstract
In this work, we investigate the recently formalised inner alignment problem. In broad terms, to align an artificial intelligence is to construct or adjust it in such a way that its outputs are in accordance with human preferences. Internal alignment is a subtask within this exercise, in which the system is treated as an optimisation mechanism which is in turn optimised by some other opti…