8 March 2019

I recently read Shane Legg and Marcus Hutter’s paper on universal intelligence1 which I thought was worth summarising, not least as it throws up some interesting philosophical tangents.

They propose a definition of universal intelligence which can be used to include the intelligence of machines. Unsurprisingly something like IQ, which is based on a Gaussian distribution, is not going to be sensible.2 Further, intelligence as evaluated in humans is commonly understood to be multi-faceted, and to an extent also culturally constructed.

Measures of intelligence applied to animals may not be that useful either, particularly as only certain species which are social are able to do certain things which we would recognise as intelligent, such as deception or self-recognition.

However, if we think of the Skinner box3, one thing is quite similar; animals are rewarded with food from the researcher when they successfully complete a task that they have learned to do (which we refer to as training); similarly in machine learning the program will have some metric of success comparable to reward e.g. in the case of supervised learning, the minimisation of some loss function.4

Legg & Hutter’s definition is:

“Intelligence measures an agent’s ability to achieve goals in a wide range of environments.”

This definition is usefully wide because it applies to animals, humans, and computers. Indeed it is so usefully wide that it might even be regarded as a kind of raison d’être for (human) life itself, although perhaps supplementary rather than primary.5 If that is going too far, it does seem at least to be at least an excellent yardstick of performance as a human.6

A lack of ability to perform in a wide range of environments recalls the apocryphal cultural narratives of highly intelligent individuals: in A Beautiful Mind, the film about the American mathematician John Nash who is the father of game theory, Nash is at first quite confused by commonplace social interaction in the college bar; in The Imitation Game Keira Knightley’s character Joan Clarke tells Benedict Cumberbatch’s character Alan Turing that he should share his apples because his team must like him as a prerequisite to providing the assistance he desires from them.

Because so few of us have the extraordinary analytic and creative power of a Nash or a Turing — or a Tamara de Lempicka or a Bridget Riley for that matter — we had better comfort ourselves that instead we can try to be as good as possible at many things.

Legg & Hutter tell us that intelligence is a function of “computation, information and complexity”, and formally state the definition of universal intelligence of agent π\pi as:

Y(π):=μE2K(μ)VπμY(\pi) := \sum_{\mu \in E} 2^{-K(\mu)} V^\mu_\pi

If you are not used to looking at equations, fear not, because when broken down, the elements become quite intuitive. Taking it piece by piece:

π\pi is a function which represents the agent, taking in the history of all previous observations, rewards and actions, and returning an action. It is not deterministic and so it can be thought of as returning a probability (which is not 1) for an action given any given history. The context of the description here is AI, and so /pi/pi is taken to be a computable function.
This is placed within another function YY, which accounts for discounting, such that jam today is better than jam tomorrow, which is important if the agent is to balance effectively more near-term and more long-term rewards.7 The reward comes from the cycle of observations, rewards, actions.

:=:= means ‘defined to be equal to’

μ\mu is a function which represents the environment, which we can think of again as a probability of a certain combination of observation and reward, given a history of observations, rewards and actions.

μE\sum_{\mu \in E} is the sum of all the environments μ\mu in the set EE which comprises all computable environments — akin to an event space.

2K(μ)2^{-K(\mu)} represents Occam’s razor8, by weighting the performance in simpler environments, represented by μ\mu, higher. KK is the Kolmogorov complexity9 function, or, in broad terms, how difficult it is to describe the environment μ\mu. If μ\mu is more complex, K(μ)K(\mu) is higher. Correspondingly, as K(μ)K(\mu) increases 2K(μ)2^{-K(\mu)} decreases.

VπμV^\mu_\pi is the agent’s ability to achieve in the specific environment.

There are some interesting implications of this:

  • It is intelligent to use Occam’s razor because the simplest explanation is also the one which is most likely to generalise, and generalised explanations are useful.
  • If you are trying to maximise the value on the right hand side, performance has to be good in less complex environments, given the weighting. In other words, you need to do the easy stuff well.
  • An agent that performs randomly will perform badly.
  • Complexity is not the same as difficulty, because complexity can be independent of the likely reward.10

In reinforcement learning because the agent is making decisions under imperfect information, it has to decide whether to act — exploiting information it already has — or wait for more information. Relatedly and informally, perhaps for humans there is an optimal thought/action ratio, which would seem to be at least a partial analogue to the inform/exploit trade-off in reinforcement learning. This is put plainly as “succeeding in the real world requires you to be more than an insightful spectator”, which I think is certainly worth bearing in mind for the more bookish amongst us.

  1. Legg & Hutter on Direct link to pdf on
  2. A Gaussian distribution, also called a normal distribution, is a bell-shaped curve, particularly applicable to physical characteristics of living things e.g. heights of students at a university, where extremely high or low values (e.g. the 7’2” tall individual) have a correspondingly lower probability of occurring, and those around the mean have the highest probability of occurring. In relation to other measures of intelligence in humans e.g. Dom Cummings likes ‘g-factor’, which he takes from Spearman, which is the correlation between different mental abilities.
  3. See e.g.
  4. If you are not familiar with loss functions the most familiar optimisation function may be OLS linear regression:
  5. While people’s goals may be many and varied, goals also apply quite obviously to human experience. Certainly it is hard to imagine a living person without goals, as by definition we have goals — even prosaic ones like seeking the next piece of cake — in order to stay alive.
  6. I was once told by an elite sportsman on the subject of his background “sport was taken seriously … it didn’t matter which sport you did but it was important to be good at it”.
  7. Discounting is commonly used in economics and finance for the same purpose, for instance to reflect time value of money, which states that some quantity of money is more valuable now than in a future date, because if we have it now we can earn a riskless positive return on it until such future date. The extent to which this is the case (i.e. the steepness of the discount) depends on the discount factor used.
  8. Tersely stated, Occam states that simpler solutions are more likely to be correct than complex ones. The paper gives this nicely: “Given multiple hypotheses that are consistent with the data, the simplest should be preferred.”
  9. In a nutshell, Kolmogorov complexity is how easily we can describe something, based on the fact that the more complex the thing we are trying to describe is, the more effort we will need to expend to describe it. For instance with regard to strings of numbers, the Kolmogorov complexity is represented by the minimum length of the description (in any language), e.g. ‘12345’ could be described in English words as ‘ascending integers one to five inclusive’, ‘11111111111111111’ as ‘seventeen of [digit] one’. Clearly we could always just use the string itself, so that is a natural maximum: if the string is truly random, which means there is inherently no discernible pattern latent within it, we cannot describe it exactly without simply re-stating it. The language can be a computer programme and the string can be binary, in which case the complexity is the length of the programme. Unfortunately Kolmogorov complexity is not computable so it has to be estimated. Complexity is related to compression, which is quite commonplace: compression is used in certain file types e.g. jpegs, flac, zips, to represent on set of data with a smaller set of data, either loosing information — ‘lossy’ compression, like a jpeg or mp3 — or not loosing information — not lossless compression, like flac — along the way.
  10. A slightly contrived example might be a sort of wheel-of-fortune type game show where instead of the wheel segments denoting cash prizes, each segment defines a complex function on it which is used to calculate the payoff. However, let’s say that a friendly mathematician has calculated that in all scenarios all the payoffs are high positive values which are similar in magnitude. If we ignore whatever steps it takes to get to spin the wheel for a moment; although it could be described as complex, this is not a difficult environment, because the player will benefit whatever happens.