Something amazing has happened. A couple of years ago, we closely followed the progress of AlphaGo, the distributed DeepMind algorithm which defeated Lee Sedol in four out of five games of the ancient board game Go. This has since been defeated by a variant, AlphaGo Zero, which is considerably stronger — after 20 days of training, it was able to win 100 out of 100 games against the version of AlphaGo which played against Sedol. After a further 20 days of training, it won 89 out of 100 games against a stronger instantiation of AlphaGo, namely the one which defeated the world champion Ke.
However, its superiority over the previous algorithm isn’t the most interesting aspect. What makes it interesting is that, unlike AlphaGo which both trained on human games and made use of hardcoded features (such as ‘liberties’), AlphaGo Zero is remarkably simple:
- The algorithm has no external inputs, learning only from games against itself;
- The input to the neural network is just 17 inputs, namely the parity of the turn, the indicator functions of white stones for the last 8 positions, and the indicator functions of black stones for the last 8 positions. (Storing the immediate history is a necessity due to ko rules.)
- Instead of separate policy and value networks, the algorithm uses only one neural network;
- Monte Carlo rollouts are ditched in favour of a feedback loop where the tree search evolves together with the network.
Read the Nature paper for more details. AlphaGo Zero was trained on just four tensor processing units (TPUs), which are fast hardware implementations of fixed-point limited-precision linear algebra. This is much more efficient (but less numerically precise) than a GPU, which is in turn much more efficient (but less flexible) than a CPU.