**Note:** the video is in French; recently, Dave Greene added an English translation of the subtitles.

* the video is part of Cabaret’s *Passe-Science* series. You may enjoy some of his other videos, including an explanation of the P vs NP problem and a reduction of Boolean satisfiability to the 3-colourability of planar graphs.

In related news, Michael Simkin recently created a wonderfully anachronistic self-propagator entitled **Remini**: it uses the same single-channel/slow-salvo construction mechanism as the 0E0P metacell, but it is built from oscillatory components instead of static ones. That is to say, it implements modern ideas using components available in the 1970s.

The project involved slmake together with a suite of additional tools developed by Simkin. There isn’t a video of this machine self-replicating, so you’d need to download a program such as Golly in order to watch it running.

For further reading, I recommend (in order):

- The wiki entry (under construction) for the 0E0P metacell;
- An article unveiling various simpler examples of self-constructing circuitry;
- The slmake repository;
- A tutorial on effective use of slmake;
- A challenge thread proposing another contraption, that no-one has yet built. This would require the use of slmake followed by some ‘DNA-splicing’ to interleave the construction recipe with extra operations.

]]>

- number of logic gates (smaller is better);
- depth of circuit (lower is better).

One of the best pieces of software out there is Berkeley’s ABC tool. It represents a circuit in a form called an AIG (AND-inverter graph), which is a directed acyclic graph of 2-input AND gates and 1-input NOT gates (the latter of which are considered to be free). Then, it performs a variety of rounds of local optimisations, such as:

- searching for 4-input 1-output subcircuits and ‘rewriting’ them by replacing with equivalent subcircuits of fewer logic gates;
- searching for subcircuits that can be ‘refactored’ as compositions of smaller subcircuits;
- ‘balancing’ the graph to minimise the circuit depth.

In 2011, Nan Li and Elena Dubrova wrote an article which demonstrated significant improvements by including a selection of 5-input 1-output replacements. Instead of restricting to AIGs, the authors allowed elementary XOR gates in the graph as well, which (in the presence of costless 1-input inverters) has the elegant effect that every 2-input Boolean gate has unit cost.

There are exactly 2^32 = 4294967296 Boolean functions with 5 inputs and 1 output, so it would be infeasible in practice to directly store optimal circuits for all of them. However, up to inverting the inputs, permuting the inputs, and negating the outputs, there are only 616126 equivalence classes (called *NPN classes*, for ‘negate-permute-negate’). The authors cherry-picked approximately 1000 of those, and used a *Boolean matcher* to sequentially test a given subcircuit against each of these classes in turn. Doing so for all 616126 equivalence classes would soon get rather slow…

Earlier, in 2005, Donald Knuth wrote a collection of computer programs to find the lowest-cost implementations of all 616126 NPN classes of 5-input 1-output functions. Instead of Boolean matching, Knuth’s approach was to ‘canonise’ functions: find the lexicographically smallest truth table which is NPN-equivalent to a given function, and use that as the representative for the NPN class. The serious advantage is that lookup only takes constant time, by using the canonical truth table as a key into a hashtable.

To avoid a full brute-force search, Knuth cleverly approached the problem by induction: try to describe a larger circuit (implementing a harder function) in terms of smaller circuits (implementing easier functions). He separated the inductive step into three cases:

**Top-down**: If we can compute A in n gates and B in m gates, then f(A, B) can be computed in n + m + 1 gates, where f is an arbitrary gate.**Bottom-up**: If we can compute C(x1, x2, x3, x4, x5) in n gates, then we can compute C(f(x1, x2), x2, x3, x4, x5) in n + 1 gates, and C(f(x1, x2), g(x1, x2), x3, x4, x5) in n + 2 gates.**Special**: Anything not of the above form. By assuming that it’s not of either of the previous cases, the possible structure of such a circuit can be constrained considerably, reducing the size of the brute-force search.

Eventually, he had solved all but 6 NPN classes of functions (each of which he knew required either 11 or 12 gates). By some extra computational horsepower, he eventually solved these last holdouts, finding that all but one could be managed in 11 gates, and therefore the last one required exactly 12.

One slight impasse from a usability perspective is that the above results were separated across several databases (for top-down and bottom-up steps), text files (for the majority of the special chains), and even in the README file (for the last 6 NPN classes). As such, I realised that it’s worth organising Knuth’s results into a more convenient form.

This was the motivation behind **optimal5**: a database I created with two aims:

- Consolidating Knuth’s research into a uniform database;
- Making function canonisation as efficient as possible, allowing fast lookup;

The first of these tasks was routine — it just involved tracing the inductive constructions (including keeping track of permutations and negations of intermediate results) and ‘unpacking’ them into complete normalised circuits. It was rather laborious owing to the piecemeal structure of the inductive proof, but not particularly noteworthy.

The second of these tasks was both much more mathematically interesting and challenging. In Knuth’s original code, a function is canonised by iterating through all 3840 (2^5 . 5!) permutations and negations of the inputs, negating the output if necessary to ensure the circuit is zero-preserving, and taking the lexicographic minimum over all of those choices.

But 3840 is quite a large number, so even with Knuth’s very streamlined bitwise tricks, it still took a whole **10 microseconds** to canonise a function. After Admiral Grace Hopper’s unforgettable lecture about nanoseconds and microseconds and what length of wire would be hung around my neck per microsecond, I wanted to improve upon that.

If all of this discussion about 5-input 1-output Boolean functions is rather abstract, imagine a 5-dimensional hypercube such as the one below, which is deservedly the logo for the project:

A 5-input 1-output Boolean function corresponds to a way to colour the vertices of this hypercube red and green. Two such functions are NPN-equivalent if you can rotate/reflect one hypercube, and possibly alternate the colours, to make it identical to the other. (And if 5-dimensional hypercubes are too difficult to visualise, just visualise 3-dimensional cubes instead — this simplification doesn’t actually ruin any of the intuition.)

This 5-dimensional (resp. 3-dimensional) hypercube has 10 faces (resp. 6). So we can systematically place each one of those face-down, and look at just the 16 vertices (resp. 4) on the top face, and find out the top face’s canonical form by looking it up in a 2^16-element lookup table. So we’ve made 10 lookups so far, one for each face.

Now, a canonical hypercube must have a canonical top face, so we can discard whichever subset of those 10 orientations (in most cases, it will be 9 out of 10) don’t achieve the lexicographical minimum, and concentrate only on the others. At that point we could do an exhaustive search over 384 permutations, instead of 3840, and save ourselves a factor of 10 in most cases (and gain nothing for very highly symmetric functions, such as the parity function). If I recall correctly, this gave an improvement to about **1.6 microseconds**. Much better, but I’d still prefer not to have Admiral Hopper suspend half a kilometre of conducting wire around my neck, thereby necessitating even more mathematics:

Of course, there’s no point traversing all 384 permutations, since you know that (once you’ve made the top face lexicographically minimal) only the elements in the stabiliser subgroup of the top face have any chance of resulting in the lexicographically smallest colouring of the entire cube. So we can instead traverse this subgroup. I decided to ask on MathOverflow whether anyone knew how to do solve the Travelling Salesman Problem efficiently on a Cayley graph, but they didn’t, so I implemented the Held-Karp algorithm instead. Specifically, I opted for:

- If the stabiliser has at most 24 elements, use the optimal Hamiltonian path determined by Held-Karp;
- Otherwise (and this case is sufficiently rare that it doesn’t matter that it’s slower), just traverse all 384 elements as before.

Being far too lazy to manually write code for all 75 subgroups that arise in this manner, I instead wrote a much shorter program to generate this code on my behalf. (If you’re wondering whence the constant 1984 arises, it’s the smallest modulus such that all 222 canonical 4-input functions have distinct residues; this is a rudimentary example of perfect hashing.)

By this point, it took a total of **686 nanoseconds** on average to canonise a function, look up the circuit in the hashtable, transform that circuit back to the original function, and check the result.

Using the profiler *perf* I was able to see that the canonisation was no longer the bottleneck, and the other things were taking the lion’s share of the time. Satisfied with the algorithm, I slightly rewrote parts of the implementation to make it faster (e.g. fixed-size data structures instead of std::vectors for representing circuits), and slashed the total time down to **308 nanoseconds**.

Observing that the hashtable lookup itself was taking much of the time, Tom Rokicki helpfully suggested replacing the std::unordered map with a custom implementation of a hashtable (ideally using perfect hashing, as with the Hamiltonian path lookup, or a ‘semi-perfect’ compromise). Back-of-the-envelope calculations suggested that such a hashtable would end up being very sparse, with lots of empty space, annihilating much of the memory advantage of only storing one representative per NPN equivalence class.

Then finally I did something that required ε of effort to accomplish: I simply searched the Internet for the fastest hashtable I could find, swapped the std::unordered_map with this fancy ‘flat hashmap’, and crossed my fingers. The result? **209 nanoseconds**. The performance profile is now sufficiently uniform, with no clear bottlenecks or obvious sources of algorithmic inefficiency, that I’m happy to leave it there and not try to squeeze out any extra performance. Moreover, 60 metres of wire isn’t nearly as uncomfortable as the three kilometres we started with…

I was having a discussion with Rajan Troll, who wondered whether some multi-output rewriting steps could be useful. A back-of-the-envelope calculation (taking the leading term of the Polya enumeration formula and discarding the other terms) suggests that there are about 1.4 million NPPN* classes of 4-input 2-output functions.

*the two *outputs* can be freely permuted, as well as the four inputs, ergo the extra P. (I suppose that if I had multiple interchangeable inputs and outputs, whatever that means, I would be an APPG.)

Since using 4-input 2-output rewriting could enable logic sharing (where two different computations share intermediate results), there seems to be a significant amount of utility in embarking on a Knuth-style search for optimal 4-input 2-output (as opposed to 5-input 1-output) circuits.

I’ve started working on that now, including having written a script to enumerate all of the possible shapes of optimal n-input 1-output Boolean chains. This is sufficient, since any 4-input 2-output circuit can be decomposed into a 4-input 1-output chain (computing one of the outputs) and an n-input 1-output chain (computing the other output), where the second chain’s inputs may include intermediate values from the first chain.

Updates to follow as events warrant…

]]>- ρ = sqrt(–2 log(U))
- θ = 2π V
- X = ρ cos(θ)
- Y = ρ sin(θ)

Then it follows that X and Y are independent standard Gaussian distributions. On a computer, where independent uniform distributions are easy to sample (using a pseudo-random number generator), this enables one to produce Gaussian samples.

As the joint probability density function of a pair of independent uniform distributions is shaped like a box, it is thus entirely reasonable to coin the term ‘Müller’ to refer to the shape of the joint probability density function of a pair of independent standard Gaussians.

It transpires that it’s even easier to manufacture a uniform distribution from a collection of independent standard Gaussian distributions. In particular, if W, X, Y, and Z are independent standard Gaussians, then we can produce a uniform distribution using a rational function:

The boring way to prove this is to note that this is the ratio of an exponential distribution over the sum of itself and another independent identically-distributed exponential distribution. But is there a deeper reason? Observing that the function is homogeneous of degree 0, it is equivalent to the following claim:

Take a random point on the unit sphere in 4-dimensional space (according to its Haar measure), and orthogonally project onto a 2-dimensional linear subspace. Then the squared length of the projection is uniformly distributed in the interval [0, 1].

This has a very natural interpretation in quantum theory (which seems to be a special case of a theorem by Bill Wootters, according to this article by Scott Aaronson arguing why quantum theory is more elegant over the complex numbers as opposed to the reals or quaternions):

Take a random qubit. The probability p of measuring zero in the computational basis is uniformly distributed in the interval [0, 1].

Discarding the irrelevant phase factor, qubits can be viewed as elements of S² rather than S³. (This quotient map is the Hopf fibration, whose discrete analogues we discussed earlier). Here’s a picture of the Bloch sphere, taken from my 2014 essay on quantum computation:

Then, the observation reduces to the following result first proved by Archimedes:

Take a random point on the unit sphere (in 3-dimensional space). Its z-coordinate is uniformly distributed.

Equivalently, if you take any slice containing a sphere and its bounding cylinder, the areas of the curved surfaces agree precisely:

There are certainly more applications of Archimedes’ theorem on the 2-sphere, such as the problem mentioned at the beginning of Poncelet’s Porism: the Socratic Dialogue. But what about the statement involving the 3-sphere (i.e. the preimage of Archimedes’ theorem under the Hopf fibration), or the construction of a uniform distribution from four independent standard Gaussians?

]]>- Afshani, Freksen, Kamma, and Larsen proved a lower bound of Ω(
*n log n*) on the circuit complexity of integer multiplication,**conditional on a conjecture in network coding**. - Harvey and van der Hoeven published an algorithm for large integer multiplication, establishing an unconditional upper bound of O(
*n log n*). This is only marginally faster than the O(*n log n log log n*) Schönhage–Strassen algorithm, overtaking it only for unimaginably large numbers, but is of great theoretical interest because it coincides with the conjectural lower bound. (The authors also showed that the same complexity can be achieved by a multi-tape Turing machine.)

Essentially all modern integer multiplication algorithms are recursive in nature, and the computational complexity depends on the number of levels of recursion together with computational complexity of each level. To summarise:

In practice, it is common to mix-and-match these algorithms: using FFT-based algorithms (typically Schönhage–Strassen) near the root of the recursion, and switching to Toom-Cook at lower levels, before finally falling back on hardware multiplication at the leaves. This new Harvey–Hoeven algorithm is only suitable for *really* large integers, and switches to older algorithms (in the manner described) for numbers with fewer than 2^(1729^12) binary digits.

A refinement of the algorithm reduces that to 2^(9^12) = 2^282429536481 binary digits, but that is still much much larger than any number that could be practically stored, even storing one digit per atom in the observable universe.

]]>So, what exactly is happening?

- At 0:06, the organism begins to sequentially construct four identical copies of itself.
- At 0:14, the original organism self-destructs to leave room for its offspring.
- At 0:16, each of the four children begin to sequentially construct copies of themselves. By 0:18, there are eight organisms.
- By 0:24, there are a total of thirteen organisms.
- At 0:27, the four from the previous generation self-destruct, followed shortly by the eight outermost organisms.
- By 0:34, the apoptosis of the outermost organisms finishes, leaving behind a clean isolated copy indistinguishable from the original cell.

How does it work? Why did the cells suddenly choose to die, and how did the middle cell know that it was due to survive? And how does this relate to multicellular life?

**Update, 2019-05-12:** Here’s a high-definition video of the construction of the south-east daughter machine:

The field of *artificial life* is often ascribed to Christopher Langton’s self-replicating loops. We have discussed these previously. A sequence of simple LOGO-like instructions circulate in an ensheathed loop. This information is executed 4 times to construct another copy of the loop (taking advantage of the symmetry of the daughter loop), and then the same tape is copied into the daughter loop:

If we quantify the number of times the loop’s instruction tape is utilised, we can represent it as the formal sum 4E + 1C (where ‘E’ represents one tape execution and ‘C’ represents tape copying).

However, there’s more. If the loop were only able to produce one child, the number of fertile loops would remain bounded (at 1), and it is disputed whether such bounded-fecundity ‘*linear propagators*‘ are actually true self-replicators. Note that at the end of the animation above, the loop has extended a new pseudopodium upwards, and will begin constructing a second offspring.

This continues for each of the sides of the parent loop, thereby giving an overall tape utility of 4(4E + 1C) = 16E + 4C. Note that the inner ‘4E’ comes from the fourfold symmetry of the daughter loop, whereas the outer ‘4E’ comes from the fourfold symmetry of the parent loop.

Anyway, after a while, the colony of self-replicating loops resemble this:

Five years after Langton’s loops were invented, John Byl removed the inner sheath of the loop to result in a more minimalistic self-replicator, with only 4 tape cells surrounded by 8 sheath cells:

Moreover, the underlying rule is simpler: only 6 states instead of 8. This comes at the expense of reduced flexibility; whereas one could build a larger Langton’s loop by increasing each side-length by *n* and inserting *n* ‘move forward’ instructions into the loop, there is no way to construct a Byl loop with any other genome.

Nor does it stop with Byl. In 1993, Chou and Reggia removed the outer sheath from the loop by adding two more states (returning to 8, same as Langton). The loops, which are barely recognisable as such, are only 6 cells in size: half of Byl’s loop and an order of magnitude smaller than Langton’s.

If minimality were the only concern, all of these examples would be blown out of the water by Edward Fredkin’s single-cell replicator in the 2-state XOR rule. However, *every* configuration in that rule replicates, including a photograph of Fredkin, so it is hard to claim that this is self-directed.

The inspiration for Langton’s loop was an earlier (1968) 8-state cellular automaton by E. F. Codd (the inventor of the relational database). Codd’s cellular automaton was designed to support universal computers augmented with universal construction capabilities: unlike Langton’s loops, the instruction tape can program the machine to build any configuration of quiescent cells, not just a simple copy of itself.

It took until 2010 before Codd’s machine was actually built, with some slight corrections, by Tim Hutton. It is *massive*:

Codd’s cellular automaton itself was borne out of a bet in a pub, where Codd challenged a friend that he could create a self-replicating computer in a cellular automaton with fewer states than von Neumann’s original 29-state cellular automaton.

For an *n*-state *k*-neighbour cellular automaton, there are different rules, where is the number of distinct neighbourhoods that can occur. (We get equality in the case of asymmetric rules, but for rules with symmetries the count is more complex and depends on the Polya Enumeration Theorem.) Consequently, we can concretely define the ‘complexity’ of the rule (in bits) to be .

For instance, Langton’s, Codd’s and Chou-Reggia’s cellular automata all have a complexity of 25056 bits, whereas Nobili’s 32-state adaptation of von Neumann’s original 29-state rule has a complexity of 167772160 bits. Conway’s two-state rule, by comparison, has only 18 bits of complexity.

We can plot the population count (including the tape) of different self-replicating machines on one axis, and the complexity of the rule on the other axis. Interestingly, qualitative categories of replicator such as ‘universal constructor’, ‘loop’, and ‘parity-rule replicator’ form visually distinct clusters in the space:

Near the top of the plot are two rough hypothetical designs of replicators which have never been built:

- Conway’s original blueprint for a universal constructor in his 2-state 9-neighbour cellular automaton, as described in
*Winning Ways*and*The Recursive Universe*; - An estimate of how large a self-replicating machine would need to be in Edwin Roger Banks’ ‘
*Banks-IV*‘ cellular automaton, described in his 1971 PhD thesis.

The third point from the top (Codd’s 1968 self-replicating computer) also fell into this category, until Tim Hutton actually constructed the behemoth. This has been estimated to take 1000 years to replicate, which is why it is firmly above the threshold of ‘full simulation is beyond present computational capabilities’.

Everything else in this plot has been explicitly built and simulated for at least one full cycle of replication. Immediately below Codd’s machine, for instance, is Devore’s machine (built by Hightower in 1992), which is much more efficient and can be simulated within a reasonable time. The other patterns form clusters in the plot:

- On the right-hand side of the plot is a cluster of self-replicating machines in von Neumann cellular automata, along with Renato Nobili’s and Tim Hutton’s modifications of the rule.
- The green points in this centre at the bottom are loop-like replicators. As well as Langton’s loops, this includes evolvable variants by Sayama and Oros + Nehaniv.
- The bottom-left cluster comprises trivial parity-rule replicators which have no tape and are passively copied by the underlying rule.

The yellow points on the left edge are self-propagating configurations which move by universal construction, but are not replicators in the strictest sense. They are all bounded-fecundity self-constructors, and with the exception of Greene 2013, they do not even copy their own tapes.

Finally, we have the new organism (shown in white on the left-hand side of the log-log plot, immediately below the threshold of practicality). Suitably programmed, this is a parity-rule replicator, and a loop-like replicator, and a universal constructor. It is also the first unbounded-fecundity replicator in Conway’s 2-state cellular automaton.

If we look again at the video:

we can see that, macroscopically, it copies itself in all four directions similar to Langton’s loops. The circuitry is designed such that each new child is placed in the same orientation and phase as the parent. Moreover, we see that the organism is programmed to self-destruct — either before or after constructing up to four children.

Whether or not it self-destructs prematurely depends on what *signals* it has received from its neighbours. Effectively, the machine receives a signal (a positive integer between 1 and 7, inclusive) from each of the (up to four) neighbours, and a 0 from any empty spaces if there are fewer than four neighbours. It then computes the quantity , where (a, b, c, d) are the four input signals, and indexes into a 4096-element lookup table to retrieve a value between 0 and 7 (the new ‘state’ of the machine). If 0, it immediately self-destructs without constructing any children; if nonzero, it constructs a daughter machine in each vacant space. Finally, it broadcasts the new state as a signal to all four neighbours, before self-destructing anyway.

In doing so, this loop-like replicator behaves as a single cell in any 8-state 4-neighbour cellular automaton; the rule is specified by the lookup table inside the replicator. We call this construct a **metacell** because it emulates a single cell in a (8-state 4-neighbour) cellular automaton using a large collection of cells in the underlying (2-state 9-neighbour) cellular automaton.

This is not the first metacell (David Bell’s Unit Life Cell being the first example), but it is unique in having a 0-population ground state. As such, unlike the Unit Life Cell (which requires the entire plane to be tiled with infinitely many copies), any finite pattern in the emulated rule can be realised as a finite pattern in the underlying rule.

Interestingly, every 2-state 9-neighbour cellular automaton can be emulated at half the speed as an 8-state 4-neighbour cellular automaton. As such, we can ‘import’ any pattern from any such cellular automaton into Conway’s rule, thereby obtaining the first examples of:

- a parity-rule replicator (by emulating Fredkin, HighLife, or ThighLife);
- a reflectorless rotating oscillator;
- a spaceship made of perpetually colliding copies of smaller spaceships;

or even the metacell itself, recursively, obtaining an infinite sequence of exponentially larger and slower copies thereof (as if the existing metacell isn’t already too large and slow!).

To simplify the process of ‘metafying’ a pattern from an arbitrary isotropic 2-state 9-neighbour cellular automaton, I have included a Python script; this programs the metacell for the desired rule and assembles many copies (one for each cell in the original pattern) thereof into an equivalent pattern ‘writ large’.

Next time, we shall discuss in greater detail how the metacell itself was built. Until then, you may want to read Dave Greene’s recent article about some of the technology involved.

]]>An infinite strip of width can similarly be 6-coloured in a relatively simple way.

Interestingly, it has been shown that any tile-based 6-colouring of the plane is *critical* in the sense that the maximum diameter of any tile must be equal to the minimum separation between similarly-coloured tiles; there is no room for manoeuvre. Moreover, this means that it is insufficient to simply specify the colours of the tiles themselves; it is necessary to also colour the (measure-0) vertices and edges where they meet!

More updates as events warrant…

]]>…and Endre Szemeredi of regularity lemma fame…

…and Louis Nirenberg…

…and, last but certainly not least, enjoyed sparkling Riesling in a Bavarian brewery with Michael Atiyah:

He proceeded to summon several of us into a room, wherein he posed a rather interesting problem and offered a reward for its solution:

Consider

ndistinct points, in the three-dimensional unit ball. Let the ray (half-line) from through meet the boundary of the ball at , viewed as a complex number on the Riemann sphere. We define the monic polynomials whose roots are given by the projections of the remaining points onto the sphere.Prove that these

npolynomials are linearly independent.

If we consider the determinant of the matrix *M* formed by the coefficients of these polynomials, we get a degree-½*n*(*n−*1) homogeneous polynomial in the *n*(*n−*1) roots. This determinant can be seen to be invariant under adding a constant to all roots, but it is not scale-invariant because the degree is nonzero. This can be amended by dividing by a normalising constant, yielding a rational function δ:

Note that δ is not only scale- and translation-invariant, but also is invariant under simultaneously replacing all roots by their reciprocals. This means that δ is invariant under the entirety of the Möbius group, which corresponds naturally to the group of orientation-preserving projective transformations fixing the unit ball. Since δ is dimensionless, it is reasonable to conjecture the following stronger problem:

Prove that |δ| ≥ 1.

Apparently an acquaintance of Atiyah proved this for up to 4 points by symbolic manipulation in a computer algebra package, and experimentally verified that it appears to hold in much higher dimensions.

Interestingly, if one of the points is on the boundary of the unit ball, it can be seen that deleting it does not alter the value of δ. (Hint: since we have so much invariance, it suffices to check this at the point 0.) This allowed Atiyah to strengthen the problem even further:

Prove that, if we leave the points in-place and gradually shrink the ball until one of the points lies on the boundary, the value |δ| does not increase.

Atiyah circulated this problem to as many mathematicians as he could, offering a bottle of champagne and an invitation to the next HLF as a reward for anyone who could solve it. I was perplexed that Atiyah — who is a ‘theory-builder’ rather than a ‘problem-solver’ (e.g. Erdös) — would be interested in a problem that, whilst being elegant, seemingly bears no connection to serious research mathematics. I wondered whether he was following in the footsteps of Littlewood, who used to take disguised versions of the Riemann hypothesis and give them to PhD students as research problems.

Of course, I didn’t know at the time which great problem Atiyah had reduced to this lemma. Last year, however, he gave a talk at Cambridge presenting a proof of this geometrical inequality. I wasn’t at the talk, but apparently it involved expressing the logarithm of |δ| (possibly negated) as the ** von Neumann entropy** of some system, and proving the strongest version of the conjecture as a corollary of entropy being non-decreasing.

On Monday morning, however, Atiyah will be presenting a proof of the Riemann hypothesis in a 45-minute talk at the Heidelberg Laureate Forum, three years after he presented this problem to us. The abstract of the forthcoming talk mentions that it builds upon work by von Neumann, which is tantalisingly consistent with my prediction that his ‘points in a ball’ conjecture was merely the remaining lemma required to solve a huge unsolved problem!

Anyway, in 60 hours’ time, number theory will be revolutionised. Let’s hope that his proof generalises easily to GRH as well, so that we can enjoy a deterministic primality test faster than AKS.

]]>For reasons that shall soon become clear, I found myself faced with the task of sorting a list of 12 objects.

Usually one would choose an algorithm such as *quicksort* or *Timsort*. Conventional comparison-based sorting algorithms operate by comparing pairs of objects, and are otherwise unrestricted: the choices of objects to compare can depend on the results of previous comparisons.

A *sorting network* is a much more restricted sorting algorithm, where the only allowed operation is the compare-exchange instruction CMPX(*i*, *j*). This compares objects in positions *i* and *j*, swapping them if they are in the wrong order, and revealing no information. Here are the best known sorting networks on 9 and 12 elements, photographed from *The Art of Computer Programming* by Donald Knuth:

So, with straight-line code of 39 CMPX instructions it is possible to sort a collection of 12 objects without any need for loops, conditional branching, or any other form of control flow. This is especially useful when programming a GPU, where control flow is toxic for performance.

I proceeded to transcribe the sorting network from the above diagram into CUDA code. As a mere mortal, I was not totally convinced that I’d copied it flawlessly, so resorted to building a test to verify the correctness of the transcribed network. Preferring to do this in a high-level language such as Python, I resorted to my usual tricks of writing a single file which is valid in two languages and incorporating it into the source code by means of one innocuous line: **#include “sorting_network.py”**

(If you think this is bad, people have done much worse…)

Examining the Python component of the code, you may notice that it only tests the 2^12 different binary sequences, rather than the 12! different totally ordered sets. It is a general property of comparator networks that it suffices to only test binary sequences to prove that the network can sort arbitrary sequences; this is known as the 0-1 principle.

What is the minimum number of CMPX gates necessary to sort *n* objects? And what is the minimum circuit depth? The naive algorithm of bubble sort shows that a gate-count of O(*n*^2) and a circuit depth of O(*n*) are both attainable. Similarly, the gate-count must be at least the binary logarithm of *n*! (as with any comparison-based sorting algorithm) which gives a lower bound of Ω(*n log n*) for the gate-count and Ω(*log n*) for the depth.

Batcher found a recursive construction of sorting networks with a depth of ½*k*(*k*+1), where *k* is the ceiling of the binary logarithm of *n*, and each layer has ½*n* comparators. This is achieved by firstly Batcher-sorting the initial and final halves of the sequence, followed by interleaving them (diagram by User:Bitonic from Wikipedia):

The correctness of the algorithm follows from the aforementioned 0-1 principle. By the inductive hypothesis, it suffices to examine the rightmost blue box and suppose that the two halves of the input are correctly sorted, in which case the input would resemble:

[n/2 – m zeroes] [m ones] | [l zeroes] [n/2 – l ones]

The only ‘cross-lane’ operations are the comparators in the brown box. If l is no greater than m, the result of this is the following:

[n/2 – m zeroes] [m – l ones] [l zeroes] | [n/2 ones]

and otherwise we get the complementary arrangement:

[n/2 zeroes] | [m ones] [l – m zeroes] [n/2 – l ones]

Concentrating only on the non-constant half, our task is reduced to the simpler problem of sorting a binary sequence which switches at most twice between a run of zeroes and a run of ones. We can split the effect of the pink box into two modules: one which reverses one of the two halves (we get to decide which half!), followed by one which behaves identically to a brown box. Observe that, as before, one of the two halves of the pink box must therefore be constant, and the other must again be a binary sequence which switches at most twice. By induction, the result follows.

Owing to the low depth, simplicity, and efficiency, Batcher’s bitonic mergesort is often used for sorting large lists on GPUs.

But is the bitonic mergesort optimal? The circuit above takes 80 comparators to sort 16 inputs, whereas the best circuit in Knuth takes only 60 comparators (again with a depth of 10). It’s not even optimal for depth, as the next page of Knuth has a 61-comparator sorting network with a depth of 9.

What about asymptotics? The bitonic mergesort gives an upper bound on the depth of O((*log n*)^2) and basic information theory gives a lower bound of Ω(*log n*).

The next surprise was when Szemeredi, Komlos and Ajtai proved that the lower bound is tight: they exhibited a construction of sorting networks of optimal depth O(*log n*). As you can imagine from Szemeredi’s background in combinatorics and extremal graph theory, the construction relies on a family of graphs called *expanders*.

A simplified version of the construction (by Paterson, 1990) is described here. The original paper provides explicit constants, showing that a depth ~ 6100 log(*n*) is possible, compared with ~ ½ log(*n*)^2 for Batcher’s bitonic mergesort. In other words, the threshold for switching from bitonic mergesort to Paterson’s variant of AKS occurs when *n* is approximately 2^12200.

A further improvement by Chvatal reduces the asymptotic constant from 6100 to 1830, and actually provides an explicit (non-asymptotic) bound: provided *n* ≥ 2^78, there is a sorting network of depth 1830 log(*n*) − 58657. This reduces the crossover point to exactly *n* ≥ 2^3627. As Knuth remarked, this is still far greater than the number of atoms in the observable universe, so the practical utility of the AKS sorting algorithm is questionable.

Interestingly, this is not the first time there has been an asymptotically impressive algorithm named AKS after its authors: a set of three Indian Institute of Technology undergraduates {Agrawal, Kayal, Saxena} found the first unconditional deterministic polynomial-time algorithm for testing whether an *n*-digit number is prime. This O(*n*^(6+o(1)) algorithm tends not to be used in practice, because everyone believes the Generalised Riemann Hypothesis and its implication that the O(*n*^(4+o(1)) *deterministic Miller-Rabin algorithm* is correct.

- For Northern Ireland to have dual EU/UK status;
- For there to be a 10-mile ‘trade buffer zone’ between Northern Ireland and the Republic of Ireland.

The second is more interesting from a mathematical perspective: the 10-mile buffer zone means that (the closures of) Northern Ireland and the Republic of Ireland are disjoint compact subsets of a normal topological space. By Urysohn’s Lemma, this means that there exists a continuous function such that is identically 0 on Northern Ireland and identically 1 on the Republic of Ireland.

The proof of this proceeds as follows:

- By taking closures, assume without loss of generality that NI and ROI are both closed and disjoint (the interior 10-mile buffer zone is not considered to belong to either).
- Define U(1) and V(0) to be the complements of NI and ROI, respectively. These are overlapping open sets, whose intersection is the buffer zone.
- For each :
- For each dyadic rational with denominator and odd numerator:
- Let and , so are adjacent;
- By appealing to the normality of Ireland, let U(r) and V(r) be two disjoint open sets containing the complements of V(q) and U(s), respectively.

- For each dyadic rational with denominator and odd numerator:
- Now we have disjoint open sets U(r) and V(r) for each dyadic rational r, such that the U(r) form an ascending chain of nested spaces.
- Define (where the infimum of an empty set is taken to be 1).

With this interpolating function , it is easy to take convex combinations of EU and UK standards. For example, a road sign at a point x must be stated in ‘lengths per hour’, where one length is exactly 1 + 0.609344(1 – f(x)) kilometres.

]]>In other news, following on from Aubrey de Grey’s 5-chromatic unit-distance graph, there has been an effort to study the algebraic structure of the graphs. Specifically, viewing the vertices as points in the complex plane, one can ask what number fields contain the vertices of the unit-distance graphs.

In particular, it was noted that both Moser’s spindle and Golomb’s graph, the smallest examples of 4-chromatic unit-distance graphs, lie in the ring , where is a complex number with real part and absolute value 1. Ed Pegg Jr produced a beautiful demonstration of this:

Philip Gibbs showed that the entire ring, and consequently all graphs therein, can be coloured by a homomorphism to a four-element group. Consequently, Ed Pegg’s hope that the large unit-distance graph above is 5-chromatic was doomed to fail — but that is not too much of a worry now that we have de Grey’s 5-chromatic graph.

Several of Marijn Heule’s 5-chromatic graphs lie in . Apparently both this ring and have homomorphic 5-colourings, so we cannot find a 6-chromatic unit-distance graph lying in either of these rings.

Incidentally, the record is a 610-vertex example, again due to Heule:

]]>