Over the past few weeks, I’ve been investigating *Boolean optimisation*. That is to say, given some circuit of logic gates that implements a particular *n*-input *m*-output function, find a more efficient circuit that implements the same function. In practical applications, ‘more efficient’ is a multi-objective optimisation problem, with the two highest priorities generally being:

- number of logic gates (smaller is better);
- depth of circuit (lower is better).

One of the best pieces of software out there is Berkeley’s ABC tool. It represents a circuit in a form called an AIG (AND-inverter graph), which is a directed acyclic graph of 2-input AND gates and 1-input NOT gates (the latter of which are considered to be free). Then, it performs a variety of rounds of local optimisations, such as:

- searching for 4-input 1-output subcircuits and ‘rewriting’ them by replacing with equivalent subcircuits of fewer logic gates;
- searching for subcircuits that can be ‘refactored’ as compositions of smaller subcircuits;
- ‘balancing’ the graph to minimise the circuit depth.

In 2011, Nan Li and Elena Dubrova wrote an article which demonstrated significant improvements by including a selection of 5-input 1-output replacements. Instead of restricting to AIGs, the authors allowed elementary XOR gates in the graph as well, which (in the presence of costless 1-input inverters) has the elegant effect that every 2-input Boolean gate has unit cost.

There are exactly 2^32 = 4294967296 Boolean functions with 5 inputs and 1 output, so it would be infeasible in practice to directly store optimal circuits for all of them. However, up to inverting the inputs, permuting the inputs, and negating the outputs, there are only 616126 equivalence classes (called *NPN classes*, for ‘negate-permute-negate’). The authors cherry-picked approximately 1000 of those, and used a *Boolean matcher* to sequentially test a given subcircuit against each of these classes in turn. Doing so for all 616126 equivalence classes would soon get rather slow…

### Knuth’s exhaustive search

Earlier, in 2005, Donald Knuth wrote a collection of computer programs to find the lowest-cost implementations of all 616126 NPN classes of 5-input 1-output functions. Instead of Boolean matching, Knuth’s approach was to ‘canonise’ functions: find the lexicographically smallest truth table which is NPN-equivalent to a given function, and use that as the representative for the NPN class. The serious advantage is that lookup only takes constant time, by using the canonical truth table as a key into a hashtable.

To avoid a full brute-force search, Knuth cleverly approached the problem by induction: try to describe a larger circuit (implementing a harder function) in terms of smaller circuits (implementing easier functions). He separated the inductive step into three cases:

**Top-down**: If we can compute A in n gates and B in m gates, then f(A, B) can be computed in n + m + 1 gates, where f is an arbitrary gate.**Bottom-up**: If we can compute C(x1, x2, x3, x4, x5) in n gates, then we can compute C(f(x1, x2), x2, x3, x4, x5) in n + 1 gates, and C(f(x1, x2), g(x1, x2), x3, x4, x5) in n + 2 gates.**Special**: Anything not of the above form. By assuming that it’s not of either of the previous cases, the possible structure of such a circuit can be constrained considerably, reducing the size of the brute-force search.

Eventually, he had solved all but 6 NPN classes of functions (each of which he knew required either 11 or 12 gates). By some extra computational horsepower, he eventually solved these last holdouts, finding that all but one could be managed in 11 gates, and therefore the last one required exactly 12.

### Optimal5: an efficient database of Knuth’s solutions

One slight impasse from a usability perspective is that the above results were separated across several databases (for top-down and bottom-up steps), text files (for the majority of the special chains), and even in the README file (for the last 6 NPN classes). As such, I realised that it’s worth organising Knuth’s results into a more convenient form.

This was the motivation behind **optimal5**: a database I created with two aims:

- Consolidating Knuth’s research into a uniform database;
- Making function canonisation as efficient as possible, allowing fast lookup;

The first of these tasks was routine — it just involved tracing the inductive constructions (including keeping track of permutations and negations of intermediate results) and ‘unpacking’ them into complete normalised circuits. It was rather laborious owing to the piecemeal structure of the inductive proof, but not particularly noteworthy.

The second of these tasks was both much more mathematically interesting and challenging. In Knuth’s original code, a function is canonised by iterating through all 3840 (2^5 . 5!) permutations and negations of the inputs, negating the output if necessary to ensure the circuit is zero-preserving, and taking the lexicographic minimum over all of those choices.

But 3840 is quite a large number, so even with Knuth’s very streamlined bitwise tricks, it still took a whole **10 microseconds** to canonise a function. After Admiral Grace Hopper’s unforgettable lecture about nanoseconds and microseconds and what length of wire would be hung around my neck per microsecond, I wanted to improve upon that.

If all of this discussion about 5-input 1-output Boolean functions is rather abstract, imagine a 5-dimensional hypercube such as the one below, which is deservedly the logo for the project:

A 5-input 1-output Boolean function corresponds to a way to colour the vertices of this hypercube red and green. Two such functions are NPN-equivalent if you can rotate/reflect one hypercube, and possibly alternate the colours, to make it identical to the other. (And if 5-dimensional hypercubes are too difficult to visualise, just visualise 3-dimensional cubes instead — this simplification doesn’t actually ruin any of the intuition.)

This 5-dimensional (resp. 3-dimensional) hypercube has 10 faces (resp. 6). So we can systematically place each one of those face-down, and look at just the 16 vertices (resp. 4) on the top face, and find out the top face’s canonical form by looking it up in a 2^16-element lookup table. So we’ve made 10 lookups so far, one for each face.

Now, a canonical hypercube must have a canonical top face, so we can discard whichever subset of those 10 orientations (in most cases, it will be 9 out of 10) don’t achieve the lexicographical minimum, and concentrate only on the others. At that point we could do an exhaustive search over 384 permutations, instead of 3840, and save ourselves a factor of 10 in most cases (and gain nothing for very highly symmetric functions, such as the parity function). If I recall correctly, this gave an improvement to about **1.6 microseconds**. Much better, but I’d still prefer not to have Admiral Hopper suspend half a kilometre of conducting wire around my neck, thereby necessitating even more mathematics:

### Hamiltonian paths

Of course, there’s no point traversing all 384 permutations, since you know that (once you’ve made the top face lexicographically minimal) only the elements in the stabiliser subgroup of the top face have any chance of resulting in the lexicographically smallest colouring of the entire cube. So we can instead traverse this subgroup. I decided to ask on MathOverflow whether anyone knew how to do solve the Travelling Salesman Problem efficiently on a Cayley graph, but they didn’t, so I implemented the Held-Karp algorithm instead. Specifically, I opted for:

- If the stabiliser has at most 24 elements, use the optimal Hamiltonian path determined by Held-Karp;
- Otherwise (and this case is sufficiently rare that it doesn’t matter that it’s slower), just traverse all 384 elements as before.

Being far too lazy to manually write code for all 75 subgroups that arise in this manner, I instead wrote a much shorter program to generate this code on my behalf. (If you’re wondering whence the constant 1984 arises, it’s the smallest modulus such that all 222 canonical 4-input functions have distinct residues; this is a rudimentary example of perfect hashing.)

By this point, it took a total of **686 nanoseconds** on average to canonise a function, look up the circuit in the hashtable, transform that circuit back to the original function, and check the result.

### Further optimisations

Using the profiler *perf* I was able to see that the canonisation was no longer the bottleneck, and the other things were taking the lion’s share of the time. Satisfied with the algorithm, I slightly rewrote parts of the implementation to make it faster (e.g. fixed-size data structures instead of std::vectors for representing circuits), and slashed the total time down to **308 nanoseconds**.

Observing that the hashtable lookup itself was taking much of the time, Tom Rokicki helpfully suggested replacing the std::unordered map with a custom implementation of a hashtable (ideally using perfect hashing, as with the Hamiltonian path lookup, or a ‘semi-perfect’ compromise). Back-of-the-envelope calculations suggested that such a hashtable would end up being very sparse, with lots of empty space, annihilating much of the memory advantage of only storing one representative per NPN equivalence class.

Then finally I did something that required ε of effort to accomplish: I simply searched the Internet for the fastest hashtable I could find, swapped the std::unordered_map with this fancy ‘flat hashmap’, and crossed my fingers. The result? **209 nanoseconds**. The performance profile is now sufficiently uniform, with no clear bottlenecks or obvious sources of algorithmic inefficiency, that I’m happy to leave it there and not try to squeeze out any extra performance. Moreover, 60 metres of wire isn’t nearly as uncomfortable as the three kilometres we started with…

### Future work

I was having a discussion with Rajan Troll, who wondered whether some multi-output rewriting steps could be useful. A back-of-the-envelope calculation (taking the leading term of the Polya enumeration formula and discarding the other terms) suggests that there are about 1.4 million NPPN* classes of 4-input 2-output functions.

*the two *outputs* can be freely permuted, as well as the four inputs, ergo the extra P. (I suppose that if I had multiple interchangeable inputs and outputs, whatever that means, I would be an APPG.)

Since using 4-input 2-output rewriting could enable logic sharing (where two different computations share intermediate results), there seems to be a significant amount of utility in embarking on a Knuth-style search for optimal 4-input 2-output (as opposed to 5-input 1-output) circuits.

I’ve started working on that now, including having written a script to enumerate all of the possible shapes of optimal n-input 1-output Boolean chains. This is sufficient, since any 4-input 2-output circuit can be decomposed into a 4-input 1-output chain (computing one of the outputs) and an n-input 1-output chain (computing the other output), where the second chain’s inputs may include intermediate values from the first chain.

Updates to follow as events warrant…

I’m curious, which circuit(s) took the most gates to implement? And how many gates does it take?

I’m happy too see that there are more and more people preferring the free lab over the proprietary hub.