Quantum circuit optimization with AlphaTensor

The signature tensor
The tensor representation of a circuit relies on the concept of phase polynomials. Consider a circuit with CNOT and T gates alone. Its corresponding unitary matrix U performs the operation \(U\left\vert {x}_{1:N}\right\rangle ={{\mathrm{e}}}^{{\mathrm{i}}\phi ({x}_{1:N})}\left\vert A{x}_{1:N}\right\rangle\), where ϕ(x1:N) is the phase polynomial and A is an invertible matrix that can be implemented with Clifford gates only. For example, applying a T gate to the second qubit after a CNOT gives \((I\otimes T\,){\mathrm{CNOT}}\left\vert x,y\right\rangle ={{\mathrm{e}}}^{{\mathrm{i}}\frac{\uppi }{4}(x\oplus y)}\left\vert x,x\oplus y\right\rangle\), that is, \(\phi (x,y)=\frac{\uppi }{4}(x\oplus y)=\frac{\uppi }{4}(x+y-2xy)\). After cancelling out the terms leading to multiples of 2π, the phase polynomial takes a multilinear form, that is, \(\phi ({x}_{1:N})=\frac{\uppi }{4}\left({\sum }_{i}{a}_{i}{x}_{i}+2{\sum }_{i < j}{b}_{ij}{x}_{i}{x}_{j}+4{\sum }_{i < j < k}{c}_{ijk}{x}_{i}{x}_{j}{x}_{k}\right)\), where ai ∈ {0, …, 7}, bij ∈ {0, …, 3} and cijk ∈ {0, 1}. This maps directly to a circuit of T, CS and CCZ gates: each aixi term is implemented by ai T gates acting on the ith qubit, each bijxixj term corresponds to bij CS gates on qubits i and j, and each non-zero cijkxixjxk term is a CCZ gate on qubits i, j and k. Each of these terms corresponds to a non-Clifford operation only if the corresponding constant (ai, bij or cijk) is odd. Thus, two CNOT+T circuits implement the same unitary up to Clifford gates if and only if they have the same phase polynomial after taking the modulo 2 of their coefficients. This information about the polynomial can then be captured in a symmetric three-dimensional binary signature tensor \({\mathcal{T}}\) of size N × N × N (refs. 17,19). See Supplementary Section A.1 for more details.
We remark that the signature tensor captures information about the non-Clifford components of the circuit only. Thus, two circuits with the same signature tensor may implement a different unitary matrix—but it is always possible to implement one of the circuits by adding Clifford gates to the output of the other circuit. As in this work we consider the T gate as the only metric of complexity, and Clifford gates do not incur any cost, we consider two circuits with the same signature tensor to be equivalent for the purposes of T-count optimization.
System description
AlphaTensor-Quantum builds on AlphaTensor, which is in turn based on AlphaZero58. It is implemented over a distributed computing architecture, composed of one learner and multiple actors that communicate asynchronously. The learner is responsible for hosting a neural network and updating its parameters, while the main role of the actors is to play games to generate data for the learner. When the learner updates the network parameters by gradient descent steps, it sends them to the actors. The actors employ Monte Carlo tree search (MCTS), using queries to the network to guide their policy, and play games whose actions tend to improve over the policy dictated by the neural network. Played games are sent to the learner and stored in a buffer, and they will serve as future data points to train the network parameters.
Neural network
The neural network at the core of AlphaTensor-Quantum takes as input the game state and outputs a policy, that is, a probability distribution over the next action to play, and a probability distribution over the estimated return (that is, the sum of rewards from the current state until the end of the game) (Extended Data Fig. 5). The neural network has a key component, the torso, where a set of attention operations59 are applied. In AlphaTensor-Quantum, the torso interleaves two types of layer: axial attention60 and symmetrization layers (Extended Data Fig. 5). Symmetrization layers are inspired by the empirical observation that, for a symmetric input \(X\in {{\mathbb{R}}}^{N\times N}\) (with X = X⊤), preserving the symmetry of the output in-between axial attention layers substantially improves performance. Therefore, after each axial attention layer, we add a symmetrization operation, defined as X ← β ⊙ X + (1 − β) ⊙ X⊤, where ‘⊙’ denotes the element-wise product. For β = 1/2, this makes the output X symmetric. In practice, we found that letting β be a learnable N × N matrix improves performance even further, even though X is no longer symmetric, due to the ability of the network to exchange information between rows and columns after each axial attention layer. (When the input is a tensor, \(X\in {{\mathbb{R}}}^{N\times N\times C}\), we apply the symmetrization operations independently across the dimensions C, using the same matrix β.) We refer to this architecture as symmetrized axial attention. See Supplementary Section C.4 for more details of the full neural network architecture.
Symmetrized axial attention performs better than the neural network of standard AlphaTensor (which requires attention operations across every pair of modes of the tensor)25. It is also one of the key ingredients that allow AlphaTensor-Quantum to scale up to larger tensors: for N = 30 and the same number of layers, symmetrized axial attention runs about 4 times faster and reduces the memory consumption by a factor of 3. (See also Supplementary Section D.3 for further ablations.) We believe symmetrized axial attention may be useful for other applications beyond AlphaTensor-Quantum.
Gadgetization
The Toffoli gate admits a gadget to implement it using four T gates42 instead of the seven T gates of the ancilla-free implementation. When considering a state-of-the-art magic-state factory12, which converts a CCZ magic state into two T states, we can consider that the cost of the Toffoli gadget is equivalent to that of two T gates. Similarly, the CS gate can also be built using Clifford operations and a single (CCZ) magic state, so that its cost is equivalent to that of two T gates when considering a state-of-the-art magic-state factory12.
AlphaTensor-Quantum incorporates gadgetization into TensorGame as follows. Every time that an action is played, we check whether it completes a Toffoli when combined with the six previous actions, a CS gate when combined with the two previous actions, or none of them. If it completes a Toffoli, we assign a positive reward of +4, so that the 7 actions jointly incur a reward of −2 (corresponding to 2 T gates) as opposed to the reward of −7 corresponding to the un-gadgetized implementation. If it completes a CS, we assign a reward of 0, so that the 3 actions jointly incur a reward of −2. See Fig. 2 for an illustration.
We can check whether a gadget was completed, up to Clifford operations, by simply finding linear dependencies between u(s) and the previous actions (see Supplementary Section C.3 for a justification):
-
If s ≥ 7, we check whether the action u(s) completes a Toffoli gadget. Let a ≜ u(s−6), b≜ u(s−5) and c ≜ u(s−4). If the vectors a, b and c are linearly independent and all the following relationships hold, then we declare that u(s) completes a Toffoli gadget:
$$\begin{array}{l}{\mathbf{u}}^{(s-3)}=\mathbf{a}+\mathbf{b},\qquad {\mathbf{u}}^{(s-2)}=\mathbf{a}+\mathbf{c},\\{\mathbf{u}}^{(s-1)}=\mathbf{a}+\mathbf{b}+\mathbf{c},\qquad {\rm{and}}\qquad {\mathbf{u}}^{(s)}=\mathbf{b}+\mathbf{c}.\end{array}$$
-
If s ≥ 3, we check whether the action u(s) completes a CS gadget. Let a ≜ u(s−2) and b ≜ u(s−1). If the vectors a and b are linearly independent and u(s) = a + b, then u(s) completes a CS gadget.
We refer to the relationships above as gadgetization patterns. Each factor (action) may belong to at most one gadget; thus, if the action at step s completes a gadget of any type, we only check for a new CS gadget from step s + 3 onwards, and for a new Toffoli gadget from step s + 7 onwards.
Synthetic demonstrations
Like its predecessor, AlphaTensor-Quantum uses a dataset of randomly generated tensor/factorization pairs to train the neural network. Specifically, the network is trained on a mixture of a supervised loss (which encourages it to imitate the synthetic demonstrations) and a standard RL loss (which encourages it to learn to decompose the target quantum signature tensor). For each experiment, we generate 5 million synthetic demonstrations. To generate each demonstration, we first randomly sample the number of factors R from a uniform distribution in [1, 125] and then sample R binary factors u(r), such that each element has a probability of 0.75 of being set to 0.
After having generated the factors, we randomly overwrite some of them to incorporate the patterns of the Toffoli and CS gadgets, to help the network learn and identify those patterns. For each demonstration, we include at least one gadget with probability 0.9. The number of gadgets is uniformly sampled in [1, 15] (the value is also truncated when R is not large enough), and for each gadget we sample the starting index r and the type of gadget (Toffoli with probability 0.6 and CS with probability 0.4), overwriting the necessary factors according to the gadget’s pattern. We ensure that gadgets do not overlap throughout the demonstration.
Change of basis
To increase the diversity of the played games, the actors apply a random change of basis to the target tensor \({\mathcal{T}}\) at the beginning of each game. From the quantum computation point of view, a change of basis can be implemented by a Clifford-only circuit, and therefore this procedure does not affect the resulting number of T gates.
A change of basis is represented by a matrix M ∈ {0, 1}N×N such that det(M) = 1 under modulo 2. We randomly generate 50,000 such basis changes and use them throughout the experiment, that is, the actor first randomly picks a matrix M and then applies the following transformation to the target tensor \({\mathcal{T}}\):
$${\widetilde{{\mathcal{T}}}}_{ijk}=\mathop{\sum }\limits_{a=1}^{N}\mathop{\sum }\limits_{b=1}^{N}\mathop{\sum }\limits_{c=1}^{N}{M}_{ia}{M}_{jb}{M}_{kc}{{\mathcal{T}}}_{abc}.$$
The actor then plays the game in that basis (that is, it aims at finding a decomposition of \(\widetilde{{\mathcal{T}}}\)). After the game has finished, we can map the played actions back to the standard basis by applying the inverse change of basis.
The diversity injected by the change of basis facilitates the agent’s task because it suffices to find a low-rank decomposition in any basis. Note that gadgets are not affected by the change of basis, as their patterns correspond to linear relationships that are preserved after linear transformations.
Data augmentation
For each game played by the actors, we obtain a new game by swapping two of the actions in it. Specifically, we swap the last action that is not part of a gadget with a randomly chosen action that is also not part of a gadget. Mathematically, swapping actions is a valid operation due to commutativity. From the quantum computation point of view, it is a valid operation because the considered unitary matrices are diagonal.
Besides the action permutation, we also permute the inputs at acting time every time we query the neural network. In particular, we apply a random permutation to the input tensor and past actions that represent the current state, and then invert this permutation at the output of the network’s policy head. Similarly, at training time, we apply a random permutation to both the input and the policy targets, and train the neural network on this transformed version of the data. In practice, we sample 100 permutations at the beginning of an experiment and use them thereafter.
Sample-based MCTS
In AlphaTensor-Quantum, the action space is huge, and therefore it is not possible to enumerate all the possible actions from a given state. Instead, AlphaTensor-Quantum relies on sample-based MCTS, similarly to sampled AlphaZero61.
Before committing to an action, the agent uses a search tree to explore multiple actions to play. Each node in the tree represents a state, and each edge represents an action. For each state–action pair (s, a), we store a set of statistics: the visit count N(s, a), the action value Q(s, a) and the empirical policy probability \(\widehat{\pi }(s,a)\). The search consists of multiple simulations; each simulation traverses the tree from the root state s0 until a leaf state sL is reached, by recursively selecting in each state s an action a that has high empirical policy probability and high value but has not been frequently explored, that is, we choose a according to
$$\arg \mathop{\max }\limits_{a}Q(s,a)+c(s)\widehat{\uppi }(s,a)\frac{\sqrt{{\sum }_{b}N(s,b)}}{1+N(s,a)},$$
where c(s) is an exploration factor. The MCTS simulation keeps choosing actions until a leaf state sL is reached; when that happens, the network is queried on sL—obtaining the policy probability π(sL, ⋅) and the return probability z(sL). We sample K actions ak from the policy to initialize \(\widehat{\uppi }({s}_{{\mathrm{L}}},a)=\frac{1}{K}{\sum }_{k}{\delta }_{a,{a}_{k}}\), and initialize Q(sL, a) from z(sL) using a risk-seeking value25. After the node sL has been expanded, we update the visit counts and values on the simulated trajectory in a backwards pass61. We simulate 800 trajectories using MCTS; after that, we use the visit counts of the actions at the root of the search tree to form a sample-based improved policy, following ref. 25, and commit to an action sampled from that improved policy.
Training regime
We train AlphaTensor-Quantum separately for each considered application (that is, one agent for the benchmark circuits, another one for multiplication in finite fields, another one for all the circuits related to binary addition and so on). This approach allows the actors to repeatedly play enough games starting from the tensors of interest (becoming better and better at decomposing those specific targets), while at the same time being able to identify and exploit possible decomposition patterns shared across the circuits within the same application. We find that the experiment variance is very low: a single experiment is enough for AlphaTensor-Quantum to find the best constructions for each application.
We train AlphaTensor-Quantum on a tensor processing unit (TPU) with 256 cores, with a total batch size of 2,048, training for about 1 million iterations (the number of training iterations varies depending on the application, ranging between 80,000 and 3 million iterations). We generally use 3,600 actors per experiment, each running on a standalone TPU (experiments on the benchmark circuits were run with twice as many actors).
The computational cost of running an experiment depends on multiple factors. The main ones are: the number of qubits N, the episode length (that is, the maximum allowed number of steps in TensorGame), the circuit complexity (that is, how many steps it takes to optimally decompose the tensor) and the architecture parameters (for example, the number of layers of the torso). Theoretically, the cost increases roughly cubically with N, linearly with either the episode length (at the beginning of the training algorithm) or the circuit complexity (after a number of training iterations), and linearly with the number of layers of the torso. As an example, the training algorithm for the most costly experiment runs at about 2 steps per second (corresponding to the experiment on unary iteration, with N = 72, episode length of 250 and 4 layers).
Other versions of AlphaTensor-Quantum
We found empirically that the GF(2m)-mult circuits were particularly challenging to optimize, and they required additional training iterations. To reduce the size of the search space and accelerate the training procedure, we developed a variant of AlphaTensor-Quantum that favours Toffoli gates. Specifically, every time that three consecutive and linearly independent actions (factors) are played, the environment automatically applies the remaining four actions that would complete the Toffoli gadget. We applied this version of AlphaTensor-Quantum to optimize these targets only.
Verification
To verify that the results of AlphaTensor-Quantum are correct, we used the ‘feynver’ tool developed by ref. 44 after each step of the compilation process discussed in Supplementary Section C.1. It uses the sum-over-paths formalism62 to produce a proof of equality and can be applied to any circuit. However, it is a sound but not complete method, so it may fail both to prove and to disprove equality for some pairs of circuits. As the tool runs slowly for larger circuits, we ran it wherever practical (limiting each run to several hours). Each circuit was verified assuming that any intermediate measurements introduced by Hadamard gadgetization always return zero, because our software pipeline does not generate the classically controlled correction circuits (these can be easily constructed but are irrelevant to AlphaTensor-Quantum).
This allowed us to verify inputs up to the point where a tensor is obtained. To verify that the decompositions constructed by AlphaTensor-Quantum were correct, we computed the corresponding tensor and checked that it was equal to the original input tensor. While we did not use the optimized decompositions to produce circuits, they could be verified in the same way as discussed above.
2025-03-20 00:00:00