Paper deep dive

Opening the AI black box: program synthesis via mechanistic interpretability

Eric J. Michaud, Isaac Liao, Vedang Lad, Ziming Liu, Anish Mudide, Chloe Loughridge, Zifan Carl Guo, Tara Rezaei Kheirkhah, Mateja Vukelic, Max Tegmark

Year: 2024Venue: arXiv preprintArea: Mechanistic Interp.Type: EmpiricalEmbeddings: 78

Models: GPT-4 Turbo, RNNs (custom minimal architectures)

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/12/2026, 7:29:20 PM

Summary

MIPS (Mechanistic-Interpretability-based Program Synthesis) is a novel, fully automated method that distills learned algorithms from neural networks (specifically RNNs) into executable Python code. By using integer autoencoders and symbolic regression on hidden state representations, MIPS converts neural networks into finite state machines, offering a complementary approach to LLMs for program synthesis without relying on human-generated training data.

Entities (5)

GPT-4 · large-language-model · 100%MIPS · method · 100%RNN · neural-network-architecture · 100%Mechanistic Interpretability · research-field · 95%Symbolic Regression · technique · 95%

Relation Signals (4)

MIPS → applies → Symbolic Regression

confidence 100% · applies Boolean or integer symbolic regression to capture the learned algorithm

MIPS → performs → Program Synthesis

confidence 100% · MIPS, a novel method for program synthesis

MIPS → uses → RNN

confidence 100% · MIPS uses an integer autoencoder to convert the RNN into a finite state machine

MIPS → complements → GPT-4

confidence 90% · find it highly complementary to GPT-4

Cypher Suggestions (2)

Find all methods related to program synthesis · confidence 90% · unvalidated

MATCH (m:Method)-[:PERFORMS]->(t:Task {name: 'Program Synthesis'}) RETURN m

Identify dependencies between techniques and architectures · confidence 85% · unvalidated

MATCH (m:Method)-[:USES]->(a:Architecture) RETURN m.name, a.name

Abstract

Abstract:We present MIPS, a novel method for program synthesis based on automated mechanistic interpretability of neural networks trained to perform the desired task, auto-distilling the learned algorithm into Python code. We test MIPS on a benchmark of 62 algorithmic tasks that can be learned by an RNN and find it highly complementary to GPT-4: MIPS solves 32 of them, including 13 that are not solved by GPT-4 (which also solves 30). MIPS uses an integer autoencoder to convert the RNN into a finite state machine, then applies Boolean or integer symbolic regression to capture the learned algorithm. As opposed to large language models, this program synthesis technique makes no use of (and is therefore not limited by) human training data such as algorithms and code from GitHub. We discuss opportunities and challenges for scaling up this approach to make machine-learned models more interpretable and trustworthy.

PDF

Open source PDF →Open local PDF →

Full Text

78,121 characters extracted from source content.

Expand or collapse full text

Opening the AI black box: program synthesis via mechanistic interpretability Eric J. Michaud * 1 2 Isaac Liao * 1 Vedang Lad * 1 Ziming Liu * 1 2 Anish Mudide 1 Chloe Loughridge 3 Zifan Carl Guo 1 Tara Rezaei Kheirkhah 1 Mateja Vukeli ́ c 1 Max Tegmark 1 2 Abstract We present MIPS, a novel method for program synthesis based on automated mechanistic inter- pretability of neural networks trained to perform the desired task, auto-distilling the learned algo- rithm into Python code. We test MIPS on a bench- mark of 62 algorithmic tasks that can be learned by an RNN and find it highly complementary to GPT-4: MIPS solves 32 of them, including 13 that are not solved by GPT-4 (which also solves 30). MIPS uses an integer autoencoder to convert the RNN into a finite state machine, then applies Boolean or integer symbolic regression to capture the learned algorithm. As opposed to large lan- guage models, this program synthesis technique makes no use of (and is therefore not limited by) human training data such as algorithms and code from GitHub. We discuss opportunities and challenges for scaling up this approach to make machine-learned models more interpretable and trustworthy. 1. Introduction Machine-learned algorithms now outperform traditional human-discovered algorithms on many tasks, from transla- tion to general-purpose reasoning. These learned algorithms tend to be black-box neural networks, and we typically lack a full understanding of how they work. This has motivated the growing field ofmechanistic interpretability, aiming to assess and improve their trustworthiness. Major progress has been made in interpreting and understanding smaller models, but this work has involved human effort, which raises questions about whether it can scale to larger mod- els. This makes it timely to investigate whether mechanistic * Equal contribution, alphabetical order 1 Massachusetts Insti- tute of Technology, Cambridge, MA, USA 2 Institute for Artifi- cial Intelligence and Fundamental Interactions 3 Harvard Univer- sity, Cambridge, MA, USA. Correspondence to: Max Tegmark <tegmark@mit.edu>. Preprint. Under review. interpretability can be fully automated (Tegmark & Omo- hundro, 2023). The goal of the present paper is to take a modest first step in this direction by presenting and testing MIPS (Mechanistic- Interpretability-basedProgramSynthesis), a fully auto- mated method that can distill simple learned algorithms from neural networks into Python code. The rest of this paper is organized as follows. After reviewing prior work in Section I, we present our method in Section I, test it on a benchmark in Section IV and summarize our conclusions in Section V. 2. Related Work Program synthesisis a venerable field dating back to Alonzo Church in 1957; Zhou & Ding (2023) and Odena et al. (2020) provide recent reviews of the field. Large language Models (LLMs) have become increasingly good at writing code based on verbal problem descriptions or auto-complete. We instead study the common alternative problem setting known as “programming by example” (PBE), where the desired program is specified by giving examples of input- output pairs (Wu et al., 2023). The aforementioned papers review a wide variety of program synthesis methods, many of which involve some form of search over a space of pos- sible programs. LLMs that synthesize code directly have recently become quite competitive with such search-based approaches (Sobania et al., 2022). Our work provides an alternative search-free approach where the program learning happens during neural network training rather than execu- tion. Our work builds on the recent progress inmechanistic inter- pretability(MI) of neural networks (Olah et al., 2020; Cam- marata et al., 2020; Wang et al., 2022; Olsson et al., 2022). Much MI work has tried to understand how neural networks represent various types of information,e.g., geographic in- formation (Goh et al., 2021; Gurnee & Tegmark, 2023), truth (Burns et al., 2022; Marks & Tegmark, 2023) and the state of board games (McGrath et al., 2022; Toshniwal et al., 2022; Li et al., 2022). Another major MI thrust has been to understand how neural networks perform algorithmic tasks, 1 arXiv:2402.05110v1 [cs.LG] 7 Feb 2024 Program synthesis via mechanistic interpretability e.g., modular arithmetic (Nanda et al., 2023; Liu et al., 2022; Zhong et al., 2023; Quirke et al., 2023), greater-than (Hanna et al., 2023), and greatest-common-divisor (Charton, 2023). Whereas Lindner et al. (2023) automatically convert tradi- tional code into a neural network, we aim to do the opposite. Other recent efforts to automate MI include identifying a sparse subgraph of the network whose units are causally relevant to a behavior of interest (Conmy et al., 2023; Syed et al., 2023) and using LLMs to label internal components of neural networks, for instance neurons (Bills et al., 2023) and features discovered by sparse autoencoders (Cunningham et al., 2023; Bricken et al., 2023). 3. MIPS, our program synthesis algorithm Figure 1.The pipeline of our program synthesis method. MIPS relies on discovering integer representations and bit representations of hidden states, which enable regression methods to figure out the exact symbolic relations between input, hidden, and output states. As summarized in Figure 1, MIPS involves the following key steps. 1. Neural network training 2. Neural network simplification 3. Finite state machine extraction 4. Symbolic regression Step 1 is to train a black-box neural network to learn an algorithm that performs the desired task. In this paper, we use a recurrent neural network (RNN) of the general form h i =f(h i−1 ,x i ),(1) y i =g(h i ),(2) that maps input vectorsx i into output vectorsy i via hidden statesh i . The RNN is defined by the two functionsfand g, which are implemented as feed-forward neural networks (MLPs) to allow more model expressivity than for a vanilla RNN. The techniques described below can also be applied to more general neural network architectures. Step 2 attempts to automatically simplify the learned neu- ral network without reducing its accuracy. Steps 3 and 4 automatically distill this simplified learned algorithm into Python code. When the training data is discrete (consisting of say text tokens, integers, or pixel colors), the neural net- work will be afinite state machine: the activation vectors for each of its neuron layers define finite sets and the entire working of the network can be defined by look-up tables specifying the update rules for each layer. For our RNN, this means that the space of hidden stateshis discrete, so that the functionsfandgcan be defined by lookup tables. As we will see below, the number of hidden states that MIPS needs to keep track of can often be greatly reduced by clus- tering them, corresponding to learned representations. After this, the geometry of the cluster centers in the hidden space often reveals that they form either an incomplete multidi- mensional lattice whose points represent integer tuples, or a set whose cardinality is a power of two, whose points represent Boolean tuples. In both of these cases, the afore- mentioned lookup tables simply specify integer or Boolean functions, which MIPS attempts to discover via symbolic regression. Below we present aninteger autoencoderand aBoolean autoencoderto discover such integer/Boolean representations from arbitrary point sets. We will now describe each of the three steps of MIPS in greater detail. 3.1. AutoML optimizing for simplicity We wish to find thesimplestRNN that can learn our task, to facilitate subsequent discovery of the algorithm that it has learned. We therefore implement an AutoML-style neural architecture search that tries to minimize network size while achieving perfect test accuracy. This search space is defined by a vectorpof five main architecture hyperparameters: the five integersp= (n,w f ,d f ,w g ,d g )corresponding to the dimensionality of hidden stateh, the width and depth of thef-network, and the width and depth of theg-network, respectively. Both thef- andg-networks have a linear final layer and ReLU activation functions for all previous layers. The hidden stateh 0 is initialized to zero. To define the parameter search space, we define ranges for each parameter.For all tasks, we usen∈ 2 Program synthesis via mechanistic interpretability 1,2,...,128,w f ∈ 1,2,...,256,d f ∈ 1,2,3, w g ∈1,2,...,256andd g ∈1,2,3, so the total search space consists of128×256×3×256×3 = 75,497,472 hyperparameter vectorsp i . We order this search space by imposing a strict ordering on the importance of minimizing each hyperparameter – lowerd g is strictly more important than lowerd f , which is strictly more important than lower n, which is strictly more important than lowerw g , which is strictly more important than lowerw f . We aim to find the hyperparameter vector (integer 5-tuple)p i in the search space which has lowest rankiunder this ordering. We search the space in the following simple manner. We first start at indexi= 65,536, which corresponds to parameters (1,1,2,1,1). For each parameter tuple, we train networks using 5 different seeds. We use the loss functionℓ(x,y) = 1 2 log[1 + (x−y) 2 ], finding that it led to more stable training than using vanilla MSE loss. We train for either 10,000 or 20,000 steps, depending on the task, using the Adam optimizer, a learning rate of10 −3 , and batch size 4096. The test accuracy is evaluated with a batch of 65536 samples. If no networks achieve 100% test accuracy (on any test batch), we increaseiby2 1/4 . We proceed in this manner until we find a network where one of the seeds achieves perfect test accuracy or until the full range is exhausted. If we find a working network on this upwards sweep, we then perform a binary search using the interval halving method, starting from the successfuli, to find the lowestiwhere at least one seed achieves perfect test accuracy. 3.2. Auto-simplification After finding a minimal neural network architecture that can solve a task, the resulting neural network weights typ- ically seem random and un-interpretable. This is because there exist symmetry transformations of the weights that leave the overall input-output behavior of the neural net- work unchanged. The random initialization of the network has therefore caused random symmetry transformations to be applied to the weights. In other words, the learned net- work belongs to an equivalence class of neural networks with identical behavior and performance, corresponding to a submanifold of the parameter space. We exploit these symmetry transformations to simplify the neural network into anormal form, which in a sense is the simplest member of its equivalence class. Conversion of objects into a nor- mal/standard form is a common concept in mathematics and physics (for example, conjunctive normal form, wavefunc- tion normalization, reduced row echelon form, and gauge fixing). Two of our simplification strategies below exploit a sym- metry of the RNN hidden state spaceh. We can always write the MLPgin the formg(h) =G(Uh+c)for some functionG. So iffis affine,i.e., of the form f(h,x) =Wh+Vx+b, then the symmetry transfor- mation W ′ ≡AWA −1 ,V ′ =AV,U ′ =UA −1 ,h ′ ≡Ah, b ′ =Abkeeps the RNN in the same form: h ′ i =Ah i =AWA −1 Ah i−1 +AVx i +Ab =W ′−1 h ′ i−1 +V ′ x i +b ′ ,(3) y i =G(Uh i +c) =G(UA −1 h ′ i +c) =G(U ′ h ′ i +c).(4) We think of neural networks as nails, which can be hit by various auto-normalization hammers. Each hammer is an algorithm that applies transformations to the weights to re- move degrees of freedom caused by extra symmetries or cleans the neural network up in some other way. In this section, we describe five normalizers we use to simplify our trained networks, termed “Whitening”, “Jordan normal form”, “Toeplitz”, “De-bias”, and “Quantization”. For every neural network, we always apply this sequence of normal- izers in that specific order, for consistency. We describe them below and provide additional details about them in the Appendix D. 1.Whitening:Just as we normalize input data to use for training neural networks, we would like activations in the hidden state spaceh i to be normalized. To ensure normalization in all directions, we feed the training dataset into the RNN, collect all the hidden states, com- pute the uncentered covariance matrixC, and then apply a whitening transformh7→C −1/2 hto the hid- den state space so that its new covariance becomes the identity matrix. This operation exists purely to provide better numerical stability to the next step. 2.Jordan normal form:When the functiongis affine, we can apply the aforementioned symmetry transfor- mation to try to diagonalizeW, so that none of the hidden state dimensions interact with one another. Un- fortunately, not all matricesWcan be diagonalized, so we use a generalized alternative: the Jordan normal form, which allows elements of the superdiagonal to be either zero or one. To eliminate complex numbers, we also apply2×2unitary transformations to eigen- vectors corresponding to conjugate pairs of complex eigenvalues afterward. The aforementioned whitening is now ruined, but it helped make the Jordan normal form calculation more numerically stable. 3.Toeplitz:OnceWis in Jordan normal form, we di- vide it up into Jordan blocks and apply upper-triangular Toeplitz transformations to the dimensions belonging to each Jordan block. There is now an additional sym- metry, corresponding to multiplying each Jordan block by an upper triangular Toeplitz matrix, and we exploit 3 Program synthesis via mechanistic interpretability 0.60.40.20.00.20.40.6 First hidden dimension 0.25 0.50 0.75 1.00 1.25 1.50 1.75 Second hidden dimension Bit representation (binary_addition) 50454035302520 First hidden dimension 70 75 80 85 90 95 100 Second hidden dimension Integer lattice (sum_last2) Figure 2.These hidden structures can be turned into discrete rep- resentations. Left: the hidden states for the bitstring addition task are seen to form four clusters, corresponding to 2 bits: the output bit and the carry bit. Right: the hidden states for the SumLast2 task are seen to form clusters on a 2D lattice corresponding to two integers. the Toeplitz matrix that maximally simplifies the afore- mentionedV-matrix. 4. De-bias:SometimesWis not full rank, andbhas a component in the direction of the nullspace. In this case, the component can be removed, and the biasc can be adjusted to compensate. 5.Quantization:After applying all the previous normal- izers, many of the weights may have become close to integers, but not exactly due to machine precision and training imperfections. Sometimes, depending on the task, all of the weights can become integers. We therefore round any weights that are withinε≡0.01 of an integer to that integer. 3.3. Boolean and integer autoencoders As mentioned, our goal is to convert a trained recurrent neural network (RNN) into a maximally simple (Python) program that produces equivalent input-output behavior. This means that if the RNN has 100% accuracy for a given dataset, so should the program, with the added benefit of being more interpretable, precise, and verifiable. Once trained/written, the greatest difference between a neu- ral network and a program implementing the same finite state machine is that the former is fuzzy and continuous, while the latter is precise and discrete. To convert a neural network to a program, some discretization (“defuzzifica- tion”) process is needed to extract precise information from seemingly noisy representations. Fortunately, mechanistic interpretability research has shown that neural networks tend to learn meaningful, structured knowledge representa- tions for algorithmic tasks (Liu et al., 2022; Nanda et al., 2023). Previous interpretability efforts typically involved case-by-case manual inspection, and only gained algorith- mic understanding at the level of pseudocode at best. We tackle this more ambitious question: can we create an au- tomated method that distills the learned representation and associated algorithms into an equivalent (Python) program? Since the tasks in our benchmark involve bits and integers, which are already discrete, the only non-discrete parts in a recurrent neural network are its hidden representations. Here we show two cases when hidden states can be dis- cretized: they are (1) a bit representation or (2) a (typically incomplete) integer lattice. Generalizing to the mixed case of bits and integers is straightforward. Figure 2 shows all hidden state activation vectorsh i for all steps with all train- ing examples for two of our tasks. The left panel shows that the10 4 pointsh i form2 2 = 4tight clusters, which we interpret as representing 2 bits. The right panel reveals that the pointsh i form an incomplete 2D lattice that we interpret as secretly representing a pair of integers. Bit representations The hidden states for the 2 bits in Figure 2 are seen to form a parallelogram. More generally, we find that hidden states encodebbits as2 b clusters, which in some cases formb- dimensional parallelograms and in other cases look more random. Our algorithm tries all(2 b )!possible assignments of the2 b clusters to bitstrings of lengthband selects the assignment that minimizes the length of the resulting Python program. Integer lattice As seen in Figure 2, the learned representation of an integer lattice tends to be both non-square (deformed by a random affine transformation) and sparse (since not all integer tu- plets occur during training). We thus face the following problem: given (possibly sparse) samples of pointsh i from ann-dimensional lattice, how can we reconstruct the in- teger lattice in the sense that we figure out which integer tuple each lattice point represents? We call the solution aninteger autoencodersince it compresses any point set into a set of integer tuples from which the original points can be at least approximately recovered ash i =Ak i +b, whereAis a matrix andbis a vector that defines the affine transformation and a set of integer vectorsk i . In the Appendix A, we present a solution that we call theGCD lattice finder. For the special casen= 1, its core idea is to compute the greatest common denomina- tor of pairwise separations: for example, for the points 1.7,3.2,6.2,7.7..., all point separations are divisible by A= 1.5, from which one infers thatb= 0.2and the lattice can be rewritten as1.5×1,2,4,5+0.2. For multidimen- sional lattices, our algorithm uses the GCD of ratios of generalized cell volumes to infer the directions and lengths of the lattice vectors that form the columns ofA. For the special case where the MLP defining the function fis affine or can be accurately approximated as affine, we use a simpler method we term theLinear lattice finder, also described in Appendix B. Here the idea is to exploit that the lattice is simply an affine transformation of a regular integer 4 Program synthesis via mechanistic interpretability lattice (the input data), so we can simply “read off” the desired lattice basis vectors from this affine transformation. Symbolic regression Once the hidden statesh i have been successfully mapped to Boolean or integer tuples as described above, the functions fandgthat specify the learned RNN can be re-expressed as lookup tables, showing their Boolean/integer output tuple for each Boolean/integer input tuple. All that remains is now symbolic regression,i.e., discovering the simplest possible symbolic formulae that definefandg. Boolean regression:In the case where a function maps bits to a bit, our algorithm determines the following set of correct Boolean formulae and then returns the shortest one. The first candidate formula is the function written in disjunctive normal form, which is always possible. If the Boolean function issymmetric,i.e., invariant under all permutations of its arguments, then we also write it as an integer function of its bit sum. Integer regression:In the case when a function maps inte- gers to an integer, we try the following two methods: 1. If the function is linear, then we perform simple linear regression, round the resulting coefficients to integers, and simplify,e.g., multiplications by 0 and 1. 2.Otherwise, we use the brute-force symbolic solver fromAI Feynman(Udrescu et al., 2020), including the 6 unary operations>,<,∼,H,D,Aand 4 bi- nary operations+,−,∗,%whose meanings are ex- plained in Appendix C, then convert the simplest dis- covered formula into Python format. Once symbolic formulas have been separately discovered for each component of the vector-valued functionsfandg, we insert them into a template Python program that implements the basic loop over inputs that are inherent in an RNN. We present examples of our auto-generated programs in Figures 3 and 4 and in Appendix F. 4. Results We will now test the program synthesis abilities of our MIPS algorithm on a benchmark of algorithmic tasks specified by numerical examples. For comparison, we try the same benchmark on GPT-4 Turbo, which is currently (as of Jan- uary 2024) described by OpenAI as their latest generation model, with a 128k context window and more capable than the original GPT-4. 4.1. Benchmark Our benchmark consists of the 62 algorithmic tasks listed in Table 1. They each map one or two integer lists of length 10 or 20 into a new integer list. We refer to integers whose range is limited to0,1as bits. We generated this task list manually, attempting to produce a collection of diverse tasks that would in principle be solvable by an RNN. We also focused on tasks whose known algorithms involved ma- jority, minimum, maximum, and absolute value functions be- cause we believed they would be more easily learnable than other nonlinear algorithms due to our choice of the ReLU activation for our RNNs. The benchmark training data and project code is available athttps://github.com/ ejmichaud/neural-verification. The tasks are described in Table 1, with additional details in Appendix E. Since the focus of our paper is not on whether RNNs can learn algorithms, but on whether learned algorithms can be auto-extracted into Python, we discarded from our bench- mark any generated tasks on which our RNN-training failed to achieve 100% accuracy. Our benchmark can never show that MIPS outperforms any large language model (LLM). Because LLMs are typically trained on GitHub, many LLMs can produce Python code for complicated programming tasks that fall outside of the class we study. Instead, the question that our MIPS-LLM comparison addresses is whether MIPS complements LLMs by being able to solve some tasks where an LLM fails. 4.2. Evaluation For both our method and GPT-4 Turbo, a task is consid- ered solved if and only if a Python program is produced that solves the task with 100% accuracy. GPT-4 Turbo is prompted using the “chain-of-thought” approach described below and illustrated in Figure 5. For a given task, the LLM receives two lists of length 10 sourced from the respective RNN training set. The model is instructed to generate a formula that transforms the el- ements of list “x” (features) into the elements of list ‘y’ (labels). Subsequently, the model is instructed to translate this formula into Python code. The model is specifically asked to use elements of the aforementioned lists as a test case and print “Success” or “Failure” if the generated func- tion achieves full accuracy on the test case. An external program extracts a fenced markdown codeblock from the output, which is saved to a separate file and executed to determine if it successfully completes the task. To improve the chance of success, this GPT-4 Turbo prompting process is repeated three times, requiring only at least one of them to succeed. We run GPT using default temperature settings. 5 Program synthesis via mechanistic interpretability Table 1.Benchmark results. For tasks with note “see text”, please refer to Appendix E Task #Input Strings Element Type Task DescriptionTask Name Solved by GPT-4? Solved by MIPS? 12bitBinary addition of two bit stringsBinaryAddition01 22intTernary addition of two digit stringsBase 3Addition00 32intBase 4 addition of two digit stringsBase4Addition00 42intBase 5 addition of two digit stringsBase 5Addition00 52intBase 6 addition of two digit stringsBase6Addition10 62intBase 7 addition of two digit stringsBase7Addition00 72bitBitwise XORBitwise Xor11 82bitBitwise ORBitwiseOr11 92bitBitwise ANDBitwise And11 101bitBitwise NOTBitwiseNot11 111bitParity of last 2 bitsParityLast211 121bitParity of last 3 bitsParity Last301 131bitParity of last 4 bitsParityLast400 141bitParity of all bits seen so farParity All01 151bitParity of number of zeros seen so farParityZeros01 161intCumulative number of even numbersEvensCounter00 171intCumulative sumSum All11 181intSum of last 2 numbersSum Last201 191intSum of last 3 numbersSumLast301 201intSum of last 4 numbersSumLast411 211intSum of last 5 numbersSum Last511 221intsum of last 6 numbersSumLast611 231intSum of last 7 numbersSumLast711 241intCurrent numberCurrentNumber11 251intNumber 1 step backPrev111 261intNumber 2 steps backPrev211 271intNumber 3 steps backPrev311 281intNumber 4 steps backPrev411 291intNumber 5 steps backPrev511 301int1 if last two numbers are equalPrevious EqualsCurrent01 311intcurrent−previousDiffLast201 321int|current−previous|Abs Diff01 331int|current|AbsCurrent11 341int|current|−|previous|DiffAbsValues10 351intMinimum of numbers seen so farMinSeen10 361intMaximum of integers seen so farMaxSeen10 371intinteger in 0-1 with highest frequencyMajority0110 381intInteger in 0-2 with highest frequencyMajority0200 391intInteger in 0-3 with highest frequencyMajority 0300 401int1 if even, otherwise 0EvensDetector10 411int1 if perfect square, otherwise 0PerfectSquareDetector00 421bit1 if bit string seen so far is a palindromeBitPalindrome10 431bit1 if parentheses balanced so far, else 0Balanced Parenthesis00 441bitNumber of bits seen so far mod 2ParityBitsMod210 451bit1 if last 3 bits alternateAlternating Last300 461bit1 if last 4 bits alternateAlternatingLast410 471bitbit shift to right (same as prev1)BitShiftRight11 482bitCumulative dot product of bits mod 2BitDotProdMod201 491bitBinary division by 3 (see text)Div310 501bitBinary division by 5 (see text)Div500 511bitBinary division by 7 (see text)Div700 521intCumulative addition modulo 3AddMod311 531intCumulative addition modulo 4AddMod400 541intCumulative addition modulo 5AddMod500 551intCumulative addition modulo 6AddMod600 561intCumulative addition modulo 7AddMod700 571intCumulative addition modulo 8AddMod800 581int1D dithering, 4-bit to 1-bit (see text)Dithering10 591intNewton’s of - freebody (integer input)NewtonFreebody01 601intNewton’s law of gravity (see text)NewtonGravity01 611intNewton’s law w. spring (see text)NewtonSpring01 622intNewton’s law w. magnetic field (see text)NewtonMagnetic00 Total solved3032 6 Program synthesis via mechanistic interpretability 1 2def f(s,t): 3a = 0;b = 0; 4ys = [] 5for i in range(10): 6c = s[i]; d = t[i]; 7next_a = b ˆ c ˆ d 8next_b = b+c+d>1 9a = next_a;b = next_b; 10y = a 11ys.append(y) 12return ys Figure 3.The generated program for the addition of two binary numbers represented as bit sequences. Note that MIPS rediscovers the “ripple adder”, where the variablebabove is the carry bit. 4.3. Performance As seen in Table 1, MIPS is highly complementary to GPT-4 Turbo: MIPS solves 32 of our tasks, including 13 that are not solved by ChatGPT-4 (which solves 30). The AutoML process of Section 3.1 discovers networks of varying task-dependent shape and size. Table 2 shows the parameterspdiscovered for each task. Across our 62 tasks, 16 tasks could be solved by a network with hidden dimensionn= 1, and the largestnrequired was 81. For many tasks, there was an interpretable meaning to the shape of the smallest network we discovered. For instance, on tasks where the output is the element occurringksteps earlier in the list, we foundn=k+ 1, since the current element and the previouskelements must be stored for later recall. We found two main failure modes for MIPS: 1.Noise and non-linearity. The latent space is still close to being a finite state machine, but the non-linearity and/or noise present in an RNN is so dominant that the integer autoencoder fails,e.g., forDiffAbsValues. Hu- mans can stare at the lookup table and regress the sym- bolic function with their brains, but since the lookup table is not perfect,i.e., has the wrong integer in a few examples, MIPS fails to symbolically regress the function. This can probably be mitigated by learning and generalizing from a training subset with a smaller dynamic range. 2.Continuous computation. A key assumption of MIPS is that RNNs are finite-state machines. However, RNNs can also use continuous variables to represent informa- tion — theMajority0Xtasks fail for this reason. This can probably be mitigated by identifying and imple- menting floating-point variables. Figure 3 shows an example of a MIPS rediscovering the 1def f(s): 2a = 198;b = -11;c = -3;d = 483;e = 0; 3ys = [] 4for i in range(20): 5x = s[i] 6next_a = -b+c+190 7next_b = b-c-d-e+x+480 8next_c = b-e+8 9next_d = -b+e-x+472 10next_e = a+b-e-187 11a = next_a;b = next_b;c = next_c;d = next_d;e = next_e; 12y = -d+483 13ys.append(y) 14return ys 1def f(s): 2a = 0;b = 0;c = 0;d = 0;e = 0; 3ys = [] 4for i in range(20): 5x = s[i] 6next_a = +x 7next_b = a 8next_c = b 9next_d = c 10next_e = d 11a = next_a;b = next_b;c = next_c;d = next_d;e = next_e; 12y = a+b+c+d+e 13ys.append(y) 14return ys Figure 4.Comparison of code generated from an RNN trained on SumLast5, with (top) and without (top) normalizers. The whiten- ing normalizer provided numerical stability to the Jordan normal form normalizer, which itself simplified the recurrent portion of the program. The Toeplitz and de-biasing normalizers jointly spar- sified the occurrences ofxin the program, and the number of terms required to computey. The quantization normalizer enabled all variables to be represented as integers. ”ripple-carry adder” algorithm. The normalizers signifi- cantly simplified some of the resulting programs, as illus- trated in Fig. 4, and sometimes made the difference between MIPS failing and succeeding. We found that applying a smallL1weight regularization sometimes facilitated inte- ger autoencoding by axis-aligning the lattice. 5. Conclusions We have presented MIPS, a novel method for program syn- thesis based on automated mechanistic interpretability of neural networks trained to perform the desired task, auto- distilling the learned algorithm into Python code. Its essence is to first train a recurrent neural network to learn a clever finite state machine that performs the task, and then auto- matically figure out how this machine works. 7 Program synthesis via mechanistic interpretability Conversation Start User: ”Each row in the table below contains two lists...give me a formula for ...” GPT: [Response] User: ”Please write a Python program to ...” GPT: [Response] Extracted Code Block Success or Failure? Success Failure Figure 5.We compare MIPS against program synthesis with the large language model GPT-4 Turbo, prompted with a “chain-of- thought” approach. It begins with the user providing a task, fol- lowed by the model’s response, and culminates in assessing the success or failure of the generated Python code based on its accu- racy in processing the provided lists. 5.1. Findings We found MIPS highly complementary to LLM-based pro- gram synthesis with GPT-4 Turbo, with each approach solv- ing many tasks that stumped the other. Whereas LLM-based methods have the advantage of drawing upon a vast corpus of human training data, MIPS has the advantage of discover- ing algorithms from scratch without human hints, with the potential to discover entirely new algorithms. As opposed to genetic programming approaches, MIPS leverages the power of deep learning by exploiting gradient information. Program synthesis aside, our results shed further light on mechanistic interpretability, specifically on how neural net- works represent bits and integers. We found thatnintegers tend to get encoded linearly inndimensions, but generically in non-orthogonal directions with an additive offset. This is presumably because there are many more such messy encodings than simple ones, and the messiness can be eas- ily (linearly) decoded. We saw thatnbits sometimes get encoded as ann-dimensional parallelogram, but not always – possibly because linear decodability is less helpful when the subsequent bit operations to be performed are nonlinear anyway. 5.2. Outlook Our work is merely a modest first attempt at mechanistic- interpretability-based program synthesis, and there are many obvious generalizations worth trying in future work. For example: 1. Improvements in training and integer autoencoding (since many of our failed examples failed only just barely) 2.Generalization from RNNs to other architectures such as transformers 3.Generalization from bits and integers to more gen- eral extractable data types such as floating-point num- bers and various discrete mathematical structures and knowledge representations 4.Scaling to tasks requiring much larger neural networks 5.Automated formal verification of synthesized pro- grams (we perform such verification with Dafny in Appendix F.1 to show that our MIPS-learned ripple adder correctly addsanybinary numbers, not merely those in the test set, but such manual work should ide- ally be fully automated) LLM-based coding co-pilots are already highly useful for program synthesis tasks based on verbal problem descrip- tions or auto-complete, and will only get better. MIPS instead tackles program synthesis based on test cases alone. This makes it analogous tosymbolic regression(Udrescu et al., 2020; Cranmer, 2023), which has already proven use- ful for various science and engineering applications (Cran- mer et al., 2020; Ma et al., 2022) where one wishes to ap- proximate data relationships with symbolic formulae. The MIPS framework generalizes symbolic regression from feed- forward formulae to programs with loops, which are in prin- ciple Turing complete. If this approach can be scaled up, it may enable promising opportunities for making machine- learned algorithms more interpretable, verifiable, and trust- worthy. Broader Impact Because machine-learned algorithms now outperform tra- ditional human-discovered algorithms on many tasks, there are incentives to deploy them even without a full under- standing of how they work and of whether they are biased, unsafe, or otherwise problematic. The aspirational broader impact motivating this paper is to help automate the process of making AI systems more transparent, robust, and trust- worthy, with the ultimate goal of developing provably safe AI systems (Tegmark & Omohundro, 2023). 8 Program synthesis via mechanistic interpretability Acknowledgements We thank Wes Gurnee, James Liu, and Armaun Sanayei for helpful conversations and suggestions. This work is supported by Erik Otto, Jaan Tallinn, the Rothberg Family Fund for Cognitive Science, the NSF Graduate Research Fellowship (Grant No. 2141064), and IAIFI through NSF grant PHY-2019786. References Bills, S., Cammarata, N., Mossing, D., Tillman, H., Gao, L., Goh, G., Sutskever, I., Leike, J., Wu, J., and Saunders, W.Language models can explain neurons in language models.https: //openaipublic.blob.core.windows.net/ neuron-explainer/paper/index.html, 2023. Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., Lasenby, R., Wu, Y., Kravec, S., Schiefer, N., Maxwell, T., Joseph, N., Hatfield-Dodds, Z., Tamkin, A., Nguyen, K., McLean, B., Burke, J. E., Hume, T., Carter, S., Henighan, T., and Olah, C. Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread, 2023. https://transformer- circuits.pub/2023/monosemantic-features/index.html. Burns, C., Ye, H., Klein, D., and Steinhardt, J. Discovering latent knowledge in language models without supervision. arXiv preprint arXiv:2212.03827, 2022. Cammarata, N., Goh, G., Carter, S., Schubert, L., Petrov,M.,and Olah,C.Curve detectors. Distill, 2020.doi:10.23915/distill.00024.003. https://distill.pub/2020/circuits/curve-detectors. Charton, F. Can transformers learn the greatest common divisor?arXiv preprint arXiv:2308.15594, 2023. Conmy, A., Mavor-Parker, A. N., Lynch, A., Heimersheim, S., and Garriga-Alonso, A. Towards automated circuit discovery for mechanistic interpretability.arXiv preprint arXiv:2304.14997, 2023. Cranmer, M. Interpretable machine learning for science with pysr and symbolicregression. jl.arXiv preprint arXiv:2305.01582, 2023. Cranmer, M., Sanchez Gonzalez, A., Battaglia, P., Xu, R., Cranmer, K., Spergel, D., and Ho, S. Discovering sym- bolic models from deep learning with inductive biases. Advances in Neural Information Processing Systems, 33: 17429–17442, 2020. Cunningham, H., Ewart, A., Riggs, L., Huben, R., and Sharkey, L.Sparse autoencoders find highly inter- pretable features in language models.arXiv preprint arXiv:2309.08600, 2023. Goh, G.,†, N. C.,†, C. V., Carter, S., Petrov, M., Schubert, L., Radford, A., and Olah, C. Multimodal neurons in arti- ficial neural networks.Distill, 2021. doi: 10.23915/distill. 00030. https://distill.pub/2021/multimodal-neurons. Gu, A. and Dao, T.Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752, 2023. Gurnee, W. and Tegmark, M. Language models represent space and time.arXiv preprint arXiv:2310.02207, 2023. Hanna, M., Liu, O., and Variengien, A. How does gpt-2 compute greater-than?: Interpreting mathematical abil- ities in a pre-trained language model.arXiv preprint arXiv:2305.00586, 2023. Li, K., Hopkins, A. K., Bau, D., Vi ́ egas, F., Pfister, H., and Wattenberg, M. Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv preprint arXiv:2210.13382, 2022. Lindner, D., Kram ́ ar, J., Farquhar, S., Rahtz, M., Mc- Grath, T., and Mikulik, V. Tracr: Compiled transform- ers as a laboratory for interpretability.arXiv preprint arXiv:2301.05062, 2023. Liu, Z., Kitouni, O., Nolte, N., Michaud, E. J., Tegmark, M., and Williams, M. Towards understanding grokking: An effective theory of representation learning. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URLhttps://openreview.net/forum? id=6at6rB3IZm. Ma, H., Narayanaswamy, A., Riley, P., and Li, L. Evolving symbolic density functionals.Science Advances, 8(36): eabq0279, 2022. Marks, S. and Tegmark, M.The geometry of truth: Emergent linear structure in large language model representations of true/false datasets.arXiv preprint arXiv:2310.06824, 2023. McGrath, T., Kapishnikov, A., Toma ˇ sev, N., Pearce, A., Wattenberg, M., Hassabis, D., Kim, B., Paquet, U., and Kramnik, V. Acquisition of chess knowledge in alphazero. Proceedings of the National Academy of Sciences, 119 (47):e2206625119, 2022. Nanda, N., Chan, L., Liberum, T., Smith, J., and Stein- hardt, J. Progress measures for grokking via mechanistic interpretability.arXiv preprint arXiv:2301.05217, 2023. 9 Program synthesis via mechanistic interpretability Odena, A., Shi, K., Bieber, D., Singh, R., Sutton, C., and Dai, H.Bustle: Bottom-up program synthesis through learning-guided exploration.arXiv preprint arXiv:2007.14381, 2020. Olah, C., Cammarata, N., Schubert, L., Goh, G., Petrov, M., and Carter, S. Zoom in: An introduction to circuits. Distill, 5(3):e00024–001, 2020. Olsson, C., Elhage, N., Nanda, N., Joseph, N., DasSarma, N., Henighan, T., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Johnston, S., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. In-context learn- ing and induction heads.Transformer Circuits Thread, 2022. https://transformer-circuits.pub/2022/in-context- learning-and-induction-heads/index.html. Quirke, P. et al. Understanding addition in transformers. arXiv preprint arXiv:2310.13121, 2023. Sobania, D., Briesch, M., and Rothlauf, F. Choose your programming copilot: A comparison of the program syn- thesis performance of github copilot and genetic program- ming. InProceedings of the genetic and evolutionary computation conference, p. 1019–1027, 2022. Syed, A., Rager, C., and Conmy, A. Attribution patching outperforms automated circuit discovery.arXiv preprint arXiv:2310.10348, 2023. Tegmark, M. and Omohundro, S.Provably safe sys- tems: the only path to controllable agi.arXiv preprint arXiv:2309.01933, 2023. Toshniwal, S., Wiseman, S., Livescu, K., and Gimpel, K. Chess as a testbed for language model state tracking. In Proceedings of the AAAI Conference on Artificial Intelli- gence, volume 36, p. 11385–11393, 2022. Udrescu, S.-M., Tan, A., Feng, J., Neto, O., Wu, T., and Tegmark, M. Ai feynman 2.0: Pareto-optimal symbolic re- gression exploiting graph modularity.Advances in Neural Information Processing Systems, 33:4860–4871, 2020. Wang, K., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. Interpretability in the wild: a circuit for indirect object identification in gpt-2 small.arXiv preprint arXiv:2211.00593, 2022. Wu, J., Wei, L., Jiang, Y., Cheung, S.-C., Ren, L., and Xu, C. Programming by example made easy.ACM Transactions on Software Engineering and Methodology, 33(1):1–36, 2023. Zhong, Z., Liu, Z., Tegmark, M., and Andreas, J. The clock and the pizza: Two stories in mechanistic explanation of neural networks.arXiv preprint arXiv:2306.17844, 2023. Zhou, B. and Ding, G. Survey of intelligent program syn- thesis techniques. InInternational Conference on Al- gorithms, High Performance Computing, and Artificial Intelligence (AHPCAI 2023), volume 12941, p. 1122– 1136. SPIE, 2023. 10 Program synthesis via mechanistic interpretability A. Lattice finding using generalized greatest common divisor (GCD) Our method often encounters cases where hidden states secretly form an affine transformation of an integer lattice. However, not all lattice points are observed in training samples, so our goal is to recover the hidden integer lattice from sparse observations. A.1. Problem formulation Suppose we have a set of lattice points inR D spanned byDindependent basis vectors,b i (i= 1,2,·,D). Each lattice pointjhas the position x j = D X i=1 a ji b i +c,(5) wherecis a global translation vector, and the coefficientsa ji are integers Our problem: givenNsuch data points(x 1 ,x 2 ,·,x N ), how can we recover the integer coefficientsa ji for each point data point as well asb i andc? Note that even when the whole lattice is given, there are still degrees of freedom for the solution. For example,c7→ c+b i ,a ji 7→a ji −1 remains a solution, andb i → P D j=1 Λ ij b j remains a solution ifΛis an integer matrix whose determinant is±1. So our success criterion is: (1)a ji are integers; (2) the discovered bases and the true bases have the same determinant (the volume of a unit cell). Once a set of bases is found, we can simplify them by minimizing their total norms over valid transformations (Λ∈Z D×D ,det(Λ) =±1). A.2. Regular GCD As a reminder, given a list ofnnumbersy 1 ,y 2 ,·,y n , a common divisordis a number such that for alli, y i d is an integer. All common divisors are the setd|y i /d∈Z, and the greatest common divisor (GCD) is the largest number in this set. Because GCD(y 1 ,·,y n ) = GCD(y 1 ,GCD(y 2 ,GCD(y 3 ,...))),(6) it without loss of generality suffices to consider the casen= 2. A common algorithm to compute GCD of two number is the so-called Euclidean algorithm. We start with two numbersr 0 ,r 1 andr 0 > r 1 , which is step 0. For thek th step, we perform division-with-remainder to find the quotientq k and the remainderr k so thatr k−2 =q k r k−1 +r k with|r k−1 |>|r k | 1 . The algorithm will eventually produce a zero remainderr N = 0, and the other non-zero remainderr N−1 is the greatest common divisor. For example,GCD(55,45) = 5, because 55 = 1×45 + 10, 45 = 4×10 + 5, 10 = 2×5 + 0. (7) A.3. Generalized GCD inDdimensions Given a list ofnvectorsy 1 ,y 2 ,·,y n wherey i ∈R D , and assuming that these vectors are in the lattice described by Eq. (5), we can without loss of generality setc= 0, since we can always redefine the origin. InDdimensions, the primitive of interest is theD-dimensional parallelogram: a line segment forD= 1(one basis vector), a parallelogram forD= 2 (two basis vectors), parallelepiped forD= 3(three basis vectors),etc. One can construct aD-dimensional parallelogram by constructing its basis vectors as a linear integer combination ofy j , i.e., q i = n X j=1 m ij y j ,m ij ∈Z,i= 1,2,·,D.(8) The goal ofD-dimensional GCD is to find a “minimal” parallelogram, such that its volume (which isdet(q 1 ,q 2 ,·,q D )) 1 We are considering a general case wherer 0 andr 1 may be negative. Otherwiser k can always be positive numbers, hence no need to use the absolute function. 11 Program synthesis via mechanistic interpretability Figure 6.Both red and blue basis form a minimal parallelogram (in terms of cell volume), but one can further simplify red to blue by linear combination (simplicity in the sense of smallℓ 2 norm). is GCD of volumes of other possible parallelograms. Once the minimal parallelogram is found 2 , we can also determineb i in Eq. (5), sinceb i are exactlyq i ! To find the minimal parallelogram, we need two steps: (1) figure out the unit volume; (2) figure outq i (i= 1,2,·)whose volume is the unit volume. Step 1: Compute unit volumeV 0 .We first definerepresentativeparallelograms as one where alli= 1,2,·,D, m i ≡(m i1 ,m i2 ,·,m iD )are one-hot vectors,i.e., with only one element being 1 and 0 otherwise. It is easy to show that the volume of any parallelogram is a linear integer combination of volumes of representative parallelograms, so WLOG we can focus on representative parallelograms. We compute the volumes of all representative parallelograms, which gives a volume array. Since volumes are just scalars, we can get the unit volumeV 0 by calling the regular GCD of the volume array. Step 2: Find a minimal parallelogram (whose volume is the unit volume computed in step 1).Recall that in regular GCD, we are dealing with two numbers (scalars). To leverage this in the vector case, we need to create scalars out of vectors, and make sure that the vectors share the same linear structure as the scalars so that we can extend division-and-remainder to vectors. A natural scalar is volume. Now consider two parallelograms P1 and P2, which shareD−1basis vectors (y 3 ,...,y D+1 ), but last basis vector is different:y 1 for P1 andy 2 for P2. Denote their volume asV 1 andV 2 : V 1 = det(y 1 ,y 3 ,y 4 ,...,y D ) V 2 = det(y 2 ,y 3 ,y 4 ,...,y D ) (9) Since aV 1 +bV 2 = det(ay 1 +by 2 ,y 3 ,y 4 ,...,y D ),(10) which shows that(V 1 ,V 2 )and(y 1 ,y 2 )share the same linear structure. We can simply apply division-and-remainder toV 1 andV 2 as in regular GCD: V ′ 1 ,V ′ 2 = GCD(V 1 ,V 2 ),(11) whose quotients in all iterations are saved and transferred toy 1 andy 2 : y ′ 1 ,y ′ 2 = GCD withpredefinedquotients(y 1 ,y 2 ).(12) IfV 1 =V 0 (which is the condition for minimal parallelogram), the algorithm terminates and returns(y ′ 1 ,y 3 ,y 4 ,·,y D ). IfV 1 > V 0 , we need to repeat step 2 with the new vector listy ′ 1 ,y 3 ,·,y N . Why can we removey ′ 2 for next iteration?Note that although eventuallyV ′ 1 >0 andV ′ 2 = 0 , typicallyy 2 ̸= 0. However, since 0 =V ′ 2 = det(y ′ 2 ,y 3 ,y 4 ,·,y D ),(13) this meansy ′ 2 is a linear combination of(y 3 ,·,y D ), hence can be removed from the vector list. Step 3: Simplification of basis vectors.We want to further simplify basis vectors. For example, the basis vectors obtained in step 2 may have large norms. For exampleD= 2, the standard integer lattice hasb 1 = (1,0)andb 2 = (0,1), but they 2 There could be many minimal parallelograms, but finding one is sufficient. 12 Program synthesis via mechanistic interpretability are infinitely many possibilities after step 2, as long aspt−sq=±1forb 1 = (p,q)andb 2 = (s,t), e.g.,b 1 = (3,5)and b 2 = (4,7). To minimizeℓ 2 norms, we choose a basis and project-and-subtract for other bases. Note that: (1) again we are only allowed to subtract integer times of the chosen basis; (2) the volume of the parallelogram does not change since the project-and-subtract matrix has determinant 1 (supposeb i (i= 2,3,·,D)are projected tob 1 and subtracted by multiples ofb 1 .p ∗ represents projection integers):        1p 2→1 p 3→1 ·p D→1 010·0 001·0 . . . . . . . . . . . . 000·1        (14) We do this iteratively, until no norm can become shorter via project-and-subtract. Please see Figure 6 for an illustration of how simplification works for a 2D example. Computation overheadis actually surprisingly small. In typical cases, we only need to callO(1)times of GCD. Dealing with noiseUsually the integer lattice in the hidden space is not perfect, i.e., vulnerable to noise. How do we extract integer lattices in a robust way in the presence of noise? Note that the terminating condition for the GCD algorithm is when the remainder is exactly zero - we relax this condition to that the absolute value of the remainder to be smaller than a thresholdε gcd . Another issue regarding noise is that noise can be accumulated in the GCD iterations, so we hope that GCD can converge in a few steps. To achieve this, we select hidden states in a small region with data fractionp%of the whole data. Since bothε gcd andpdepends on data and neural network training which we do not know a priori, we choose to grid sweepε gcd ∈[10 −3 ,1]andp∈(0.1,100); for each(ε gcd ,p), we obtain an integer lattice and compute its description length. We select the(ε gcd ,p)which gives the lattice with the smallest description length. The description length includes two parts: integer descriptions of hidden stateslog(1 +|Z| 2 ), and residual of reconstructionlog(1 + ( AZ+b−X ε dl ) 2 )withε dl = 10 −4 . B. Linear lattice finder Although our RNN can represent general nonlinear functions, in the special case when the RNN actually performs linear functions, program synthesis can be much easier. So if the hidden MLP is linear, we would expect the hidden states to be an integer lattice, because inputs are integer lattices and the mappings are linear. Effectively the hidden MLP works as a linear function:h (t) =W h h (t−1) +W i x (t) (we neglected the bias term since it is not relevant to finding basis vectors of a lattice). Suppose we have input seriesx (1) ,x (2) ,...,x (t) , thenh (t) is h (t) = t X j=1 W t−j h W i x j ,(15) Sincex j themselves are integer lattices, we could then interpret the following as basis vectors: W t−j h W i ,j= 1,2,·,t,(16) which are not necessarily independent. For example, for the task of summing up the last two numbers,W h W i andW i are non-zero vectors and are independent, while othersW n h W i ≈0,n≥2. ThenW h W i andW i are the two basis vectors for the lattice. In general, we measure the norm of all the candidate basis vectors, and select the firstkvectors with highest norms, which are exactly basis vectors of the hidden lattice. C. Symbolic regression The formulation of symbolic regression is that one has data pair(x i ,y i ),i= 1,2,...,NwithNdata samples. The goal is to find a symbolic formulafsuch thaty i =f(x i ). A function is expressed in reverse polish notation (RPN), for example, |a|−cis expressed asaAc-whereAstands for the absolute value function. We have three types of variables: 13 Program synthesis via mechanistic interpretability • type-0 operator. We include input variables and constants. •type-1 operator (takes in one type-0 to produce one type-0). We include operations>,<,∼,H,D,A.>means+1; <means−1;∼means negating the number;Dis dirac delta which outputs 1 only when taking in 0;Ais the absolute value function; •type-2 operator (takes in two type-0 to produce on type-0). We include operations+,∗,−,%.+means addition of two numbers;∗means multiplication of two numbers;−means subtraction of two numbers;%is the remainder of one number module the other. There are only certain types of templates (a string of numbers consisting of 0,1,2) that are syntactically correct. For example, 002is correct while02is incorrect. We iterate over all the templates not longer than 6 symbols, and for each template, we try all the variable combinations. Each variable combination corresponds to a symbolic equationf, for which we can check whetherf(x i ) =y i for 100 data points. If success, we terminate the brute force program and return the successful formula. If brute force search does not find any correct symbolic formula within compute budget, we will simply return the formulaa, to make sure that the program can still be synthesized but simply fail to make correct predictions. D. Neural Network Normalization Algorithms It is well known that neural networks exhibit a large amount of symmetry. That is, there are many transformations that can be applied to networks without affecting the mapy=f(x)that they compute. A classic example is to permute the neurons within layers. In this section, we describe a suite of normalizers that we use to transform our networks into a standard form, such that the algorithms that they learn are easier to interpret. We call our five normalizers “Whitening”, “Jordan normal form (JNF)”, “Toeplitz”, “De-bias”, and “Quantization”. The main symmetry which we focus on is a linear transformation of the hidden spaceh7→Ah, which requires the following changes tofandg: f(h,x) =Wh+Vx+b=⇒f(h,x) =AWA −1 h+AVx+Ab g(h) =G(Uh+c) =⇒f(h) =G(UA −1 h+c) and is implemented by changing the weights: W=⇒AWA −1 V=⇒AV b=⇒Ab U=⇒UA −1 For this symmetry, we can apply an arbitrary invertible similarity transformationAtoW, which is the core idea underlying our normalizers, three of which have their own unique ways of constructingA, as we describe in the sections below. Most importantly, one of our normalizers exploitsAto convert the hidden-to-hidden transformationWinto Jordan normal form, in the case wherefis linear. Recent work has shown that large recurrent networks with linear hidden-to-hidden transformations, such as state space models (Gu & Dao, 2023) can perform just as well as transformer-based models in language modeling on a large scale. A main advantage of using linear hidden-to-hidden transformations is the possibility of expressing the hidden space in it’s eigenbasis. This causes the hidden-to-hidden transformation to become diagonal, so that it can be computed more efficiently. In practice, modern state space models assume diagonality, and go further to assume the elements on the diagonal are real; they fix the architecture to be this way during training. By doing this, we ignore the possibility of linear hidden-to-hidden transformations that cannot be transformed into a real diagonal matrix via diagonalization. Such examples include rotation matrices (whose eigenvalues may be complex), and shift matrices (whose eigenvalues are degenerate and whose eigenvectors are duplicated). A more general form than the diagonal form is the Jordan normal form, which consists of Jordan blocks along the diagonal, each of which has the form λI+Sfor an eigenvalueλand the shift matrixSwith ones on the superdiagonal and zeros elsewhere. The diagonalization 14 Program synthesis via mechanistic interpretability is a special case of Jordan normal form, and all matrices can be transformed to Jordan normal form. A simple transformation can also be applied to Jordan normal forms that contain pairs of complex generalized eigenvectors, to convert them into real matrices. For nonlinear hidden-to-hidden transformations, we computeWas though the nonlinearities have been removed. D.1. Whitening Transformation Similar to normalizing the means and variances of a dataset before feeding it into a machine learning model, a good first preprocessing step is to normalize the distribution of hidden states. We therefore choose to apply a whitening transformation to the hidden space. To compute the transformation, we compute the covariance matrix of hidden activations across the dataset, and use the singular value decomposition (SVD) of this covariance matrix to find the closest transformation to the identity that will bring this covariance matrix to the identity. We ignore any directions with covariance less thanε= 0.1, which cause more instability when normalized. We then post-apply this transformation to the last linear layer of the hidden-to-hidden transformation and its biases, and pre-apply its inverse to the first layers of the hidden-to-hidden and hidden-to-output transformations. This leaves the net behavior of the network unchanged. Other transformations which we use in other normalizers operate in a similar manner, by post-applying and pre-applying a transformation and its inverse transformation to the first and last layers that interact with the hidden space. D.2. Jordan Normal Form Transformation Critically, the hidden-to-hidden transformations which we would like to convert into Jordan normal form are imperfect because they are learned. Eigenvectors belonging to each Jordan block must be identical, whereas this will only be approximately true of the learned transformation. The Jordan normal form of a matrix is unstable; consider a matrix W= 0 1 δ0 which, whenδ̸= 0, can be transformed into Jordan normal form by: 0 1 δ0 = 11 √ δ− √ δ |z T √ δ0 0− √ δ 11 √ δ− √ δ −1 (17) but whenδ= 0, is transformed into Jordan normal form by: 0 1 0 0 = 1 0 0 1 | z T 0 1 0 0 1 0 0 1 −1 (18) As we can see, all of the matrices in the decomposition are unstable nearδ= 0, so the issue of error thresholding is not only numerical, but is mathematical in nature as well. We would like to construct an algorithm which computes the Jordan normal form with an error threshold|δ|< ε= 0.7 within which the algorithm will pick the transformationTfrom Equation (18) instead of from Equation (17). Our algorithm first computes the eigenvaluesλ i , and then iteratively solves for the generalized eigenvectors which lie in ker((W−λI) k )for increasingk. The approximation occurs whenever we compute the kernel (of unknown dimension) of a matrixX; we take the SVD ofXand treat any singular vectors as part of the nullspace if their singular values are lower than the thresholdε, calling the resultε-ker(X). Spaces are always stored in the form of a rectangular matrixFof orthonormal vectors, and their dimension is always the width of the matrix. We build projections usingproj(F) =F H , whereF H denotes the conjugate transpose ofF. We compute kernelsker(X)of known dimension of matricesXby taking the SVDX=V 1 SV H 2 and taking the last singular vectors inV H 2 . We compute column spaces of projectors of known dimension by taking the top singular vectors of the SVD. The steps in our algorithm are as follows: 15 Program synthesis via mechanistic interpretability 1.Solve for the eigenvaluesλ i ofW, and check that eigenvalues that are withinεof each other form groups, ie. that |λ i −λ j |≤εand|λ j −λ k |≤εalways implies|λ k −λ i |≤ε. Compute the mean eigenvalue for every group. 2. Solve for the approximate kernels ofW−λIfor each mean eigenvalueλ. We will denote this operation by ε-ker(W−λI). We represent these kernels by storing the singular vectors whose singular values are lower thanε. Also, construct a “corrected matrix” ofW−λIfor everyλby taking the SVD, discarding the low singular values, and multiplying the pruned decomposition back together again. 3.Solve for successive spacesF k of generalized eigenvectors at increasing depthskalong the set of Jordan chains with eigenvalueλ, for allλ. In other words, find chains of mutually orthogonal vectors which are mapped to zero after exactlykapplications of the mapW−λI. We first solve forF 0 =ker(W−λI). Then fork >0, we first solve for J k =ε-ker((I−proj(F k−1 ))(W−λI))and deduce the number of chains which reach depthkfrom the dimension of J k , and then solve forF k =col(proj(J k )−proj(F 0 )). 4.Perform a consistency check to verify that the dimensions ofF k always stay the same or decrease withk. Go through the spacesF k in reverse order, and whenever the dimension ofF k decreases, figure out which direction(s) are not mapped to by applyingW−λItoF k+1 . Do this by building a projectorJfrom mapping vectors representingF k+1 throughW−λI, and takingcol(proj(F k )−J). Solve for the Jordan chain by repeatedly applyingproj(F i )(W i −λI) foristarting fromk−1and going all the way down to zero. 5. Concatenate all the Jordan chains together to form the transformation matrixT. The transformationTconsists of generalized eigenvectors which need not be completely real but may also include pairs of generalized eigenvectors that are complex conjugates of each other. Since we do not want the weights of our normalized network to be complex, we also apply a unitary transformation which changes any pair of complex generalized eigenvectors into a pair of real vectors, and the resulting block ofWinto a multiple of a rotation matrix. As an example, for a real 2 by 2 matrixWwith complex eigenvectors, we have W=T a+bi0 0a−bi T −1 =T ′ a−b b a (T ′ ) −1 ,T ′ = 1 √ 2 1i 1−i D.3. Toeplitz Transformation OnceWis in Jordan normal form, each Jordan block is an upper triangular Toeplitz matrix. Upper-triangular Toeplitz matrices, including Jordan blocks, will always commute with each other, because they are all polynomials of the shift matrix (which has ones on the superdiagonal and zeros elsewhere,) and therefore these transformations will leaveWunchanged, but will still affectV. We splitVup into parts operated on by each Jordan block, and use these Toeplitz transformations to reduce the most numerically stable columns of each block ofVto one-hot vectors. The numerically stability of a column vector is determined by the absolute value of the bottom element of that column vector, since it’s inverse will become the degenerate eigenvalues of the resulting Toeplitz matrix. If no column has a numerically stability aboveε= 0.0001, we pick the identity matrix for our Toeplitz transformation. D.4. De-biasing Transformation Oftentimes,Wis not full rank, and has a nontrivial nullspace. The biasbwill have some component in the direction of this nullspace, and eliminating this component only affects the behavior of the output networkg, and the perturabtion cannot carry on to the remainder of the sequence viaf. Therefore, we eliminate any such component, and compensate accordingly by modifying the bias in the first affine layer ofg. We identify the nullspaces by taking an SVD and identifying components whose singular value is less thanε= 0.1. D.5. Quantization Transformation After applying all of the previous transformations to the RNN, it is common for many of the weights to become close to zero or some other small integer. Treating this as a sign that the network is attempting to implement discrete operations using integers, we snap any weights and biases that are within a thresholdε= 0.01of an integer, to that integer. For certain simple tasks, sometimes this allows the entire network to become quantized. 16 Program synthesis via mechanistic interpretability E. Supplementary training data details Here we present additional details on the benchmark tasks marked ”see text” in Table 1: • Div3/5/7:This is a long division task for binary numbers. The input is a binary number, and the output is that binary number divided by 3, 5, or 7, respectively. The remainder is discarded. For example, we have 1000011/11=0010110 (67/3=22). The most significant bits occur first in the sequence. •Dithering:This is a basic image color quantization task, for 1D images. We map 4-bit images to 1-bit images such that the cumulative sum of pixel brightnesses of both the original and dithered images remains as close as possible. •NewtonGravity:This is an euler forward propagation technique which follows the equationF=input−1,v7→ v+F,x7→x+v. • NewtonSpring:This is an euler forward propagation technique which follows the equationF=input−x,v7→ v+F,x7→x+v. •NewtonMagnetic:This is an euler forward propagation technique which follows the equationF x =input 1 −v y ,F y = input 2 +v x ,v7→v+F,x7→x+v. F. Generated programs This section includes all successfully generated Python programs. Binary-Addition 1 2def f(s,t): 3a = 0;b = 0; 4ys = [] 5for i in range(10): 6c = s[i]; d = t[i]; 7next_a = b ˆ c ˆ d 8next_b = b+c+d>1 9a = next_a;b = next_b; 10y = a 11ys.append(y) 12return ys Bitwise-Xor 1 2def f(s,t): 3a = 0; 4ys = [] 5for i in range(10): 6b = s[i]; c = t[i]; 7next_a = b ˆ c 8a = next_a; 9y = a 10ys.append(y) 11return ys Bitwise-Or 1 2def f(s,t): 3a = 0; 4ys = [] 5for i in range(10): 6b = s[i]; c = t[i]; 7next_a = b+c>0 8a = next_a; 9y = a 10ys.append(y) 11return ys Bitwise-And 1 2def f(s,t): 3a = 0;b = 1; 4ys = [] 5for i in range(10): 6c = s[i]; d = t[i]; 7next_a = (not a and not b and c and d) or (not a and b and not c and d) or (not a and b and c and not d) or (not a and b and c and d) or (a and not b and c and d) or (a and b and c and d) 8next_b = c+d==0 or c+d==2 9a = next_a;b = next_b; 10y = a+b>1 11ys.append(y) 12return ys 17 Program synthesis via mechanistic interpretability Bitwise-Not 1 2def f(s): 3a = 1; 4ys = [] 5for i in range(10): 6x = s[i] 7next_a = x 8a = next_a; 9y = -a+1 10ys.append(y) 11return ys Parity-Last2 1 2def f(s): 3a = 0;b = 0; 4ys = [] 5for i in range(10): 6c = s[i] 7next_a = c 8next_b = a ˆ c 9a = next_a;b = next_b; 10y = b 11ys.append(y) 12return ys Parity-Last3 1 2def f(s): 3a = 0;b = 0;c = 0; 4ys = [] 5for i in range(10): 6d = s[i] 7next_a = d 8next_b = c 9next_c = a 10a = next_a;b = next_b;c = next_c; 11y = a ˆ b ˆ c 12ys.append(y) 13return ys Parity-All 1 2def f(s): 3a = 0; 4ys = [] 5for i in range(10): 6b = s[i] 7next_a = a ˆ b 8a = next_a; 9y = a 10ys.append(y) 11return ys Parity-Zeros 1 2def f(s): 3a = 0; 4ys = [] 5for i in range(10): 6b = s[i] 7next_a = a+b==0 or a+b==2 8a = next_a; 9y = a 10ys.append(y) 11return ys Sum-All 1 2def f(s): 3a = 884; 4ys = [] 5for i in range(10): 6x = s[i] 7next_a = a-x 8a = next_a; 9y = -a+884 10ys.append(y) 11return ys Sum-Last2 1 2def f(s): 3a = 0;b = 99; 4ys = [] 5for i in range(10): 6x = s[i] 7next_a = -b+x+99 8next_b = -x+99 9a = next_a;b = next_b; 10y = a 11ys.append(y) 12return ys Sum-Last3 1 2def f(s): 3a = 0;b = 198;c = 0; 4ys = [] 5for i in range(10): 6x = s[i] 7next_a = x 8next_b = -a-x+198 9next_c = -b+198 10a = next_a;b = next_b;c = next_c; 11y = a+c 12ys.append(y) 13return ys 18 Program synthesis via mechanistic interpretability Sum-Last4 1 2def f(s): 3a = 0;b = 99;c = 0;d = 99; 4ys = [] 5for i in range(10): 6x = s[i] 7next_a = c 8next_b = -x+99 9next_c = -b-d+198 10next_d = b 11a = next_a;b = next_b;c = next_c;d = next_d; 12y = a-b-d+198 13ys.append(y) 14return ys Sum-Last5 1 2def f(s): 3a = 198;b = -10;c = -2;d = 482;e = 1; 4ys = [] 5for i in range(20): 6x = s[i] 7next_a = -b+c+190 8next_b = b-c-d-e+x+480 9next_c = b-e+8 10next_d = -b+e-x+472 11next_e = a+b-e-187 12a = next_a;b = next_b;c = next_c;d = next_d;e = next_e; 13y = -d+483 14ys.append(y) 15return ys Sum-Last6 1 2def f(s): 3a = 0;b = 295;c = 99;d = 0;e = 297;f = 99; 4ys = [] 5for i in range(20): 6x = s[i] 7next_a = -b+295 8next_b = b-c+f 9next_c = b-c+d-97 10next_d = -f+99 11next_e = -a+297 12next_f = -x+99 13a = next_a;b = next_b;c = next_c;d = next_d;e = next_e;f = next_f; 14y = -b+c-e-f+592 15ys.append(y) 16return ys Sum-Last7 1 2def f(s): 3a = 297;b = 198;c = 0;d = 99;e = 0;f = -15;g = 0; 4ys = [] 5for i in range(20): 6x = s[i] 7next_a = -a+d-f+g+480 8next_b = a-d 9next_c = d+e-99 10next_d = -c+99 11next_e = -b+198 12next_f = -c+f+x 13next_g = x 14a = next_a;b = next_b;c = next_c;d = next_d;e = next_e;f = next_f;g = next_g; 15y = -d+f+114 16ys.append(y) 17return ys Current-Number 1 2def f(s): 3a = 99; 4ys = [] 5for i in range(10): 6x = s[i] 7next_a = -x+99 8a = next_a; 9y = -a+99 10ys.append(y) 11return ys Prev1 1 2def f(s): 3a = 0;b = 99; 4ys = [] 5for i in range(10): 6x = s[i] 7next_a = -b+99 8next_b = -x+99 9a = next_a;b = next_b; 10y = a 11ys.append(y) 12return ys 19 Program synthesis via mechanistic interpretability Prev2 1 2def f(s): 3a = 99;b = 0;c = 0; 4ys = [] 5for i in range(10): 6x = s[i] 7next_a = -x+99 8next_b = -a+99 9next_c = b 10a = next_a;b = next_b;c = next_c; 11y = c 12ys.append(y) 13return ys Prev3 1 2def f(s): 3a = 0;b = 0;c = 99;d = 99; 4ys = [] 5for i in range(10): 6x = s[i] 7next_a = b 8next_b = -c+99 9next_c = d 10next_d = -x+99 11a = next_a;b = next_b;c = next_c;d = next_d; 12y = a 13ys.append(y) 14return ys Prev4 1 2def f(s): 3a = 0;b = 99;c = 0;d = 99;e = 0; 4ys = [] 5for i in range(10): 6x = s[i] 7next_a = c 8next_b = -a+99 9next_c = -d+99 10next_d = -e+99 11next_e = x 12a = next_a;b = next_b;c = next_c;d = next_d;e = next_e; 13y = -b+99 14ys.append(y) 15return ys Prev5 1 2def f(s): 3a = 0;b = 0;c = 99;d = 99;e = 99;f = 99; 4ys = [] 5for i in range(20): 6x = s[i] 7next_a = -c+99 8next_b = -d+99 9next_c = -b+99 10next_d = e 11next_e = f 12next_f = -x+99 13a = next_a;b = next_b;c = next_c;d = next_d;e = next_e;f = next_f; 14y = a 15ys.append(y) 16return ys Previous-Equals-Current 1 2def f(s): 3a = 0;b = 0; 4ys = [] 5for i in range(10): 6c = s[i] 7next_a = delta(c-b) 8next_b = c 9a = next_a;b = next_b; 10y = a 11ys.append(y) 12return ys Diff-Last2 1 2def f(s): 3a = 199;b = 100; 4ys = [] 5for i in range(10): 6x = s[i] 7next_a = -a-b+x+498 8next_b = a+b-199 9a = next_a;b = next_b; 10y = a-199 11ys.append(y) 12return ys Abs-Diff 1 2def f(s): 3a = 100;b = 100; 4ys = [] 5for i in range(10): 6c = s[i] 7next_a = b 8next_b = c+100 9a = next_a;b = next_b; 10y = abs(b-a) 11ys.append(y) 12return ys 20 Program synthesis via mechanistic interpretability Abs-Current 1 2def f(s): 3a = 0; 4ys = [] 5for i in range(10): 6b = s[i] 7next_a = abs(b) 8a = next_a; 9y = a 10ys.append(y) 11return ys Bit-Shift-Right 1 2def f(s): 3a = 0;b = 1; 4ys = [] 5for i in range(10): 6x = s[i] 7next_a = -b+1 8next_b = -x+1 9a = next_a;b = next_b; 10y = a 11ys.append(y) 12return ys Bit-Dot-Prod-Mod2 1 2def f(s,t): 3a = 0; 4ys = [] 5for i in range(10): 6b = s[i]; c = t[i]; 7next_a = (not a and b and c) or (a and not b and not c) or (a and not b and c) or (a and b and not c) 8a = next_a; 9y = a 10ys.append(y) 11return ys Add-Mod-3 1 2def f(s): 3a = 0; 4ys = [] 5for i in range(10): 6b = s[i] 7next_a = (b+a)%3 8a = next_a; 9y = a 10ys.append(y) 11return ys Newton-Freebody 1 2def f(s): 3a = 82;b = 393; 4ys = [] 5for i in range(10): 6x = s[i] 7next_a = a-x 8next_b = -a+b+82 9a = next_a;b = next_b; 10y = -a+b-311 11ys.append(y) 12return ys Newton-Gravity 1 2def f(s): 3a = 72;b = 513; 4ys = [] 5for i in range(10): 6x = s[i] 7next_a = a-x+1 8next_b = -a+b+x+71 9a = next_a;b = next_b; 10y = b-513 11ys.append(y) 12return ys Newton-Spring 1 2def f(s): 3a = 64;b = 57; 4ys = [] 5for i in range(10): 6x = s[i] 7next_a = a+b-x-57 8next_b = -a+121 9a = next_a;b = next_b; 10y = -a+64 11ys.append(y) 12return ys F.1. Formal Verification The Dafny programming language is designed so that programs can be formally verified for correctness. The desired behavior of a program can be explicitly specified via preconditions, postconditions, and invariants, which are verified via automated theorem proving. These capabilities make Dafny useful in fields where correctness and safety are crucial. 21 Program synthesis via mechanistic interpretability We leverage Dafny’s robust verification capabilities to prove the correctness of the bit addition Python program synthesized by MIPS. The bit addition Python program was first converted to Dafny, then annotated with specific assertions, preconditions, and postconditions that defined the expected behavior of the code. Each annotation in the code was then formally verified by Dafny, ensuring that under all possible valid inputs, the code’s output would be consistent with the expected behavior. On line 79, we show that the algorithm found by MIPS is indeed equivalent to performing bit addition with length 10 bitvectors in Dafny. Dafny-Code 1 2function ArrayToBv10(arr: array<bool>): bv10 // Converts boolean array to bitvector 3reads arr 4requires arr.Length == 10 5 6ArrayToBv10Helper(arr, arr.Length - 1) 7 8 9function ArrayToBv10Helper(arr: array<bool>, index: nat): bv10 10reads arr 11requires arr.Length == 10 12requires 0 <= index < arr.Length 13decreases index 14ensures forall i :: 0 <= i < index ==> ((ArrayToBv10Helper(arr, i) >> i) & 1) == (if arr [i] then 1 else 0) 15 16if index == 0 then 17(if arr[0] then 1 else 0) as bv10 18else 19var bit: bv10 := if arr[index] then 1 as bv10 else 0 as bv10; 20(bit << index) + ArrayToBv10Helper(arr, index - 1) 21 22 23method ArrayToSequence(arr: array<bool>) returns (res: seq<bool>) // Converts boolean array to boolean sequence 24ensures |res| == arr.Length 25ensures forall k :: 0 <= k < arr.Length ==> res[k] == arr[k] 26 27res := []; 28var i := 0; 29while i < arr.Length 30invariant 0 <= i <= arr.Length 31invariant |res| == i 32invariant forall k :: 0 <= k < i ==> res[k] == arr[k] 33 34res := res + [arr[i]]; 35i := i + 1; 36 37 38 39function isBitSet(x: bv10, bitIndex: nat): bool 40requires bitIndex < 10 41ensures isBitSet(x, bitIndex) <==> (x & (1 << bitIndex)) != 0 42 43(x & (1 << bitIndex)) != 0 44 45 46function Bv10ToSeq(x: bv10): seq<bool> // Converts bitvector to boolean sequence 47ensures |Bv10ToSeq(x)| == 10 48ensures forall i: nat :: 0 <= i < 10 ==> Bv10ToSeq(x)[i] == isBitSet(x, i) 49 50[isBitSet(x, 0), isBitSet(x, 1), isBitSet(x, 2), isBitSet(x, 3), 51isBitSet(x, 4), isBitSet(x, 5), isBitSet(x, 6), isBitSet(x, 7), 52isBitSet(x, 8), isBitSet(x, 9)] 53 22 Program synthesis via mechanistic interpretability 54 55function BoolToInt(a: bool): int 56if a then 1 else 0 57 58 59function XOR(a: bool, b: bool): bool 60(a || b) && !(a && b) 61 62 63function BitAddition(s: array<bool>, t: array<bool>): seq<bool> // Performs traditional bit addition 64reads s 65reads t 66requires s.Length == 10 && t.Length == 10 67 68var a: bv10 := ArrayToBv10(s); 69var b: bv10 := ArrayToBv10(t); 70var c: bv10 := a + b; 71 72Bv10ToSeq(c) 73 74 75method f(s: array<bool>, t: array<bool>) returns (sresult: seq<bool>) // Generated program for bit addition 76requires s.Length == 10 && t.Length == 10 77ensures |sresult| == 10 78ensures forall i :: 0 <= i && i < |sresult| ==> sresult[i] == ((s[i] != t[i]) != (i > 0 && ((s[i-1] || t[i-1]) && !(sresult[i-1] && (s[i-1] != t[i-1]))))) 79ensures BitAddition(s, t) == sresult // Verification of correctness 80 81var a: bool := false; 82var b: bool := false; 83var result: array<bool> := new bool[10]; 84var i: int := 0; 85 86while i < result.Length 87invariant 0 <= i <= result.Length 88invariant forall j :: 0 <= j < i ==> result[j] == false 89 90result[i] := false; 91i := i + 1; 92 93 94i := 0; 95 96assert forall j :: 0 <= j < result.Length ==> result[j] == false; 97 98while i < result.Length 99invariant 0 <= i <= result.Length 100invariant b == (i > 0 && ((s[i-1] || t[i-1]) && !(result[i-1] && (s[i-1] != t[i-1])))) 101invariant forall j :: 0 <= j < i ==> result[j] == ((s[j] != t[j]) != (j > 0 && ((s[j -1] || t[j-1]) && !(result[j-1] && (s[j-1] != t[j-1]))))) 102 103assert b == (i > 0 && ((s[i-1] || t[i-1]) && !(result[i-1] && (s[i-1] != t[i-1])))); 104 105result[i] := XOR(b, XOR(s[i], t[i])); 106b := BoolToInt(b) + BoolToInt(s[i]) + BoolToInt(t[i]) > 1; 107assert b == ((s[i] || t[i]) && !(result[i] && (s[i] != t[i]))); 108 109i := i + 1; 110 111 112sresult := ArrayToSequence(result); 113 23 Program synthesis via mechanistic interpretability Table 2.AutoML architecture search results. All networks achieved 100% accuracy on at least one test batch. Task #Task Namenw f d f w g d g Train LossTest Loss 1BinaryAddition2114200 2Base3Addition2115200 3Base4Addition2115200 4Base5Addition2115200 5Base 6Addition211622.45e-092.53e-09 6Base7Addition2111022.32e-062.31e-06 7BitwiseXor1112200 8BitwiseOr111113.03e-023.03e-02 9BitwiseAnd111113.03e-023.03e-02 10BitwiseNot1111100 11ParityLast211122921.68e-021.69e-02 12ParityLast3211521.62e-041.64e-04 13ParityLast43112923.07e-072.99e-07 14ParityAll1112200 15ParityZeros1112200 16EvensCounter4117338.89e-058.88e-05 17SumAll111116.09e-086.13e-08 18Sum Last22111100 19SumLast3311116.34e-076.35e-07 20SumLast4411112.10e-042.11e-04 21SumLast5511118.86e-038.87e-03 22SumLast6611111.82e-021.81e-02 23SumLast7711113.03e-023.01e-02 24CurrentNumber1111100 25Prev12111100 26Prev23111100 27Prev34111100 28Prev4511112.04e-072.05e-07 29Prev5611116.00e-055.96e-05 30Previous EqualsCurrent211526.72e-056.61e-05 31DiffLast22111100 32AbsDiff222111.84e-071.84e-07 33AbsCurrent111224.51e-085.71e-08 34Diff AbsValues211423.15e-062.96e-06 35MinSeen1112200 36MaxSeen111221.46e-120 37Majority011116324.03e-034.05e-03 38Majority024119821.64e-041.71e-04 39Majority03211113236.94e-056.86e-05 40EvensDetector51116328.18e-048.32e-04 41PerfectSquareDetector481110021.92e-031.97e-03 42Bit Palindrome18118623.81e-053.69e-05 43BalancedParenthesis1111627.44e-037.10e-03 44Parity BitsMod21111100 45Alternating Last3211321.85e-021.87e-02 46AlternatingLast4211328.24e-068.09e-06 47Bit ShiftRight2111100 48Bit DotProdMod21113200 49Div 32115926.40e-036.43e-03 50Div54117621.50e-041.55e-04 51Div 741110326.65e-046.63e-04 52Add Mod311114921.02e-031.04e-03 53Add Mod42113321.53e-041.44e-04 54AddMod53114321.02e-031.03e-03 55Add Mod641110826.14e-046.12e-04 56AddMod741119923.96e-044.07e-04 57AddMod8671113428.53e-048.34e-04 58Dithering811116627.72e-047.75e-04 59Newton Freebody211112.61e-072.62e-07 60NewtonGravity211111.81e-071.87e-07 61NewtonSpring2111100 62NewtonMagnetic411118.59e-058.60e-05 24