Paper deep dive
Unifying Optimization and Dynamics to Parallelize Sequential Computation: A Guide to Parallel Newton Methods for Breaking Sequential Bottlenecks
Xavier Gonzalez
Abstract
Abstract:Massively parallel hardware (GPUs) and long sequence data have made parallel algorithms essential for machine learning at scale. Yet dynamical systems, like recurrent neural networks and Markov chain Monte Carlo, were thought to suffer from sequential bottlenecks. Recent work showed that dynamical systems can in fact be parallelized across the sequence length by reframing their evaluation as a system of nonlinear equations, which can be solved with Newton's method using a parallel associative scan. However, these parallel Newton methods struggled with limitations, primarily inefficiency, instability, and lack of convergence guarantees. This thesis addresses these limitations with methodological and theoretical contributions, drawing particularly from optimization. Methodologically, we develop scalable and stable parallel Newton methods, based on quasi-Newton and trust-region approaches. The quasi-Newton methods are faster and more memory efficient, while the trust-region approaches are significantly more stable. Theoretically, we unify many fixed-point methods into our parallel Newton framework, including Picard and Jacobi iterations. We establish a linear convergence rate for these techniques that depends on the method's approximation accuracy and stability. Moreover, we give a precise condition, rooted in dynamical stability, that characterizes when parallelization provably accelerates a dynamical system and when it cannot. Specifically, the sign of the Largest Lyapunov Exponent of a dynamical system determines whether or not parallel Newton methods converge quickly. In sum, this thesis unlocks scalable and stable methods for parallelizing sequential computation, and provides a firm theoretical basis for when such techniques will and will not work. This thesis also serves as a guide to parallel Newton methods for researchers who want to write the next chapter in this ongoing story.
Tags
Links
- Source: https://arxiv.org/abs/2603.16850v1
- Canonical: https://arxiv.org/abs/2603.16850v1
Intelligence
Status: not_run | Model: - | Prompt: - | Confidence: 0%
Entities (0)
Relation Signals (0)
No relation signals yet.
Cypher Suggestions (0)
No Cypher suggestions yet.
Full Text
376,956 characters extracted from source content.
Expand or collapse full text
UNIFYING OPTIMIZATION AND DYNAMICS TO PARALLELIZE SEQUENTIAL COMPUTATION: A GUIDE TO PARALLEL NEWTON METHODS FOR BREAKING SEQUENTIAL BOTTLENECKS A DISSERTATION SUBMITTED TO THE DEPARTMENT OF STATISTICS AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Xavier Gonzalez March 2026 arXiv:2603.16850v1 [math.NA] 17 Mar 2026 Abstract Recurrent neural networks (RNNs) were widely regarded as "inherently sequen- tial" because each hidden state depends on the previous one. This sequential de- pendency creates a computational bottleneck: evaluating an RNN on a sequence of length T seems to require O(T) time steps, even with unlimited parallel proces- sors. This dissertation challenges the conventional wisdom and develops methods that enable parallel evaluation of nonlinear RNNs with O((logT) 2 ) computational depth. Moreover, the methods I have developed and studied are very general, and can parallelize the broad class of computations falling under the heading of state space models (SSMs). SSMs include not only nonlinear RNNs but also Markov chain Monte Carlo (MCMC), sampling from diffusion models, and explicit differ- ential equation solvers, among many other applications. In my PhD, I built on an approach [41, 142] that reformulates RNN evaluation as a fixed-point problem and applies Newton’s method to leverage the parallel scan algorithm. However, when I began work on this subject, the community’s understanding of this parallel Newton method was hindered by certain limitations. The Newton iterations suffered from a lack of scalability in the state dimension D, a lack of stability in certain applications, and a general lack of understanding of its convergence properties and rates. In this thesis, I address these limitations with methodological and theoretical contributions. The methodological contributions of this thesis include developing scalable and stable parallel Newton methods, based on quasi-Newton and trust-region approaches. The quasi-Newton methods further accelerate the training of RNNs and use a factor of D less memory. The trust-region approaches are parallelized over the sequence length using a parallel Kalman filter and are significantly more stable than their undamped counterparts. These methods have inspired follow-up work both in nonlinear sequence modeling and in parallelizing MCMC. The theoretical contributions of this thesis include establishing the convergence rates of parallel Newton methods. Both the Newton and quasi-Newton methods enjoy global convergence in at most T iterations. Moreover, we show that the con- ditioning of the optimization landscape, as quantified by its Polyak-Łojasiewicz (PL) constant, is determined by the stability of the dynamical system, as quanti- fied by its Largest Lyapunov Exponent (LLE). By doing so, we show that stable (i.e. LLE < 0) dynamics enjoy convergence in O(logT) iterations, while unstable dynamics converge too slowly for parallelization to work. In sum, this thesis unlocks scalable and stable methods for parallelizing sequen- tial computation, and provides a firm theoretical basis for when such techniques will and will not be effective. This thesis also serves as a guide to parallel Newton methods for researchers who want to write the next chapter in this ongoing story. iv Acknowledgments Thank you to everyone who has made this PhD dissertation possible; and all who have taught and mentored me over the years. Thank you to my advisor Scott Linderman for teaching me so much about research, collaboration, mentoring, software engineering, math, statistics, neuro- science, and more. You have created one of the kindest and happiest labs at Stan- ford, and you are the driving force behind this social and collaborative culture. Speaking of the Linderman lab, thank you to all the amazing mentors and friends I’ve met in Scott’s lab. I am grateful for the opportunity to work with and learn from such talented researchers. Thank you to all my collaborators: I have learned so much and had so much fun working with you. Thank you in particular to the four postdocs who espe- cially mentored me on these projects: Andy Warrington, Leo Kozachkov, Kelly Buchanan, and David Zoltowski. Each of you taught me so much in different ways, and I am grateful for your generosity of time and wisdom. Thank you to my family, friends, and mentors for their support and encourage- ment. Thank you especially to my parents Javier and Natalya, my sister Natasha, and my girlfriend Tiffany. Your nurturing and support has made everything possible—and your love has brightened my days. Ultimately, thank you to God—you are the source of all good things. Thank you for the many blessings of my PhD. v C O N T E N T S Ii n t ro du c t i o n a n d b ac k g ro u n d1 1i n t ro du c t i o n2 1.1Extended History3 1.2Outline6 2b ac k g ro u n d8 2.1Dynamics: State Space Models8 2.1.1State Space Models (SSMs)8 2.1.2Examples of SSMs9 2.1.3Limitation of SSMs: "Inherently Sequential"11 2.2Parallel Computing: The Parallel Associative Scan12 2.2.1The Parallel Scan: A Gentle Introduction12 2.2.2Parallelizing Linear Dynamical Systems14 2.2.3Parallelizing Kalman Filtering and Smoothing15 2.2.4The Difficulties of Parallelizing an SSM in general15 2.3Numerical Analysis: Newton’s method16 2.3.1Root-finding16 2.3.2Optimization20 2.3.3Fixed-point methods22 2.4Putting it all together: Parallel Newton methods23 2.4.1Parallel Newton methods: DEER and DeepPCR23 2.4.2More in depth derivation25 2.4.3Limitations of Newton’s method27 IIm e t h o d s : s c a l a b l e a n d s ta b l e pa r a l l e l i z at i o n29 3s c a l a b l e pa r a l l e l i z at i o n : q ua s i - n e w t o n m e t h o d s30 3.1Quasi-DEER: A diagonal approximation31 3.2Global convergence32 3.3Experiments and performance of quasi-DEER34 3.3.1Quasi-DEER for Evaluation35 3.3.2Quasi-DEER for Training36 3.4Further development and directions for future work38 3.4.1Efficiently Estimating the Diagonal of the Jacobian39 3.4.2Generalizing quasi-DEER to other approximate Jacobians40 3.4.3Training and the backwards pass43 3.4.4Initializing the guess for the state trajectory45 4s ta b l e pa r a l l e l i z at i o n : e l k a n d t ru s t r e g i o n m e t h o d s46 4.1Levenberg-Marquardt and Trust-Region Methods46 vi 4.2ELK: Evaluating Levenberg-Marquardt with Kalman48 4.3Dynamics perspective on ELK50 4.4Experiments and performance of ELK52 4.4.1Edge of stability: Parallelizing a sine wave52 4.4.2Chaotic system: Parallelizing the Lorenz96 System55 4.5Further extensions: scale- and clip-ELK57 4.5.1Scale-ELK58 4.5.2Clip-ELK58 4.6Conclusion58 I t h e o r y : c o n v e r g e n c e r at e s60 5c o n v e r g e n c e r at e s o f g au s s - n e w t o n f o r pa r a l l e l i z i n g n o n - l i n e a r s s m s61 5.1Predictability and the Largest Lyapunov Exponent61 5.2Polyak-Łojasiewicz and Merit Landscape Conditioning63 5.3Conditioning depends on dynamical properties65 5.3.1Merit Function PL Constant is Controlled by the Largest Lya- punov Exponent of Dynamics66 5.3.2Residual function Jacobian Inherits the Lipschitzness of the Nonlinear State Space Model67 5.4Rates of Convergence for Optimizing the Merit Function68 5.4.1DEER Always Converges Globally at a Linear Rate69 5.4.2Size of DEER Basin of Quadratic Convergence71 5.5Experiments74 5.5.1The Convergence Rate Exhibits a Threshold between Pre- dictable and Chaotic Dynamics74 5.5.2DEER can converge quickly for predictable trajectories pass- ing through unpredictable regions78 5.5.3Application: Chaotic Observers80 5.6Discussion81 5.6.1Related Work82 5.6.2Implications82 5.7Extensions84 6c o n v e r g e n c e r at e s o f q ua s i - n e w t o n m e t h o d s f o r pa r a l - l e l i z i n g s s m s86 6.1Unifying fixed-point iterations as quasi-DEER methods87 6.1.1Picard iterations87 6.1.2Jacobi iterations89 6.1.3Summary89 6.2Convergence rates for quasi-DEER90 6.2.1Convergence rates of fixed-point iterations91 6.2.2Limitations of this convergence analysis92 6.2.3Intuitions about rates of convergence93 vii 6.2.4Summary of Convergence Analysis97 6.3Performance of the different fixed-point methods98 6.3.1Case study #1: Solving the group word problem with New- ton iterations98 6.3.2Case Study #2: Picard iterations struggle to parallelize RNNs100 6.3.3Case Study #3: Jacobi iterations struggle to parallelize dis- cretized Langevin diffusion103 6.4Related Work104 6.5Discussion106 IV c o n c l u s i o n108 7c o n c l u s i o n a n d f u t u r e d i r e c t i o n s109 7.1Summary of Contributions109 7.2Future Directions110 7.2.1Improving parallel Newton methods110 7.2.2Finding the best applications of parallel Newton methods113 Va p p e n d i x115 a g l o b a l c o n v e r g e n c e o f pa r a l l e l n e w t o n m e t h o d s116 a.1 Comparison of the two results116 a.2 Corrected version of Theorem 3.6 of Tang et al.118 bp r e d i c ta b i l i t y a n d c o n d i t i o n i n g120 b.1Theorem statement and proof120 b.2Discussion of why small singular values leads to ill-conditioning122 b.3The dynamical interpretation of the inverse Jacobian123 b.3.1Connection to semiseparable matrices and Mamba2124 b.4Framing based on global bounds125 b.5Discussion of the LLE regularity conditions125 b.6Controlling the maximum singular value127 b.7Condition number of the Jacobian128 c d i s c u s s i o n o f pa r a l l e l c h o r d m e t h o d s129 b i b l i o g r a p h y134 viii Part I I N T R O D U C T I O N A N D B A C K G R O U N D The first part of this thesis provides the motivation and background for parallelizing dynamical systems. We introduce the fundamental prob- lem of sequential computation in deep learning and review the mathe- matical foundations that enable parallel evaluation of these models. haha gpus go brr Figure 1: Parallel Newton methods. With a clever connection between New- ton’s method and the parallel scan, we can use GPUs to parallelize and therefore accelerate dynamical systems. 1 I N T R O D U C T I O N Sequential processes are ubiquitous in statistics and machine learning. Evaluat- ing a recurrent neural network [81], sampling from a diffusion model [100, 209, 214] or with Markov chain Monte Carlo (MCMC) [51, 71], generating from a deep state space model [85, 86, 181, 207], and unrolling layers of a deep neural net- work [96, 226] all involve sequential computations. Naively, these computations require time proportional to the length of the input or the depth of the architec- ture, and in some cases, they may not take full advantage of massively parallel modern hardware like graphics processing units (GPUs). For example, if the com- putational graph is a very long chain where each individual step in the chain is not too computationally intensive, but the chain is very long, this computational graph will not fully utilize the approximately 10,000 cores of a modern GPU. This mismatch between the requirements of sequential computation and the de- sign of modern parallel hardware has led to sequential models losing the "hard- ware lottery" [105] and being replaced by more easily parallelized architectures. This broad story is most clearly exemplified in the transition from recurrent neu- ral networks (RNNs), the dominant sequence modeling architecture prior to 2018, towards attention and the transformer architecture, an embarrassingly parallel approach that powers most of modern AI, including the "generative pretrained transformers" behind ChatGPT and other modern large language models (LLMs) [29]. In fact, as Vaswani et al. [226] write in the introduction to their landmark paper that introduced the transformer (emphasis added): Recurrent neural networks have been firmly established as state of the art approaches in sequence modeling and transduction problems such as language modeling and machine translation. However, their inherently sequential nature precludes parallelization within train- ing examples, which becomes critical at longer sequence lengths, as memory constraints limit batching across examples. Despite significant improvements in RNN computational efficiency, the fundamental con- straint of sequential computation remains. In the era of massively parallel hardware like GPUs, and ever longer sequences of data to process, the "inherently sequential" [122, 149, 160, 208, 226] nature of RNNs was viewed as a disqualifying disadvantage in many applications. 2 1.1 e x t e n d e d h i s t o r y3 Incredibly, however, as introduced in the seminal papers of Danieli et al. [41] and Lim et al. [142], nonlinear RNNs and many other types of "inherently se- quential" computations can be parallelized over the sequence length. This paral- lelization is achieved by recasting the problem of sequential evaluation as a high- dimensional nonlinear equation that can be solved using Newton iterations that are parallelized over the sequence length. However, when these parallel Newton methods were first published, limitations blocked their wider use and adoption. These were standard limitations for Newton’s method in general [26, 179, 180], namely • A lack of scalability of the method, especially as the state size increased; • A lack of stability of the convergence of the method in certain applications; and • A lack of understanding of under what conditions the Newton iterations would converge, and at what rates. This dissertation helps to resolve these limitations. Methodologically, we intro- duce quasi-Newton methods to provide scalable parallelization and trust-region methods to provide stable parallelization. Theoretically, we provide detailed con- vergence analyses of these methods, including proofs of global convergence and the identification of the stability of the underlying dynamical system as a critical decider of whether or not efficient parallelization is possible. In doing so, we have unlocked scalable parallelization of nonlinear RNNs and a robust theoretical un- derstanding of under what conditions such parallelization is desirable. Moreover, these methods parallelize not only nonlinear RNNs [40, 61, 80, 142] but can also parallelize a wide range of models called state space models (SSMs). In this thesis, an SSM is a discrete-time dynamical system with state s t ∈ R D that evolves over time by a transition function s t =f t (s t−1 ) (see Section 2.1). Examples of chain-like computational graphs involving SSMs include sampling from MCMC [83, 244] or diffusion models [41, 90, 153, 199, 201, 221], solving differential equations with explicit methods [111], and many other diverse applications [78]. Taken together, this thesis lays the foundation for exciting future work in paral- lelizing a broad range of important primitives, while also more clearly delineating which processes are—and are not—"inherently sequential." This thesis serves as an introduction to parallel Newton methods for researchers eager to contribute to this exciting, new field. 1.1e x t e n d e d h i s t o r y The modern development of massively parallel hardware like GPUs and TPUs has created new urgency around the parallelization of sequential processes, con- tributing to the modern development of these parallel Newton methods [41, 142]. 1.1 e x t e n d e d h i s t o r y4 However, this thesis and the parallel Newton methods it surveys build on a long tradition of parallel-in-time computing [66]. In short, as long as there have been parallel computers and long sequences, there has been important work in paral- lelizing these sequential processes—and the modern massive increase in scale has led to a renaissance of these methods. While there were of course many earlier efforts at parallel computers, the IL- LIAC IV is widely credited as being the first massively parallel computer [15]. The ILLIAC IV was developed in the 1960s and 1970s at the University of Illinois, and was designed to have 256 processors that could carry out computation in parallel. Almost immediately, this novel development in hardware led to novel develop- ments in algorithms. For example, in 1973, Stone [216] explicitly cited the ILLIAC as motivation for his development of a technique to solve tridiagonal systems of equations 1 in parallel. Stone [216] called this method recursive doubling, and it is known today as parallel cyclic reduction or the parallel associative scan. We provide more background on the parallel scan as a general and fundamental primitive in Section 2.2. However, to give a specific example, a canonical application of the parallel scan is as a technique to use T processors to multiply T matrices together in O(logT) computational depth—thus enabling exponential speedups on large parallel machines. In addition to the development of parallel methodology, the development of the ILLIAC also spurred fundamental work in the theory of what computations could and could not be parallelized. For example, in 1975, Hyafil and Kung [110] and Kung [134] explicitly cited the ILLIAC as motivation for their study of which algorithms and models could be efficiently parallelized. They showed that linear recursions enjoy speedups from parallel processors while nonlinear recursions of rational functions with degree larger than one in general cannot. These pre- scient works set the stage for the more general findings of this thesis presented in Part I, where we explicitly link the dynamical properties of the recursion to its parallelizability. The desire to solve differential equations over long time windows also led to the development of parallel-in-time methods for continuous-time initial value prob- lems (IVPs) [69, 177]. While this dissertation primarily focuses on discrete-time SSMs, there are intimate links between discrete and continuous time, just as there are intimate links between difference and differential equations. In fact, the numer- ical solving of differential equations almost always eventually reduces to solving some discretization of the ordinary differential equation (ODE). Consequently, it is unsurprising that the ODE parallel-in-time, multiple shooting, and multigrid literature has many of the ingredients of modern parallel Newton methods. For example, in 1989 Bellen and Zennaro [18] suggested a quasi-Newton method for 1 Which is extremely similar to parallel Newton methods, which in their simplest form solve bidiagonal systems of equations: see Section 2.4 1.1 e x t e n d e d h i s t o r y5 solving differential equations that has almost all the core components of parallel Newton methods—except for the parallel scan. Indeed, the core ingredients of parallel Newton methods remained scattered throughout the parallel-in-time literature. For example, Horton, Vandewalle, and Worley [106] proposed parallel-in-time solvers for differential equations using the parallel scan—but applied this technique to other fixed-point iterations like Gauss- Seidel and Jacobi, not Newton. This preference for Gauss-Seidel and Jacobi iterations persisted in many strands of the literature. For example, Deshpande et al. [49] provided a theoretical analysis of convergence rates for these parallel-in-time methods. In discrete time, Naumov [173] showed how evaluating Markov chains could be cast as a system of nonlin- ear equations and discussed many techniques from numerical analysis for solving them, again focusing on Jacobi and Gauss-Seidel; Song et al. [213] extended this program with many deep learning experiments. A possible explanation for this preference for Jacobi and Gauss-Seidel iterations is the heavier computational cost of a single Newton iteration, especially on less massively parallelized machines. On the other hand, when Newton iterations were suggested—as by Gander and Vandewalle [67] as an interpretation of parareal iterations [145]—the link to par- allelization via the parallel scan was omitted. Therefore, to the best of my knowledge, the full marriage between parallel scans and Newton iterations had to wait until 2023 and the seminal papers of Danieli et al. [41] and Lim et al. [142]—even though all the necessary ingredients had ex- isted in the literature for at least thirty years. A likely factor in the delay was the fracturing of knowledge and motivations across different communities in dynam- ics, parallel computation, numerical analysis, and machine learning. Chapter 2 brings together the necessary background from all of these disciplines to close this gap and facilitate communication between these different communities. Undoubtedly the hardware and software ecosystems also played a role in the eventual development of parallel Newton’s method. In software, the standardiza- tion of autodifferentiation [27, 155, 184] made the computation of Jacobians less burdensome. In hardware, the development of GPUs with thousands of proces- sors and gigabytes of on-device memory made the cost of Newton iterations far less burdensome than they were on the ILLIAC and other earlier parallel ma- chines. The importance of these software and hardware lotteries [105] in the de- velopment of algorithms cannot be overstated. While this thesis provides an intro- duction to techniques that let us take "inherently sequential" processes and reduce their latency on parallel hardware, we must remain open to the possibility that further developments in hardware, software, and algorithm may lead to yet more radically different approaches in the future. 1.2 o u t l i n e6 1.2o u t l i n e In this introduction, we provided a brief survey of the history of parallel compu- tation and parallel-in-time algorithms. In particular, we discussed how the recent rise of massively parallel processors like GPUs has further stimulated the advance- ment of parallel algorithms for "inherently sequential" computation. Building on this work, my thesis has developed scalable methods for parallel-in-time com- putation and a firm theoretical understanding of under what conditions such parallel-in-time computation makes sense. The rest of this thesis is organized as follows: • Chapter 2 provides fundamental background for understanding parallel Newton methods, tying together dynamics, parallel computing, and numer- ical analysis. • Chapter 3 introduces our first method, a quasi-Newton method for scalable parallelization. • Chapter 4 introduces our second method, a trust-region method for stable parallelization. • Chapter 5 establishes our theoretical analysis of convergence rates for the Gauss-Newton optimization method for parallelizing SSMs. • Chapter 6 studies the convergence rates of a wide class of quasi-Newton methods for parallelizing SSMs. • Chapter 7 concludes by summarizing the contributions of this thesis and discussing promising directions for future work. The following publications form the basis of this dissertation: Chapters 3 and 4 are based on: Xavier Gonzalez, Andrew Warrington, Jimmy T.H. Smith, and Scott W. Linderman. "Towards Scalable and Stable Parallelization of Nonlinear RNNs." In Advances in Neural Information Processing Systems (NeurIPS), 2024. Chapter 5 is based on: Xavier Gonzalez*, Leo Kozachkov*, David M. Zoltowski, Kenneth L. Clarkson, and Scott W. Linderman. "Predictability Enables Paralleliza- tion of Nonlinear State Space Models." In Advances in Neural Informa- tion Processing Systems (NeurIPS), 2025. 1.2 o u t l i n e7 Chapter 6 is based on: Xavier Gonzalez*, E. Kelly Buchanan*, Hyun Dong Lee, Jerry Wei- hong Liu, Ke Alexander Wang, David M. Zoltowski, Leo Kozachkov, Christopher Ré, and Scott W. Linderman. "A Unifying Framework for Parallelizing Sequential Models with Linear Dynamical Systems." In Transactions on Machine Learning Research (TMLR), 2026. Throughout this thesis, we also include material from David M. Zoltowski*, Skyler Wu*, Xavier Gonzalez, Leo Kozachkov, and Scott W. Linderman. "Parallelizing MCMC Across the Sequence Length." In Advances in Neural Information Processing Systems (NeurIPS), 2025. This last paper extends and develops quasi-Newton methods to parallelize Markov chain Monte Carlo across the sequence length. 2 B A C K G R O U N D This thesis uses techniques from applied math to parallelize a class of sequen- tial processes known as state space models (SSMs). Therefore, in this chapter, we provide background on the three diverse foundational areas—dynamics, paral- lel computing, and optimization—so that we can bring them together as parallel Newton methods. 2.1dy na m i c s : s tat e s pac e m o d e l s The first topic this thesis brings into play is dynamics. In particular, we study a class of sequential processes called state space models [172]. In this background section, we define state space models, survey their broad use in statistics and machine learning, and discuss why their evaluation was deemed to be “inherently sequential.” 2.1.1 State Space Models (SSMs) A state space model is a discrete-time dynamical system with a fixed state size. We denote this state by s t ∈ R D , where the subscript t denotes the time of the state, and the dimension D denotes the state size. The state evolves according to a dynamics or transition function as s t+1 =f t (s t ).(1) Importantly, state space models satisfy the Markov property, where the state at time t +1 depends only on the state at time t, and not any of the previous states. Informally, the Markov property means that once we know the present, we can forget the past. Our primary consideration in this thesis is how to evaluate (equivalently “simulate”, “unroll” or “roll-out”) an SSM from an initial condition s 0 . We make this goal precise in the following problem statement: p ro b l e m s tat e m e n t ( u n ro l l i n g a n s s m ) :Evaluate the sequence s 1:T = (s 1 ,s 2 ,... ,s T ) starting from s 0 , where s t follows the SSM dynamics in equation (1). Figure 2 indicates graphically how when we unroll the dynamics of an SSM from 8 2.1 dy na m i c s : s tat e s pac e m o d e l s9 s 0 s 1 s 2 s T−1 s T f 1 f 2 f T · Figure 2: Unrolling an SSM. We shade the initial state s 0 to indicate that we know the initial condition. s 0 s 1 s 2 s T−1 s T u 1 u 2 u T−1 u T f · ↕ s 0 s 1 s 2 s T−1 s T f 1 f 2 f T · Figure 3: Graphical diagram showing the equivalence (based on currying) between an SSM driven by inputs and an autonomous system with time-varying transition dynamics. We shade the inputs u t to indicate that they are known. known initial condition s 0 , we obtain a computational graph that is a chain of sequential dependencies. Throughout this thesis, we will use T to denote the se- quence length. Often, state space models also take an input u t ∈ R D at each time step. Thus, the dynamics become s t+1 =f(s t ,u t ).(2) However, as illustrated in Figure 3, we can always curry the input into the dynam- ics function to obtain an equivalent SSM without inputs. Specifically, we define the curried dynamics functions as f t (s t ) :=f(s t ,u t ). Thus, we can rewrite the SSM with inputs as an SSM without inputs, as in equa- tion (1). While almost all the SSMs we consider in this thesis take inputs, we will often omit them from the notation for simplicity, relying on the fact that we can always curry them into the dynamics functions. 2.1.2 Examples of SSMs The framework given in equation (1) is extremely general, and many well-known models in statistics and machine learning can be expressed as SSMs. We summa- rize some important examples in Table 1. 2.1 dy na m i c s : s tat e s pac e m o d e l s10 SSMState (s t )Input (u t )Transition (f) Linear Dynamical Systems(LDS) [154] StateInputLinear Deep SSMs [75, 84, 208] Stack of statesInputLinear Recurrent Neural Networks (RNNs) [38, 54, 102, 118, 219] Hidden stateInputRNN cell MCMC[51,71, 244] Current sampleNoiseTransition kernel Samplingfrom diffusion models [137, 201, 210] Noisy imageNoiseDenoisingfunc- tion Explicitdiffer- entialequation solvers [66, 111, 125] Current stateN/ANumericalinte- grator "Recurrent depth" fortransformers [47, 70, 116, 197, 230] Layer activationsOriginal inputTransformer block State of reinforce- mentlearning (RL) agent [188, 220] Environment state NoiseEnvironment dy- namics Gradient descent [26] Parameter valuesN/AGradient step The human brain [228] neural activitysensory inputsynapses Table 1: Some illustrative examples of state space models (SSMs). The first example in this Table 1 is the linear dynamical system (LDS), which has linear transition dynamics, that is, the state evolves as s t+1 =A t s t +B t u t .(3) Thus, an LDS is a special case of an SSM (equation (1)), but where the dynamics function f t is affine. These linear dynamical systems have enjoyed a resurgence in machine learning recently as linear RNNs [160, 181] or deep state space models 2.1 dy na m i c s : s tat e s pac e m o d e l s11 s 0 s 1 s 2 s 3 o 1 o 2 o 3 Figure 4: A linear Gaussian state space model (LGSSM): The LGSSM consists of latent variables s t and observed variables o t . The generative model of the LGSSM consists of dynamics s t+1 ∼ N ( As t ,Q ) and emissions o t+1 ∼ N ( Cs t+1 ,R ) . [85, 86, 207]. In these deep learning architectures, the temporal dynamics of each layer are linear, but the output of each layer is passed through a nonlinearity to become the input of the next layer. 2.1.2.1 Bayesian inference for linear Gaussian SSMs: Kalman filtering and smoothing We take a brief aside to discuss Bayesian inference in state space models, as the core primitives of Kalman filtering and smoothing are fundamental to our stable parallelization techniques developed in Chapter 4. We begin by noting that we can include many probabilistic models into the SSM framework by incorporating stochastic inputs into our SSM dynamics equation equation (2). A fundamental probabilistic model is the linear Gaussian state space model (LGSSM), where the latent variables s t follow linear dynamics with Gaus- sian noise, and emit observations o t with linear readouts with Gaussian noise [171, 194]. See Figure 4. In particular, note that the LGSSM is a simple way to make an LDS a probabilistic object: the latent variables s t are modeled as an LDS. Two canonical inferential targets in the LGSSM are the filtering distributions, p(s t | o 1:t ), and the smoothing distributions, p(s t | o 1:T ). The Kalman filter [119] and Rauch-Tung-Striebel (RTS) smoother 1 [189] obtain the filtering and smoothing distributions (respectively) in an LGSSM. The Kalman filter makes a single pass forward in time to get the filtering distributions, while the RTS smoother then makes an additional pass backwards in time to get the smoothing distributions. Thus, these canonical algorithms for Bayesian inference in LGSSMs would also at first glance seem to be inherently sequential. 2.1.3 Limitation of SSMs: "Inherently Sequential" Indeed, despite the breadth of use of SSMs across statistics and machine learn- ing, it was widely believed that SSMs were “inherently sequential” to evaluate [122, 208, 226]. With longer sequences, this sequential evaluation becomes a com- putational bottleneck, especially on modern hardware like GPUs and TPUs that thrive on parallelism. As a result, in keeping with the “hardware lottery” [105], 1 Occasionally we will call the RTS smoother a “Kalman” smoother for simplicity. 2.2 pa r a l l e l c o m p u t i n g : t h e pa r a l l e l a s s o c i at i v e s c a n12 many researchers began to turn away from SSMs in favor of more parallelizable approaches. However, it turns out that there is a simple but effective way to parallelize our first and simplest example of SSMs in Table 1: linear dynamical systems. Therefore, in our next background section on parallel computing, we review the parallel as- sociative scan that allows us to parallelize linear dynamical systems. Ultimately, as we will see in Section 2.4, a clever use of the parallel scan allows us to parallelize SSMs in general, despite their “inherently sequential” nature. 2.2pa r a l l e l c o m p u t i n g : t h e pa r a l l e l a s s o c i at i v e s c a n The parallel scan [24, 216], also known as the associative scan and, colloquially, pscan, is a well-known primitive in the parallel computing literature [99, 136, 138]. The core idea of the parallel scan is a divide-and-conquer algorithm. We illustrate this point in the simple example of multiplying a series of matrices together. 2.2.1 The Parallel Scan: A Gentle Introduction s i m p l e e x a m p l e : m u lt i p ly i n g a s e q u e n c e o f m at r i c e sConsider the following problem: given a series of square matrices A 1 ,A 2 ,... ,A T−1 ,A T , compute their product 2 , A T A T−1 ...A 2 A 1 . The simplest way to carry out the matrix mul- tiplication is sequentially: first compute A 1 , then compute A 2 A 1 , then compute A 3 A 2 A 1 , and so on. Such an approach takes O(T) time. A core insight of the parallel scan is that matrix multiplication is closed; that is, if A s ∈ R D×D and A t ∈ R D×D , then A t A s ∈ R D×D . Thus, matrix products can be computed recursively in pairs, as illustrated in Figure 5. A 1 A 2 A 3 A 4 A 2 A 1 A 4 A 3 A 4 A 3 A 2 A 1 Figure 5: Parallel Scan for Matrix Multiplication. We illustrate a divide-and-conquer ap- proach to compute the product A 4 A 3 A 2 A 1 . Note that this divide-and-conquer approach naturally leads to O(logT) depth. Because of the divide-and-conquer (binary-tree-like) nature of this approach to multiplying matrices, with O(T) processors, the time needed to get the matrix 2 Note that we have the matrices act via left-multiplication over the sequence length, because this is the most common way to write matrix-vector products. 2.2 pa r a l l e l c o m p u t i n g : t h e pa r a l l e l a s s o c i at i v e s c a n13 product is only O(logT). This simple example illustrates the core intuition behind the parallel scan: a closed operation leading to a divide-and-conquer approach that parallelizes a computation so that it takes sublinear time. However, there are two additional details of the parallel associative scan that we should address: arbi- trary binary associative operators and closure; and getting intermediate products. d e ta i l # 1 : pa r a l l e l s c a n s f o r a r b i t r a r y b i na r y a s s o c i at i v e o p e r a - t o r sMatrix multiplication is an associative operator, as A 3 (A 2 A 1 ) = (A 3 A 2 )A 1 . In general, consider a binary associative operator ⊗, which would satisfy q 3 ⊗ (q 2 ⊗q 1 ) = (q 3 ⊗q 2 )⊗q 1 . Now, let us further assume that this binary associative operator is closed: Definition 2.1 (Closure). A binary associative operator ⊗ is closed over a set S if it satisfies the property: q 1 ∈ S,q 2 ∈ S⇒q 2 ⊗q 1 ∈ S.(4) If⊗ is closed, then we can again use a parallel scan to compute the cumulative product of the operands. A wide range of binary associative operators are closed, and can thus be par- allelized with the parallel scan. We have already seen that matrix multiplication is such a binary associative operator. An even simpler example of a binary as- sociative operator amenable to the parallel scan is scalar addition. The fact that addition of scalars (and vectors) is closed allows cumulative sums to be computed with the parallel scan algorithm. When the binary associative operator is addition, it is also known as the prefix sum algorithm. Clearly, addition is associative and closed, and so summing a series of scalars can be done with a divide-and-conquer approach. d e ta i l # 2 : o b ta i n i n g t h e i n t e r m e d i at e t e r m s i n t h e p ro du c tThe parallel scan is meant to be a parallelized implementation of the Scan primitive from functional programming [22]. However, Scan not only returns the final prod- uct A T A T 1 ...A 1 , as we illustrated in Figure 5, but also all the intermediate terms A 1 , A 2 A 1 , A 3 A 2 A 1 , etc. In fact, the parallel scan provides all the intermediate terms as well. We again illustrate in our motivating example of matrix multiplication, in particular the setting where T = 8. We will denote the individual matrices as A 1 ,A 2 ,A 3 ,...A 8 , and their products as A s:t , i.e. A 5:6 =A 6 A 5 . The first phase of the parallel scan is the up-sweep, and takes log(T) iterations and O(T) memory. Crucially, note that we are using O(T) processors in parallel as well. We start multiplying adjacent pairs of matrices together. Looking, for example, at Position 8 of Table 2, we go from A 8 to A 7:8 to A 5:8 to A 1:8 . Then, in the down-sweep, we fill in the missing products to obtain all the cumu- lative products A 1:t for 1⩽ t⩽ T . Intuitively, the down-sweep also takes O(logT) 2.2 pa r a l l e l c o m p u t i n g : t h e pa r a l l e l a s s o c i at i v e s c a n14 iterations, for the same reason that any natural number T can be represented using 1 + log 2 (T) digits in binary. Pos. 1Pos. 2Pos. 3Pos. 4Pos. 5Pos. 6Pos. 7Pos. 8 A 1 A 2 A 3 A 4 A 5 A 6 A 7 A 8 A 1 A 1:2 A 3 A 3:4 A 5 A 5:6 A 7 A 7:8 A 1 A 1:2 A 3 A 1:4 A 5 A 5:6 A 7 A 5:8 A 1 A 1:2 A 3 A 1:4 A 5 A 5:6 A 7 A 1:8 Table 2: Up-sweep for multiplying A 1 ,A 2 ,...A 8 . Pos. 1Pos. 2Pos. 3Pos. 4Pos. 5Pos. 6Pos. 7Pos. 8 A 1 A 1:2 A 3 A 1:4 A 5 A 1:6 A 7 A 1:8 A 1 A 1:2 A 1:3 A 1:4 A 1:5 A 1:6 A 1:7 A 1:8 Table 3: Down-sweep for multiplying A 1 ,A 2 ,...A 8 . Thus, together, the up-sweep and the down-sweep of the parallel scan run in O(logT) time on O(T) processors, and at the end of this algorithm, we get all of the intermediate products 3 (the “prefix sums”). 2.2.2 Parallelizing Linear Dynamical Systems Having digested the fundamentals of the parallel scan, it becomes apparent that composition of affine functions is also a binary associative operator that is closed. Therefore, it is possible to parallelize over the sequence length the roll-out of an LDS evolving according to equation (3). In more detail, consider the affine function f i (x) = A i x + b i . Notice that the composition of affine functions is also affine, as f j (f i (x)) = A j A i x + b j +A j b i . Thus, if we represent the operands as ordered pairs (A i ,b i ) and (A j ,b j ), we can write the associative operator⊗ for the composition of affine functions as (A i ,b i )⊗ (A j ,b j ) = (A j A i ,b j +A j b i ).(5) Thus, we observe that in this setting, ⊗ is closed. We also should check that ⊗ is associative: we can do so with either elementary algebra, or by observing that function composition is associative. This observation that composition of affine functions can be parallelized with the associative scan is what lets us parallelize LDSs. The insight that LDSs could 3 See the last row of Table 3. 2.2 pa r a l l e l c o m p u t i n g : t h e pa r a l l e l a s s o c i at i v e s c a n15 be parallelized with parallel scans led to a revolution in the deep sequence mod- eling community based on transformer alternative architectures. Because LDSs could be parallelized, this led to the development of both linear RNNs [160, 181] and deep SSMs [85, 207]. These approaches boil down to sequence mixing layers that are LDSs (and therefore parallelizable with parallel scan), stacked nonlin- early in depth. As we will see throughout this thesis, decomposing nonlinear SSM dynamics into LDSs that can be parallelized with the parallel scan is our fundamental tool in parallelizing arbitrary SSMs. 2.2.3 Parallelizing Kalman Filtering and Smoothing In the previous section, we showed how we can parallelize the evaluation of a linear dynamical system. In this section, we discuss how we can also paral- lelize Bayesian inference—Kalman filtering and smoothing—in probabilistic mod- els based on LDSs, namely linear Gaussian SSMs (which we reviewed in Subsec- tion 2.1.2.1). Both the Kalman filter and RTS smoother would seem to be inherently sequen- tial algorithms, requiring O(T) time. However, Särkkä and García-Fernández [192] demonstrated that the Kalman filter and RTS smoother can also be parallelized over the sequence length via the construction of custom binary associative opera- tors and a parallel scan. While we leave the details of this construction to Särkkä and García-Fernández [192], we note that it is intuitively plausible to be able to parallelize filtering and smoothing in an LGSSM with a parallel scan because • the dynamical backbone is an LDS, for which we have a parallel scan (cf. equation (5)); • since everything is linear and Gaussian, all distributions remain Gaussian, hinting at closure; and • we can combine p(s t ′ |s 0 ,o 1:t ′ ) with p(s t |s t ′ ,o t ′ +1:t ) to obtain p(s t |s 0 ,o 1:t ), sug- gesting a divide-and-conquer strategy. These parallel filtering and smoothing algorithms are useful in machine learning, allowing for parallelization of structured variational autoencoders [115, 243]. Sim- ilar approaches also work for Hidden Markov Models [92] and for computing log-normalizing constants [107]. 2.2.4 The Difficulties of Parallelizing an SSM in general The astute reader might note that the composition of functions, i.e. f 1 ◦f 2 , is al- ways a binary associative operator. So, why do we have all these special cases of parallel scans, and not simply one parallel scan for the composition◦ of arbitrary functions f i ? 2.3 n u m e r i c a l a na ly s i s : n e w t o n ’ s m e t h o d16 The reason to have many different parallel scans is precisely the importance of having the binary associative operator be closed. In all the previous examples, the binary associative operator ⊗ satisfies Definition 2.1, letting us easily store combinations of operands q i ⊗q j and so employ a divide-and-conquer technique. While we could consider some gigantic function space F, for which function composition would be closed, the practical question then becomes: how would we store the combinations of operands? If we do not have some compact repre- sentation for elements of F, then we cannot use a parallel scan in practice, even though the parallel scan may seem applicable in theory. Nonetheless, we still have the parallel scan for parallelizing LDSs. When one has a hammer (the parallel scan for LDSs), everything begins to look like a nail. Thus, one might attempt the seemingly hacky approach of taking nonlinear dy- namical systems, and iteratively • linearizing the system • evaluating the linearized system in parallel with the parallel scan. Incredibly, this approach, which motivates this thesis, is not a hack but is rather an instantiation of Newton’s method! Therefore, in the next section, we review Newton’s method in optimization and numerical analysis generally. 2.3n u m e r i c a l a na ly s i s : n e w t o n ’ s m e t h o d Newton’s method is one of the most fundamental approaches in root-finding, opti- mization, and numerical analysis generally [26, 108, 179, 180]. In this background section, we review the fundamentals of Newton’s method and other related tech- niques in root-finding, optimization, and fixed-point methods. 2.3.1 Root-finding Consider a high-dimensional nonlinear function r(s) : R P 7→ R P . A standard prob- lem in numerical analysis is to find the root of such a function, i.e. find those s ⋆ for which r(s ⋆ ) = 0. In high dimensions and for a complicated function, it is not immediately obvious how one might find such a zero in an efficient manner. However, if our function is affine, i.e. r(s) = Ms + b, then provided M is invert- ible there is at least a straightforward way to find the root s ⋆ , as s ⋆ = −M −1 b. Newton’s method for root-finding for differentiable functions r is based on the idea of iteratively • linearizing r around our current guess s (i) to form the affine function ˆ r (i) ; and then • finding the root of ˆ r (i) , and making it our new guess s (i+1) . 2.3 n u m e r i c a l a na ly s i s : n e w t o n ’ s m e t h o d17 s r(s) 0 s ? s (0) s (1) s (2) s (3) Newton’s Method Figure 6: Newton’s method for root-finding. Here we illustrate 3 iterations of New- ton’s method for root-finding on the one-dimensional cubic function r(s) = (s − 0.4) 3 + 0.45(s − 0.4). We observe that each iteration of Newton’s method involves linearizing the function to obtain ˆ r (i) (·) (shown in color) and then find- ing the zero of this linearization to obtain our next guess. We show a graphical depiction of Newton’s method for a one-dimensional func- tion r(·) : R7→ R in Figure 6. Let us define the notational shorthand J (i) := ∂r ∂s (s (i) ), where we choose J to stand for the Jacobian matrix (i.e. derivative) of r. With this notation, we see that the first step is given by a first-order Taylor expansion of r around s (i) , i.e. ˆ r (i) (s) := r(s (i) ) + J (i) s − s (i) . Thus, we see that every step of Newton’s method for root-finding—where we aim to find the zero of ˆ r (i) (s)—is given by s (i+1) = s (i) − J (i) −1 r(s (i) ).(6) Of course, for equation (6) to be valid, we must have J (i) invertible—but for the parallel Newton methods considered in this dissertation, it always will be (see equation (17)). Another limitation of Newton’s method immediately visible from equation (6) is the need to store and invert J (i) ∈ R P×P . In particular, the matrix inversion requires O(P 3 ) floating point operations (FLOPs). While the implementation of the numerical linear algebra can be optimized [77], the overall cost of Newton’s method has inspired a broad literature on cheaper, approximate quasi-Newton 2.3 n u m e r i c a l a na ly s i s : n e w t o n ’ s m e t h o d18 methods [148, 178, 179]. We build on and contribute to this quasi-Newton lit- erature in Chapter 3. While Figure 6 shows intuitively why Newton’s method can be a powerful tech- nique for root-finding, let us discuss some of its convergence properties further and more formally. c o n v e r g e n c e o f n e w t o n ’ s m e t h o dNewton’s method is known to enjoy quadratic convergence within a basinaround the solution s ⋆ [108, 179, 180]. One way to define quadratic convergence, following the presentation in No- cedal and Wright [179], is via the notion of Q-convergence, which is short for quotient-convergence: Definition 2.2 (Q-convergence). Consider a sequence of iterates s (i) which is converging to a limit s ⋆ as i→∞. Then this sequence Q-converges to s ⋆ with order q and with rate of convergence γ if, for all (i) sufficiently large, ∥e (i+1) ∥⩽γ∥e (i) ∥ q ,(7) where the errors are defined by e (i) := s (i) − s ⋆ ,(8) and∥·∥ is any valid vector norm. If the order q = 1, we say that the iterative method enjoys linear convergence, while if q = 2, it enjoys quadratic convergence. In linear convergence, the error sat- isfies 4 ∥e (i) ∥ ⩽ γ i ∥e (0) ∥, indicating that the norm of the error decays exponen- tially in the number of iterations, with base γ. We must have γ < 1 for linear convergence to converge to a limit. In quadratic convergence, the error satisfies ∥e (i) ∥⩽ (γ∥e 0 ∥) 2 i /γ, indicating the norm of the error decays doubly-exponentially with base γ∥e 0 ∥. Again, however, to actually enjoy decrease with quadratic conver- gence, we must have γ∥e 0 ∥<1, giving rise to the basin of quadratic convergence B Q given by B Q := s (i) :∥s (i) − s ⋆ ∥< 1 γ . With this definition, we now provide a simple proof 5 that Newton’s method enjoys quadratic rate in a basin around the solution s ⋆ . Proposition 2.3. Say we are trying to find a root of r(s) : R P 7→ R P with Newton’s method as defined in equation (6). If we assume that J(s) is L-Lipschitz and is always invertible with ∥J(s) −1 ∥⩽ β for all s, then Newton’s method converges quadratically in the basin given by s (i) :∥s (i) − s ⋆ ∥⩽ 2 Lβ . 4 Here, we slightly abuse notation to make e (0) the first iteration where the inequality in (7) holds. 5 following e.g. the proof of Proposition 4 of Lu, Zhu, and Hou [153]. 2.3 n u m e r i c a l a na ly s i s : n e w t o n ’ s m e t h o d19 Proof. Subtract s ⋆ from both sides of equation (6) to obtain e (i+1) = e (i) − J(s (i) ) −1 r(s (i) ). Taylor expanding r(·) around s (i) , we get the equality r(s ⋆ ) = r(s (i) ) − J(s (i) )e (i) + R (i) , where the remainder R (i) satisfies∥R∥⩽ L 2 ∥e (i) ∥ 2 . Since r(s ⋆ ) =0, it follows that e (i+1) = J(s (i) ) −1 R (i) . Taking norms on both sides and inputting the assumptions, it follows that ∥e (i+1) ∥⩽ Lβ 2 ∥e (i) ∥ 2 , i.e. Newton’s method enjoys quadratic convergence in the specified basin. However, this quadratic convergence of Newton’s method only holds locally, i.e. for initial guesses s (0) that are close to the zero s ⋆ . It is stronger and more helpful to have guarantees for global convergence, i.e. assurances that an iterative solver will converge (and with a specified rate) no matter the initial guess s (0) . Unfortunately, in general, Newton’s method does not enjoy global convergence guarantees [179]. We illustrate with a simple example. s r(s) =s 1/3 0 s (0) s (1) s (2) s (3) Newton’s Method Divergence (Cube Root) Figure 7: Newton’s method can globally diverge. A graphical depiction showing how Newton’s method for root-finding can globally diverge for a simple function like r(s) = s 1/3 . 2.3 n u m e r i c a l a na ly s i s : n e w t o n ’ s m e t h o d20 Example 2.4 (Newton’s method can diverge: r(s) = s 1/3 .). Consider the standard cube root function r(s) = s 1/3 defined on all of the real line. At all points s∈ R, the derivative of this function is given by r ′ (s) = 1 3 s −2/3 . So, plugging into equation (6), it follows that, no matter the initial guess, the Newton iterates follow s (i+1) = −2s (i) . Consequently, we observe that for any initial guess not the unique solution s ⋆ = 0, New- ton’s method will diverge for this function, as shown in Figure 7. Note that Proposition 2.3 does not even apply in this setting because the derivative is neither Lipschitz nor does its inverse have a uniform bound. Studying the convergence rates of parallel Newton methods—as well as their possible instabilities—is a major theme of this thesis. 2.3.2 Optimization While Newton’s method is usually first presented in an introduction to calculus as a method for root-finding (Subsection 2.3.1), it is best known in machine learning in the context of optimization. Say we have an objective function F(s) : R P → R that is twice differentiable, and we wish to find its minimum, i.e. s ⋆ = argmin s∈R P F(s). For large dimension P and a complicated objective function F(s), optimization can be very difficult. In fact, the problem of high-dimensional optimization is one of the central problems of machine learning [89, 117, 127, 227]. However, if F(s) is a convex quadratic function, i.e. we can write F(s) = 1 2 s ⊤ Ms + b ⊤ s +c for positive-definite matrix M, then its unique minimizer is given by s ⋆ = −M −1 b. Thus, Newton’s method for optimization of a twice-differentiable function is directly analogous to Newton’s method for root-finding for a differentiable func- tion. In Newton’s method for root-finding, we built on the fact that we could solve invertible linear systems, and so for a nonlinear system r(s) = 0, we iteratively lin- earize r(·) and solve. In Newton’s method for optimization, we build on the fact that we have a closed form solution for the minimum of a convex quadratic, and so for a twice-differentiable function F(·), we iteratively build and minimize the quadratic surrogate of F(·). The quadratic surrogate for F(·) at our current guess s (i) is given by ˆ F i (s) =F(s (i) ) +∇ s F(s (i) ) ⊤ s − s (i) + 1 2 s − s (i) ⊤ ∇ 2 s F(s (i) ) s − s (i) ,(9) 2.3 n u m e r i c a l a na ly s i s : n e w t o n ’ s m e t h o d21 where ∇ 2 s F(s (i) )∈ R P×P is the Hessian of F(·) evaluated at s (i) . Therefore, if the Hessian is positive-definite, the minimizer of ˆ F i (s), and therefore the formula for the next iteration in Newton’s method for optimization, is s (i+1) := −(∇ 2 s F(s (i) )) −1 ∇ s F(s (i) ) −∇ 2 s F(s (i) )s (i) = s (i) − (∇ 2 s F(s (i) )) −1 ∇ s F(s (i) ). However, we recognize this update for Newton’s method for optimization as the same as Newton’s method for root-finding in equation (6), where the function we are finding the root of is ∇ s F(·) : R P → R P . Thus, we see that Newton’s method for optimization of a function F(·) is nothing more than Newton’s method for root- finding applied to the derivative of F(·). This connection is part of the rich interplay in numerical analysis between root- finding (finding the zero of a function) and optimization (finding the minima of a function) [179]. The fact that Newton’s method for optimization of objective function F(·) is equivalent to Newton’s method for root-finding applied to its derivative ∇F(·) makes sense because for a differentiable function F(s) : R P → R, its minima lie among its stationary points (the set of points where its derivative ∇F(s) = 0). g au s s - n e w t o n m e t h o d f o r o p t i m i z at i o n o f s u m - o f - s q ua r e sHow- ever, there are even more connections between root-finding and optimization. If we return to the problem of finding a root of a residual function r(s) : R P → R P , we observe that we can form a merit function L(s) := 1 2 ∥r(s)∥ 2 2 .(10) Because L(s) is a sum-of-squares objective, it is greater than or equal to zero, and we observe that L(s ⋆ ) =0, meaning the root s ⋆ of r(·) is also the minimizer of the merit function 6 L(·). By basic calculus, we observe that the gradient and Hessian of L are given by ∇L(s) = J ⊤ r ∇ 2 L(s) = J ⊤ J + P X i=1 r i (s)∇ 2 r i (s), where J(s) := ∂r ∂s (s). 6 While it is admittedly counterintuitive to desire to "minimize" a "merit function," we follow the naming convention set by the classic textbook Nocedal and Wright [179]. 2.3 n u m e r i c a l a na ly s i s : n e w t o n ’ s m e t h o d22 While we could apply Newton’s method for optimization to L, we know from our previous discussion that this would be Newton’s method for root-finding applied to the gradient ∇L(s) = J ⊤ (s)r(s), and not Newton’s method for root- finding applied to the original residual function r(s). However, a very simple modification called the Gauss-Newton method restores the link between optimization of the sum-of-squares merit function L and root- finding of the residual r. In the Gauss-Newton method, we apply Newton’s method, but we approximate the Hessian by J ⊤ J. The Gauss-Newton method is thus a way to get the benefit of second-order methods while only taking one derivative. More- over, its updates take the form s (i+1) = s (i) − J ⊤ J −1 J ⊤ r. If J is invertible, then the Gauss-Newton updates take the form s (i+1) = s (i) − J −1 r, which we again recognize as equation (6), i.e. root-finding for r. Note, therefore, that if J is invertible, then Gauss-Newton as an optimization technique for L is mathematically equivalent to Newton’s method for root finding applied to r. For this reason, another interpretation of the Gauss-Newton method is as linearizing the residual function r(s): that is, each step of the Gauss-Newton method minimizes the quadratic loss ˆ L s (i) (s) := 1 2 r(s (i) ) + J(s (i) ) s − s (i) 2 2 . For small residuals, the Newton and Gauss-Newton methods have similar con- vergence properties (cf. [179]). Importantly, just like Newton’s method for root- finding, they can also both diverge globally. For example, take Example 2.4 and turn it into an optimization problem with objective function F(s) = s 4/3 to see that Newton’s method will diverge or merit function L(s) = s 2/3 to see that Gauss- Newton will diverge. 2.3.3 Fixed-point methods We can also write each step of Newton’s method as the action of an operator A N (s) : R P → R P , i.e. s (i+1) = A N (s (i) ) A N (s (i) ) = s (i) − J (i) −1 r(s (i) ). 2.4 p u t t i n g i t a l l t o g e t h e r : pa r a l l e l n e w t o n m e t h o d s23 Importantly note that if s ⋆ is a root of r(s), i.e. r(s ⋆ ) = 0, then it follows that A N (s ⋆ ) = s ⋆ , i.e. s ⋆ is a fixed-point of the Newton’s method operator A N . In general, a fixed-point problem aims to find s ⋆ satisfying F(s ⋆ ) = s ⋆ , for some function F(s) : R P → R P . Any fixed-point problem can be interpreted as a root-finding problem by defining r(s) = s − F(s), and then asking to find s ⋆ such that r(s ⋆ ) =0. Because of all of these connections, Newton’s method is also a foundational concept in fixed-point methods and solvers [180]. As we will see in Chapter 6, many different fixed-point methods can be used to parallelize SSMs, including Picard and Jacobi iterations. We discuss in more detail in that section. 2.4p u t t i n g i t a l l t o g e t h e r : pa r a l l e l n e w t o n m e t h o d s In the previous sections, we reviewed dynamics, parallel computation, and numer- ical analysis, with the goal of combining these three diverse fields to parallelize the unrolling of state space models. In this section, we combine these three ingre- dients to show how parallel Newton methods allow for the parallelization of such “inherently sequential” processes. 2.4.1 Parallel Newton methods: DEER and DeepPCR Concurrently, Lim et al. [142] developed DEER and Danieli et al. [41] developed DeepPCR, both of which are the same parallel Newton method for the paralleliza- tion of SSMs. This section reviews their foundational work. Throughout this the- sis, we use the terms “DeepPCR”, “DEER”, and “parallel Newton methods” inter- changeably. The fundamental idea of parallelizing SSMs is to replace sequential evaluation with parallel iterative evaluation. We compare these two approaches to evaluating an SSM in Figure 8. Going forward, we will denote the true roll-out from the SSM over the entire trajectory of length T as s ⋆ ∈ R TD , i.e. s ⋆ 1 = f 1 (s 0 ),s ⋆ 2 = f 2 (s ⋆ 1 ), and in general s ⋆ t = f t (s ⋆ t−1 ). 7 Note that at initialization, s ⋆ ̸= s (0) , i.e. we may be initializing in a way that is not faithful at all to the true SSM dynamics. Thus, we 7 In our discussion of parallel Newton methods, and henceforth in this thesis, we will use bold script for variables of shape TD or TD×TD, and not bold for variables of shape D or D×D. So, bolding will be reserved for variables that extend over the sequence length, while variables that are just at a particular point in time will not be bolded. We follow this convention to distinguish between operations that occur across the sequence length vs. at a particular point. 2.4 p u t t i n g i t a l l t o g e t h e r : pa r a l l e l n e w t o n m e t h o d s24 . Sequential vs Parallel Evaluation Towards Scalable and Stable Parallelization of Nonlinear RNNs Xavier Gonzalez, Andrew Warrington, Jimmy T.H. Smith, Scott W. Linderman xavier18, scott.linderman@stanford.edu 1.X. Gonzalez, A. Warrington, J. Smith, and S. Linderman. Towards Scalable and Stable Parallelization of Nonlinear RNNs. Advances in Neural Information Processing Systems, 2024. 2.YH Lim, Q. Zhu, J. Selfridge, and M. Kasim. Parallelizing non-linear sequential models over the sequence length. International Conference on Learning Representations, 2024. 3.S. Sarkkaand A. Garcia-Fernandez. Temporal parallelization of Bayesian smoothers. IEEE Transactions on Automatic Control. 2021. 4.S. Sarkka and L. Svensson. Levenberg-Marquardt and line-search extended Kalman smothers. ICASSP 2020- 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2020. References Paper Link -Parallelized sequence modelling is important. Consider Transformers and deep SSMs (linear RNNs). -Nonlinear RNNs can also be parallelized by treating them as the solution of fixed point equation (DEER 2 , Lim et al, ‘24) -Parallelizing nonlinear RNNs can accelerate their evaluation and training by making better use of GPUs -We make parallelizing RNNS scalable using a quasi- Newton method and stable using a trust-region Summary -The residual (what we want to find the root of) is -So, the Jacobian in this particular problem is -Define the Newton step as -The Newton step is the solution of a solve, i.e. find satisfying -Because has block bidiagonal structure, this solve can be evaluated using forward substitution. -The forward substitution gives a simple linear recursion with the initial condition , and for t>1 Background: DEER & Newton’s method . Algorithms . Parallel Scan -Make diagonal approximation. -Quasi-Newton update. -Compute & memory efficient. Stability: ELK -Newton’s method can be stabilized with a trust region. -The resulting penalized objective is the solution to a Kalman smoother 3 and can be parallelized 4 . Figure 2: Sequential evaluation versus parallel iterative evaluation. Left: Sequential evaluation steps through the sequence. Right: Parallel evaluation iterates over the whole sequence, and can converge in fewer steps. 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 →J(s (i) )ωs=r ωs (i+1) 1 =→r 1 (s (i) ) ωs (i+1) t = ! ωf ωs (s (i) t→1 ) " ωs (i+1) t→1 →r t (s (i) ). s 0 s 1 s 2 s T→1 s T f · → # ↑(ωs,ε) · = T $ t=1 log↓ % s (i) t & & & s t , 1 ε I D ' +log↓(s 1 |f 1 (s 0 ),I D ) + T $ t=2 log↓ % s t & & & f t (s (i) t→1 )+ ! ωf t ωs (s (i) t→1 ) " (s t→1 →s (i) t→1 ),I D ' s 0 s 1 s (i) 1 s 2 s (i) 2 s 3 s (i) 3 A 1 ,b 1 1/ε A 2 ,b 2 1/ε A 3 ,b 3 1/ε 41 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +ωs (1) +ωs (2) +ωs (i) J t>1 r(s 1:T ):=[s 1 →f(s 0 ),s 2 →f(s 1 ),...,s T →f(s T→1 )] J(s):= ωr ωs (s)= ! " " " " " # I D 0... 0 0 → ωf ωs (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...→ ωf ωs (s T→1 )I D $ % % % % % & . ωs (i+1) :=s (i+1) →s (i) ωs 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +ωs (1) +ωs (2) +ωs (i) J t>1 r(s 1:T ):=[s 1 →f(s 0 ),s 2 →f(s 1 ),...,s T →f(s T→1 )] J(s):= ωr ωs (s)= ! " " " " " # I D 0... 0 0 → ωf ωs (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...→ ωf ωs (s T→1 )I D $ % % % % % & . ωs (i+1) :=s (i+1) →s (i) ωs 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +ωs (1) +ωs (2) +ωs (i) J t>1 r(s 1:T ):=[s 1 →f(s 0 ),s 2 →f(s 1 ),...,s T →f(s T→1 )] J(s):= ωr ωs (s)= ! " " " " " # I D 0... 0 0 → ωf ωs (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...→ ωf ωs (s T→1 )I D $ % % % % % & . ωs (i+1) :=s (i+1) →s (i) ωs 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +ωs (1) +ωs (2) +ωs (i) J t>1 r(s 1:T ):=[s 1 →f(s 0 ),s 2 →f(s 1 ),...,s T →f(s T→1 )] J(s):= ωr ωs (s)= ! " " " " " # I D 0... 0 0 → ωf ωs (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...→ ωf ωs (s T→1 )I D $ % % % % % & . ωs (i+1) :=s (i+1) →s (i) ωs 40 →J(s (i) )ωs=r ωs (i+1) 1 =→r 1 (s (i) ) ωs (i+1) t = ! ωf ωs (s (i) t→1 ) " ωs (i+1) t→1 →r t (s (i) ). s 0 s 1 s 2 s T→1 s T f · → # ↑(ωs,ε) · = T $ t=1 log↓ % s (i) t & & & s t , 1 ε I D ' +log↓(s 1 |f 1 (s 0 ),I D ) + T $ t=2 log↓ % s t & & & f t (s (i) t→1 )+ ! ωf t ωs (s (i) t→1 ) " (s t→1 →s (i) t→1 ),I D ' s 0 s 1 s (i) 1 s 2 s (i) 2 s 3 s (i) 3 A 1 ,b 1 1/ε A 2 ,b 2 1/ε A 3 ,b 3 1/ε Algorithm 4DEER 1:procedureDEER(f,s 0 , init_guess, tol) 2:diff↔↗ 3:states↔init_guess 4:whilediff>toldo 5:shifted_states↔[s 0 ,states[:→1]] 6:fs↔f(shifted_states) 7:Js↔GetJacobians(f,shifted_states) 8:bs↔fs→Js@shifted_states 9:new_states↔ParallelScan(Js,bs) 10:diff↔↘states→new_states↘ 1 11:states↔new_states 12:end while 13:returnstates 14:end procedure 41 →J(s (i) )ωs=r ωs (i+1) 1 =→r 1 (s (i) ) ωs (i+1) t = ! ωf ωs (s (i) t→1 ) " ωs (i+1) t→1 →r t (s (i) ). s 0 s 1 s 2 s T→1 s T f · → # ↑(ωs,ε) · = T $ t=1 log↓ % s (i) t & & & s t , 1 ε I D ' +log↓(s 1 |f 1 (s 0 ),I D ) + T $ t=2 log↓ % s t & & & f t (s (i) t→1 )+ ! ωf t ωs (s (i) t→1 ) " (s t→1 →s (i) t→1 ),I D ' s 0 s 1 s (i) 1 s 2 s (i) 2 s 3 s (i) 3 A 1 ,b 1 1/ε A 2 ,b 2 1/ε A 3 ,b 3 1/ε Algorithm 4DEER 1:procedureDEER(f,s 0 , init_guess, tol) 2:diff↔↗ 3:states↔init_guess 4:whilediff>toldo 5:shifted_states↔[s 0 ,states[:→1]] 6:fs↔f(shifted_states) 7:Js↔GetJacobians(f,shifted_states) 8:bs↔fs→Js@shifted_states 9:new_states↔ParallelScan(Js,bs) 10:diff↔↘states→new_states↘ 1 11:states↔new_states 12:end while 13:returnstates 14:end procedure 41 (A 1 ,b 1 ) (A 2 ,b 2 ) (A 3 ,b 3 ) (A 4 ,b 4 ) (A 2 A 1 ,A 2 b 1 +b 2 ) (A 4 A 3 ,A 4 b 3 +b 4 ) (A 4 A 3 A 2 A 1 ,A 4 A 3 A 2 b 1 +A 4 A 3 b 2 +A 4 b 3 +b 4 ) 42 ! ! ! " ! # ! $ ! % ! & ! ' ! ( * ! * " * # * $ * % * & * ' * ( "(log T) T 17 →J(s (i) )ωs=r ωs (i+1) 1 =→r 1 (s (i) ) ωs (i+1) t = ! ωf ωs (s (i) t→1 ) " ωs (i+1) t→1 →r t (s (i) ). s 0 s 1 s 2 s T→1 s T f · → # ↑(ωs,ε) · = T $ t=1 log↓ % s (i) t & & & s t , 1 ε I D ' +log↓(s 1 |f 1 (s 0 ),I D ) + T $ t=2 log↓ % s t & & & f t (s (i) t→1 )+ ! ωf t ωs (s (i) t→1 ) " (s t→1 →s (i) t→1 ),I D ' s 0 s 1 s (i) 1 s 2 s (i) 2 s 3 s (i) 3 A 1 ,b 1 1/ε A 2 ,b 2 1/ε A 3 ,b 3 1/ε Algorithm 4DEER 1:procedureDEER(f,s 0 , init_guess, tol) 2:diff↔↗ 3:states↔init_guess 4:whilediff>toldo 5:shifted_states↔[s 0 ,states[:→1]] 6:fs↔f(shifted_states) 7:Js↔GetJacobians(f,shifted_states) 8:bs↔fs→Js@shifted_states 9:new_states↔ParallelScan(Js,bs) 10:diff↔↘states→new_states↘ 1 11:states↔new_states 12:end while 13:returnstates 14:end procedure 41 →J(s (i) )ωs=r ωs (i+1) 1 =→r 1 (s (i) ) ωs (i+1) t = ! ωf ωs (s (i) t→1 ) " ωs (i+1) t→1 →r t (s (i) ). s 0 s 1 s 2 s T→1 s T f · → # ↑(ωs,ε) · = T $ t=1 log↓ % s (i) t & & & s t , 1 ε I D ' +log↓(s 1 |f 1 (s 0 ),I D ) + T $ t=2 log↓ % s t & & & f t (s (i) t→1 )+ ! ωf t ωs (s (i) t→1 ) " (s t→1 →s (i) t→1 ),I D ' s 0 s 1 s (i) 1 s 2 s (i) 2 s 3 s (i) 3 A 1 ,b 1 1/ε A 2 ,b 2 1/ε A 3 ,b 3 1/ε Algorithm 4DEER 1:procedureDEER(f,s 0 , init_guess, tol) 2:diff↔↗ 3:states↔init_guess 4:whilediff>toldo 5:shifted_states↔[s 0 ,states[:→1]] 6:fs↔f(shifted_states) 7:Js↔GetJacobians(f,shifted_states) 8:bs↔fs→Js@shifted_states 9:new_states↔ParallelScan(Js,bs) 10:diff↔↘states→new_states↘ 1 11:states↔new_states 12:end while 13:returnstates 14:end procedure 41 Training with Quasi-DEER 050K100K Training Step 0 50 100 Validation Accuracy (%) Sequential DEER Quasi-DEER 050K100K Training Step 0.0 0.2 0.4 Wallclock time per param update (s) 050K100K Training Step 0 10 20 Newton iters per update Figure 4: Using quasi-DEER to learn a time-series classifier with an input sequence length of 18,000. Left : Validation Accuracy. Center : Wallclock time per update (quasi-DEER is two times faster). Right: Iterations to convergence. Evaluating at the Edge of Stability Stanford University Linderman Lab Iteration 50 TruetraceDEERq-DEER ELK q-ELK Iteration 100Iteration 200Iteration 500 10 3 10 2 10 1 10 0 10 2 10 3 Newton steps for MAD < 0.1 05001000 Newton steps 10 9 10 25 MAD 10 2 10 1 10 0 100 1000 Newton steps for MAD < 10 3 10 2 10 1 10 0 10 0 Wallclock time (s) for MAD < 0.1 0.00.51.01.5 Wallclock time (s) 10 9 10 25 MAD 10 2 10 1 10 0 0.1 1.0 Wallclock time for MAD < (s) DEERq-DEER ELK q-ELK Figure 8:Evaluating the Lorenz96 system in parallel.(Top two rows): Same format as Figure 7. (Bottom row): Plot of Lorenz96 trajectory during optimization. DEER methods are noticeably more unstable than ELK methods. B.6 Background on Parallel Scans For a more detailed reference on parallel scans, the interested reader should refer to Appendix H of Smith et al. [65] or to Blelloch [7]. In ourcodebase, we leveragejax.lax.associative_scanwith the correct binary associative operator. The binary associative operator for DEER and quasi-DEER is simply the composition of affine maps, while the binary associative operation for Kalman filtering can be found in Särkkä and García-Fernández [59] and indynamax[12]. C Additional Background on Newton’s Method In this appendix, we provide additional background on Newton’s method, and why it is of use for parallelizing nonlinear RNNs. Newton’s method provably enjoys quadratic (very fast) convergence in a basin near the true solution. Moreover, as exhibited by the widespread usage of Newton’s method across many domains, New- ton’s method can exhibit fast convergence in practice. However, a major motivation for this paper is that globally, Newton’s method can be unstable and converge slowly. This instability is a major motivation for our development of ELK. A core insight from Lim et al. [36] is that in the setting of evaluating RNNs, Newton’s method can be cast as a parallel scan (called DEER). At each “Newton iteration,” DEER linearizes the nonlinear dynamics of the RNN it is evaluating. To the extent that linear approximations are a very powerful tool across a wide variety of domains (e.g. Taylor expansions), this linear approximation can be 24 10 3 10 2 10 1 10 0 10 2 10 3 Newton steps for MAD < 0.1 05001000 Newton steps 10 9 10 25 MAD 10 2 10 1 10 0 100 1000 Newton steps for MAD < 10 3 10 2 10 1 10 0 10 0 Wallclock time (s) for MAD < 0.1 0.00.51.01.5 Wallclock time (s) 10 9 10 25 MAD 10 2 10 1 10 0 0.1 1.0 Wallclock time for MAD < (s) DEERq-DEER ELK q-ELK Figure 8:Evaluating the Lorenz96 system in parallel.(Top two rows): Same format as Figure 7. (Bottom row): Plot of Lorenz96 trajectory during optimization. DEER methods are noticeably more unstable than ELK methods. B.6 Background on Parallel Scans For a more detailed reference on parallel scans, the interested reader should refer to Appendix H of Smith et al. [65] or to Blelloch [7]. In ourcodebase, we leveragejax.lax.associative_scanwith the correct binary associative operator. The binary associative operator for DEER and quasi-DEER is simply the composition of affine maps, while the binary associative operation for Kalman filtering can be found in Särkkä and García-Fernández [59] and indynamax[12]. C Additional Background on Newton’s Method In this appendix, we provide additional background on Newton’s method, and why it is of use for parallelizing nonlinear RNNs. Newton’s method provably enjoys quadratic (very fast) convergence in a basin near the true solution. Moreover, as exhibited by the widespread usage of Newton’s method across many domains, New- ton’s method can exhibit fast convergence in practice. However, a major motivation for this paper is that globally, Newton’s method can be unstable and converge slowly. This instability is a major motivation for our development of ELK. A core insight from Lim et al. [36] is that in the setting of evaluating RNNs, Newton’s method can be cast as a parallel scan (called DEER). At each “Newton iteration,” DEER linearizes the nonlinear dynamics of the RNN it is evaluating. To the extent that linear approximations are a very powerful tool across a wide variety of domains (e.g. Taylor expansions), this linear approximation can be 24 Figure 6: Evaluating the Lorenz-96 chaotic system (5 dimensions, F=8, ). Top: Maximum absolute difference (MAD) across Newton iterations (left) and wallclock time (right). The DEER methods are unstable, but converge with our resetting heuristic. Bottom : Intermediate trajectories of first three coordinates. (A 1 ,b 1 ) (A 2 ,b 2 ) (A 3 ,b 3 ) (A 4 ,b 4 ) (A 2 A 1 ,A 2 b 1 +b 2 ) (A 4 A 3 ,A 4 b 3 +b 4 ) (A 4 A 3 A 2 A 1 ,A 4 A 3 A 2 b 1 +A 4 A 3 b 2 +A 4 b 3 +b 4 ) dx i dt =(x i+1 →x i→2 )x i→1 →x i +F 42 →J(s (i) )ωs=r ωs (i+1) 1 =→r 1 (s (i) ) ωs (i+1) t = ! ωf ωs (s (i) t→1 ) " ωs (i+1) t→1 →r t (s (i) ). ωs (i+1) t =diag ! ωf ωs (s (i) t→1 ) " ωs (i+1) t→1 →r t (s (i) ). s 0 s 1 s 2 s T→1 s T f · → # ↑(ωs,ε) · = T $ t=1 log↓ % s (i) t & & & s t , 1 ε I D ' +log↓(s 1 |f 1 (s 0 ),I D ) + T $ t=2 log↓ % s t & & & f t (s (i) t→1 )+ ! ωf t ωs (s (i) t→1 ) " (s t→1 →s (i) t→1 ),I D ' s 0 s 1 s (i) 1 s 2 s (i) 2 s 3 s (i) 3 A 1 ,b 1 1/ε A 2 ,b 2 1/ε A 3 ,b 3 1/ε (A 1 ,b 1 ) (A 2 ,b 2 ) (A 3 ,b 3 ) (A 4 ,b 4 ) (A 2 A 1 ,A 2 b 1 +b 2 ) (A 4 A 3 ,A 4 b 3 +b 4 ) (A 4 A 3 A 2 A 1 ,A 4 A 3 A 2 b 1 +A 4 A 3 b 2 +A 4 b 3 +b 4 ) dx i dt =(x i+1 →x i→2 )x i→1 →x i +F 41 Figure 1:Overview of the paralleliz- able methods we consider in this paper. We introduce diagonal approximations to improve complexity (quasi-DEER, Section 4.1) and link to Kalman filter- ing and trust regions to improve stability (ELK, Section 4.2). We combine these ideas in quasi-ELK (Section 4.2). Table 1:Description of the relative strengths and weaknesses of the five evaluation methods we consider. We include a discussion of this in Section 7. Method Desiderata Parallel WorkMemory Stability SequentialNoO(TD 2 )O(D)Very high DEER [Lim et al ’24] YesO(TD 3 )O(TD 2 )Low Quasi-DEERYesO(TD)O(TD)Low ELKYesO(TD 3 )O(TD 2 )High Quasi-ELKYesO(TD)O(TD)Moderate using a parallel scan to evaluate updates from Newton’s method, DEER inheritsO(TD 2 )memory complexity andO(TD 3 )computational work [7]. These costs can be prohibitive in practical deep learning settings. The second limitation of DEER isnumerical stability, inherited from Newton’s method. In general,undampedNewton’s method does not provide global convergence guarantees, and in practice often diverges [49]. We seek to ameliorate both these weaknesses. To do this, we leverage two techniques: quasi approximations and trust regions. Quasi approxima- tions are a common adaptation of Newtons method, where approximate, but faster and less mem- ory intensive updates, are used in-place of exact “full” Newton steps. Empirically, these are often observed to expedite convergence in terms of wallclock time, even though more Newton iterates are used. We apply quasi-approximations to remove the memory and compute scaling inherited by DEER, also finding accelerated convergence and reduced memory consumption. Secondly, we leverage a connection between Newton’s method with a trust region and Kalman smoothing in se- quential models [71]. This allows us to stabilize the Newton iteration by limiting the step size (to the radius of the trust region), preventing large and numerically unstable steps, while still being able to use parallelized Kalman smoothers [59,12], achieving a parallel runtime that is logarithmic in the sequence length. We refer to DEER accelerated with a quasi approximation as quasi-DEER, and DEER stabilized with trust regions as “EvaluatingLevenberg-Marquardt viaKalman” (ELK). We then combine these yielding a fast and stable algorithm, which we term quasi-ELK. Crucially, DEER, ELK, and their quasi-variants arealgorithmsfor parallelizinganydiscrete-time nonlinear dynamical system, including stateful architectures such as RNNs, that may or may not in- clude stochasticity. We use “parallel” to refer to the fact that each iteration of our iterative algorithm operates on theentireT-length sequence (and not on each sequence element one at a time). We outline the key contributions and organization of the paper here: We first introduce background material, particularly focusing on DEER [36], in Sections 2 and 3. We then present three short novel proofs: that DEER is globally convergent; that this convergence is robust to modifications of the linearized dynamics (Proposition 1); and that there is a unique solution with no local minima (Appendices A.1 and A.2). We then introduce quasi-approximations to DEER to improve efficiency (quasi-DEER, Section 4.1), and trust regions to stabilize DEER (ELK, Section 4.2) We also provide an interpretation of how trust regions stabilize the dynamics by damping the eigenvalues of the Jacobians (Section 4.2 and Appendix A.3). We show empirically that quasi-DEER remains accurate, with reduced runtime and memory consumption (Section 6). In regimes where DEER is numerically unstable or convergences slowly, we show ELK and quasi-ELK can enjoy fast, numerically stable convergence. We conclude by discussing the relative strengths and weaknesses of each method, providing guidance on how to select and tune them, and highlighting avenues for future research (Section 7). We provide our code athttps://github.com/lindermanlab/elk. 2 Problem Statement We consider nonlinear Markovian state space models, with the state at timetdenoteds t →R D and nonlinear transition dynamicsf:R D ↑R D . We denote the full sequence ofTstates as s 1:T →R T→D . Note that we will be mainly considering the transition dynamics in this paper, and 2 Table 1: Summary of the features of the evaluation algorithms. Proposition 1 : DEER and quasi-DEER are globally convergent in at most T iterations. Proof: By induction. Note that s 0 is fixed. Corollary: We can reset states later in the sequence and still get convergence (in the case of instability). 051015 Newton Iterations 10 °10 10 °7 10 °4 Untrained ARGRU MAD 0200040006000800010000 Newton Iterations 10 °6 10 °3 10 0 10 3 Trained ARGRU MAD DEERQuasi-DEERELKQuasi-ELK -20 0 20 1 -20 0 20 100 -20 0 20 1000 0200040006000800010000 Timestep,t -20 0 20 2000 Time series afterNewton iteration: Figure 5: Evaluating an AR GRU that generates sine waves. Sequential evaluation is the fastest, with q-ELK being the fastest parallelized method (2x slower than sequential, 6x faster than DEER). →J(s (i) )ωs=r ωs (i+1) 1 =→r 1 (s (i) ) ωs (i+1) t = ! ωf ωs (s (i) t→1 ) " ωs (i+1) t→1 →r t (s (i) ). ωs (i+1) t =diag ! ωf ωs (s (i) t→1 ) " ωs (i+1) t→1 →r t (s (i) ). s 0 s 1 s 2 s T→1 s T f · → # ↑(ωs,ε) · = T $ t=1 log↓ % s (i) t & & & s t , 1 ε I D ' +log↓(s 1 |f 1 (s 0 ),I D ) + T $ t=2 log↓ % s t & & & f t (s (i) t→1 )+ ! ωf t ωs (s (i) t→1 ) " (s t→1 →s (i) t→1 ),I D ' s 0 s 1 s (i) 1 s 2 s (i) 2 s 3 s (i) 3 A 1 ,b 1 1/ε A 2 ,b 2 1/ε A 3 ,b 3 1/ε (A 1 ,b 1 ) (A 2 ,b 2 ) (A 3 ,b 3 ) (A 4 ,b 4 ) (A 2 A 1 ,A 2 b 1 +b 2 ) (A 4 A 3 ,A 4 b 3 +b 4 ) (A 4 A 3 A 2 A 1 ,A 4 A 3 A 2 b 1 +A 4 A 3 b 2 +A 4 b 3 +b 4 ) dx i dt =(x i+1 →x i→2 )x i→1 →x i +F 41 Algorithm 4ParallelizeRNN 1:procedurePARALLELIZERNN(f,s 0 , init_guess, tol, method, quasi) 2:diff→↑ 3:states→init_guess 4:whilediff>toldo 5:shifted_states→[s 0 ,states[:↓1]] 6:fs→f(shifted_states) 7:Js→GETJACOBIANS(f,shifted_states) 8:ifquasithen 9:Js→DIAG(Js) 10:bs→fs↓Js·shifted_states 11:ifmethod=‘deer’then 12:new_states→PARALLELSCAN(Js,bs) 13:else ifmethod=‘elk’then 14:new_states→PARALLELKALMANFILTER(Js,bs,states) 15:diff→↔states↓new_states↔ 1 16:states→new_states 17:returnstates 18:end procedure 13 Inference as trust-region 13.0.1 Filtering We are going to apply kalman-DEER in a totally scalar setting, where each state is a scalar. By Taylor’s theorem x t+1 =f t+1 (x t ) =f t+1 (x (i) t +(x t ↓x (i) t ) ↗f t+1 (x (i) t )+ df t+1 dx t (x (i) t )(x t ↓x (i) t ). So, our updates are given by x (i+1) t+1 =f t+1 (x (i) t )+ df t+1 dx t (x (i) t )(x (i+1) t ↓x (i) t ). Written another way, the updates are given by x (i+1) t+1 = df t+1 dx t (x (i) t )x (i+1) t + ! f t+1 (x (i) t )↓ df t+1 dx t (x (i) t )x (i) t " . 42 Scalability: + Diagonal Jacobian DEER 2 Quasi-DEER ELKQuasi-ELK Stability: + Trust region + Kalman filter 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +ωs (1) +ωs (2) +ωs (i) J t>1 r(s 1:T ):=[s 1 →f(s 0 ),s 2 →f(s 1 ),...,s T →f(s T→1 )] J(s):= ωr ωs (s)= ! " " " " " # I D 0... 0 0 → ωf ωs (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...→ ωf ωs (s T→1 )I D $ % % % % % & . ωs (i+1) :=s (i+1) →s (i) ωs 40 Algorithm 1 : Pseudocode for the parallelized algorithms (color-coded). Figure 3: A parallel scan converts a sequential scan into a binary tree. Scalability: Quasi-DEER Global Convergence Figure 1: We introduce diagonal approximations, and trust regions through Kalman filtering, to scale and stabilize DEER. Iterate until convergence 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 12 Presentation s 0 s 1 s 2 s 3 s 4 s (0) 1 s (0) 2 s (0) 3 s (0) 4 s (1) 1 s (1) 2 s (1) 3 s (1) 4 +!s (1) +!s (2) +!s (i) J t>1 r(s 1:T ):=[s 1 !f(s 0 ),s 2 !f(s 1 ),...,s T !f(s T!1 )] J(s):= !r !s (s)= ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ I D 0... 0 0 ! !f !s (s 1 )I D ...00 . . . . . . . . . . . . . . . 00...I D 0 00...! !f !s (s T!1 )I D ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ . !s (i+1) :=s (i+1) !s (i) !s 40 Sequence length T = 4 Complete in at most T sequential steps. Complete in exactly T sequential steps. Figure 8: Comparison of standard sequential evaluation of an SSM (left) with parallel iter- ative evaluation of an SSM (right). In the parallel iterative paradigm, we make a guess over the entire sequence, as indicated by the top right row labeled s (0) 1 ,s (0) 2 ,.... Using parallel computation over the sequence length, we find an update ∆s (i) to go from our current guess s (i) for the entire trajectory to our next guess s (i+1) . Adapted from Figure 1 of Lim et al. [142]. need to make updates s (i+1) = s (i) +∆s (i) in a way that brings our guesses close to s ⋆ in a small number of iterations. This desideratum raises an important question about our updates ∆s (i) : How can we compute a useful ∆s (i) in a way that uses parallel com- putation over the sequence length? While the addition s (i) + ∆s (i) is embarrassingly parallel over the sequence length, we will not achieve our goal of parallelizing SSMs over the sequence length if computing the update ∆s (i) itself requires inherently sequential computation. Parallel Newton methods offer an ingenious way to compute these updates ∆s using parallel computation over the sequence length. The core insight is that even though our initial guess s (0) may be completely wrong—and even though we do not use the true roll-out s ⋆ at any point in the computation—we can still use the SSM dynamics from equation (1) to measure how wrong our current guess is. We measure how wrong our initial guess is with its residual vector r(s (i) )∈ R TD . Each entry r t of the residual vector is given by the one-step prediction error, i.e. r t (s (i) ) :=s (i) t −f t (s (i) t−1 ).(11) Crucially, r(s ⋆ ) = 0 because s ⋆ follows the SSM dynamics, and in fact s ⋆ is the unique zero of r(·). Thus, by defining the residual in equation (11), we have recast the problem of SSM evaluation as a high-dimensional root-finding problem, i.e. starting from an initial guess s (0) , find s ⋆ ∈ R TD such that r(s ⋆ ) = 0.(12) 2.4 p u t t i n g i t a l l t o g e t h e r : pa r a l l e l n e w t o n m e t h o d s25 We discussed exactly this type of problem in our background subsection 2.3.1 on root-finding! As their name suggests, parallel Newton methods solve this high-dimensional nonlinear equation (12) using Newton’s method. Moreover, in the specific case of evaluating SSMs, where the residual at each time step is given by equation (11), each Newton update is given by a linear dynamical system and so can be evalu- ated using a parallel scan. That each step of Newton’s method for finding the zero of the residual defined in equation (11) is an LDS comes from the fact that at each step, Newton’s method linearizes the residual. To review, at each step of Newton’s method for root-finding, we find the root of the linearized residual ˆ r (i) (s), where each entry of ˆ r (i) (s) is given by ˆ r (i) t (s) =s t − f t (s (i) t−1 ) +A (i) t (s t−1 −s (i) t−1 ) | z linearization of dynamics function f t at s (i) t−1 ,(13) where throughout this thesis we use the shorthand A (i) t := ∂f t ∂s t−1 (s (i) t−1 )∈ R D×D .(14) Since each step of Newton’s method involves finding s (i+1) such that ˆ r (i) (s (i+1) ) = 0, we see from equation (13) that setting each component of ˆ r (i) to zero gives rise to the LDS s (i+1) t =A (i) t s (i+1) t−1 + f t (s (i) t−1 ) −A (i) t s (i) t−1 | z b (i) t . (15) But as discussed in subsection 2.2.2, with O(T) processors we can evaluate any LDS in O(logT) computational depth. Thus, we have shown that on a massively parallel machine like a GPU, we can evaluate each iteration of a parallel Newton method in O(logT) time. If we can converge in fewer than O( T /logT) iterations, then for sufficiently long sequence lengths and powerful parallel processors, we would expect to see wallclock speedups from parallelizing SSMs. We summarize the parallel Newton methods in Algorithm 1, and provide a more detailed derivation in the next section. 2.4.2 More in depth derivation We provide an alternative derivation of the parallel Newton update in equa- tion (15) to highlight important notions. 2.4 p u t t i n g i t a l l t o g e t h e r : pa r a l l e l n e w t o n m e t h o d s26 Algorithm 1 Parallel Newton methods for evaluating nonlinear SSMs procedure ParallelNewton(f, s 0 , initial guess s (0) 1:T , tolerance ε) for i =0,1,... ,T do A 1:T ,b 1:T ← LinearizeDynamics(f,s 0 , s (i) 1:T )▷ For all t in parallel s (i+1) 1:T ← EvaluateLDS(A 1:T ,b 1:T ,s 0 , s (i) 1:T )▷ pscan has O(logT) depth if ComputeError(f, s (i+1) 1:T )<ε then break return s (i+1) 1:T To apply Newton’s method for root-finding to the residual used in DEER/Deep- PCR (defined coordinate-wise in equation (11)), the update given in equation (6) is s (i+1) = s (i) − J(s (i) ) −1 r(s (i) ) |z ∆s (i) ,(16) where the Jacobian matrix J := ∂r ∂s (s)∈ R TD×TD is a block bidiagonal matrix of the form J = I D 0 ... 0 0 −A 2 I D ... 0 0 . . . . . . . . . . . . . . . 0 0 ... I D 0 0 0 ... −A T I D ,(17) where A t ∈ R D×D are defined as in (14). Importantly, the Jacobian J in equa- tion (17) is always invertible with all eigenvalues equal to one. Storing and naively inverting the Jacobian is infeasible for large state size D or sequence length T . However, since J(s) is block bidiagonal, we can solve for ∆s (i.e. J(s (i) ) −1 r(s (i) )) by forward substitution. This reduces to a linear recursion with the initial condition ∆s (i+1) 1 = −r 1 (s (i) ), and for t>1, ∆s (i+1) t =A (i) t ∆s (i+1) t−1 −r t (s (i) ).(18) Plugging equation (18) into equation (6) and simplifying, we again obtain the DEER/DeepPCR update equation (15). We emphasize that the Newton update ∆s∈ R TD is given by ∆s (i) = J(s (i) ) −1 r(s (i) ). The beauty of parallel Newton updates for SSMs is that we can exploit the par- ticular block bidiagonal structure of J (shown in equation (17)) to invert J using a 2.4 p u t t i n g i t a l l t o g e t h e r : pa r a l l e l n e w t o n m e t h o d s27 parallel scan. However, it is also worth examining J −1 , which is itself a structured matrix. Using the example of T =4 to demonstrate, we see that J −1 takes the form J −1 = I D 0 0 0 A 2 I D 0 0 A 3 A 2 A 3 I D 0 A 4 A 3 A 2 A 4 A 3 A 4 I D .(19) What equation (19) is meant to demonstrate is that, in general, J −1 is itself lower triangular, with each block term being a product of a sequence of Jacobian matri- ces. This particular form for J −1 makes sense because we know from equation (18) that we can invert J with an LDS, and so simply applying J −1 should be equivalent to applying the convolution that is this LDS. Studying the properties and condi- tioning of J −1 will be crucial to proving convergence rates of parallel Newton methods, which we do in Part I. We note that, as discussed in Subsection 2.3.2, all of the above can be inter- preted as applying the Gauss-Newton method for optimization to a merit function L(s) = 1 2 ∥r(s)∥ 2 2 , in addition to the provided interpretation as Newton’s method for root finding on r(s). This optimization perspective on parallel Newton meth- ods is foundational for this dissertation, as we use it to develop scalable and stable methods (Part I), as well as prove convergence rates (Part I). Finally, we can also view each each iteration of equation (16) as a fixed-point iteration [169]. In this way, another perspective on DEER is that it recasts RNNs and nSSMs in general in the framework of deep equilibrium models (DEQs) [7, 8]. Fixed-point methods, including Newton iterations [124], are often commonly used in the field of multidisciplinary optimization (MDO) in aeronautical engineering [128, 161]. All of these fields have deep connections to DEER, and interesting future work could involve exploring them. 2.4.3 Limitations of Newton’s method Equation (15) is the fundamental update behind Newton’s method for paralleliz- ing an SSM. However, it also contains the ingredients behind some critical limita- tions of "plain vanilla" Newton’s method: scalability and stability. m e t h o d o l o g i c a l l i m i tat i o n s : s c a l a b i l i t y a n d s ta b i l i t yThe diffi- culty in scaling equation (15) comes from the need to instantiate T Jacobian ma- trices, each of which are R D×D . Because the parallel scan must instantiate all of these matrices simultaneously, doing so requires O(TD 2 ) memory, which can be prohibitive for large state size or long sequence length. Moreover, because the par- allel scan involves dense matrix-matrix multiplies, the total computational work 2.4 p u t t i n g i t a l l t o g e t h e r : pa r a l l e l n e w t o n m e t h o d s28 is O(TD 3 ). While the factor of T in the work is divided across parallel processors 8 , the cubic cost in state size can also make the method prohibitively slow for large state size. For these reasons, using the update equation (15) in parallel Newton methods is difficult to use in practice at scale, often running out of memory or running too slowly. The difficulties in stability for equation (15) also come from studying its behav- ior as a linear dynamical system. In particular, the spectral norm of any matrix J t measures the maximum amount by which it may increase the size of a vector to which it is applied. So, intuitively, if the spectral norms of too many Jacobian ma- trices in equation (15) are larger than one, the update equation (15) may be highly unstable, resulting in numerical overflow and slow convergence. These difficulties with stability are common in Newton methods in general, see Example 2.4. g a p s i n t h e o r e t i c a l u n d e r s ta n d i n g : c o n v e r g e n c e p ro p e r t i e sFi- nally, both foundational works of Danieli et al. [41] and Lim et al. [142] explicitly left open the question of the global convergence of the parallel Newton method, i.e., will the method converge regardless of our initial guess s (0) . In general, New- ton’s method does not enjoy such properties, as we showed in Example 2.4. But confidence that the method will robustly and globally converge is important for broad deployment of the method. Moreover, while it is broadly known that New- ton’s method enjoys quadratic convergence in a basin around its solution [26, 142, 179], it was unclear if anything more could be said specifically about the rates of convergence of parallel Newton methods. In particular, it was unclear if we could generally expect speed-ups from parallelization in arbitrary SSMs, or if there were certain SSMs that benefit from parallelization and other SSMs that are more effi- cient to evaluate sequentially. Resolving these scaling and stability limitations of the parallel Newton method (Part I) and providing general theory about its convergence properties (Part I) are the contributions of the rest of this thesis. 8 If we have O(T) parallel processors, the O(T) work is done in O(logT) computational depth. Part I M E T H O D S : S C A L A B L E A N D S T A B L E PA R A L L E L I Z A T I O N The second part of this thesis presents its methodological contributions. We develop methods for scalable and stable parallelization of nonlin- ear SSMs. We achieve scalability using a quasi-Newton method we develop and call quasi-DEER. We achieve stability using a trust region method we develop and call ELK: Evaluating Levenberg-Marquardt with Kalman. Scalability: + Diagonal Jacobian DEER Quasi-DEER ELKQuasi-ELK Stability: + Trust region + Kalman filter Figure 9: The ungulates. This methods part of this thesis introduces scalable and stable variants of DEER. Broadly, we call these methods "par- allel Newton methods." More colloquially, we call these methods "the ungulates," which are large hoofed mammals like deer and elk. The experiments in this part are based on the code available at: https://github.com/lindermanlab/elk 3 S C A L A B L E PA R A L L E L I Z A T I O N : Q U A S I - N E W T O N M E T H O D S As we discussed in the Introduction and in the Background (Section 2.4), the parallel Newton methods of Danieli et al. [41] and Lim et al. [142] provide a novel approach to parallelize nonlinear state space models (nSSMs), even though evaluating nSSMs had long been believed to be "inherently sequential." However, it is well known in numerical analysis that Newton’s method—while an extremely powerful and fundamental method—has many limitations (see Sec- tion 2.3 as well as a textbook treatment in [179]). The common thread throughout this thesis is how we can leverage the vast literature on numerical analysis to extend, improve, and understand parallel Newton methods. In this chapter, we focus in particular on the limitation of Newton’s method with respect to scalability. In general, for trying to find the root s ∗ of a high- dimensional function r(·) : R P → R P , Newton’s method has updates of the form s (i+1) = s (i) − J −1 r. In general, this Newton update is prohibitive for large dimension P because it involves • computing the derivative J; • storing the P×P matrix J; and • inverting this matrix. All three of the steps are expensive in either compute or memory. For a parallel Newton method, the dimension of s is TD, where T is the se- quence length and D is the state size. Thus, forming a TD× TD matrix is in general intractable. Parallel Newton methods avoid forming J explicitly, instead using the structure (equation (11)) of the one-step prediction error r(·) to cast each step of Newton’s method as a linear dynamical system (LDS) (equation (15)): s (i+1) t =A (i) t s (i+1) t−1 + f t (s (i) t−1 ) −A (i) t s (i) t−1 . However, each A t ∈ R D×D , and so parallelizing this LDS using a parallel scan results in work that scales as O(TD 3 ) and memory requirement that scales as O(TD 2 ). For large state sizes and sequence lengths, these costs soon become pro- hibitive. 30 3.1 q ua s i - d e e r : a d i ag o na l a p p rox i m at i o n31 Fortunately, there exists a wide literature [179] on quasi-Newton methods that use some approximation ̃ J for J. In this chapter, we explore ways to scale parallel Newton methods by introducing quasi-Newton methods that are amenable to a parallel scan. 3.1q ua s i - d e e r : a d i ag o na l a p p rox i m at i o n We propose a very simple quasi-Newton approximation we call quasi-DEER 1 , where we use the diagonal of the Jacobians, i.e. we use updates of the form s (i+1) t = diag[A (i) t ]s (i+1) t−1 + f t (s (i) t−1 ) − diag[A (i) t ]s (i) t−1 .(20) We developed this diagonal approximation because of its compatibility with the parallel scan and because of its lower computational and memory cost (vs dense matrix multiplication). To be compatible with the parallel scan, a the operands of the chosen binary operator crucially must remains closed (Definition 2.1). Fortunately, the product of two diagonal matrices is again a diagonal matrix. Moreover, using diagonal matrices is clearly more memory and compute effi- cient than using dense matrices. Both the memory cost of storing and the compu- tational work of multiplying these diagonal matrices now scales only as O(TD), i.e. linearly with the state size. However, this quasi-DEER method based on a diagonal approximation of the Jacobian of the dynamics is very different from anything in the standard quasi- Newton literature [179]. Some immediate and natural questions are: 1. will this approach even converge? 2. if this approach does converge, will it converge in few enough iterations actually to be useful? One response to this question is to note that while a diagonal approximation serves as a type of matrix that enjoys an efficient parallel scan, in general any form of approximation ̃ A t to A t would work as a quasi-DEER method 2 if the class of matrices used for ̃ A t are closed under composition and have memory and compute costs that scales linearly in D. We discuss in Chapter 6 how many foundational fixed-point methods can be interpreted as different forms of quasi- DEER for different approximations ̃ A t . Incredibly, however, all such quasi-DEER methods (including the full Newton method DEER and the diagonal approximation) enjoy global convergence. Note that Newton’s method in general may fail to converge. This global convergence 1 Lim et al. [142] names parallel Newton methods DEER, for "Differential Equations as fixed- point itERation." 2 i.e., an efficient iterative step that makes use of the parallel scan 3.2 g l o b a l c o n v e r g e n c e32 of parallel methods and all quasi-versions of the form proposed is a special fea- ture of the particular problem of parallelizing nonlinear SSMs (cf. equation (11)). Thus, we can answer our first question: yes, this diagonal approximation in fact converges globally. Moreover, the diagonal approximation also performs well empirically. We show- case its performance for evaluating and training nonlinear RNNs, including on a benchmark dataset from computational neuroscience. Finally, beyond global convergence, we can provide a bound on how slowly quasi-DEER can converge, which is effectively based on the quality of the approx- imation for different dynamical systems. In the rest of this chapter, we will discuss the global convergence of quasi-DEER and all of its variants, as this result is foundational for this thesis and line of work. We will also showcase the experiments showing the empirical usefulness of the method. We defer a discussion of quasi-DEER convergence rates to Chapter 6 in the "Theory" part of this thesis. We instead conclude this chapter with discussions of further extensions that have been made to quasi-DEER, as well as promising directions for future work. 3.2g l o b a l c o n v e r g e n c e In general, Newton’s method is not guaranteed to converge (Example 2.4). This general risk of failing to converge led both Danieli et al. [41] and Lim et al. [142] to flag the question of convergence in parallel Newton methods as an important open question, though neither answered this question. In fact, this question of DEER’s convergence was answered in 1989 by Bellen and Zennaro [18, Remark 2.1], which we rediscovered in Gonzalez et al. [80, Proposition 1]. Not only is DEER globally convergent, but so are a wide variety of quasi-DEER methods, including the use of the diagonal approximation. Proposition 3.1. Consider the problem of finding s ⋆ 1:T which satisfy s ⋆ t = f t (s ⋆ t−1 ) and s ⋆ 1 =f 1 (s 0 ), for known dynamics functions f t T t=1 and initial condition s 0 . Also consider an iterative method A(·) of the form s (i+1) 1:T = A(s (i) 1:T ) where the action of the operator A(·) can be written as a linear dynamical system over the sequence length, i.e. each application of A takes the form s (i+1) t = ̃ A t s (i+1) t−1 + f t (s (i) t−1 ) − ̃ A t s (i) t−1 ,(21) for arbitrary matrices ̃ A t T t=1 . Then updates based on A(·) will converge to s ⋆ 1:T in at most T iterations, regardless of the initial guess s (0) 1:T . Proof. The intuition for this proof is that the initial condition for s 0 is fixed and known, and that each iteration of A(·) as given by equation (21) gives at least one 3.2 g l o b a l c o n v e r g e n c e33 more correct term in the sequence length, while not disturbing any previously correct terms. Formally, we prove this theorem by induction. Base case: we know the initial condition s 0 , as it is fixed and given by assump- tion. Induction hypothesis: assume at iteration (i) that s (i) 1:t i = s ⋆ 1:t i , i.e. the first t i terms are correct. Induction step: we need to show that s (i+1) 1:t i +1 = s ⋆ 1:t i +1 , i.e. that none of the pre- viously correct terms become wrong, and that at least one more term becomes correct. Rewriting equation (21) as s (i+1) t =f t (s (i) t−1 ) + ̃ A t s (i+1) t−1 −s (i) t−1 ,(22) we see that if s t−1 is correct at both iterations (i) and (i +1), i.e. s (i) t−1 =s (i+1) t−1 =s ⋆ t−1 , then it must be the case that s (i+1) t = f t (s ⋆ t−1 ) = s ⋆ t . Of course, s (i+1) 0 = s (i) 0 = s ⋆ 0 because s 0 is a fixed and known initial condition. So, by the above logic, it follows that if s (i) 1:t i = s ⋆ 1:t i , then s (i+1) 1:t i +1 = s ⋆ 1:t i +1 . Since we have shown in the induction step that one more correct term always accrues with each application of A(·), and because of our base case that s (0) 0 = s ⋆ 0 , the result follows from induction. Proposition 3.1 is significant and interesting for a number of reasons. First, Proposition 3.1 answers the question posed by both Danieli et al. [41] and Lim et al. [142]: does DEER converge globally? In general, Newton’s method does not enjoy global convergence 3 , but we show that not only DEER but in fact a wide family of quasi-DEER methods all enjoy global convergence. This special behavior is a result of the special structure of our residual r(·) that arises from parallelizing SSMs (see equation (11)). Proposition 3.1, as stated as Proposition 1 in Gonzalez et al. [80], was the first of its kind for global convergence in the context of parallelizing nonlinear RNNs with Newton iterations. While on the one hand this result was surprising since Newton’s method can in general diverge (Figure 7), this exact result was known in the parallel-in-time literature: see Bellen and Zennaro [18, Remark 2.1] and Gander and Vandewalle [67, Remark 4.7]. These results were also rediscovered in the context of parallelizing sampling from diffusion models (another nonlinear SSM). Notably, Shih et al. [201] proved a special case of Proposition 3.1 for ̃ A t =I D using the same proof by induction mechanism. Tang et al. [221] then proved an even stronger result that includes our Proposition 3.1 as a special case. We include an extended discussion of Theorem 3.6 of [221] in Appendix A. All in all, this core result has been rediscovered many times in different communities. 3 In fact, as Hubbard and Hubbard [108] write in their classic textbook: "no one knows anything about the global behavior of Newton’s method." 3.3 e x p e r i m e n t s a n d p e r f o r m a n c e o f q ua s i - d e e r34 Second, Proposition 3.1 ensures that arbitrary approximations can be used in the computation of the Jacobians A t = ∂f t ∂s t−1 without damaging the global conver- gence (though the convergence rate may slow). This guarantee of global conver- gence extends not only to the diagonal approximation proposed in Gonzalez et al. [80], but also to a stochastic version proposed in Zoltowski et al. [244] which we will discuss further in Subsection 3.4.1. In fact, as Zoltowski et al. [244] demon- strates, Proposition 3.1 ensures global convergence even when the dynamics f t are not differentiable, as in the Metropolis-Hastings algorithm [37, 95, 165] for Markov chain Monte Carlo (MCMC). Zoltowski et al. [244] shows empirically that using updates based on equation (21) works well even for non-differentiable dynamics f t by using an intelligent choice of surrogate gradient ̃ A t . Finally, the proof by induction of Proposition 3.1 highlights how parallel New- ton methods converge in a "causal" manner, i.e. from the start at the initial condi- tion s 0 to the end at s T . This arrow of causality has important implications both for the design of parallel Newton variants, as we will see in Chapter 4, as well as for an interpretation of what parallel Newton methods are doing, as we will see in Chapter 5. Furthermore, this "causal convergence" also results in a useful heuristic when parallelizing systems that are unstable or at the edge of stability: if intermediate computations in the parallel Newton method should ever overflow numerically, they can always be reset to an arbitrary value without damaging global convergence (though of course slowing the rate of convergence). We make great use of this "reset heuristic" in Chapter 4. Finally, this left-to-right conver- gence also justifies implementing parallel Newton methods with a sliding window [201, 244], where only t c states have equation (21) applied to them at a time. While using t c < T will increase the number of iterations needed to converge, the mem- ory layout and other architectural features of GPUs can lead the choice of certain t c <T resulting in wallclock speedups compared to naively applying equation (21) over the entire sequence length [201, 244]. Using a sliding window to implement parallel Newton methods is best practice and should always be used. Having discussed the important implications of the theoretical convergence of parallel Newton methods, we now let the rubber hit the road and ask the question: but does the diagonal approximation in equation (20) work in practice? 3.3e x p e r i m e n t s a n d p e r f o r m a n c e o f q ua s i - d e e r In this section, we showcase a variety of settings where quasi-DEER performs well in the parallel evaluation and training of nonlinear RNNs, specifically using the Gated Recurrent Unit (GRU) [38] as a simple and expressive RNN cell. 3.3 e x p e r i m e n t s a n d p e r f o r m a n c e o f q ua s i - d e e r35 3.3.1 Quasi-DEER for Evaluation To benchmark the speed and memory usage of sequential evaluation, DEER, and quasi-DEER on forward passes of RNNs, we use an experimental design from Lim et al. [142]. The task is to evaluate an untrained GRU across a range of hidden state sizes (D) and sequence lengths (T ) on a 16GB V100 GPU; the inputs to the RNN also have dimension D. We evaluate these RNNs using three approaches: sequential evaluation, DEER, and quasi-DEER. For DEER and quasi-DEER, we end the Newton iterations when∥s (i) − s (i−1) ∥ ∞ < tol, for some specified tolerance tol. In these experiments, we use a tolerance of tol = 1×10 −4 . In Figure 10, we show qualitatively that both DEER and quasi-DEER converge with great accuracy to the true sequential rollouts. 9800982098409860988099009920994099609980 0.5 0.0 0.5 GRU outputs for last 200 indices, DEER vs Sequential DEER Sequential 0200040006000800010000 2 0 2 1e7 Difference between sequential and DEER outputs 9800982598509875990099259950997510000 Sequence index (a) 0.5 0.0 0.5 GRU outputs for last 200 indices, quasi-DEER vs Sequential Quasi-DEER Sequential 0200040006000800010000 Sequence index (b) 1 0 1e5 Difference between sequential and quasi-DEER outputs Figure 10: The accuracy of evaluating with parallelized methods (DEER and quasi-DEER) as opposed to sequential evaluation. The parallelized methods converge to the correct trace within numerical precision. The hidden state size is D =4 and the sequence length is T =10,000. Having confirmed the accuracy of the parallel Newton methods, we now com- pare the wall-clock time and memory usage of sequential evaluation, DEER, and quasi-DEER. Results shown in Figure 11. Both DEER and quasi-DEER are up to twenty times faster than sequential evaluation. The runtimes are similar between DEER and quasi-DEER for small networks, because although quasi-DEER steps are faster, quasi-DEER takes more iterations to converge. For larger networks, the difference in runtime is more pronounced. We also see that quasi-DEER requires as much as an order of magnitude less memory than DEER, thus allowing the application to architectural regimes previously infeasible with DEER. In Figure 12, we run the timing benchmarks of Section 3.3.1 on a wider range of sequence lengths T and hidden state sizes D, on a larger GPU (a V100 with 32 GB) and with a smaller batch size of 1. In doing so, we highlight the parallel nature of DEER and quasi-DEER, as their wall-clock time scales sublinearly in the 3.3 e x p e r i m e n t s a n d p e r f o r m a n c e o f q ua s i - d e e r36 10 −1 10 0 10 1 Wallclock (s) D= 8D= 16D= 32D= 64 SequentialDEERQuasi-DEER 30K100K300K1M 10 0 10 1 Memory (GB) 30K100K300K1M30K100K300K1M30K100K300K1M Sequence Length (T) Figure 11: Evaluating an untrained GRU. Relative performance of sequential, DEER and quasi-DEER for evaluating a randomly initialized (and untrained) GRU on (Top Row) wall-clock time, averaged over 20 random seeds and (Bottom Row) mem- ory, averaged over 3 random seeds. All experiments use a 16GB V100 SMX2 (memory capacity indicated by the black dashed line) and Newton methods were run to convergence. Missing points in each series indicate the GPU ran out of memory. In these settings, quasi-DEER has a runtime commensurate with DEER, but with lower memory consumption. Therefore, quasi-DEER can work at scales where DEER cannot. sequence length T in smaller (D, T ) regimes. However, we note that in the larger regimes considered in our main text and in Lim et al. [142], we often observe linear scaling in the sequence length T for the wall-clock time of DEER and quasi-DEER, even though these algorithms are still faster than sequential evaluation. Figure 12 shows good evidence that these parallel algorithms are suffering from saturation of the GPU, and would benefit from even more optimized implementations. The parallel scan, given sufficiently many processors, scales as O(logT). As we show in Figure 12, we see this speedup at low model sizes and sequence lengths. Once the processors are saturated, we see a linear increase in the runtime (since the amount of work done is linear), but it is making much more effective use of the GPU, resulting in a constant factor speedup over sequential application at larger model sizes/sequence lengths. Together, these experiments confirm that quasi-DEER can replicate the perfor- mance of DEER, but with a smaller memory footprint. 3.3.2 Quasi-DEER for Training We verify that quasi-DEER expedites training nonlinear RNN models. We repli- cate the third experiment from Lim et al. [142], where a GRU is trained to classify C. elegans phenotypes from the time series of principal components of the worms’ 3.3 e x p e r i m e n t s a n d p e r f o r m a n c e o f q ua s i - d e e r37 10 −4 10 −2 10 0 Wallclock (s) D= 1D= 2 SequentialDEERQuasi-DEER D= 4D= 8D= 16D= 32D= 64 1K3K 10K30K 100K300K 1M 10 0 10 1 Memory (GB) 1K3K 10K30K 100K300K 1M 1K3K 10K30K 100K300K 1M 1K3K 10K30K 100K300K 1M 1K3K 10K30K 100K300K 1M 1K3K 10K30K 100K300K 1M 1K3K 10K30K 100K300K 1M Sequence Length (T) Figure 12: Evaluating an untrained GRU. Sublinear and linear timing regimes for paral- lelized algorithms. The above experiments were run on a 32 GB V100 with a batch size of 1. As in Figure 11, we use 20 seeds for timing, 3 seeds for memory, and the dashed black line indicates the memory capacity of the GPU (32 GB). We observe that in smaller regimes in D and T that the wall-clock time shows sublinear scaling indicative of the use of parallel algorithms. However, when the GPU becomes saturated, the benefits of parallelization are reduced and we begin to see linear scaling in wall-clock time with T . 3.4 f u r t h e r d e v e l o p m e n t a n d d i r e c t i o n s f o r f u t u r e w o r k38 050K100K Training Step 0 50 100 Validation Accuracy (%) Sequential DEER Quasi-DEER 050K100K Training Step 0.0 0.2 0.4 Wallclock time per param update (s) 050K100K Training Step 0 5 10 15 Newton iters per update Figure 13: Training a GRU with DEER. Comparison of DEER and quasi-DEER during GRU training for the C. elegans time-series classification task (Section 3.3.2). Each time series has length T = 17,984. We show the median, and 5-95% inter- val across a rolling window of 20 training steps. (Left) DEER and quasi-DEER have the similar validation accuracy trajectories, indicating similar training dy- namics. The sequential trace shown is for 24 hours of training (compared to 11 and 4 hours for the whole DEER and quasi-DEER traces). (Center) Each quasi training iteration is 2.5 times faster than each DEER training iteration. Sequen- tial training steps took more than 6 seconds each (not pictured). (Right) Each quasi training iteration requires (approximately) 2 times more Newton itera- tions to converge, indicating that each quasi Newton step is approximately 5 times faster than the corresponding DEER Newton step. body posture [28]. This task is colloquially known as the "eigenworms" task. With a sequence length of T = 17,984, it is the longest task on the UEA Multivariate Time Series Classification archive, a standard benchmark set for assessing the per- formance of sequence models on long sequences [6]. We show results in Figure 13. We see that the training dynamics under quasi- DEER leads to similar validation accuracy trajectories. However, every quasi-DEER training step is faster by a factor of 2.5, despite performing around 2 times more Newton iterations per training step. This finding highlights how quasi-DEER can improve DEER when training nonlinear RNNs. In our experiment, we use the quasi-DEER approximation for the backward pass as well, leading to gradi- ents that are different from DEER in this setting. In this particular experiment, we found that there was very little degradation in performance (Figure 13, left). Nonetheless, in general we recommend modifications to quasi-DEER that also allow for an exact backwards pass: see the discussion in Subsection 3.4.3. The RNN used in this experiment is a 5 layer GRU. When we evaluate this architecture in parallel, we evaluate each layer in parallel using (quasi)-DEER. In Figure 13 (right), we report the number of (quasi)-DEER iterations averaged over all layers and batches. 3.4f u r t h e r d e v e l o p m e n t a n d d i r e c t i o n s f o r f u t u r e w o r k Since publication of this diagonal quasi-Newton method at NeurIPS in 2024, there have been many extensions. There are also many interesting avenues for future 3.4 f u r t h e r d e v e l o p m e n t a n d d i r e c t i o n s f o r f u t u r e w o r k39 work. This section highlights additional important ideas and future directions for quasi-Newton methods for parallelizing nSSMs. 3.4.1 Efficiently Estimating the Diagonal of the Jacobian e f f i c i e n t ly e s t i m at i n g t h e d i ag o na l o f t h e dy na m i c s jac o b i a nThe diagonal approximation presented in equation (20) uses the diag(A t ), the diago- nals of the dynamics Jacobian, to be significantly more memory and work efficient. However, an important question is: how does one acquire these diagonals? The simplest approach is to compute the A t with autodifferentiation, and then take their diagonals. This simple approach still decreases the required work dur- ing the parallel scan by a factor of D 2 . We use this approach in the "eigenworms" experiment in Figure 13, where we show empirically in this setting that this sim- ple approach can still yield substantial speedups. However, the price of this sim- plicity is that we do not unlock all of the benefits of quasi-DEER. For example, this approach offers no savings on peak memory utilization. Furthermore, during autodifferentiation, we still require D function calls. An approach to unlock the full benefits of quasi-DEER is to compute diag(A t ) analytically and implement its closed form directly. We follow this approach in Figure 11 and Figure 12, demonstrating substantial memory savings. Nonetheless, computing derivatives by hand has many drawbacks, and for sufficiently complex dynamics functions diag(A t ) may not even have an imple- mentable closed form. For this reason, Zoltowski et al. [244] takes a different approach: provide a stochastic estimator of diag(A t ) that requires only O(D) mem- ory and one function call. This approach leverages the Hutchinson estimator for the diagonal of a matrix [13, 109, 237]. Consider a matrix A. The Hutchinson estimator ˆ A for diag(A) is ˆ A =v⊙Av,(23) where each entry of v is an iid draw from a Rademacher random variable, i.e. v = 1 with probability 1 /2 and v = −1 otherwise; and where ⊙ represents elemen- twise multiplication of two vectors. ˆ A is an unbiased estimator for diag(A) as E[ ˆ A] = diag(A). As presented in equation (23), the Hutchinson estimator ˆ A seems a bit silly: we already knew A, and so could have just taken diag(A) directly. However, say we want to instead find the diagonal of ∂f ∂s (s)—which is exactly what we need to run quasi-DEER—without ever instantiating the D×D matrix (which requires wasteful memory and compute costs). After sampling the Rademacher variable v, we can compute the matrix-vector product ∂f ∂s (s)v with a single Jacobian vector product (JVP), which requires only a single pass through f. 3.4 f u r t h e r d e v e l o p m e n t a n d d i r e c t i o n s f o r f u t u r e w o r k40 A JVP is a standard primitive in automatic differentiation 4 libraries like JAX [27] and PyTorch [184]. A JVP takes in a simple function f and a tangent vector v, and returns the product of the Jacobian of f with the tangent vector v. By virtue of the chain rule, if JVPs for a suitable basis of functions are defined in an autodif- ferentiation library, one can evaluate derivatives for wide-classes of functions. In fact, the Jacobian ∂f /∂s(s) can be obtained from D number of JVPs, one for each basis vector. Consequently, the Hutchinson estimator obtains an unbiased estimator for the diagonal of the dynamics Jacobian. However, the Hutchinson estimator never needs to instantiate the D× D matrix diag( ∂f /∂s(s)), and requires only a single function call. Thus, the Hutchinson estimator shares the same cost in memory and compute as analytically implementing diag( ∂f /∂s(s))—but even when the closed form of diag( ∂f /∂s(s)) is difficult to obtain, the Hutchinson estimator can still be computed easily using standard autodifferentiation libraries. The variance of the Hutchinson estimator can be reduced by using more Rademacher random vari- ables. Moreover, if the Jacobian ∂f ∂s truly is diagonal, then the Hutchinson estima- tor is exact. In any case, because of Proposition 3.1, we know that substituting an approximate Jacobian based on the Hutchinson estimator will still converge glob- ally. Finally, Zoltowski et al. [244] provides a variety of empirical demonstrations showing the strong performance of the Hutchinson estimator for parallelizing the sampling of complicated, high-dimensional distributions via Markov chain Monte Carlo. In conclusion, if the desired diagonal is tractable analytically and performance is paramount, implementing the derivative directly may yield the most efficient performance. However, if computing the diagonal is intractable or unwieldy for prototyping many functions f, the Hutchinson estimator introduced in Zoltowski et al. [244] allows for the use of autodifferentiation—with comparable memory and compute costs—to obtain a practically useful estimate. 3.4.2 Generalizing quasi-DEER to other approximate Jacobians As we will formalize in Chapter 6, the closer our approximate dynamics matrices ̃ A t are to the true Jacobians A t , the faster the rate of convergence. Moreover, we know from Proposition 3.1 that any approximate Jacobian will still result in global convergence. Thus, a major direction of future research in quasi-Newton methods is finding other structured matrices that improve expressivity while retaining ef- ficiency. 4 See Baydin et al. [9] and Maclaurin [155] for more details on automatic differentiation, more commonly and colloquially known as "autodiff" 3.4 f u r t h e r d e v e l o p m e n t a n d d i r e c t i o n s f o r f u t u r e w o r k41 r e pa r a m e t e r i z i n g t h e dy na m i c s t o b e d i ag o na lClearly, if the dynam- ics are axes-aligned, then the Jacobian is a diagonal matrix and the diagonal ap- proximation is exact. If the dynamics are not axes-aligned, but there exist some coordinate transform on the s t to make the dynamics axes-aligned, then we could also run quasi-DEER on these reparameterized dynamics to enjoy the efficiency of quasi-DEER with the convergence speed of full DEER. However, even if each ma- trix individually is diagonalizable, it is not always possible to find a basis in which a set of matrices A t are mutually diagonalizable. Nonetheless, even if we only approximately diagonalize the A t , we know from Proposition 3.1 that the result- ing quasi-DEER will still globally converge, and may still be much faster than just taking the diagonal approximation. Ways to obtain such an approximate joint di- agonalization include taking some representative matrix, such as the first Jacobian A 1 or an average of all the Jacobians, and finding its eigenbasis. In general, such a resulting eigenbasis is complex-valued. For reasons still not fully understood, such a complex-valued reparameterization struggles with convergence, especially on GPUs, likely indicating an issue with numerical precision. However, an elegant approach for reparameterization taken by Zoltowski et al. [244] is to use a real eigenbasis. This real eigenbasis is obtained by symmetrizing the representative matrix before finding its eigenbasis. This approach is partic- ularly well-suited to the context of parallelizing MCMC because the dynamics Jacobian in Langevin dynamics [139]—a common sampling approach that is a backbone of the MALA MCMC algorithm [20]—is already a real symmetric ma- trix because it is the Hessian of the log probability of the target distribution p. Zoltowski et al. [244] demonstrate across a wide-range of experiments that this reparameterization using a real-valued eigenbasis is a robust, efficient, and effec- tive method for parallelizing MCMC over the sequence length. A final consideration around reparameterization is its computational cost. An- other advantage of reparameterizing in MCMC is that the cost of the eigendecom- position is a fixed, one-time cost for a particular kernel—whereas in the context of parallelizing RNNs, one may have to frequently rediagonalize to account for the change in dynamics across gradient updates. u s i n g o t h e r s t ru c t u r e d m at r i c e s i n t h e pa r a l l e l s c a nIn quasi- DEER as presented in equation (20), we used diag(A t ) for our approximate dy- namics matrix ̃ A t . We chose to use the diagonal because the composition of diago- nal matrices is closed—multiplying two diagonal matrices together yields another diagonal matrix—and closure of the operation is required to use the parallel scan. Nonetheless, Proposition 3.1 shows that any approximate matrix ̃ A t will still result in global convergence. For example, in Chapter 6, we will show that many common fixed-point methods—including Jacobi and Picard iterations—can also be interpreted as versions of quasi-DEER with different types of approximation techniques used to form ̃ A t . 3.4 f u r t h e r d e v e l o p m e n t a n d d i r e c t i o n s f o r f u t u r e w o r k42 Therefore, it is natural to ask what other types of structured matrices ̃ A t can be easily computed and are closed under composition. For example, in paral- lelizing Hamiltonian Monte Carlo (HMC)[21, 174], which includes both position and momenta variables, Zoltowski et al. [244] demonstrated that "diagonal-block" matrices 5 satisfy these desiderata. Under permutation of the coordinates, these diagonal-block matrices are equivalent to block diagonal matrices. A benefit of block diagonal matrices is that they can better utilize the tensor cores of GPUs. Other possibilities for future work include developing quasi-DEER methods based on parallel scan for other structured matrices, such as low-rank matrices. For example, Terzi ́c et al. [222] developed an efficient parallel scan for permutation matrices, which could be an intriguing option for quasi-DEER in certain settings. Moreover, other matrices such as Householder matrices [23] are not well-suited to parallel scans, but admit a chunkwise parallel form that has achieved great success for language modeling in the DeltaNet architecture [196, 236]. In general, there are many varieties [42, 205] of structured matrices that all merit further exploration for use in the context of parallelizing nonlinear SSMs, whether using parallel scans, chunkwise parallel approaches, or other as yet unimagined schemes. f o r e g o i n g au t o d i f f a n d u s i n g b roy d e n t y p e m e t h o d sA unique as- pect of the quasi-DEER methods discussed in this chapter when compared with the broader quasi-Newton literature (cf. [48, 179]) is the manner in which the ap- proximate derivative ̃ J is constructed. In all of the instantiations of quasi-DEER discussed above, we in some way differentiate the residual r(·) at every iteration; and then we use an approximation of this derivative to reduce the memory and compute requirements of the parallel scan we use to evaluate the resulting LDS. However, much of the quasi-Newton literature—especially the widely-used Broyden methods [30, 48, 60, 148]—is motivated by trying to avoid the computa- tional cost of differentiating r(·) itself. 6 In Broyden methods, an approximation to either J or J −1 is built up over the optimization trajectory using only information gleaned from the trajectory itself (primarily the values s (i) and r(s (i) )). As discussed in Dennis Jr and Schnabel [48], building up an approximation ̃ J for J is called Broyden’s first method or Broyden’s good update; building up an ap- proximation G for J −1 is called Broyden’s second method or Broyden’s bad update. A seeming advantage of the so-called "bad update" is that by approximating J −1 directly, one does not have to bear the cost of the matrix inversion. However, the reason for this colorful nomenclature is the robust observation across practitioners that Broyden’s good update tended to outperform Broyden’s bad update in appli- cation (cf. [48]); Lin, Ye, and Zhang [143] provides theoretical analysis suggesting that the good update is more robust in a wider range of initializations. 5 i.e. a block matrix where every block is a diagonal matrix 6 In contrast, the quasi-DEER methods have accepted the cost of differentiating r(·), and instead focus on reducing the cost of the next step, which is the parallel scan. 3.4 f u r t h e r d e v e l o p m e n t a n d d i r e c t i o n s f o r f u t u r e w o r k43 Tang et al. [221] used Broyden’s bad update to parallelize the evaluation of non- linear SSMs—in their chosen setting, sampling from diffusion models. Building on this work, future work that leans more deeply into the rich literature of Broyden methods—especially Broyden’s good update—could have important implications for parallelizing nonlinear SSMs. 3.4.3 Training and the backwards pass To train an RNN, we need both the forward pass (which fills in the state trajectory s 1:T ) and the backward pass (which computes the gradient of some loss function with respect to the RNN parameters θ∈ R P , and updates those parameters ac- cordingly). In this chapter, we have primarily focused on how DEER and quasi-DEER let us parallelize the forward pass of an RNN, up to numerical precision. However, we should also discuss how to parallelize the backwards pass as well, specifically the fact that DEER also has an exact backward pass that is parallelized across the sequence length. To show this, let us consider an RNN cell parameterized by θ, i.e. s t =f θ (s t−1 ), where here s t represents the RNN hidden state. Assume we want to train our RNN to minimize some supervised scalar loss L(s T ) that is explicitly a function of the final RNN hidden state, but of course depends recurrently on all of the RNN hidden states s 1:T and the RNN cell parameters θ. In modern deep learning, optimization is always done using updates to the pa- rameters θ based on some function of the gradient of the loss L with respect to the parameters. This derivative is computed during the "backwards pass," i.e. back- propagation or the chain rule. In the context of RNNs, this approach to finding the derivative is also called backpropagation through time (BPTT). This name em- phasizes that we are applying the chain rule over dependencies over the sequence length (which, especially in neuroscience applications, can be thought of as time). Therefore, using the chain rule to compute dL /dθ, it follows that dL dθ (s T ) = ∂L ∂s T (s T ) ds T dθ (s T−1 )(24) ds t dθ = ∂s t ∂s t−1 |z A t ds t−1 dθ + ∂f ∂θ (s t−1 ),(25) where the A t are exactly the dynamics function Jacobians that DEER uses to paral- lelize the forward pass, and we can compute ∂f ∂θ over all s 1:T in an embarrassingly parallel manner using a Map. Moreover, equations (24) and (25) indicate that BPTT 3.4 f u r t h e r d e v e l o p m e n t a n d d i r e c t i o n s f o r f u t u r e w o r k44 is an LDS, which we know how to parallelize using a parallel scan. In more detail, unrolling this recursion in equations (24) and (25), it follows that dL dθ (s T ) |z ∈R 1×P = T X t=1 ∂L ∂s T (s T ) | z ∈R 1×D · t+1 Y τ=T A τ | z ∈R D×D · ∂f ∂θ (s t−1 ) |z ∈R D×P .(26) We observe that we can obtain all of the products Q t+1 τ=T A τ with a parallel scan, showing how we can also parallelize the backwards pass. To summarize, while DEER may need multiple LDSs to achieve the exact for- ward pass, when it has converged it has the exact matrices A t needed for the backward pass, which can then also be parallelized with a single parallel scan. However, this single parallel scan backwards would incur all of memory and compute costs that quasi-DEER sought to avoid. For this reason, in the eigen- worms experiment shown in Figure 13, we also use the ̃ A t = diag(A t ) for the backwards pass shown in equation (26) as well. As we run only one parallel scan in the eigenworms experiment, we are using an approximate gradient that is not equal to the true gradient that would arise from either DEER or sequential evalua- tion. This approach using approximate gradients worked empirically for training in the eigenworms experiment shown in Figure 13. Furthermore, Caillon, Fagnou, and Allauzen [32] used an even more approximate gradient–choosing ̃ A t be a random diagonal matrix—though their language modeling experiments show de- graded performance relative to full BPTT. Because of the alteration of the training dynamics, we do not recommend training with approximate gradients in general. We instead recommend the following two alternatives to obtain exact backwards passes in a computationally feasible manner. First, Farsang and Grosu [61], Danieli et al. [40], and Zattra et al. [240] take the approach of adjusting the RNN architectures to have diagonal 7 Jacobians, with Danieli et al. [40] demonstrating that this approach scales to strong language modeling performance with 7 billion parameter models. In such architectures, the quasi-DEER approximate Jacobian is actually exact, i.e. the memory and com- pute costs of running full DEER are reduced to be linear in D. Furthermore, the backwards pass is now exact as well. While restricting the architecture in this way would seem intuitively to reduce its expressivity, precisely and rigorously inves- tigating this intuition is an important avenue for future work. Moreover, just as using richer structured matrices could increase the convergence speed of quasi- DEER (see our discussion in Subsection 3.4.2), so too could they improve the expressivity of RNN architectures. Second, just as both DEER and quasi-DEER use multiple parallel scans to obtain an exact forward pass, we could also use multiple parallel scans to obtain an exact backwards pass for quasi-DEER. In particular, observing equation (26), we can treat 7 or block-diagonal. 3.4 f u r t h e r d e v e l o p m e n t a n d d i r e c t i o n s f o r f u t u r e w o r k45 ̃s T := ∂L /∂s T (s T )∈ R D as an initial state 8 in an LDS with transition function being multiplication by A ⊤ t . We can then implement the resulting quasi-DEER backwards update along the lines of the form shown in equation (20) where we can use diag[A t ] for our transition matrix and use vector-Jacobian products (VJPs) of f t to efficiently compute A ⊤ t ̃s (i) t for the transition function. 9 While such an approach provably must converge, assessing its efficacy empirically is an interesting avenue for future work. However, we note that a highly related idea called Highway backpropagation has already been shown to accelerate the training of GRUs for character level language modeling [58]. In conclusion, DEER enjoys exact forward and backwards passes, using multi- ple parallel scans to achieve the forward pass and a single parallel scan for the backwards pass. Quasi-DEER as implemented in Figure 13 enjoys an exact for- ward pass, but only an approximate backwards pass, as it uses only a single par- allel scan for the backwards pass. Using an exact backward pass is important in general, and can be achieved by restricting the architecture to make the ̃ A t exact, or by using multiple parallel scans in the backward pass as well. Of course, there are other ways beyond BPTT to train RNNs, including e-prop [17], forward-mode optimization [239], evolutionary methods [191], and zeroth order methods [33]. 3.4.4 Initializing the guess for the state trajectory An important consideration for parallel Newton methods is how to initialize the initial guess for the state trajectory s (0) 1:T . As we saw in Proposition 2.3, Newton’s method enjoys quadratic convergence if it is initialized close to the true solution s ⋆ . However, in general, picking an initial guess for the trajectory that is close to the true trajectory s ⋆ 1:T can be as difficult as finding the true trajectory s ⋆ 1:T itself. An exception can be if an approximate trajectory is already known, which could arise if we were training an RNN on a single sequence (so the state trajectory does not change too much with each training step), or conducting sensitivity analysis in Markov chain Monte Carlo (so each chain is close). In this chapter, the parallel Newton methods were initialized from all zeros. Consequently, the initial dynamics matrices A (0) t are all the same, which can exac- erbate instability. A better approach used in Part I is to initialize at random. Probably the best approach presented in the literature comes from Danieli et al. [40], which uses one Jacobi iteration (starting from all zeros) to initialize the states, i.e. s (0) t = f(0,u t ), where u t are the inputs to the RNN. This Jacobi itera- tion is embarrassingly parallel over the sequence length, and can provide a good initialization for the parallel Newton methods. Further research in even better initialization may prove fruitful. 8 Note the reversal of time for the backwards pass. 9 We can use a VJP of f t to compute A ⊤ t ̃s (i) t because A t := ∂f t /∂s t−1 (s ⋆ t−1 ). 4 S T A B L E PA R A L L E L I Z A T I O N : E L K A N D T R U S T R E G I O N M E T H O D S Another well-known failure mode of Newton’s methods is instability—the failure mode where Newton’s method diverges, with the solutions growing in magni- tude instead of converging to the solution s ⋆ (cf. Figure 7). This exact failure mode of never converging does not directly apply to parallel Newton methods, which are guaranteed by Proposition 3.1 to converge in at most T iterations. However, if parallel Newton methods take too many iterations to converge, they will be slower than sequential evaluation, defeating the goal of parallelization in the first place. As we discuss in this chapter, a failure mode that can slow down the con- vergence of Newton methods, especially in finite precision, is when intermediate iterates s (i) explode in value. To overcome this slowed convergence caused by in- stability, we introduce a parallelized trust-region optimizer called ELK: Evaluating Levenberg-Marquardt with Kalman. 4.1l e v e n b e r g - m a rq ua r d t a n d t ru s t - r e g i o n m e t h o d s For the purpose of stabilizing parallel Newton methods, we take the optimiza- tion perspective discussed in Subsection 2.3.2, focusing on the merit function L(s) introduced in equation (10). However, instead of optimizing this merit func- tion with the Gauss-Newton algorithm (GN) (i.e. DEER), we will instead use the Levenberg-Marquardt (LM) algorithm [140, 159], one of the most standard trust- region approaches. The idea of a trust region is simple, and is depicted in Fig- ure 14, which is adapted from Figure 4.1 of Nocedal and Wright [179]. The core idea of trust-region methods is that the quadratic surrogate being minimized by the Gauss-Newton method may only be accurate or helpful in a neighborhood of our current guess s (i) . Thus, trust-region methods require that the next iterate s (i+1) minimize the merit function L(s), subject to being in some neighborhood of the current guess s (i) . Trust regions are often used in conjunction with Newton’s method to improve numerical stability and convergence. Each Gauss-Newton step solves an unconstrained optimization problem, while each trust-region step solves a constrained optimization problem. 46 4.1 l e v e n b e r g - m a rq ua r d t a n d t ru s t - r e g i o n m e t h o d s47 Figure 14: Graphical Depiction of Trust-Region Methods. We show both an undamped Gauss-Newton step (red) and a stabilized trust-region step (blue). The solid lines indicate the contours of the merit function L we want to minimize. The dashed lines indicate the contours of the quadratic surrogate that Gauss- Newton is minimizing on this iteration. The dotted lines indicate the trust re- gion around s (i) ; trust-region methods restrict the update to this ball, resulting in this case in an update that reduces the objective. Figure adapted from No- cedal and Wright [179, Figure 4.1]. The Levenberg-Marquardt algorithm in particular is a canonical trust-region method. Let us define the quadratic surrogate that Levenberg-Marquardt is mini- mizing at each iteration (i) as a function of the step ∆s it takes, i.e. e L s (i) (∆s) = 1 2 r(s (i) ) + J(s (i) )∆s 2 2 .(27) Then, LM uses updates that solve the constrained optimization problem min ∆s e L s (i) (∆s)subject to ∥ ∆s ∥ 2 ⩽D i+1 ,(28) where D i+1 is an upper bound on the step size, thus defining our trust region. Note that both the objective e L s (i) and the constraint g(∆s) :=∥∆s∥ 2 2 −D 2 i+1 are convex in ∆s (in fact, both are quadratic). Therefore, by the method of Lagrange multipliers, minimizing the constrained optimization problem in equation (28) is equivalent to minimizing the Lagrangian b L s (i) (∆s) = e L s (i) (∆s) + λ i+1 2 ∥ ∆s ∥ 2 2 (29) over ∆s for some fixed λ i+1 ⩾0. Note that if λ i+1 =0, then the unconstrained mini- mizer of e L s (i) (∆s) is inside of the trust region. 4.2 e l k : e va l uat i n g l e v e n b e r g - m a rq ua r d t w i t h k a l m a n48 Since equation (29) is quadratic in ∆s, it follows that ∆s LM = − J ⊤ J +λI −1 J ⊤ r. Therefore, we observe that if λ = 0, then we recover the typical GN step. On the other hand, if λ is large, we see that the LM update approaches the gradient descent (GD) update with step size 1/λ. Intuitively, LM is "regularizing" the update by adding a non-negative term to the diagonal of J ⊤ J. This regularization can help to stabilize the update when J ⊤ J has small eigenvalues, resulting in an update with smaller and more manageable magnitude. In fact, this stabilization technique used by LM is exactly analogous to the regularization technique of ridge regression orℓ 2 -regularization used in statistical machine learning [93, 94, 103, 132, 224, 231]. However, while it is intuitive why LM can help stabilize the GN updates, it is not immediately obvious how we can parallelize the LM update over the sequence length to help us achieve our goal of parallelizing the evaluation of nonlinear SSMs. In the next section, we will show how we can parallelize LM updates in our setting via a connection with Kalman smoothing. 4.2e l k : e va l uat i n g l e v e n b e r g - m a rq ua r d t w i t h k a l m a n There is a rich literature connecting optimization techniques with the problem of filtering and smoothing 1 . In particular, Bell and Cathey [16] and Bell [14] draw connections between the Gauss-Newton method and the iterated extended Kalman filter and smoother [194, 215]. Because Gauss-Newton is unstable, it is natural to use Levenberg-Marquardt [140, 159] to stabilize the filtering/smooth- ing problem [35, 156, 193]. This connection between optimization and Kalman smoothing hinges on the following point noted by Särkkä and Svensson [193]: the minimizer of this La- grangian in equation (29) can be obtained by a Kalman smoother. We emphasize this connection in the following proposition. Proposition 4.1. Solving for the Levenberg-Marquardt update that minimizes (29) with fixed λ i+1 is equivalent to finding the maximum a posteriori (MAP) estimate of s 1:T in a linear Gaussian state space model, which can be done in O(logT) time on a sufficiently large parallel machine. Proof. Expanding the residual and Jacobian functions in (27), we see that up to an additive constant, the negative Lagrangian can be rewritten as, 1 For background on filtering and smoothing, see our introduction to Bayesian filtering and smoothing in Subsection 2.1.2.1, and Särkkä and Svensson [194] for the standard textbook introduction. 4.2 e l k : e va l uat i n g l e v e n b e r g - m a rq ua r d t w i t h k a l m a n49 s 0 s 1 s (i) 1 s 2 s (i) 2 s 3 s (i) 3 A 1 ,b 1 1/λ A 2 ,b 2 1/λ A 3 ,b 3 1/λ Figure 15: Graphical Diagram of the ELK LGSSM. We provide a graphical diagram il- lustrating how the LM update in the context of parallelizing nSSMs is the MAP solution to posterior inference in an appropriately constructed LGSSM. Without any observations (i.e. λ = 0, or equivalently observations with infinite variance), we simply recover the DEER update. However, by using our previ- ous state s (i) 1:T as our observations, we restrict the dynamics to a trust region. − b L(∆s,λ i+1 ) · = logN ( s 1 |f(s 0 ),I D ) + T X t=1 logN s (i) t s t , 1 λ i+1 I D + T X t=2 logN s t f(s (i) t−1 ) + ∂f ∂s (s (i) t−1 ) (s t−1 − s (i) t−1 ),I D , (30) where N(x |μ,Σ) denotes the probability density function of the multivariate nor- mal distribution. We recognize (30) as the log joint probability of a linear Gaussian state space model (LGSSM) [194] on (s 1 ,... , s T ). Consequently, the dynamics distributions are given by the linearization of f, and the emissions are the previous iteration’s states, s (i) . The parameter λ i+1 sets the precision of the emissions, governing how far the posterior mode deviates from the previous states. We show the graphical diagram for this LGSSM for T =3 in Figure 15. The minimizer of (29) is the posterior mode of the LGSSM (30), and can be obtained by Kalman smoothing [194]. As with the linear recursions in DEER, the Kalman smoother can be implemented as a parallel scan that scales as O(logT) in time on a machine with O(T) processors [144, 192]. Therefore, we can evaluate an RNN by minimizing the merit function with the Levenberg-Marquardt algorithm. Since each step of LM can be performed by par- allel Kalman smoothing, we call this approach Evaluating Levenberg-Marquardt with Kalman (ELK). Note that DEER is a special case of ELK, where λ = 0, which can be seen as minimizing the unpenalized linearized objective (27), or, alternatively, taking a Newton step with an infinitely large trust region. Moreover, under cer- tain conditions, ELK also enjoys global convergence guarantees [179, Thms. 11.7, 11.8]. 4.3 dy na m i c s p e r s p e c t i v e o n e l k50 q ua s i - e l k : s c a l a b i l i t y a n d s ta b i l i t yAs with DEER, we can substitute an approximate Jacobian into the Lagrangian to obtain the quasi-ELK algorithm. Quasi-ELK enjoys the compute and memory scaling of quasi-DEER, as well as stability from the trust region damping from ELK. We show empirically in Sec- tion 4.4 that while quasi-ELK takes more iterates to converge than ELK, each quasi-ELK iterate is faster, giving overall runtime speedups. i m p l e m e n tat i o n d e ta i l sThe convergence rate of (quasi-)ELK depends on the trust region radius D i (or alternatively λ i ). Although there exist methods to analytically set λ i [179, Algorithm 4.3], these approaches require factorizing ∂r /∂s, which is intractable at scale. Therefore, in practice, we treat λ as a hyperparameter set by a sweep over log-spaced values. We also use Kalman filtering instead of smoothing. We do so for two main rea- sons: filtering requires less work and memory; and we also found it to converge in fewer Newton iterations than smoothing. We hypothesize that this faster con- vergence is related to Proposition 3.1, whose proof shows that the early part of the trace converges first. The traces in parallel Newton iterations converge causally, propagating information from the ground truth initial condition s 0 to the end of the sequence. Therefore, it makes intuitive sense that a Kalman filter, which is also causal, would have better empirical performance than a Kalman smoother. Using the Kalman filter also provides an intuitive explanation—based on dy- namics instead of optimization—as to how ELK calms instabilities that can arise in DEER. We discuss this connection in the next section. 4.3dy na m i c s p e r s p e c t i v e o n e l k A complementary perspective on how ELK results in more stable evaluation of nonlinear RNNs is to see how the Kalman filter damps the spectral norms of the Jacobian matrices A t of the transition dynamics. The spectral norm of a matrix gives the maximum factor by which it can scale an input vector, and so is intuitively related to the stability of a linear dynamical system. We first provide a high-level overview, and then provide a more detailed derivation. ov e rv i e wLet A t be the Jacobians ∂f t /∂s t−1 used in the linear recurrence re- lations and b t be the offsets. Then the prediction step of the Kalman filter (ELK) is the same as DEER. However, after applying the update step in ELK (which imposes the trust region), we obtain a second linear recurrence relation where the linear operator is given by Γ t A t . Note that Γ t is a symmetric positive definite matrix with eigenvalues bounded above by 1 /1+λ. Thus, by the Spectral Theorem, it follows that the spectral norm of Γ t A t is bounded above by ∥A t ∥ /1+λ. Note that larger λ corresponds to more regularization/smaller trust region; and therefore correspondingly results in smaller effective spectral norm. We recover DEER ex- 4.3 dy na m i c s p e r s p e c t i v e o n e l k51 actly if λ =0. Thus, while large spectral norms of A t are a cause of the instability of DEER when evaluating unstable dynamical systems, ELK directly attenuates these spectral norms, providing an explanation for why the intermediate itera- tions using ELK remain stable. d e r i vat i o nWe define our dynamics used in Newton iteration (i +1) as A t = ∂f t ∂s t−1 (s (i) t−1 ) b t =f t (s (i) t−1 ) − ∂f t ∂s t−1 (s (i) t−1 )s (i) t−1 . Now A t ∈ R D×D and b t ∈ R D . In line with considering the system as the LDS in (30), we set the process noise to be I D , and with the emissions governed by s (i+1) t ∼ N(s (i) t ,σ 2 I D ), where σ 2 controls the size of our trust region, since λ =1/σ 2 . In the notation of Murphy [170], we see that the predict step is μ (t+1)|t =A t+1 μ t|t +b t+1 Σ (t+1)|t =A t+1 Σ t|t A ⊤ t+1 +I D . Meanwhile, the update step is μ (t+1)|(t+1) =μ (t+1)|t +Σ (t+1)|t (Σ (t+1)|t +σ 2 I D ) −1 (s (i) t+1 −μ (t+1)|t ) Σ (t+1)|(t+1) =Σ (t+1)|t −Σ (t+1)|t (Σ (t+1)|t +σ 2 I D ) −1 Σ ⊤ (t+1)|t . To unpack this further, we first define the attenuation matrix Γ t+1 : =σ 2 A t+1 Σ t|t A ⊤ t+1 + (σ 2 +1)I D −1 . Because Σ t|t is a covariance matrix, it is symmetric positive semidefinite, and so A t+1 Σ t|t A ⊤ t+1 is also symmetric positive semidefinite, and so all of its eigenvalues are nonnegative. Therefore, all the eigenvalues of A t+1 Σ t|t A ⊤ t+1 + (σ 2 + 1)I D are greater than or equal to σ 2 +1. Consequently, Γ t+1 is symmetric positive definite. Thus, by the Spectral Theo- rem, all eigenvalues of Γ t+1 are positive. By the above argument, the eigenvalues of Γ t+1 are all less than or equal to σ 2 1+σ 2 <1. Moreover, since Γ t+1 is symmetric posi- tive definite, its eigenvalues are equal to its singular values, and so∥Γ t+1 ∥ 2 ⩽ 1 /1+λ. 4.4 e x p e r i m e n t s a n d p e r f o r m a n c e o f e l k52 Thus, we observe that the resulting filtering is given by the recurrence relation μ (t+1)|(t+1) = linear dynamics z | Γ t+1 A t+1 μ t|t + bias term z| Γ t+1 b t+1 + (A t+1 Σ t|t A ⊤ t+1 +I D ) A t+1 Σ t|t A ⊤ t+1 + (σ 2 +1)I D −1 s (i) t+1 Σ (t+1)|(t+1) =Γ t+1 (A t+1 Σ t|t A ⊤ t+1 +I D ). If we are given the Σ t|t , we see that the filtered means (the updates for ELK) come from a linear recurrence relation with linear term Γ t+1 A t+1 . Finally, by the submultiplicativity of norms and our results above, it follows that ∥Γ t+1 A t+1 ∥ 2 ⩽∥Γ t+1 A t+1 ∥ 2 ⩽ 1 1 +λ ∥A t+1 ∥ 2 .□ 4.4e x p e r i m e n t s a n d p e r f o r m a n c e o f e l k Having derived the ELK algorithm and studied its theoretical properties, we now empirically assess its performance in parallelizing dynamical systems at the edge of stability. We examine two dynamical systems: a sine wave and the Lorenz-96 dynamical system [152]. All the experiments in this section were run on a single NVIDIA A100 GPU with 80 GB onboard memory. 4.4.1 Edge of stability: Parallelizing a sine wave First, we pretrain an RNN to recapitulate a sine wave. For our architecture, we use a GRU with hidden states h t ∈ R 3 and scalar inputs x t ∈ R. However, at every point in the sequence t, we readout the hidden state h t ∈ R 3 and use it to parameterize a mean μ t+1 ∈ R and a variance σ 2 t+1 ∈ R + . We then sample x t+1 according to x t+1 ∼ N(μ t+1 ,σ 2 t+1 ); this output x t+1 is then fed back in as the input to the autoregressive GRU at time step t +1 to make the new hidden step h t+1 . Crucially, when parallelizing this architecture, the Markovian state s t must be expanded to include the current sampled output value, as well as the current GRU state. We pretrain this GRU using standard sequential evaluation and backpropagation- through-time to produce a noisy sine wave of length 10,000. We train the GRU on 1024 traces x 1:T generated from a sine wave with amplitude 10 and white noise applied to each time step, and the training objective is to minimize the negative log probability of the x 1:T . 4.4 e x p e r i m e n t s a n d p e r f o r m a n c e o f e l k53 We note that such a system is Markovian with state dimension D = dim(h) + dim(x), as together the hidden state h t and output x t+1 determine the next hidden state h t+1 and output x t+2 . Thus, in the notation of equation (1), a hidden state s t of the Markovian state space model is s t = (x t+1 , h t ). Therefore, we can apply parallel Newton methods to try to find the correct trace s ∗ in a parallelized manner instead of autoregressively. i n i t i a l i z e d a r g ruWe first repeat the analysis in Section 3.3.1 for evaluat- ing a randomly initialized autoregressive GRU. We see in the top left panel of Figure 16 that all four parallel Newton methods converge rapidly and stably to the correct trace, indicated by a low mean absolute discrepancy (MAD) between the true trace and the generated trace. t r a i n e d a r g ruWe then study a pre-trained GRU that generates a noisy sine wave (see Figure 16, bottom). The linear recurrence relation (18) was numerically unstable in DEER and quasi-DEER. To remedy these instabilities, we take the approach described earlier of setting the unstable parts of the trace to a fixed value (here zero). Doing so ensures convergence, but at the cost of “resetting” the optimization for large swathes of the trace (Figure 16, bottom) and slowing convergence (see Figure 16, top right). This finding highlights how the instabilities of DEER — which are inherited from both pathologies of Newton’s method and the parallel recurrence — can be crippling in even very simple scenarios. While resetting allows for convergence, the resulting convergence is very slow. We then apply ELK and quasi-ELK. We show the results in the top right and bot- tom panels of Figure 16. We select the trust region size with a one-dimensional search over log-spaced values between 10 0 and 10 7 . We see ELK has stabilized convergence, with the evaluation never incurring numerical instabilities or requir- ing heuristics. Crucially, by taking more stable steps (and not needing stabilizing heuristics) ELK and quasi-ELK converge faster than DEER and quasi-DEER. ELK can stabilize and expedite the convergence of DEER, with quasi-ELK faster still (by wall-clock time). However, when run on an A100 GPU with 80 GB onboard memory, all paral- lel Newton methods (including DEER) are slower than sequential generation, as shown in Table 4. Quasi-ELK is the fastest parallel method, taking 221 millisec- onds, compared to sequential evaluation, taking 96 milliseconds. For comparison, DEER took 1,225 milliseconds. Quasi-ELK therefore still represents a large im- provement in runtime over previous parallel methods. These timing results are illustrative of multiple themes of our paper. We see that the undamped Newton steps are individually faster because they are carrying out fewer computations. The undamped Newton steps are just computing a linear recurrence relation, while the trust-region methods are computing a filtering pass. 4.4 e x p e r i m e n t s a n d p e r f o r m a n c e o f e l k54 051015 Newton Iterations 10 −10 10 −7 10 −4 Untrained ARGRU MAD 0200040006000800010000 Newton Iterations 10 −6 10 −3 10 0 10 3 Trained ARGRU MAD DEERQuasi-DEERELKQuasi-ELK -20 0 20 1 -20 0 20 100 -20 0 20 1000 0200040006000800010000 Timestep,t -20 0 20 2000 Time series afterNewton iteration: Figure 16: ELK stabilizes parallel evaluation of an AR GRU. (Top Left) The mean absolute difference (MAD) evaluated on the outputs converges rapidly for all four meth- ods on a sequence generated by an untrained AR GRU. (Top Right) The MAD for evaluating a trained AR GRU. Undamped DEER variants are unstable and converge slowly (using the reset heuristic). ELK stabilizes and accelerates con- vergence. (Bottom) The output after 1, 100, 1000, and 2000 Newton iterations. The black dotted line is the true trace. ELK and quasi-ELK converge rapidly, but DEER and quasi-DEER are unstable. The lines where DEER and quasi- DEER are zero depict the zeroing heuristic. However, because the undamped Newton methods are numerically unstable, they take dramatically more Newton steps to converge. Similarly, we see that the quasi methods are dramatically faster than their dense counterparts as they replace O(D 3 ) matrix-matrix multiplication with O(D) diago- nal matrix multiplication. The O(D 3 ) work required by a parallel scan on a dense linear recurrence likely saturates the GPU). We see in Table 4 that individual steps in the dense DEER/ELK are (approximately) a factor of between 3.5 and 30 times slower per step than their quasi (diagonal) variants. However, they take a factor of between 2 and 10 fewer iterations. f u r t h e r d e ta i l s o n s e t t i n g λWe provide more details on how to set the hyperparameters for ELK in Figure 17. We sweep over the hyperparameter for 15 different input sequences, and plot the median and quartiles of the cost to convergence in terms of Newton iterates and runtime (left column of Figure 17). We see a U-shaped curve: large λ takes needlessly small steps, slowing progress; small λ results in many resets, slowing convergence. Crucially, we see there is little variance across individual sequences. These results show that there is a well- 4.4 e x p e r i m e n t s a n d p e r f o r m a n c e o f e l k55 Table 4: Time to evaluate a length T = 10,000 trained AR GRU using sequential vs par- allelized methods. We note the dynamax package [144] we used for the parallel Kalman filter implementation in ELK is not optimized for speed, and hence these run times could be further improved. AlgorithmTimeper Newton step(ms, mean ± std) Newton stepsto conver- gence Totaltime toconver- gence (ms) Sequential Evaluation SequentialN/AN/A96 Parallelized Methods DEER0.282±0.000544491255 Quasi-DEER 0.087±0.00027383642 ELK3.600±0.0670172619 Quasi-ELK 0.141±0.00041566221 behaved dependence that can be optimized on a validation set with a simple 1-d grid search. We also chart the approximation error against cost for the AR GRU (center and right column of Figure 17). We see that the approximation error reduces in fewer Newton steps with full DEER as opposed to quasi-DEER, but, crucially, the wall-clock time (the more important of the two metrics) is notably lower across all accuracies for quasi-DEER. This indicates that our more efficient – but approx- imate – quasi-DEER is broadly preferable to the more expensive – but exact – DEER updates. Furthermore, the stabilized ELK and quasi-ELK are better still. We also show the steps/time to convergence for a range of accuracy thresholds, and see that our methods outperform DEER across the full range of thresholds and metrics. 4.4.2 Chaotic system: Parallelizing the Lorenz96 System Having investigated the parallel Newton methods on the edge of stability (a si- nusoidal oscillation), we now investigate their performance on a chaotic system. We tackle the parallel evaluation of the classic non-linear 5-dimensional Lorenz- 96 system, with F = 8 which results in chaotic dynamics. We seek to evaluate this system (for T = 1000 timesteps) using (quasi)-DEER and (quasi)-ELK. We directly use the Lorenz-96 dynamics as our nonlinear dynamics function f, i.e. the architecture/time evolution is the Lorenz-96 ODE system, evaluated using the 4.4 e x p e r i m e n t s a n d p e r f o r m a n c e o f e l k56 10 −3 10 −1 λ 10 2 10 3 10 4 Newton steps for MAD < 0.001 0500010000 Newton steps 10 9 10 25 MAD 10 −3 10 −2 10 −1 10 0 10 2 10 3 Newton steps for MAD < 10 −3 10 −1 λ 10 0 10 1 Wallclock time (s) for MAD < 0.001 0.00.51.01.5 Wallclock time (s) 10 9 10 25 MAD 10 −3 10 −2 10 −1 10 0 0.1 1.0 Wallclock time for MAD < (s) DEERq-DEER ELK q-ELK Figure 17: Experiment to show how to set the hyperparameters for (quasi)-ELK on the AR GRU pre-trained to generate a noisy sine wave (Figure 16 in the main text). Top row plots Newton steps; bottom row plots wall-clock time. Lower is better for all plots. (Left) median steps/time to convergence over λ over 15 sequences. Quartiles are shaded but are very small. DEER methods are independent of λ. (Center) Updated version of Figure 16 instead plotting MAD as a function of wall-clock time. (Right) Time to convergence is robust as a function of conver- gence threshold ε. Median and quartiles across 15 sequences are shown. DEER methods are nearly constant at the thresholds considered (very slight positive slope). Note we plot for increasing λ corresponding to a smaller trust region, and reducing ε corresponding to a tighter convergence threshold. Dormand-Prince solver [53]. The state is the five-dimensional Lorenz system state. The input is therefore the initial condition of the ODE; and the outputs are the T×5 subsequent system states. Of course, ODE solvers are also examples of SSMs (see Table 1). We demonstrate that all the parallelized methods converge to the correct trace, but that (quasi)-ELK is dramatically more stable at intermediate Newton itera- tions prior to convergence. We see that DEER and ELK methods converge in a comparable number of steps (this makes sense as DEER is a special case of ELK for λ→ 0). DEER is faster (in terms of wall-clock time) because of the extra work done per ELK iteration. However, ELK has stabilized convergence, whereas DEER relies heavily on resetting. Interestingly we see that quasi is slower by all met- rics, suggesting that the chaotic dynamics may require the more accurate updates. Quasi methods can be implemented to consume notably lower memory, however, and so may be preferable in certain circumstances. In Figure 18, we report mean absolute deviation (MAD) of the time series at Newton iteration (i) against the true state sequence. “Iteration” then refers to the 4.5 f u r t h e r e x t e n s i o n s : s c a l e - a n d c l i p - e l k57 10 −3 10 −2 10 −1 10 0 λ 10 2 10 3 Newton steps for MAD < 0.1 05001000 Newton steps 10 9 10 25 MAD 10 −2 10 −1 10 0 100 1000 Newton steps for MAD < 10 −3 10 −2 10 −1 10 0 λ 10 0 Wallclock time (s) for MAD < 0.1 0.00.51.01.5 Wallclock time (s) 10 9 10 25 MAD 10 −2 10 −1 10 0 0.1 1.0 Wallclock time for MAD < (s) DEERq-DEER ELK q-ELK Iteration 50 True traceDEERq-DEER ELK q-ELK Iteration 100Iteration 200Iteration 500 Figure 18: Evaluating the Lorenz96 system in parallel. (Top two rows): Same format as Figure 17. (Bottom row): Plot of Lorenz96 trajectory during optimization. DEER methods are noticeably more unstable than ELK methods. number of Newton iterations, i.e. the number of updates applied to the entire state sequence. We set hyperparameters using 10 different evaluations of the Lorenz96 (i.e. starting from 10 different initial points). 4.5f u r t h e r e x t e n s i o n s : s c a l e - a n d c l i p - e l k Since running the experiments for ELK published in our NeurIPS 2024 paper [80], we developed simpler and more lightweight damping techniques to achieve many of the stabilization benefits of ELK. Zoltowski et al. [244] often uses these damping techniques to parallelize MCMC chains. We discuss two of these extensions, scale- ELK and clip-ELK, below. 4.6 c o n c l u s i o n58 4.5.1 Scale-ELK Motivated by our demonstration in Section 4.3 which shows that ELK reduces the spectral norms of the Jacobian matrices in the transition dynamics, we recommend a more lightweight version of ELK which we call scale-ELK. Scale-ELK uses a hyperparameter k∈ [0,1] (as opposed to λ∈ [0,∞) used by ELK). Scale-ELK uses a linear dynamical system just like DEER, with the dynam- ics defined as A t = (1 −k) ∂f t ∂s t−1 (s (i) t−1 ) b t =f t (s (i) t−1 ) − (1 −k) ∂f t ∂s t−1 (s (i) t−1 )s (i) t−1 . Thus, setting k =0 recovers DEER, while setting k =1 recovers a (computationally expensive form of) sequential evaluation. Ideally, k is chosen to keep the spectral norms of A t T t=1 below 1. Note that k t can also be chosen on a timestep dependent basis. By Proposition 3.1, scale-ELK also enjoys global convergence. Scale-ELK enjoys two primary benefits over ELK. First, an evaluation of scale- ELK uses fewer FLOPs than ELK, as scale-ELK is just parallelizing an LDS while ELK uses a parallelized Kalman filter. Second, the Kalman filter involves inverses which run the risk of introducing numerical instability, while scale-ELK avoids these complications. 4.5.2 Clip-ELK However, scale-ELK still has a hyperparameter k. Although this hyperparameter can be set using techniques as shown in Figure 17, it would be desirable to have a hyperparameter-free method. Therefore, we propose clip-ELK, which is a hyperparameter free approach to achieve the same goal of a stable LDS. Clip-ELK applies to the "quasi" diagonal approximation only, and simply clips each element of A t (which in this setting is a diagonal matrix) to be between [−1,1]. Clip-ELK also converges globally by Proposition 3.1. Moreover, by design, it ensures that each iteration of clip-ELK is a stable LDS. Clipping can also be done to some hyperparameter, e.g. [−ρ, ̄ρ], for ρ, ̄ρ⩽1. 4.6c o n c l u s i o n ELK presents a beautiful connection between dynamics—Kalman filtering and smoothing—and optimization—the Levenberg-Marquardt, or trust-region methods— to parallelize dynamical systems in a stable way. In the experiments we provide in Section 4.4, we show that the intermediate ELK iterates are much more stable than 4.6 c o n c l u s i o n59 the DEER iterates. Interestingly, at early (around 100) iterations, even though ELK has not recovered the exact trace s ⋆ 1:T , Figures 16 and 18 show that it qualitatively appears to have the right "manifold" of the dynamics. For this reason, ELK could prove very useful in the "early stopping" of these parallel Newton methods in the context of parallelizing MCMC chains, or in general when the desired output of a procedure is a distribution instead of an exact trajectory. Nevertheless, as shown in Table 4, even at the edge of stability, all of the par- allel Newton methods struggle to achieve parity with—let alone beat—the speed of sequential evaluation for obtaining an exact trajectory. These difficulties raise the important question: are there certain dynamical systems that cannot be paral- lelized efficiently? We answer this question in the next part of this thesis, which provides a thorough account of the convergence rates of these parallel Newton methods. Part I T H E O R Y : C O N V E R G E N C E R A T E S The third part of this thesis presents its theoretical contributions. We present the first detailed analysis of the convergence rates of these par- allel Newton methods. In particular, we show how the predictability of the dynamics is the primary determining factor of the convergence rate of the method. Furthermore, we show how a wide-range of fixed- point methods in use for parallelizing sequential computation can be unified in the quasi-DEER framework. We show how the quality of the quasi-DEER approximation in this framework affects the convergence rates of different fixed-point methods in different problems. Predictable Unpredictable State Space State Space Merit FunctionMerit Function Figure 19: Predictability enables parallelization. Predictable dynamics yield well-conditioned merit functions, enabling rapid convergence. Un- predictable dynamics produce flat or ill-conditioned merit land- scapes, resulting in slow convergence or numerical failure. 5 C O N V E R G E N C E R A T E S O F G A U S S - N E W T O N F O R PA R A L L E L I Z I N G N O N L I N E A R S S M S The previous chapters developed practical algorithms for parallelizing nonlinear SSMs. A natural question arises: which systems admit efficient parallelization? This chapter establishes a fundamental connection between the dynamics of a system and the difficulty of the resulting optimization problem (of the merit func- tion defined in equation (10)). Our central result is that predictability enables parallelization: systems whose future states can be reliably predicted from past states admit efficient parallel evaluation, while chaotic systems do not. In particular, we establish a precise relationship between a system’s dynamics and the conditioning of its corresponding optimization problem, as measured by its Polyak-Łojasiewicz (PL) constant. We show that the predictability of a system, defined as the degree to which small perturbations in state influence future behav- ior and quantified by the largest Lyapunov exponent (LLE), impacts the number of optimization steps required for evaluation. For predictable systems, the state trajectory can be computed in at worst O((logT) 2 ) time, where T is the sequence length: a major improvement over the conventional sequential approach. One factor of log(T) comes from the computational cost of each Gauss-Newton step, which uses a parallel scan. The other factor of log(T) comes from the number of Gauss-Newton steps needed to converge, which yields the interpretation that a predictable nonlinear SSM can be thought of as a stack of O(logT) LDSs. In contrast, chaotic or unpredictable systems exhibit poor con- ditioning, with the consequence that parallel evaluation converges too slowly to be useful. Importantly, our theoretical analysis shows that predictable systems always yield well-conditioned optimization problems, whereas unpredictable systems lead to severe conditioning degradation. We validate our claims through exten- sive experiments, providing practical guidance on when nonlinear dynamical sys- tems can be efficiently parallelized. We highlight predictability as a key design principle for parallelizable models. 5.1p r e d i c ta b i l i t y a n d t h e l a r g e s t lya p u n ov e x p o n e n t Predictability is usually defined through its antonym: unpredictability [141, 217]. 61 5.1 p r e d i c ta b i l i t y a n d t h e l a r g e s t lya p u n ov e x p o n e n t62 Unpredictable systems are dynamical systems whose future behavior is highly sensitive to small perturbations. The system’s intrinsic sensitivity amplifies small perturbations and leads to massive divergence of trajectories. A common exam- ple is a chaotic system, like the weather: a butterfly flapping its wings in Tokyo today can lead to a thunderstorm in Manhattan next month [141, 217]. Given a snapshot of the current atmospheric state, weather models can provide accurate forecasts over short time horizons—typically a few days. However, predictions degrade rapidly beyond that, as the system’s intrinsic sensitivity amplifies small uncertainties in the initial snapshot [151]. By contrast, predictable systems [152, 223] are those in which small pertur- bations are forgotten. Small perturbations are diminished over time, rather than amplified. A familiar example is aviation: a patch of choppy air rarely makes an airplane land at the wrong airport. The notion of (un)predictability can be formalized through various routes such as chaos theory [74, 198] and contraction analysis [31, 150]. We provide a defi- nition of predictability in terms of the Largest Lyapunov Exponent (LLE) [186, 217]: Definition 5.1 (Predictability and Unpredictability). Consider a sequence of Jacobians A 1 ,A 2 ,· ,A T . We define the associated Largest Lyapunov Exponent (LLE) to be LLE := lim T→∞ 1 T log (∥ A T A T−1 ·A 1 ∥) = λ,(31) where ∥·∥ is an induced operator norm. If λ < 0, we say that the nonlinear state space model is predictable at s 0 . Otherwise, we say it is unpredictable. Suppose we wish to evaluate a nonlinear SSM (1) from an initial condition s 0 , but we only have access to an approximate measurement s ′ 0 that differs slightly from the true initial state. If the system is unpredictable (λ>0), then the distance between nearby trajectories grows as ∥s t −s ′ t ∥ ∼ e λt ∥s 0 −s ′ 0 ∥.(32) Letting ∆ denote the maximum acceptable deviation beyond which we consider the prediction to have failed, the time horizon over which the prediction remains reliable scales as Time to degrade to ∆ prediction error ∼ 1 λ log ∆ ∥s 0 −s ′ 0 ∥ .(33) This relationship highlights a key limitation in unpredictable systems: even signif- icant improvements in the accuracy of the initial state estimate yield only logarith- mic gains in prediction time. The system’s inherent sensitivity to initial conditions 5.2 p o lya k - ł o ja s i e w i c z a n d m e r i t l a n d s c a p e c o n d i t i o n i n g63 overwhelms any such improvements. Predictable systems, such as contracting sys- tems, have the opposite property: trajectories initially separated by some distance will eventually converge towards one another (Figure 19), improving prediction accuracy over time. The sign of λ determines the system’s qualitative behavior: • λ<0 (predictable): Perturbations decay exponentially. Small errors in initial conditions have diminishing effects on future states. Examples include stable linear systems and contractive nonlinear maps. • λ = 0 (marginal): Perturbations neither grow nor decay on average. This is the boundary between predictable and chaotic dynamics. • λ>0 (chaotic): Perturbations grow exponentially. The system exhibits sensi- tive dependence on initial conditions—the hallmark of chaos. Small errors rapidly amplify, making long-term prediction impossible. We will show that the predictability of the dynamics directly governs the con- ditioning of the corresponding merit function L(s 1:T ) := 1 2 ∥r(s 1:T )∥ 2 2 .(34) To show this rigorously, in the next section we introduce the Polyak-Łojasiewicz (PL) constant μ to quantify the conditioning (flatness) of L. 5.2p o lya k - ł o ja s i e w i c z a n d m e r i t l a n d s c a p e c o n d i t i o n i n g Chewi and Stromme [36] state that The Polyak-Łojasiewicz (PL) condition forms the cornerstone of mod- ern non-convex optimization. Also known as gradient dominance, the PL condition [62, 121, 176, 187] is simple: A function L(s) is μ-PL if it satisfies, for μ>0, 1 2 ||∇L(s)|| 2 ⩾ μ ( L(s) − L(s ⋆ ) ) (35) for all s. The largest μ for which equation (35) holds for all s is called the PL constant of L(s). In general, it can be difficult to use the PL condition if the minimum L(s ⋆ ) is not known in advance. However, L(s ⋆ ) = 0 in all applications of parallel Newton methods in this thesis, allowing for further simplification of equation (35). PL is a form of gradient dominance because equation (35) requires that if we are away from the true minimum (i.e. L(s) − L(s ⋆ ) is large), then the gradient must be 5.2 p o lya k - ł o ja s i e w i c z a n d m e r i t l a n d s c a p e c o n d i t i o n i n g64 ∥∇ℒ∥ 2 ≥μℒ μ→0μ≫0 Figure 20: PL constant μ captures the flatness of the merit function landscape. We pro- vide a schematic illustrating how a smaller PL constant μ results in flatter merit function landscapes. large as well. Therefore, the PL constant μ can be thought of as a measure of the "flatness" of the merit function: as μ→ 0, the magnitude of the gradient becomes smaller and smaller as the merit function landscape becomes flatter and flatter, as shown in Figure 20. All of the intuition and results about PL functions applies to parallel Newton methods because the merit function defined in equation (10) satisfies equation (35). In fact, this result is known in the literature for general sum-of-squares functions [176]: Proposition 5.2. The merit function L(s) defined in equation (10) satisfies equation (35) for μ := inf s σ 2 min (J(s)).(36) Proof. Observe that ∇L(s) = J(s) ⊤ r(s)and L(s ∗ ) =0. Substituting these expressions into the PL inequality in equation (35) we obtain r ⊤ J(s) J(s) ⊤ r ⩾ μ r ⊤ r. Therefore, if J is full rank, then the merit function L is μ-PL, where μ = inf s λ min J(s)J(s) ⊤ = inf s σ 2 min ( J(s) ) Consequently, the merit function in equation (34) that is minimized by parallel Newton methods satisfies a number of desirable properties. For example, the 5.3 c o n d i t i o n i n g d e p e n d s o n dy na m i c a l p ro p e r t i e s65 merit function is invex, meaning that all stationary points are global minima. In other words, no optimizer of the merit function in equation (34) can be stuck in a local minimum or saddle point, because there are none: there is only the global minimizer s ⋆ . The PL condition implies invexity, but we can also see the invexity of L(s 1:T ) more directly: its gradient is ∇L(s) = J(s) ⊤ r(s), and J (defined in equation (17)) is always invertible. Therefore, the gradient can only be zero (a stationary point) when the residual is also zero, which occurs only at the true sequential rollout s ⋆ 1:T . Another reason why the PL condition is so important is that it is morally de- signed to be equivalent to linear rate for gradient descent. To provide this intu- ition, consider gradient flow on a loss function L(s), i.e. the time evolution of L subject to s evolving according to ̇ s = −∇L(s). Then, if L is μ-PL with L(s ⋆ ) =0, then ̇ L =∇L· ̇ schain rule = −∥∇L∥ 2 def. of grad. flow ⩽ −2μLPL condition Therefore, it follows that L(t)⩽ L(0) exp(−2μt), which is linear rate for a contin- uous time system (i.e., the loss decays exponentially with the number of steps taken). Note that the size of μ determines the precise convergence rate, with smaller μ (flatter landscapes) converging more slowly. Converting this argument from gradient flow (continuous time) to gradient descent (discrete steps) is done in Theorem 1 of Karimi, Nutini, and Schmidt [121] and requires only an addi- tional Lipschitzness assumption to account for the discrete step sizes. And, of course, by working backwards from the key desideratum of linear rate—i.e. that L(t)⩽ L(0) exp(γt) for some γ—we can also derive the PL condition. Therefore, by showing that the merit function minimized by parallel Newton methods is PL, we show that it morally should achieve linear rate with gradient descent—albeit with a rate controlled by the flatness of the landscape μ. Having introduced the key ingredients—dynamical predictability as quantified by the LLE, and merit function conditioning as quantified by the PL constant—we now combine them in the next section to show how dynamical properties impact properties of J and L. 5.3c o n d i t i o n i n g d e p e n d s o n dy na m i c a l p ro p e r t i e s In this section, we provide two results showing how the key quantities of the parallel Newton problem—chiefly the Jacobian J := ∂r /s 1:T and the merit function 5.3 c o n d i t i o n i n g d e p e n d s o n dy na m i c a l p ro p e r t i e s66 L—are determined by properties of the underlying dynamical system. In particu- lar, in Theorem 5.3, we show that the conditioning of J and L are determined by the predictability of the dynamics, while in Theorem 5.4 we show that the Lips- chitzness of J is also controlled by the Lipschitzness of the dynamical Jacobians A t . These two results facilitate the proof and interpretation of convergence rates for various parallel Newton methods. 5.3.1 Merit Function PL Constant is Controlled by the Largest Lyapunov Exponent of Dynamics As stated earlier, the Largest Lyapunov Exponent is a commonly used way to define the (un)predictability of a nonlinear state space model. In order to proceed, we need to control more carefully how the product of Jacobian matrices in (31) behaves for finite-time products. We will assume that there exists a "burn-in" period where the norm of Jacobian products can transiently differ from the LLE. In particular, we assume that ∀t>1, ∀k⩾0, ∀s, b e λk ⩽ ∥ A t+k−1 A t+k−2 ·A t ∥ ⩽ a e λk ,(37) where a ⩾ 1 and b ⩽ 1. The constant a quantifies the potential for transient growth—or overshoot—in the norm of Jacobian products before their long-term behavior emerges, while b quantifies the potential for undershoot. Theorem 5.3. Assume that the LLE regularity condition (37) holds. Then the PL constant μ satisfies 1 a · e λ −1 e λT −1 ⩽ √ μ ⩽ min 1 b · 1 e λ(T−1) ,1 .(38) Proof. See Appendix B for the full proof and discussion. We provide a brief sketch. Because σ min (J) = 1 /σ max (J −1 ), it suffices to control ∥J −1 ∥ 2 . We can write J = I − N where N is a nilpotent matrix. Thus, it follows that J −1 = P T−1 k=0 N k . As we discuss further in Appendix B, the matrix powers N k are intimately related to the dynamics of the system. The upper bound on ∥J −1 ∥ 2 follows after applying the triangle inequality and the formula for a geometric sum. The lower bound follows from considering∥N T−1 ∥ 2 . Theorem 5.3 is the main result of this chapter, offering a novel connection be- tween the predictability λ of a nonlinear state space model and the conditioning μ of the corresponding merit function, which affects whether the system can be effectively parallelized. If the underlying dynamics are unpredictable (λ>0), then the merit function quickly becomes poorly conditioned with increasing T , because the denominators of both the lower and upper bounds explode due to the expo- nentially growing factor. Predictable dynamics λ < 0 lead to good conditioning of the optimization problem, and parallel methods based on merit function min- imization can be expected to perform well in these cases. Indeed, when λ < 0, 5.3 c o n d i t i o n i n g d e p e n d s o n dy na m i c a l p ro p e r t i e s67 the conditioning of the merit function becomes asymptotically independent of the sequence length T , due to the exponentially shrinking factor. The proof mechanism we have sketched upper and lower bounds ∥J −1 ∥ 2 in terms of norms of Jacobian products. We only use the assumption in equation (37) to express those bounds in terms of λ. As we discuss at length in Appendix B, we can use different assumptions from equation (37) to get similar results. The- orem 5.3 and its proof should be thought of as a framework, where different assumptions (which may be more or less relevant in different settings) can be plugged in to yield specific results. w h y u n p r e d i c ta b l e s y s t e m s h av e e xc e s s i v e ly f l at m e r i t f u n c t i o n s Theorem 5.3 demonstrates that the merit function becomes extremely flat for un- predictable systems and long trajectories. This flatness poses a fundamental chal- lenge for any method that seeks to compute state trajectories by minimizing the merit function. We now provide further intuition to explain why unpredictability in the system naturally leads to a flat merit landscape. Suppose that we use an optimizer to minimize the merit function (34) for an unpredictable system until it halts with some precision. Let us further assume that the first state of the output of this optimizer following the initial condition is ε-close to the true first state, ∥s 1 −s ∗ 1 ∥ = ε. Suppose also that the residuals for all times greater than one are precisely zero—in other words, the optimizer starts with a "true" trajectory starting from initial condition s 1 . Then the overall residual norm is at most ε, ∥r(s)∥ 2 =∥s 1 −f(s 0 )∥ 2 ⩽ ( ∥s 1 −s ∗ 1 ∥ +∥s ∗ 1 −f(s 0 )∥ ) 2 =∥s 1 −s ∗ 1 ∥ 2 =ε 2 . However, since s t and s ∗ t are by construction both trajectories of an unpredictable system starting from slightly different initial conditions s 1 and s ∗ 1 , the distance between them will grow exponentially as a consequence of equation (33). By con- trast, predictable systems will have errors that shrink exponentially. This shows that changing the initial state s 1 by a small amount can lead to a massive change in the trajectory of an unpredictable system, but a tiny change in the merit function. Geometrically, this corresponds to the merit function landscape for unpredictable systems having excessive flatness around the true solution (Figure 19, bottom right panel). Predictable systems do not exhibit such flatness, since small residu- als imply small errors. Theorem 5.3 formalizes this idea. 5.3.2 Residual function Jacobian Inherits the Lipschitzness of the Nonlinear State Space Model In addition to the parameter μ, which measures the conditioning of the merit function, the difficulty of minimizing the merit function is also influenced by the Lipschitz continuity of its Jacobian J. The following theorem establishes how the 5.4 r at e s o f c o n v e r g e n c e f o r o p t i m i z i n g t h e m e r i t f u n c t i o n68 Lipschitz continuity of the underlying sequence model induces Lipschitz continu- ity in J. Theorem 5.4. If the dynamics of the underlying nonlinear state space model have L- Lipschitz Jacobians, i.e., ∀t>1, s,s ′ ∈ R D : ∥A t (s) −A t (s ′ )∥⩽L∥s −s ′ ∥, then the residual function Jacobian J is also L-Lipschitz, with the same L. Proof. By assumption, for each t, ∀s,s ′ ∈ R D : ∥A t (s t ) −A t (s ′ t )∥ 2 ⩽ L∥s t −s ′ t ∥ 2 . Define D t :=A t (s ′ t ) −A t (s t ) and D := J(s ′ ) − J(s). Since D places the blocks D t along one subdiagonal, we have ∥D∥ 2 = max t ∥D t ∥ 2 . But each block D t satisfies the Lipschitz bound ∥D t ∥ 2 ⩽ L∥s ′ t −s t ∥ 2 , so ∥D∥ 2 = max t ∥D t ∥ 2 ⩽ L max t ∥s ′ t −s t ∥ 2 ⩽ L∥s ′ − s∥ 2 . Hence, it follows that ∥J(s ′ ) − J(s)∥ 2 = ∥D∥ 2 ⩽ L∥s ′ − s∥ 2 . Thus J is L-Lipschitz. Theorem 5.4 will be important for the analysis in Section 5.4, where we consider convergence rates. Because Gauss-Newton methods rely on iteratively linearizing the dynamics (or equivalently the residual), they converge in a single step for linear dynamics L = 0, and converge more quickly if the system is close to linear (L is closer to 0). 5.4r at e s o f c o n v e r g e n c e f o r o p t i m i z i n g t h e m e r i t f u n c t i o n In Section 5.3, we established that the predictability of the nonlinear state space model directly influences the conditioning of the merit function. This insight is 5.4 r at e s o f c o n v e r g e n c e f o r o p t i m i z i n g t h e m e r i t f u n c t i o n69 critical for analyzing any optimization method used to compute trajectories via minimization of the merit function. In this section, we apply those results to study the convergence behavior of the Gauss-Newton (DEER) algorithm for the merit function defined in equation (34). We derive worst-case bounds on the number of optimization steps required for convergence. In addition, we present an average-case analysis of DEER that is less conservative than the worst-case bounds and more consistent with empirical observations. 5.4.1 DEER Always Converges Globally at a Linear Rate Although DEER is based on the Gauss-Newton method, which generally lacks global convergence guarantees, we prove that DEER always converges globally at a linear rate. This result relies on the problem’s specific hierarchical structure, which ensures that both the residual function Jacobian J and its inverse are lower block-triangular. In particular, we prove the following theorem: Theorem 5.5. Let the DEER (Gauss–Newton) updates be given by equation (16), and let s (i) denote the i-th iterate. Let e (i) := s (i) − s ∗ denote the error at iteration i, and assume the regularity condition in equation (37). Then the error converges to zero at a linear rate: ∥e (i) ∥ 2 ⩽χ w β i ∥e (0) ∥ 2 , for some constant χ w ⩾1 independent of i, and a convergence rate 0<β<1. Proof. Our general strategy for deriving DEER convergence bounds will be to fix some weighted norm ∥e∥ W : =∥W 1/2 e∥ 2 , for a symmetric positive definite matrix W. Doing so induces the operator norm ∥J∥ W : =∥W 1/2 JW −1/2 ∥ 2 such that each DEER step is a contraction in this norm, with contraction factor β∈ [0,1). This will imply that the DEER error iterates decay to zero with linear rate, as ∥e (i) ∥ W ⩽β i ∥e (0) ∥ W ,(39) i.e. ∥W 1/2 e (i) ∥ 2 ⩽β i ∥W 1/2 e (0) ∥ 2 . Using the above equation and properties of singular values, it follows that p λ min (W)∥e (i) ∥ 2 ⩽β i p λ max (W)∥e (0) ∥ 2 . 5.4 r at e s o f c o n v e r g e n c e f o r o p t i m i z i n g t h e m e r i t f u n c t i o n70 Therefore, to convert the linear rate in equation (39) back to standard Euclidean space, we incur an additional multiplicative factor that depends on the condition- ing of W 1/2 : ∥e (i) ∥ 2 ⩽χ w β i ∥e (0) ∥ 2 ,where χ w : = s λ max (W) λ min (W) .(40) d e e r a s a c o n t r ac t i o n m a p p i n gRecall that the DEER (Gauss-Newton) updates are given by s (i+1) = s (i) − J −1 (s (i) )r(s (i) ) Recalling that r(s ∗ ) = 0 and subtracting the fixed point s ∗ from both sides, we have that e (i+1) = e (i) − J −1 (s (i) )r (i) + J −1 (s (i) ) r(s ∗ ) = e (i) − J −1 (s (i) ) r(s (i) ) − r(s ∗ ) . This equation can be written using the mean value theorem as e (i+1) = I − J −1 (s (i) )B (i) e (i) whereB (i) : = Z 1 0 J(s ∗ +τe (i) )dτ From this, we can conclude that the DEER iterates will converge (i.e., the error shrinks to zero) if ∥I − J −1 (s (i) )B (i) ∥ W = ∥J −1 (s (i) ) J(s (i) ) − B (i) ∥ W ⩽β<1.(41) c o n s t ru c t i n g t h e w e i g h t e d n o r mWe will choose a diagonal weighted norm, given by W : = Diag I D , w 2 I D , ... , w 2(T−1) I D ∈ R TD×TD , w>0.(42) Under the norm induced by (42) we have ∥J(s (i) ) − B (i) ∥ W ⩽ 2wρ,(43) ∥J −1 (s (i) )∥ W ⩽ a 1 − (we λ ) T 1 −we λ ,(44) where ρ upper bounds∥J∥ 2 over all states in the DEER optimization trajectory. Multiplying (43) and (44) yields ∥J −1 (s (i) )∥ W ∥J(s (i) ) − B (i) ∥ W ⩽ 2awρ 1 − (we λ ) T 1 −we λ .(45) 5.4 r at e s o f c o n v e r g e n c e f o r o p t i m i z i n g t h e m e r i t f u n c t i o n71 To ensure the right-hand side of (45) does not exceed a prescribed β∈ [0,1), choose w = β 2ρa +βe λ .(46) With this choice, we λ <1,and 2awρ 1 −we λ = β,(47) so the geometric series in (44) is convergent and the bound in (45) holds for all T , because ∥J −1 (s (i) )∥ W ∥J(s (i) ) − B (i) ∥ W ⩽ 2awρ 1 − (we λ ) T 1 −we λ = β 1 − (we λ ) T ⩽ β. This shows that we can always pick a weighted norm so that DEER converges with linear rate in that norm. Converting back into the standard Euclidean norm using (40) and substituting in the condition number of W 1/2 one finds that ∥e (i) ∥ 2 ⩽ 2ρa +βe λ β T β i ∥e (0) ∥ 2 .(48) Thus, the DEER error converges with linear rate towards zero. Theorem 5.5 is unexpected since, in general, Gauss-Newton methods do not enjoy global convergence. The key caveat of this theorem is the multiplicative factor χ w , which can grow exponentially with the sequence length T . This factor governs the extent of transient error growth before the decay term β i eventually dominates. Theorem 5.5 has several useful, practical consequences. First, when the nonlin- ear state space model is sufficiently contracting (λ is sufficiently negative), then χ w in Theorem 5.5 can be made small, implying that in this case DEER converges with little-to-no overshoot. Theorem 5.5 also lets us establish key worst-case and average-case bounds on the number of steps needed for Gauss-Newton to converge to within a given distance of the solution. In particular, when χ w does not depend on the sequence length T , then Theorem 5.5 implies Gauss-Newton will only require O (logT) 2 total computational time, with one log factor coming from the parallel scan at each optimization step and the other coming from the total number of optimization steps needed. 5.4.2 Size of DEER Basin of Quadratic Convergence It is natural that DEER depends on the Lipschitzness of J since Gauss-Newton converges in one step for linear problems, where L = 0. In Section 5.3, we showed 5.4 r at e s o f c o n v e r g e n c e f o r o p t i m i z i n g t h e m e r i t f u n c t i o n72 that the conditioning of the merit function, as measured by the PL-constant μ, depends on the stability, or predictability, of the nonlinear dynamics. Thus, the performance of DEER depends on the ratio of the nonlinearity and stability of the underlying nonlinear state space model. Note that once s is inside the basin of quadratic convergence, it takes O(log log(1/ε)) steps to reach ε residual (effec- tively a constant number of steps). Because DEER converges so quickly within its basin of quadratic convergence, it is important to understand the size of this basin in terms of the properties of the underlying SSM we are trying to parallelize. We provide such a bound in Theorem 5.6. We make no claim about the originality of lower bounding the size of the basin of quadratic convergence in Gauss-Newton. In fact, our proof of Theorem 5.6 closely follows the convergence analysis of Newton’s method in Section 9.5.3 of Boyd and Vandenberghe [26]. Our contribution is we highlight the elegant way the predictability λ and nonlinearity L of a dynamical system influence an important feature of its merit function’s landscape. Theorem 5.6. Let μ denote the PL-constant of the merit function, which Theorem 5.3 relates to the LLE λ. Let L denote the Lipschitz constant of the Jacobian of the dynamics function A(s). Then, 2μ /L lower bounds the radius of the basin of quadratic convergence of DEER; that is, if ||r(s (i) )|| 2 < 2μ L ,(49) then s (i) is inside the basin of quadratic convergence. In terms of the LLE λ, it follows that if ||r(s (i) )|| 2 < 2 a 2 L · e λ −1 e λT −1 2 , then s (i) is inside the basin of quadratic convergence. Proof. Suppose we are at a point s (i) ∈ R TD (i.e. DEER iterate i), and we want to get to s (i+1) . The change in the trajectory is, ∆s (i) := −J(s (i) ) −1 r(s (i) ) (where the iteration number will hopefully be clear from context). The merit func- tion is L(s) = 1 2 ∥r(s)∥ 2 2 , so if we can get some control over ∥r(s (i) )∥ 2 , we will be well on our way to proving a quadratic rate of convergence. First, leveraging the form of the Gauss-Newton update, we can simply "add zero" to write r(s (i+1) ) = r(s (i) +∆s (i) ) = r(s (i) +∆s (i) ) − r(s (i) ) − J(s (i) )∆s (i) 5.4 r at e s o f c o n v e r g e n c e f o r o p t i m i z i n g t h e m e r i t f u n c t i o n73 Next, we can write the difference r(s (i) +∆s (i) ) − r(s (i) ) as the integral of the Jaco- bian, i.e. r(s (i) +∆s (i) ) − r(s (i) ) = Z 1 0 J s (i) +τ∆s (i) ∆s (i) dτ. Therefore, r(s (i+1) ) = Z 1 0 J s (i) +τ∆s (i) − J(s (i) ) ∆s (i) dτ Taking ℓ 2 -norms and using the triangle inequality, it follows that ∥r(s (i+1) )∥ 2 ⩽ Z 1 0 J s (i) +τ∆s (i) − J(s (i) ) ∆s (i) 2 dτ. Now, if we assume that J is L-Lipschitz and use the definition of spectral norm, it follows that J s (i) +τ∆s (i) − J(s (i) ) ∆s (i) 2 ⩽τL∥∆s (i) ∥ 2 2 , and so taking the integral we obtain ∥r(s (i+1) )∥ 2 ⩽ L 2 ∥∆s (i) ∥ 2 2 = L 2 r(s (i) ) ⊤ J(s (i) ) −⊤ J(s (i) ) −1 r(s (i) ). By definition, √ μ is a lower bound on all singular values of J(s(i)), for all i. There- fore,∥J(s (i) ) −1 ∥ 2 ⩽ 1 / √ μ for all i, and it follows that ∥r(s (i+1) )∥ 2 ⩽ L 2μ ∥r(s (i) )∥ 2 2 ,(50) which is the direct analogy of Boyd and Vandenberghe [26, p. 9.33]. To reiterate, here L is the Lipschitz constant of J, while μ := inf i∈N σ 2 min J(s (i) ) . While this is a quadratic convergence result for GN, this result is not useful unless ∥r(s (i+1) )∥ 2 ⩽∥r(s (i) )∥ 2 (i.e. would backtracking line search accept this up- date). However, if we have∥r(s (i) )∥ 2 < 2μ L , then every step guarantees a reduction in r because in this case ∥r(s (i+1) )∥ 2 <∥r(s (i) )∥ 2 . Therefore, we have ∥r(s (j) )∥ 2 < 2μ L for all j > i. Thus, we have related the size of the basin of quadratic convergence of GN on the DEER objective to the properties of J. Note that with linear dynamics, each A t is constant in s, and so each A t is 0-Lipschitz. Thus, the basin of quadratic convergence becomes infinite. Intuitively, if A t doesn’t change too quickly with s, then DEER becomes a more and more potent method. 5.5 e x p e r i m e n t s74 5.5e x p e r i m e n t s We conduct experiments to support the theory developed above, demonstrating that predictability enables parallelization of nonlinear SSMs. To illustrate this point, we use Gauss-Newton optimization (aka DEER). Our code is at https: //github.com/lindermanlab/predictability _ enables _ parallelization 5.5.1 The Convergence Rate Exhibits a Threshold between Predictable and Chaotic Dy- namics −0.6−0.4−0.20.0 LLE (λ) 5000 10000 T Theory (−log( ̃μ)) 0 700 1400 2100 2800 −0.6−0.4−0.20.0 LLE (λ) Experiment (steps to convergence) 0 2500 5000 7500 10000 −0.6−0.4−0.20.0 LLE (λ) 0 500 1000 steps to convergence Experiment:T= 1000 Figure 21: Threshold phenomenon in DEER convergence based on system predictabil- ity. In a family of RNNs, DEER has fast convergence for predictable systems and prohibitively slow convergence for chaotic systems. Left (Theory): We de- pict Theorem 5.3, illustrating how the conditioning of the optimization prob- lem degrades as T and the LLE (λ) increase. Center (Experiment): We vary λ across the family of RNNs, and observe a striking concordance in the num- ber of DEER optimization steps empirically needed for convergence with our theoretical characterization of the conditioning of the optimization problem. Right: For 20 seeds, each with 50 different values of λ, we plot the relation- ship between λ and the number of DEER steps needed for convergence for the sequence length T = 1000 (gray line in left and center panels). We observe a sharp increase in the number of optimization steps at precisely the transition between predictability and unpredictability. Theorem 5.3 predicts a sharp phase transition in the conditioning of the merit function at λ = 0, which should be reflected in the number of optimization steps required for convergence. To empirically validate this prediction, we vary both the LLE and sequence length T within a parametric family of recurrent neural networks (RNNs), and measure the number of steps DEER takes to converge. We generate mean-field RNNs following Engelken, Wolf, and Abbott [56], scal- ing standard normal weight matrices by a single parameter that controls their variance and therefore the expected LLE. 5.5 e x p e r i m e n t s75 In more detail, we rolled out trajectories from a mean-field RNN with step size 1 for 20 different random seeds. The dynamics equations follow the form s t+1 =Wtanh(s t ) +u t , for mild sinusoidal inputs u t . We have s t ∈ R D , where in our experiments D =100. Note that because of the placement of the saturating nonlinearity, here s t repre- sents current, not voltage. We draw each entry W ij iid ∼ N(0, g 2 /D), where g is a scalar parameter. We then set W i = 0 for all i (no self-coupling of the neurons). A key point of Engelken, Wolf, and Abbott [56] is that by scaling the single parameter g, the resulting RNN goes from predictable to chaotic behavior. While Engelken, Wolf, and Abbott [56] computes the full Lyapunov spectrum in the limit D→∞, for finite D we can compute a very accurate numerical approximation to the LLE. In particular, we use Algorithm 2 to compute the LLE in a numerically stable way. Note that the algorithm nominally depends on the initial unit vector u 0 . For this reason, we choose 3 different unit vectors (initialized at random on the unit sphere) and average over the 3 stochastic estimates. However, in practice we ob- serve that the estimate is very stable with respect to choice u 0 , and agrees with systems for which the true LLE is known, such as the Henon and logistic maps. Algorithm 2 Numerically Stable Computation of Largest Lyapunov Exponent (LLE) 1: Input: Initial unit vector u 0 , total iterations T 2: Initialize: LLE←0 3: for t =1 to T do 4:Compute evolved vector: u t ←J t u t−1 5:Compute stretch factor: λ t ←∥u t ∥ 6:Normalize vector: u t ←u t /λ t 7:Accumulate logarithmic stretch: LLE← LLE + logλ t 8: Output: Estimated LLE λ← LLE/T In Figure 22, we verify numerically that there is a monotonic relationship be- tween g and the LLE of the resulting system, and that the min-max range for 20 seeds is small. Accordingly, when making Figure 21 (Center), we use the mono- tonic relationship between g and the LLE from Figure 22 to map the average number of DEER steps (over 20 different seeds) needed for convergence for differ- ent values of g to the appropriate value of the LLE. We use 50 values of T from 9 to 9999 (log spaced) to make Figure 21 (Center). We highlight T = 1000 in Figure 21 (Right). Overall, in Figure 21, we observe a striking correspondence between the con- ditioning of the optimization problem (represented by − log ̃μ, where ̃μ is the lower bound for μ from Theorem 5.3) and the number of steps DEER takes to con- verge. This relationship holds across the range of LLEs, λ, and sequence lengths, T . 5.5 e x p e r i m e n t s76 0.60.81.01.21.41.61.82.0 g −0.6 −0.4 −0.2 0.0 LLE Median LLE over 20 seeds with Min-Max range,D=100 Median LLE Min-Max range Figure 22: Robust relationship in mean field RNN between variance parameter g and LLE of the system. For 20 seeds, we observe a robust and non-decreasing relationship between the scalar parameter g and the LLE of the resulting mean- field RNN. The plot above is made for 50 different values of g from 0.5 to 2.0 (linearly spaced). We estimate the LLE over a sequence length of T =9999. There is a rapid threshold phenomenon around λ = 0, which divides predictable from unpredictable dynamics, precisely as expected from Theorem 5.3. The corre- spondence between − log ̃μ and the number of optimization steps needed for con- vergence can be explained by DEER iterates approaching the basin of quadratic convergence with linear rate. wa l l c l o c k t i m e a n d o t h e r o p t i m i z e r sOur findings about the condi- tioning of the merit landscape apply to any solver. To show the generality of Proposition 3.1, we parallelize the sequential rollout of the mean field RNN with other optimizers like quasi-Newton and gradient descent, and observe that the number of steps these optimizers take to converge also scales with the LLE. We also record wallclock times on an H100, and observe that DEER is faster than sequential by an order of magnitude in predictable settings, but slower by an order of magnitude in unpredictable settings. We summarize this experiment in Figure 23. In the top panel of Figure 23, we observe that the number of steps for gradient descent and quasi-DEER to converge also scales monotonically with the LLE, as we expect from Theorem 5.3. DEER (Gauss-Newton) converges in a small number of steps all the way up to the threshold between predictability and unpredictabil- ity (λ = 0). Intuitively, the performance of the other optimizers degrades more quickly as unpredictability increases because quasi-Newton and gradient descent use less information about the curvature of the loss landscape. Even though gradient descent was slower to converge in this setting, we only tried gradient descent with a fixed step size. An advantage of a first-order method like gradient descent over a second-order method like Gauss-Newton (DEER) is 5.5 e x p e r i m e n t s77 −0.6−0.4−0.20.0 LLE (λ) 0 200 400 600 800 1000 steps to convergence −0.6−0.4−0.20.0 LLE (λ) 10 0 10 1 10 2 wallclock time (ms) Gauss-Newton Quasi-Newton Gradient DescentSequential Figure 23: Convergence rates and wallclock time for many optimizers. We supplement the mean-field RNN experiment by also considering quasi-Newton and gradi- ent descent methods (top), and recording wallclock time, including for sequen- tial evaluation (bottom) that the first-order method is embarrassingly parallel (and so with sufficient par- allel processors, the update runs in constant time), while DEER and quasi-DEER use parallel scans (and so the update runs in O(logT) time). Exploring accelerated first-order methods like Adam [127], or particularly Shampoo [89] or SOAP [227] (which are often preferred in recurrent settings like equation (1))—or in general trying to remove the parallel scan—are therefore very interesting directions for future work. Sequential evaluation of equation (1) can also be thought of as block coordinate descent on the merit function L(s), where the block s t ∈ R D is optimized at op- timization step (t). The optimization of each block is a convex problem: simply minimize∥s t −f(s ∗ t−1 )∥ 2 2 , or equivalently set s t =f(s ∗ t−1 ). As sequential evaluation will always take T steps to converge, we do not include it in the top panel of Figure 23. In the bottom panel of Figure 23, we also report the wallclock times for these algorithms to run (our experiments are run on an H100 with 80 GB onboard mem- ory). We observe that the run time of sequential evaluation (green) is effectively constant with respect to λ. We observe that in the predictable setting, DEER is an order of magnitude faster than sequential evaluation, while in the unpredictable regime, DEER is 1-2 orders of magnitude slower than sequential evaluation. This importance of using parallel evaluation only in predictable settings is a core prac- tical takeaway from our theoretical contributions. We run the experiment in Figure 23 on a smaller scale than the experiment in Figure 21 (Right). In Figure 23, we consider 5 random seeds for 16 values of g equispaced between 0.5 and 2.0. Each wallclock time reported is the average of 5 runs for the same seed. We use a batch size of 1. While DEER (Gauss-Newton) and quasi-DEER effectively do not have a step size (they use a step size of 1 always). 5.5 e x p e r i m e n t s78 For each value of g, we ran gradient descent with the following set of step sizes α: 0.01,0.1,0.25,0.5,0.6,0.7,0.8,0.9, and 1.0. For each value of g, we then pick the step size α that results in the fastest convergence of gradient descent. For the smallest value of g =0.5, we use α =0.6; for g =0.6, we use α =0.5; and for all other values of g, we use α = 0.25. Future work may investigate more adaptive ways to tune the step size α, or to use a learning rate schedule. We use a larger tolerance of L(s) /T ⩽ 10 −4 to declare convergence than in the rest of the paper (where we use a tolerance of 10 −10 ) because gradient descent often did not converge to the same degree of numerical precision as sequential, quasi-DEER, or DEER. However, this is a per time-step average error on the order of 10 −4 , in a system where D = 100 and each state has current on the order of 1. Nonetheless, it is an interesting direction for future work to investigate how to get gradient descent to converge to greater degrees of numerical precision in these settings; and, in general, how to improve the performance of all of these parallel sequence evaluators in lower numerical precision. 5.5.2 DEER can converge quickly for predictable trajectories passing through unpre- dictable regions Figure 24: DEER converges quickly for Langevin dynamics in a two-well potential. (Left) An illustration of the two-well potential state space in D = 2. We superimpose a contour plot of the potential on a color scheme showing the spectral norm of the dynamics Jacobian (blue indicates stability, red instability). (Center) A trace plot for the y-coordinate. The LLE of the system is −0.0145. (Right) We observe that this system, which has negative LLE, enjoys sublinear scaling in the sequence length T in the number of DEER iterations needed to converge. We plot the median number of DEER steps to convergence over 20 seeds. DEER may still converge quickly even if the system is unpredictable in certain regions. As long as the system is predictable on average, as indicated by a negative LLE, DEER can still converge quickly. This phenomenon is why we framed The- orem 5.3 in terms of the LLE λ and burn-in constants a, as opposed to a weaker 5.5 e x p e r i m e n t s79 result that assumes the system Jacobians have singular values less than one over the entire state space. To illustrate, we apply DEER to Langevin dynamics in a two-well potential (visualized in Figure 24 for D = 2). The dynamics are stable within each well but unstable in the region between them. Despite this local instability, the system’s overall behavior is governed by time spent in the wells, resulting in a negative LLE and sublinear growth in DEER’s convergence steps with sequence length T (Figure 24, right subplot). We form the two-well potential for our experiment in Section 5.5 as a sum of two quadratic potentials. Concretely, we define the potential φ as the negative log probability of the mixture of two Gaussians, where one is centered at (0,−1.4) and the other is centered at (0,1.6), and they both have diagonal covariance. In Langevin dynamics [65, 139] for a potential φ, the state s t evolves according to s t+1 =s t −ε∇φ(s t ) + √ 2εw t ,(51) where ε is the step size and w t iid ∼ N(0,I D ). In our experiments, we use ε = 0.01. 1 Accordingly, the Jacobians of the dynamics (those used in DEER) take the form A t =I D −ε∇ 2 φ(s t ). As a result, the dynamics are contracting in regions where φ has positive curva- ture (inside of the wells, where the dynamics are robustly oriented towards one of the two basins) and unstable in regions where φ has negative curvature (in the region between the two wells, where the stochastic inputs can strongly influence which basin the trajectory heads towards). We observe that even though there are regions in state space where the dynamics are not contracting, the resulting tra- jectories have negative LLE. Accordingly, in Figure 24 (Right), we observe that the number of DEER iterations needed for convergence scales sublinearly, as the LLE of all the intermediate DEER trajectories after initialization are negative. These results demonstrate that if the DEER optimization path remains in contractive re- gions on average, we can still attain fast convergence rates as the sequence length grows. Moreover, a further added benefit of our theory is demonstrated by our choice of initialization of DEER. Both [142] and [80] exclusively initialized all entries of s (0) to zero. However, such an initialization can be extremely pathological if the region of state space containing 0 is unstable, as is the case for the particular two well potential we consider. For this reason, we initialize s (0) at random (as iid standard normals). An important consequence of this experiment is that it shows that there are systems that are not globally contracting that nonetheless enjoy fast rates of con- 1 Notice that this is a discretization (with time step ε) of the Langevin Diffusion SDE ds(t) = −∇φ(s(t))dt + √ 2dw(t), where w(t) is Brownian motion [98]. 5.5 e x p e r i m e n t s80 0102030 DEER Iteration −2.0 −1.5 −1.0 −0.5 0.0 LLE LLE over Iterations 0102030 DEER Iteration 10 −11 10 −4 10 3 10 10 10 17 10 24 Merit function Merit Function 0500010000 Sequence LengthT 10 20 30 Number of DEER steps Steps to convergence MedianMin-Max (20 seeds) Figure 25: In this plot, we provide additional information about the behavior of DEER when rolling out Langevin dynamics on a two-well potential. (Left) We ob- serve that across 20 random seeds (including different Langevin dynamics tra- jectories), the LLE for intermediate DEER iterations becomes negative after the first iteration. Consequently, we observe that the merit function (Center) experiences a spike on the very first DEER iteration (following initialization, which was the only trajectory with positive LLE), before trending towards con- vergence. As the system spends most of its time in contracting regions, we observe (Right) that the number of DEER iterations needed for convergence scales sublinearly with the sequence length T . We plot the min-max range for 20 seeds, and observe that even out of 20 seeds, the maximum number of DEER iterations needed to converge on a sequence length of T =10,000 is around 35. vergence with DEER. This fact is important because a globally contractive neural network may not be so interesting/useful for classification, while a locally con- tracting network could be. Furthermore, in this experiment we show empirically that Langevin dynamics can have negative LLE (cf. Figure 24). This result suggest that the Metropolis- adjusted Langevin algorithm (MALA), a workhorse of MCMC, may also be pre- dictable in settings of interest, including multimodal distributions. Zoltowski et al. [244] provides even stronger empirical evidence that MALA may be predictable for many target distributions of interest. 5.5.3 Application: Chaotic Observers Finally, we demonstrate a practical application of our theory in the efficient par- allelization of chaotic observers. Observers are commonly used to reconstruct the full state of a system from partial measurements [154, 204]. On nine chaotic flows from the dysts benchmark dataset [73], Table 5 shows that while DEER converges prohibitively slowly on chaotic systems, it converges rapidly on stable observers of these systems, in accordance with our theory that predictability implies paral- lelizability. 5.6 d i s c u s s i o n81 Table 5: Comparison of system and observer LLEs and number of DEER steps for T = 30,000 and Euler discretization step size ∆t =0.01. System LLE (System) LLE (Observer) DEER Steps (System) DEER Steps (Observer) ABC0.16-0.0842433 Chua’s Circuit0.02-1.3769714 Kawczynski-Strizhak0.01-3.08293962 Lorenz1.02-6.28300003 Nosé–Hoover Thermostat0.02-0.13297653 Rössler0.01-0.07292887 SprottB0.20-0.39294862 Thomas0.01-3.07127477 Vallis El Niño0.58-2.48300003 We design observers for these systems using two standard approaches: (1) by directly substituting the observation into the observer dynamics, following Pecora and Carroll [185], or (2) by incorporating the observation as feedback through a gain matrix, as in Zemouche and Boutayeb [241]. We then apply DEER to compute the trajectories of both the original chaotic systems and their corresponding stable observers. As anticipated by Theorem 5.3, the chaotic systems exhibit slow con- vergence—often requiring the full sequence length—whereas the stable observers converge rapidly. As with the two-well experiment, we initialize our guess for s (0) t as iid standard normals. 5.6d i s c u s s i o n In this chapter, we provide the first precise characterization of the inherent diffi- culty of the optimization problem solved by parallel Newton methods. The con- ditioning of the merit landscape determines if parallelization will be faster in practice than sequential evaluation. We show that the conditioning of the opti- mization problem is governed by the predictability of the underlying dynamics. We translate this insight into worst-case performance guarantees for specific op- timizers, including Gauss–Newton (DEER). Our main takeaway is: Predictable dy- namics yield well-conditioned merit functions, enabling rapid convergence. Unpredictable dynamics produce flat or ill-conditioned merit landscapes, resulting in slow convergence or numerical failure. 5.6 d i s c u s s i o n82 5.6.1 Related Work While Lim et al. [142] and Danieli et al. [41] introduced parallel Newton methods, they did not prove their global convergence. Proposition 3.1 proves global conver- gence, though only with worst-case bounds of T optimization steps. These prior works did not address the relationship between system dynamics and condition- ing, or establish global linear convergence rates. Global convergence rates for Gauss-Newton are rare, despite the breadth of optimization literature [26, 175, 179, 242]. Theorem 5.5 establishes global conver- gence with linear rate for Gauss-Newton by leveraging our specific problem struc- ture, though similar results have existed for local linear convergence [180], most famously the Newton-Kantorovich theorem [120]. As discussed in Section 1.1, parallel-in-time methods, including multigrid meth- ods, have a long history. Of particular relevance to this work, Danieli and MacLach- lan [39] and De Sterck et al. [46] study the CFL number for determining the use- fulness of multigrid systems. More closely connecting the theory and practice of multigrid method with parallel Newton methods is a very interesting direction for future work. For example, Jiang et al. [114] uses multigrid methods to parallelize the evaluation and training of transformers over their layers. More recently, sev- eral works have parallelized diffusion models via fixed-point iteration, including worst-case guarantees of T steps [199, 201, 221] as well as polylogarithmic rates in T [1, 34]. Crucially, prior work has not focused on the merit function, which we can define for any discrete-time dynamical system and optimizer. To our knowledge, no prior work connects the LLE of a dynamical system to the conditioning of the corresponding optimization landscape, as established in Theorem 5.3. In particular, we showed that systems with high unpredictability yield poorly conditioned (i.e., flat) merit functions, linking dynamical instability to optimization difficulty in a geometrically appealing way. The centrality of parallel sequence modeling architectures like transformers [226], deep SSMs [85, 86, 207], and linear RNNs [236] in modern machine learn- ing underscores the need for our theoretical work. Merrill, Petty, and Sabharwal [163] explored the question of parallelizability through the lens of circuit complex- ity, analyzing when deep learning models can solve structured tasks in constant depth. Their focus complements ours, and suggests an opportunity for synthesis in future work [149]. 5.6.2 Implications Our work unlocks three key implications for nonlinear state space models: • identifying predictable systems as excellent candidates for parallelization; 5.6 d i s c u s s i o n83 • designing sequence modeling architectures to be predictable if we want to parallelize them; and • interpreting predictable SSMs as an O(logT) stack of LDSs, coupled nonlin- early in "depth". i d e n t i f y i n g p r e d i c ta b l e s y s t e m s f o r pa r a l l e l i z at i o nThis chapter provides a principled way to determine, a priori, whether optimization-based par- allelization of a given model is practical. In many robotic or control systems, particularly ones that are strongly dissipative, this insight can enable orders-of- magnitude speed-ups on GPUs [12, 45, 59, 113, 129, 190, 206, 218, 225]. For example, Zoltowski et al. [244] develops and leverages quasi-Newton meth- ods to parallelize Markov Chain Monte Carlo over the sequence length, attain- ing order-of-magnitude speed-ups. These speed-ups occurred because the quasi- Newton methods converged quickly in the settings considered. Suggestively, MCMC chains are contractive in many settings [25, 52, 157]. A precise characterization of what makes an MCMC algorithm and target distribution predictable would pro- vide useful guidance for when one should aim to parallelize MCMC over the se- quence length. Providing precise theoretical justification for parallelizing MCMC over the sequence length is an exciting avenue for future work. d e s i g n i n g p r e d i c ta b l e s e q u e n c e m i x e r sOur results impact architec- ture design. When constructing nonlinear dynamical systems in machine learn- ing—such as novel RNNs—parallelization benefits are maximized when the sys- tem is made predictable. Given the large body of work on training stable RNNs [55, 57, 61, 76, 101, 102, 131, 133, 168, 182, 219, 245], many effective techniques already exist for enforcing stability or predictability during training. A common approach is to parameterize the model’s weights so that the model is always stable. For example, Farsang and Grosu [61] and Danieli et al. [40] develop nonlinear SSMs and train them with DEER, with Danieli et al. [40] scaling to very strong per- formance as a 7B parameter language model. Both highlight the fast convergence of DEER, which is a result of the contractivity of their architectures: Farsang and Grosu [61] parameterizes their LrcSSM to be contractive, while Danieli et al. [40] clip the norms of their weight matrices. Ensuring a negative largest Lyapunov ex- ponent through parameterization guarantees parallelizability for the entire train- ing process, enabling faster and more scalable learning. Our contribution provides a theoretical foundation for why stability is essential in designing efficiently par- allelizable nonlinear SSMs. i n t e r p r e t i n g s s m s a s l o g a r i t h m i c - d e p t h s tac k s o f l d s sFinally, our results have implications for the interpretation of stable nSSMs. Because each Gauss- Newton step in DEER is a linear dynamical system (LDS), and because we prove in Theorem 5.5 that DEER converges in O(logT) steps for a stable nSSM, we can 5.7 e x t e n s i o n s84 Figure 26: Equivalence between a contractive nSSM and an O(logT) stack of linear state- space models. Contractivity implies that nonlinear dynamics can be decom- posed into a hierarchy of O(logT) layers of linear SSMs, each of which can be evaluated in O(logT) time by a parallel scan. interpret a stable nSSM as being equivalent to a "stack" of O(logT) LDSs cou- pled by nonlinearities. For example, if we have a nonlinear RNN as a sequence mixing layer, we can interpret this single layer with nonlinear dynamics as a hi- erarchical composition of linear state-space layers (SSMs), or equivalently, linear dynamical system (LDS) layers. Each layer can be evaluated in O(logT) time with a parallel scan, and the total number of layers required scales as O(logT). This per- spective shows that nonlinear temporal dependencies can be captured through a logarithmic-depth stacking of linear dynamics. Figure 26 provides a schematic illustration of this equivalence. More explicitly, each iteration of DEER is given by the LDS in equation (15). Therefore, we can interpret each "iteration" (i) of DEER as a sequence-mixing "layer" (i), where the sequence-mixing layer is an input-dependent switching lin- ear dynamical system, like in Mamba [85]. The input to "layer" (i +1) is the state trajectory of the immediately preceding "iteration" or "layer" (i). Because we prove that DEER converges linearly in Theorem 5.5, it follows that a contractive nSSM can be simulated in O(logT) LDS layers of the form shown in equation (15), as- suming the initial error grows polynomially in the sequence length. 5.7e x t e n s i o n s In this chapter, we focused primarily on the convergence rates of DEER, showing how the predictability of the dynamics affects the conditioning of J. However, as 5.7 e x t e n s i o n s85 we discussed in Subsection 3.4.2, we can in general use any quasi- method that uses an approximate form of ̃ A t to approximate the dynamics Jacobians A t . A natural question is: how do such quasi approximations affect the convergence rates of these methods? Empirically, in the results presented in this thesis so far, such quasi approximations appear to slow convergence rates, but can we provide a quantitative and rigorous understanding of the quasi-convergence rates? In the next chapter, we do just that—provide an analysis of the convergence rates of quasi-method—based on a combination of our work in Section 5.3 charac- terizing the conditioning of J and a convergence rate analysis presented in Propo- sition 4 of Lu, Zhu, and Hou [153]. 6 C O N V E R G E N C E R A T E S O F Q U A S I - N E W T O N M E T H O D S F O R PA R A L L E L I Z I N G S S M S In this last main chapter of this thesis, we tie up two loose ends relating to: • what do other members of the ungulate (quasi-Newton) family for paral- lelizing nSSMs look like and; • what are their convergence rates? In more detail, in Subsection 3.4.2 we discussed how in principle any approx- imate Jacobians ̃ A t could be substituted in for the dynamics Jacobians A t in the LDS that comprises each DEER iteration (cf. equation (22)). Any such approx- imation still converges globally by Proposition 3.1, and forms a rich family of quasi-DEER methods. A natural question is: what updates do various Jacobian approximations ̃ A t give rise to? We answer this question in Section 6.1 by formulating a unifying framework of quasi-DEER updates, showing in particular that common fixed-point iterations like Jacobi [213] and Picard [201] arise from simple approximations to ̃ A t . While the general connections between Picard and Newton iterations and their conver- gence rates for solving nonlinear equations have long been known by the applied mathematics community [180], our contribution is to make these connections ex- plicit in the setting of parallelizing nSSMs, a problem of central importance in machine learning. This perspective clarifies the properties of each method and delineates their applicability across different problem regimes. In Section 6.2, we further show the utility of this unifying framework by lever- aging it to highlight the core properties controlling the convergence rates of these different methods. We do so by building on a nice decomposition of the conver- gence rates of Picard iterations proposed in Proposition 4 of Lu, Zhu, and Hou [153]. Our unifying framework shows that this result generalizes immediately over our ungulate family. Furthermore, we build on our work from Chapter 5, to show how the dynamical properties of the underlying nSSMs and the quasi- DEER approximation we use allow for further bounds and deeper analysis of convergence rates of the different fixed-point methods. 86 6.1 u n i f y i n g f i x e d - p o i n t i t e r at i o n s a s q ua s i - d e e r m e t h o d s87 Table 6: Summary of fixed-point iteration schemes as linear dynamical systems. We list the methods by the order of their approximation. While higher order methods may converge in fewer iterations, each iteration may be more costly. For example, the prefix sum and parallel scan have O(logT) depth, while a single Jacobi iter- ation has constant depth. For all the methods, each iteration is an LDS, i.e. they can be written in the form of equation (20) where ̃ A t+1 is the transition matrix. By Proposition 3.1, these methods are guaranteed to converge in at most T iter- ations. "Order" refers to the highest number of derivatives taken: Newton and quasi-Newton methods use first derivatives, while Picard and Jacobi methods do not use derivatives of f t . Fixed-point methodOrderTransition matrix ̃ A t+1 Parallelization Newtonfirst-order ∂f t+1 ∂s t (s (i) t ) Parallel Scan (dense matrix multiplication) Quasi-Newtonquasi first-orderdiag ∂f t+1 ∂s t (s (i) t ) Parallel Scan (elementwise vector multiplication) Picardzeroth-orderI D Prefix Sum (vector addition) Jacobizeroth-order0 Map (embarrassingly parallel) 6.1u n i f y i n g f i x e d - p o i n t i t e r at i o n s a s q ua s i - d e e r m e t h o d s In this section, we propose a unifying framework for parallelizing the evaluation of nonlinear SSMs (equation (1)) using linear dynamical systems (LDSs). In Table 6 we show how both the parallel Newton and quasi-Newton methods we have discussed in this paper, as well as Picard and Jacobi iterations, all fit into this unifying framework. Having discussed Newton and quasi-Newton methods at length in Section 2.4 and Chapter 3, we will introduce Picard and Jacobi iteration in this section. Through- out, we will use the fixed-point operator notation A(·) : R TD 7→ R TD introduced in Subsection 2.3.3. 6.1.1 Picard iterations Shih et al. [201] uses Picard iteration to parallelize sampling in diffusion models. In fact, Picard iterations are often used in the context of evaluating differential equations, where ̇s =g(s,t).(52) 6.1 u n i f y i n g f i x e d - p o i n t i t e r at i o n s a s q ua s i - d e e r m e t h o d s88 After Euler discretization with step size ∆, the continuous time equation (52) be- comes the discrete-time recursion, s t+1 =s t +g(s t ,t)·∆.(53) The Picard fixed-point iteration, s (i+1) 1:T = A P (s (i) 1:T ), is then given by, s (i+1) t+1 =s 0 + t X τ=0 g(s (i) τ ,τ)·∆.(54) Because Picard iterations do not use any derivatives of the discrete-time recursion, we call them zeroth-order fixed-point iterations. Shih et al. [201] proves by induction that for any dynamical system given by equation (53), the fixed-point iterations given by equation (54) will converge to the true trajectory in at most T iterations. The similarity of that proof and its techniques to Proposition 3.1 begged the question as to how Picard and parallel Newton iterations related to each other. Our first result shows that Picard itera- tions are in fact a special case of quasi-DEER, where we approximate the Jacobian of the dynamics function by the identity matrix. Proposition 6.1. The Picard iteration operator A P given by equation (54) is a special case of an LDS, equation (22), where the transition matrix is the identity, ̃ A t =I D . Proof. Define f t+1 (s t ) :=s t +g(s t ,t)·∆. Then, from equation (54) it follows that s (i+1) t+1 =s (i+1) t +g(s (i) t ,t)·∆ =s (i+1) t −s (i) t +s (i) t +g(s (i) t ,t)·∆ =f t+1 (s (i) t ) + (s (i+1) t −s (i) t ). This is exactly of the form of the generic linear recursion shown in equation (20), with ̃ A t =I D . An important consequence of Proposition 6.1 is that like Newton iterations and quasi-Newton iterations, Picard iterations can also be cast as an LDS. In Newton iterations, the full Jacobian ∂f t /∂s t−1 is used in the LDS; in quasi-Newton iterations, the diagonal approximation diag[ ∂f t /∂s t−1 ] is used; and in Picard iterations, the identity I D is used. The Picard iteration is more compute and memory efficient than even quasi-Newton, but is also generally a less faithful approximation and takes more iterations to converge, unless the Jacobian is well-approximated by the identity. 6.1 u n i f y i n g f i x e d - p o i n t i t e r at i o n s a s q ua s i - d e e r m e t h o d s89 6.1.2 Jacobi iterations Yet another seemingly different fixed-point method are Jacobi iterations [180], which were used by Song et al. [213] to accelerate computation in a variety of set- tings in machine learning, such as feedforward networks with skip connections. Jacobi iterations are also a zeroth-order fixed-point method, and are commonly used to solve systems of multivariate nonlinear equations of the form, h t (s 1:T ) =0 ∀t∈ 1,... ,T. Instead, the Jacobi fixed-point operator, s (i+1) 1:T = A J (s (i) 1:T ), solves the following sys- tem of T univariate equations in parallel to obtain s (i+1) 1:T , h (i) t x (i) 1 ,... ,x (i) t−1 ,x t ,x (i) t+1 ,... ,x (i) T =0 ∀t∈ 1,... ,T(55) Song et al. [213] considers in particular the problem of solving recurrence re- lations of the form s t+1 =f t+1 (s 1:t ), and proves that, for such a system, Jacobi iterations converge in at most T iterations. This result is directly analogous to Proposition 3.1. In fact, in the context of iteratively applying LDSs to parallelize Markovian state space models, we prove that Jacobi iterations are a type of de- generate quasi-Newton iterations, where we "approximate" the Jacobian of the dynamics function by zero. Proposition 6.2. When applied to the Markovian state space model in equation (1), the Jacobi iteration operator A J specified by equation (55) is a special case of the common form, equation (22), where, ̃ A t+1 =0. Proof. In a Markovian state space model, the recurrence relation always takes the form specified in equation (1), i.e. s t+1 = f t+1 (s t ). Thus, Jacobi iterations take the simple form s (i+1) t+1 =f t+1 (s (i) t ). Because s (i+1) t+1 does not depend on s (i+1) t , we see that the transition matrix is zero. 6.1.3 Summary We have shown how important parallel fixed-point iterations—Newton, quasi- Newton, Picard, and Jacobi iterations—can all be cast as LDSs when deployed for evaluating nonlinear recursions, as summarized in Table 6. The regimes where these different methods excel are therefore dictated by the form of the Jacobians of their dynamics functions: if each f t+1 is close to an identity update (as is the case in sampling from a diffusion model with small discretization parameter), 6.2 c o n v e r g e n c e r at e s f o r q ua s i - d e e r90 then Picard will excel; if the dynamics are nearly uncoupled across state dimen- sions, then quasi-Newton using a diagonal approximation will excel; and if the dynamics have multiple dependencies across coordinates and the dimension D is not too large, then Newton iterations will excel. Jacobi iterations are most useful if the dynamics are heavily contracting or predictable, i.e. their largest Lyapunov exponent is close to zero (Section 5.1). Another interpretation of very contracting dynamics is dynamics that are primarily input driven, i.e. ∂f t /∂s t−1 ≈0. An important corollary is that because all of these fixed-point iterations can be cast as LDSs, they are all guaranteed to converge in all problem settings in at most T iterations by Proposition 3.1. However, as we noted above, the precise convergence rates of the different fixed-point methods will be problem dependent. In our next section, we provide theoretical analysis showing how the difference between the approximate Jacobian ̃ A t of a fixed-point method and the true dy- namics Jacobian ∂f t /∂s t−1 impacts the rate of convergence of different methods in different problems. 6.2c o n v e r g e n c e r at e s f o r q ua s i - d e e r In this section, we analyze the convergence properties of the fixed-point methods introduced in Section 6.1. We show that the convergence rate of these fixed-point methods can be understood in terms of how well the transition matrix ̃ A t approx- imates the true dynamics Jacobian A t : = ∂f t /∂s t−1 (cf. Table 6) and the stability of the LDS the fixed-point method gives rise to (cf. equation (22)). To begin, we can substitute in our approximations ̃ A t for A t in the definition of J in equation (17) to obtain an approximate residual Jacobian ̃ J(s 1:T )∈ R TD×TD given by ̃ J(s 1:T ) := I D 0 0 ... 00 − ̃ A 2 (s 1 ) I D 0 ... 00 0− ̃ A 3 (s 2 ) I D ... 00 . . . . . . . . . . . . . . . . . . 00 0 ... I D 0 00 0 ... − ̃ A T (s T−1 ) I D ,(56) The corresponding fixed point iteration A takes the form A(s (i) 1:T ) := s (i) 1:T − ̃ J(s (i) 1:T ) −1 r(s (i) 1:T ).(57) 6.2 c o n v e r g e n c e r at e s f o r q ua s i - d e e r91 For example, for Jacobi iterations, ̃ J J (s 1:T ) is always the identity matrix I TD . For Picard iterations, ̃ J P (s 1:T ) takes the form ̃ J P (s 1:T ) = I D 0 0 ... 0 0 −I D I D 0 ... 0 0 0 −I D I D ... 0 0 . . . . . . . . . . . . . . . . . . 0 0 0 ... I D 0 0 0 0 ... −I D I D .(58) Different fixed-point methods A give rise to different matrices ̃ J, which impacts their convergence rates. 6.2.1 Convergence rates of fixed-point iterations In this section, we closely follow the proof of Proposition 4 of Lu, Zhu, and Hou [153] to derive convergence rates for all fixed-point operators discussed in this paper. Lu, Zhu, and Hou [153] focused on the special case of Picard iterations, but our unifying framework allows us to see that their analysis generalizes imme- diately. For any of the fixed-point methods discussed in this paper, we can bound the convergence rate of the error e (i) defined in equation (8). Proposition 6.3 (Proposition 4 of Lu, Zhu, and Hou [153]). Consider a fixed-point solver with updates given by equation (57) for some matrix ̃ J(s (i) 1:T ) with form specified by equation (56). Let L be the maximum of the Lipschitz constants of ∂f t /∂s t−1 . Then ∥e (i+1) ∥ 2 satisfies ∥e (i+1) ∥ 2 ⩽ ̃ J(s (i) 1:T ) −1 2 · ̃ J(s 1:T ) − J(s 1:T ) 2 ∥e (i) ∥ 2 + L 2 (∥e (i) ∥ 2 2 ) ,(59) where∥·∥ 2 denotes the spectral norm of a matrix and the ℓ 2 norm of a vector. Proof. Starting from equation (57), we subtract s ⋆ 1:T from both sides to obtain e (i+1) = e (i) − ̃ J(s (i) 1:T ) −1 r(s (i) 1:T ). Next, we Taylor expand r(·) around s (i) 1:T to obtain r(s ⋆ 1:T ) = r(s (i) 1:T ) − J(s (i) 1:T )e (i) + R(e (i) ), 6.2 c o n v e r g e n c e r at e s f o r q ua s i - d e e r92 where R(e (i) ) is the second-order remainder function and has norm bounded by ∥e (i) ∥ 2 2 /2 times the Lipschitz constant of J(s (i) 1:T ), which Theorem 3 of Gonzalez et al. [79] shows is bounded by L. Since r(s ⋆ 1:T ) = 0, it follows that e (i+1) = ̃ J(s (i) 1:T ) −1 ̃ J(s (i) 1:T ) − J(s (i) 1:T ) | z Jacobian mismatch e (i) +R(e (i) ) |z higher-order Taylor remainder .(60) The result follows by taking norms on both sides and using the triangle inequality. 6.2.2 Limitations of this convergence analysis Proposition 6.3 only guarantees a decrease in the error when the iterate s (i) 1:T is already in a basin of decrease B D given by B D := s 1:T :∥e(s 1:T )∥ 2 ⩽2· 1 − ̃ J(s 1:T ) −1 2 ̃ J(s 1:T ) − J(s 1:T ) 2 L ̃ J(s 1:T ) −1 2 . However, since we know from Proposition 1 of Gonzalez et al. [80] that all the fixed-point algorithms considered in this paper must eventually converge, we know that the iterates s (i) 1:T must all eventually enter this basin of decrease B D if B D ̸=∅. For this reason, Proposition 6.3 provides helpful intuition about which fixed-point algorithms are useful for which dynamical systems. For example, let us define the basin of linear rate B L to comprise those s 1:T where ̃ J(s 1:T ) − J(s 1:T ) 2 ∥e (i) ∥ 2 > L 2 (∥e (i) ∥ 2 2 ), i.e. the expression linear in∥e (i) ∥ 2 on the right side of (59) dominates the expression quadratic in∥e (i) ∥ 2 . It follows that B L is given by B L := s 1:T :∥e(s 1:T )∥ 2 ⩽ 2 ̃ J(s 1:T ) − J(s 1:T ) 2 L . Therefore, when s (i) 1:T ∈ B D ∩ B L , it follows that the norm of the error is guaranteed to decrease by a factor of 2 ̃ J(s (i) 1:T ) −1 2 ̃ J(s (i) 1:T ) − J(s (i) 1:T ) 2 . Moreover, as ∥e (i) ∥ 2 approaches zero, the guaranteed factor of decrease approaches the value given by equation (61). Nonetheless, we can still extract very interesting intuitions about the conver- gence rates of different quasi-DEER approximations from Proposition 6.3, as we discuss in the next section. 6.2 c o n v e r g e n c e r at e s f o r q ua s i - d e e r93 6.2.3 Intuitions about rates of convergence Equation (60) shows the error decomposes into two contributions. The first term measures the discrepancy between the chosen linear operator ̃ J and the true Jaco- bian J of the residual. The second term captures the effect of higher-order nonlin- earities. Moreover, from equation (59), we see that as∥e (i) ∥ 2 approaches zero, the contri- bution from the first term, which is linear in ∥e (i) ∥ 2 , must eventually 1 dominate the contribution from the second term, which is quadratic in∥e (i) ∥ 2 . Typically, we would say the rate of decrease in ∥e (i) ∥ 2 approaches an asymptotic linear rate γ given by γ := ̃ J(s ⋆ 1:T ) −1 2 ̃ J(s ⋆ 1:T ) − J(s ⋆ 1:T ) 2 .(61) Discussions of asymptotic linear rate are subtle in our setting, where all fixed- point methods are guaranteed to converge in T iterations: see our discussion in Appendix C. Nonetheless, the functional form of γ provides useful intuition about the convergence rates of different fixed-point methods. In particular, we can study the two factors that make up the functional form of the asymptotic linear rate: ̃ J(s ⋆ 1:T ) − J(s ⋆ 1:T ) 2 and ̃ J(s ⋆ 1:T ) −1 2 . 6.2.3.1 Intuitions from ̃ J(s 1:T ) − J(s 1:T ) 2 We can control this quantity in terms of the spectral norms of the differences between the approximate and true dynamics Jacobians: Lemma 6.4. If ̃ J(s 1:T ) is given by equation (56) and J(s 1:T ) is given by equation (17), then ̃ J(s 1:T ) − J(s 1:T ) 2 = max 2⩽t⩽T ̃ A t (s t ) −A t (s t ) 2 . Proof. Plugging in the functional forms of ̃ J(·) and J(·), if we defineE t :=A t (s t−1 )− ̃ A t (s t−1 ), then ̃ J(s 1:T ) − J(s 1:T ) = 0 0 0 ... 0 0 E 2 0 0 ... 0 0 0 E 3 0 ... 0 0 . . . . . . . . . . . . . . . . . . 0 0 0 ... 0 0 0 0 0 ... E T 0 . 1 under strong enough continuity assumptions. 6.2 c o n v e r g e n c e r at e s f o r q ua s i - d e e r94 The spectral norm of a matrix M is equal to the square root of the largest eigen- value of M ⊤ M. Defining M := ̃ J(s 1:T ) − J(s 1:T ), we see that M ⊤ M = 0 0 0 ... 00 0 E ⊤ 2 E 2 0 ... 00 0 0 E ⊤ 3 E 3 ... 00 . . . . . . . . . . . . . . . . . . 0 0 0 ... E ⊤ T−1 E T−1 0 0 0 0 ... 0 E ⊤ T E T . Since M ⊤ M is a block-diagonal matrix, its eigenvalues are equal to the union of the eigenvalues of each of the blocks E ⊤ t E t . Thus, it follows that the maximum eigenvalue of M ⊤ M is equal to the maximum of all the eigenvalues of all the matrices E ⊤ t E t , and so the maximum singular value of ̃ J(s 1:T ) − J(s 1:T ) is given by max 2⩽t⩽T ̃ A t (s t−1 ) −A t (s t−1 ) 2 . A resulting intuition is that, for the fixed-point methods considered in this paper, their rate of convergence will be faster if their approximate Jacobian ̃ A t is closer to the true dynamics Jacobian A t in spectral norm. For the purposes of showing the utility of this intuition in experiment, we will use the notation diff(A) to indicate this Jacobian approximation error, i.e. diff(A) := max 2⩽t⩽T ̃ A t (s t−1 ) −A t (s t−1 ) 2 ,(62) where the sequence length T and dynamical system f should be evident from context. 6.2.3.2 Intuitions from ̃ J(s 1:T ) −1 2 Because ̃ J(s 1:T ) as defined in equation (56) is a block bidiagonal matrix, it has a block lower triangular structure of the form ̃ J(s 1:4 ) −1 = I D 0 0 0 ̃ A 2 I D 0 0 ̃ A 3 ̃ A 2 ̃ A 3 I D 0 ̃ A 4 ̃ A 3 ̃ A 2 ̃ A 4 ̃ A 3 ̃ A 4 I D ,(63) shown above for T =4. From equation (63), we see that the blocks of ̃ J(s 1:T ) −1 2 are products of the transition matrices ̃ A t from the chosen fixed-point method (cf. Table 6). In partic- ular, if the chosen fixed point method results in an unstable LDS with∥A t+1 ∥ 2 >1 at many points t in the sequence, we see that ̃ J(s 1:T ) −1 2 can be much larger 6.2 c o n v e r g e n c e r at e s f o r q ua s i - d e e r95 than one. In fact, as we saw in Section 5.3, under suitable assumptions, the norm of ̃ J −1 is related to the dynamical stability of the linear-time varying system with transition matrices ̃ A t arising from the fixed-point iterations. The larger the LLE ̃ λ of the LDS arising from the fixed-point iteration, the larger the norm of ̃ J −1 will be. More precisely, if we apply the regularity conditions in equation (37) to the LDS arising from the fixed-point iteration, then by the techniques used to prove Theorem 5.3 it follows that max 1,be ̃ λ(T−1) ⩽∥ ̃ J −1 ∥ 2 ⩽a e ̃ λT −1 e ̃ λ −1 . Therefore, we observe that the presence of this term ̃ J(s 1:T ) −1 2 in γ gives rise to the intuition that fixed-point methods resulting in unstable LDSs should have slower rates of convergence. One way to grasp this intuition is that unstable LDSs suffer from numerical blowup, especially for large T . Moreover, in the special cases of Jacobi and Picard iterations, we can compute ̃ J(s (i) 1:T ) −1 2 analytically. For Jacobi iterations, ̃ J −1 J 2 =1. For Picard iterations, the expression for ̃ J −1 P 2 is more complicated, but it scales as O(T): Lemma 6.5. Let ̃ J P be as in equation (58). Then ∥ ̃ J −1 P ∥ 2 = 1 2 sin π 2(2T+1) By the small-angle approximation for sine,∥ ̃ J −1 P ∥ 2 scales as O(T). Proof. Consider K := ̃ J −⊤ P ̃ J −1 P = I D I D I D ... I D I D 2I D 2I D ... 2I D I D 2I D 3I D ... 3I D . . . . . . . . . . . . . . . I D 2I D 3I D ... TI D . We know that λ max (K) 1/2 =∥ ̃ J −1 P ∥ 2 . Since K is a Kronecker product M⊗I D , where M i,j = min(i,j), the spectrum of K is equivalent to the spectrum of M (just with all eigenvalues having multiplicity D). Therefore, we seek to find the spectrum of M∈ R T×T . 6.2 c o n v e r g e n c e r at e s f o r q ua s i - d e e r96 However, the spectrum of M is known in the literature. For example, Theorem 2.1 of Fonseca [64] shows that if T ⩾3, then the eigenvalues λ k T−1 k=0 of M are given by λ k = 1 2 1 − cos 2k +1 2T +1 π −1 = 1 4 sin 2k +1 2(2T +1) π −2 where the second equality comes from the half-angle formula. We observe that the largest eigenvalue is therefore λ 0 , and so the result follows after we take a square root. Because ∥ ̃ J −1 P ∥ 2 >∥ ̃ J −1 J ∥ for large T , the formula for γ given by equation (61) yields the following expectation: In settings where the ̃ A t from Picard vs. Jacobi iterations approximates the true dynamics Jacobian A t equally well, we expect Jacobi iterations to converge more quickly because∥ ̃ J −1 J ∥<∥ ̃ J −1 P ∥ 2 . We now test this hypothesis with a simple simulation designed to show how Proposition 6.3 provides helpful intuition about the convergence rates of different fixed-point methods. 6.2.3.3 How fixed-point stability distinguishes between Jacobi and Picard iterations We demonstrate the helpfulness of the intuitions stemming from Proposition 6.3 in a simple simulation. We consider the LDS s t+1 = αs t , for s t ∈ R 2 . Because this is an LDS with diagonal dynamics, both the Newton and quasi-Newton iterations considered in this paper converge in one iteration. However, this simulation is useful for comparing Jacobi versus Picard iterations. This comparison is particu- larly fruitful in light of the formula for γ given by equation (61) and Theorem 6.4 because, in this setting, ∥ ̃ J J − J∥ 2 =α ∥ ̃ J P − J∥ 2 =1 −α. However,∥ ̃ J −1 J ∥ 2 =1, while∥ ̃ J −1 P ∥ 2 scales linearly with T . Therefore, when compar- ing the number of Jacobi iterations needed to converge when the dynamics are multiplication by α to the number of Picard iterations needed to converge when the dynamics are multiplication by 1 −α, we expect fewer Jacobi iterations should be needed than Picard iterations, as γ J <γ P . For α = 0.5, when ∥ e J J − J∥ 2 =∥ e J P − J∥ 2 , we see that Jacobi iterations converge in far fewer iterations than Picard iterations. Moreover, when comparing the be- havior of Jacobi for simulating f t+1 (x t ) =αx t with Picard for simulating f t+1 (x t ) = 6.2 c o n v e r g e n c e r at e s f o r q ua s i - d e e r97 0255075100 Iteration (i) 10 −7 10 −5 10 −3 10 −1 10 1 ‖ e ( i ) ‖ Jacobi 0255075100 Iteration (i) Picard α 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.20.40.60.8 α 20 40 60 80 100 Iterations to reach ‖ e ‖ < 1 e − 5 Convergence Speed Jacobi Picard Figure 27: Comparing Picard and Jacobi iterations on a diagonal LDS. For the underly- ing dynamical system s t+1 =αs t , we plot the norm of the error e (i) for Jacobi and Picard iterations. We denote the empirical slope of∥e (i) ∥ 2 for Jacobi itera- tions by γ J . (1 −α)x t , we observe that Jacobi iterations always converge faster. However, when comparing for the same value of α, we see that Picard can be faster than Jacobi when α is closer to one. This behavior makes sense, because in those settings the true Jacobian ∂f t+1 /∂x t is closer to I D than to 0. Moreover, we observe that in this setting, the error e (i) 1:T for Jacobi iteration shows a clear linear convergence rate, as predicted by Proposition 6.3. The slope of the norm of the errors of the Jacobi iterates should be log 10 (α) by equation (61) and Theorem 6.4, and in fact those values are exactly the slopes of the lines in Figure 27 (Left panel). 6.2.4 Summary of Convergence Analysis In Proposition 6.3 we present an upper bound on the norm of the error of each fixed-point iterate. As an upper bound, this result cannot always fully predict the precise trajectory of the norm of the error. Nevertheless, we can extract pleasing intuitions from Proposition 6.3. Furthermore, in our following section, we show how the resulting intuitions reflect the empirical behavior of these fixed-point methods in different settings. Most importantly, we show that the difference in spectral norm between the approximate Jacobians ̃ A t and the true dynamics Ja- cobians A t provides a helpful perspective on where the fixed-point method will excel. 6.3 p e r f o r m a n c e o f t h e d i f f e r e n t f i x e d - p o i n t m e t h o d s98 6.3p e r f o r m a n c e o f t h e d i f f e r e n t f i x e d - p o i n t m e t h o d s In this section, we consider three empirical case studies that illustrate how the unifying framework and convergence analysis presented in this paper provides guidance about which fixed-point schemes will excel in which settings. This con- cordance is based on the structure of the Jacobian of f t+1 and the relative compu- tational cost of different fixed-point methods. In a nutshell, we pay homage to Einstein and advise: Use the simplest approximate Jacobian as possible, but no simpler. To elaborate: simpler approximate Jacobians are less computationally expensive, meaning that each fixed-point iteration is more efficient. So, if the lower-order fixed-point method still converges in a small number of fixed-point iterations, it achieves the sequential roll-out s ⋆ in faster wall-clock times on GPUs than higher- order fixed point methods. However, if the higher-order fixed-point method (e.g. Newton or quasi-Newton) converges in far fewer iterations than the lower-order fixed-point method, then the increased computation of the higher-order fixed- point method is worthwhile. As supported by the theoretical analysis in Sec- tion 6.2, the number of iterations needed for a fixed-point method to converge is related to the difference in spectral norm between ̃ A t and A t := ∂f t+1 /∂s t . We support this intuition with the following case studies. All the experiments in this section were run on a single H100 with 80GB onboard memory, and the code is available at https://github.com/lindermanlab/parallelizing _ with _ lds 6.3.1 Case study #1: Solving the group word problem with Newton iterations Newton iterations should outperform quasi-Newton and Picard iterations in set- tings where the Jacobian of the recursion, f t+1 , is not well approximated by its diagonal, the identity matrix, or the zero matrix. One example of such a recursion is the group word problem, which has been used to theoretically and empirically assess the limits of sequential modeling architectures for state-tracking tasks [82, 126, 146, 163, 197]. In the sequence-modeling community, the term "group word problem" is defined as follows. Definition 6.6 (Group Word Problem). Let G be a finite group and let g 1 ,g 2 ,... ,g T be a sequence of group elements. The group word problem is to evaluate the prod- uct g 1 ·g 2 ·g T . Since each g t ∈G, the product of these group elements belongs to G as well. Merrill, Petty, and Sabharwal [163] emphasizes that nonlinear RNNs in both theory and practice are able to learn the group word problem in arbitrary groups to high accuracy with only a single layer, whereas compositions of popular linear 6.3 p e r f o r m a n c e o f t h e d i f f e r e n t f i x e d - p o i n t m e t h o d s99 RNNs linked by nonlinearities, such as S4 [86] and Mamba 2 [85], require a number of layers that grows with T . Merrill, Petty, and Sabharwal [163] emphasizes that recurrent architectures with nonlinear transitions are well-suited for solving the group word problem, because in theory and practice, such architectures can learn the group word problem to high accuracy with a single layer. Other literature has explored the value of matrix-valued states [10, 82]. However, in Proposition 6.7 below, we show that neither nonlinearity nor matrix-valued states are needed to understand or solve the group word problem. Instead, the problem can be formulated as an LDS with vector-valued states and input-dependent transition matrices. Proposition 6.7. Let G be a finite group. Then there exists some D⩽ |G| for which we can represent the group word problem as a time-varying LDS, f t+1 (s t ) = A t+1 s t , with states s t ∈ R D denoting the running product of group elements and transition matrices A t+1 ∈ R D×D that depend on the input g t+1 . Proof. By Cayley’s theorem, any finite group G can be embedded in a symmetric group S D , for some D⩽ |G|. Therefore, by choosing the initial state s 0 ∈ R D to have D distinct entries (a "vocabulary" of size D), we can use the tabular representation of permutations [4, eq. 1.5.2] to represent an element of S D as s t (by a permutation of the elements of s 0 ). We can also choose A t+1 ∈ R D×D to be the permutation ma- trix corresponding to the embedding of g t+1 in S D , since any element of S D can be represented as a D×D permutation matrix (e.g., see Figure 28B). Consequently s t =A t A t−1 ...A 2 A 1 s 0 is an embedding of an element of G in S D in the tabular rep- resentation. In fact, s t ∈ R D represents the running product g 1 g 2 ...g t−1 g t , which is precisely the goal of the group word problem. Though we have cast the group word problem as a time-varying LDS with f t+1 (s t ) = A t+1 s t , we can still evaluate this recursion with any of the fixed-point methods described above. Since the dynamics are linear, the Newton iteration corresponds to evaluating the LDS with a parallel scan, and it converges in one iteration. While other methods would require more iterations to converge, they could still be more efficient in wall-clock time, since they use less memory and compute per iteration. However, we can use the Jacobian approximation error diff(·) (defined in equa- tion (62)) of the different fixed-point methods to get a sense if the other fixed-point methods are likely to excel in this setting. The state transition matrices of the group word problem are permutation ma- trices with spectral norm one, and so diff(A J ) = 1. Furthermore, since with high probability there will be a state transition matrix with diagonal all zero, it follows 2 Mamba allows input-dependent dynamics matrices but they must be diagonal, which prevents a single Mamba layer from implementing the particular LDS in Proposition 6.7, which uses permutation matrices. Merrill, Petty, and Sabharwal [163] also demonstrate that a linear time- varying system with a dense transition matrix can learn the group word problem. 6.3 p e r f o r m a n c e o f t h e d i f f e r e n t f i x e d - p o i n t m e t h o d s100 x t+1 =A t+1 x t ︸ ︷ ︸ f t+1 (x t ) A t+1 ∈R D×D ︸ ︷ ︸ ∂f t+1 /∂x=A t+1 10 2 10 3 10 4 10 1 10 2 10 3 10 4 Number of Fixed-Point Iters 10 2 10 3 10 4 10 −3 10 −2 10 −1 10 0 Wallclock Time (s) −1 0 1 Jacobi Picard Quasi-Newton Newton Sequential Sequence Length (T) A) Group Word Problem B)A t+1 for Permutation (1 5 2 4 3) C) Convergence Rates Figure 28: A single Newton iteration solves the S 5 group word problem, whereas the number of iterations required for the other methods increases with sequence length. We consider the task of evaluating the product of S 5 group elements. A: The group word problem can be expressed as an LDS with input-dependent state-transition matrices. B: An example input-dependent transition matrix A t for permutation (1 5 2 4 3), in cycle notation. C: For each fixed-point method and a range of sequence lengths, T , we compute the median (over ten random seeds) number of fixed-point iterations to converge (top) and the median wall- clock time (bottom). While a single Newton iteration is sufficient to solve the S 5 problem, the number of iterations required for the other methods increases with the sequence length. that diff(A QN ) = 1 while diff(A P ) = 2. Since we would expect to need diff(A) < 1 for a fixed-point method A to be effective, our theoretical analysis in Section 6.2 suggests that none of the fixed-point methods other than Newton will be effective on the group word problem. We test this hypothesis with a simple experiment simulating the S 5 word prob- lem, a standard problem in the sequence modeling literature [82, 163]. In this setting, Figure 28 shows that quasi-Newton, Picard, and Jacobi iterations require nearly T iterations to converge. On the other hand, we see that Newton’s method solves the S 5 word problem with just one fixed-point iteration, as expected since the true dynamics are linear. The speed-up is also apparent in the wall-clock time comparison, where we see that Newton is faster than other methods, regardless of T . 6.3.2 Case Study #2: Picard iterations struggle to parallelize RNNs We next consider a task where Picard iterations struggle, while the other fixed- point methods excel. This task is parallelizing recurrent neural networks (RNNs), like the Gated Recurrent Unit or GRU [38]. 6.3 p e r f o r m a n c e o f t h e d i f f e r e n t f i x e d - p o i n t m e t h o d s101 x t+1 = (1−z t ) x t +z t ̃x t ︸︷︸ f t (x t ) z t =σ(Linear([u t , x t ])) r t =σ(Linear([u t , x t ])) ̃x t = tanh(Linear([u t , r t x t ])) 10 3 10 4 10 1 10 2 10 3 10 4 Number of Fixed-Point Iters D= 2 10 3 10 4 10 1 10 2 10 3 10 4 D= 4 10 3 10 4 10 1 10 2 10 3 10 4 D= 8 10 3 10 4 10 −3 10 −2 10 −1 10 0 Wallclock Time (s) 10 3 10 4 10 −3 10 −2 10 −1 10 0 10 1 10 3 10 4 10 −3 10 −2 10 −1 10 0 10 1 −0.1 0.0 0.1 Jacobi Picard Quasi-Newton Newton Sequential ︸ ︷ ︸ ∂f t+1 /∂x=A t+1 Sequence Length (T) A) GRU dynamics B) Jacobian of GRU dynamics (D= 8) C) Convergence Rates Figure 29: Picard iterations struggle to parallelize RNNs. We evaluate GRUs with ran- dom parameter initialization for different sequence lengths T and hidden state sizes D. A: The nonlinear dynamics of a GRU, following Feng et al. [63], where x t is the hidden state, u t is the input, and the notation Linear[·,·] indicates a linear readout from the concatenation of two vectors. B: A representative Jaco- bian matrix ∂f t /∂x from a GRU trajectory, which is not well approximated by the identity matrix. C: For each fixed-point method and a range of sequence lengths, T , and state sizes, D, we compute the median (over ten random seeds) number of fixed-point iterations to converge (top row) and the median wall- clock time (bottom row). Picard iterations take nearly T iterations to converge, while the other fixed point methods yield order-of-magnitude speed-ups over sequential evaluation We show the results of a simple experiment in Figure 29. We evaluate GRUs with random parameter initialization for different hidden dimension sizes D and sequence lengths T using sequential evaluation as well as fixed-point iterations. This is the same experimental set up as that shown in Figure 11, except this time we are using H100s. As we observe in Panel B of Figure 29, at initialization the Jacobian of the GRU has entries that are fairly small in value (on the order of 0.1). Therefore, it is intuitively plausible that diff(A J ) and diff(A QN ) would both be less than one, while diff(A P ) would be greater than one. To demonstrate the different values of the diff(·) operator for quasi-Newton, Jacobi, and Picard iterations in this setting, we consider the setting D = 8 and T = 1000. For 10 random seeds, we plot a variety of quantities relevant for γ (cf. equation (61)) in Figure 29. We observe that lower values of γ (i.e., faster rates of asymptotic linear convergence) coincide with fewer fixed-point iterations needed in Figure 29. 6.3 p e r f o r m a n c e o f t h e d i f f e r e n t f i x e d - p o i n t m e t h o d s102 0.10.20.30.4 ‖ ̃ A t+1 ‖ 2 0 1000 2000 3000 1.261.281.301.321.341.36 ‖ ̃ J −1 QN ‖ 2 0.0 0.5 1.0 1.5 2.0 0.250.500.751.001.251.501.75 ‖ ̃ A t+1 −A t+1 ‖ 2 0 1000 2000 3000 0.80.91.01.1 γ 0 1 2 3 4 JacobiPicardQuasi-Newton Figure 30: Understanding the convergence rates in Figure 29. In the setting of the GRU experiment for D =8 and T =1000, we plot relevant quantities for understand- ing the convergence rates of different methods over 10 random seeds. (Top left.) We plot the spectral norm of the approximate Jacobian for the quasi-Newton iterations we consider in this paper, i.e. diag[A t (s ⋆ t−1 )]. (Top right.) For each of the 10 random seeds, we plot ∥ ̃ J QN (s ⋆ 1:T ) −1 ∥ 2 . We observe that they are always larger than one. (Bottom left.) We plot the difference between approxi- mate Jacobians and true dynamics Jacobians over all time steps and seeds for quasi-Newton, Jacobi, and Picard iterations. We observe that this difference for Picard iterations is always larger than one, and so we would intuitively expect Picard iteration to be very slow for parallelizing GRUs. This behavior is pre- cisely what we see in Figure 29. (Bottom right.) Across the 10 random seeds, we plot the value of γ for Jacobi and quasi-Newton iterations (Picard would be O(T) and so is not shown). Because∥ ̃ J J (s ⋆ 1:T ) −1 ∥ 2 =1, the 10 γ J ’s are equiv- alent to the maximum values from the differences in (bottom left) over the 10 random seeds. However, since (top right) shows that ∥ ̃ J QN (s ⋆ 1:T ) −1 ∥ 2 > 1, we observe that the values of γ QN are larger than in (bottom left). In summary, be- cause the values of γ J are smaller than the values of γ QN , we would intuitively expect Jacobi to converge in fewer fixed-point iterations, which is exactly what we observe in Figure 29. We observe that diff(A J ) and diff(A QN ) are both below one always, which cor- responds to their fast rates of convergence demonstrated in Figure 29. In contrast, 6.3 p e r f o r m a n c e o f t h e d i f f e r e n t f i x e d - p o i n t m e t h o d s103 diff(A P ) is always greater than one, which corresponds to the slow rates of con- vergence of Picard iteration in the experiment depicted in Figure 29. In conclusion, we expect quasi-Newton and Jacobi iterations to join Newton iterations in excelling in this setting, while we would expect Picard iterations to converge prohibitively slowly. This behavior is exactly what we observe in Figure 29. 6.3.3 Case Study #3: Jacobi iterations struggle to parallelize discretized Langevin diffu- sion x t+1 =x t −∇φ(x t ) ︸ ︷ ︸ f(x t ) + √ 2w t ∂f ∂x =I D −∇ 2 φ(x t ) 10 3 10 4 10 1 10 2 10 3 10 4 Number of Fixed-Point Iters D= 32 10 3 10 4 10 1 10 2 10 3 10 4 D= 64 10 3 10 4 10 1 10 2 10 3 10 4 D= 128 10 3 10 4 10 1 10 2 10 3 10 4 D= 256 10 3 10 4 10 −3 10 −2 10 −1 10 0 10 1 Time (s) 10 3 10 4 10 −3 10 −2 10 −1 10 0 10 1 10 3 10 4 10 −2 10 −1 10 0 10 1 10 2 10 3 10 4 10 −1 10 0 10 1 10 2 −1.0 −0.5 0.0 0.5 1.0 Jacobi Picard Quasi-Newton Newton Sequential ︸ ︷ ︸ ∂f t+1 /∂x=A t+1 Sequence Length (T) A) Langevin Dynamics B) Jacobian of Langevin dynamics (D= 32) C) Convergence Rates Figure 31: Jacobi iterations struggle when the dynamics Jacobian is close to the iden- tity. We evaluate Langevin dynamics for a potential φ. A: The nonlinear dy- namics of Langevin dynamics for a potential φ and step size ε, where x t is the state and w t is Gaussian noise. B: The Jacobian for Langevin dynamics is well- approximated by the identity matrix, especially for small step size ε =1×10 −5 . C: We evaluate Langevin dynamics for larger dimensions, plotting the median of 10 random seeds. Jacobi iteration consistently take T steps and are always slower than sequential, while the other fixed-point methods converge in fewer T steps and can be faster than sequential. The missing Newton iteration points indicate the GPU ran out of memory. Based on the theoretical analysis presented in Proposition 6.3, we expect that if the Jacobian of the dynamics function is well-approximated by the identity ma- trix, then Picard should converge relatively quickly and at considerably lower cost, especially when compared to the other zeroth-order method of Jacobi iter- ations. A canonical example of such a system where the dynamics are close to identity comes from a discretization of Langevin dynamics [65, 139]. Langevin dynamics are a workhorse for MCMC [20] and motivated the development of score-matching methods [211], which are closely related to diffusion models [100, 6.4 r e l at e d w o r k104 209, 214]. As we discussed in Subsection 5.5.2, Langevin dynamics follows equa- tion (51), and consequently have a dynamics Jacobian that is well-approximated by the identity matrix for small step sizes ε. More generally, the identity approx- imation tends to be well-suited to problems where a differential equation is dis- cretized with small step sizes, such as when sampling from diffusion models [104]. In fact, simply by observing the structure of the Jacobian in Panel B of Figure 31, we observe that the diff(·) operator for Newton, quasi-Newton, and Picard iter- ations in this setting will be close to zero, while diff(A J ) will be close to one. Therefore, based on our analysis in Proposition 6.3, we hypothesize that the other fixed-point methods should dramatically outperform Jacobi iterations in this set- ting. We test this hypothesis with a simple experiment shown in Figure 31. We simu- late Langevin dynamics on a potential φ given by the negative log probability of the mixture of two anisotropic Gaussians. In this setting, Picard iterations take far fewer than T iterations to converge and can be faster than sequential evaluation. We note that quasi-Newton iterations, which include information only about the diagonal of the Jacobian of the dynamics, appear to have comparable wall-clock time, by virtue of taking fewer iterations to converge (though each fixed-point iteration involves more work). Whether fixed-point iterations are faster than sequential evaluation also de- pends on memory utilization. For example, Shih et al. [201] and Lu, Zhu, and Hou [153] demonstrated wall-clock speed-ups when using Picard iterations for sam- pling from a diffusion model using a "sliding window" to only evaluate chunks of the sequence length where the parallel scan algorithm can fit in memory. As we discuss in Section 3.2, using the sliding window is best practice for parallel Newton methods and should be used in all future work. 6.4r e l at e d w o r k In this chapter we unify prominent fixed-point methods for the parallel evaluation of sequences in the language of linear dynamical systems. While many papers have employed different fixed-point iterations for different problems in machine learning — Lim et al. [142], Danieli et al. [41], and Danieli et al. [40] using Newton iterations, Tang et al. [221] and Gonzalez et al. [80] using quasi-Newton iterations, Shih et al. [201] using Picard iterations, and Song et al. [213] using Jacobi iterations, among other works — to the best of our knowledge no one has explicitly unified these different methods in the language of linear dynamical systems. g e n e r a l u n i f i c at i o n o f f i x e d - p o i n t m e t h o d s : pa r a l l e l - c h o r d m e t h - o d sWhile connections between Newton’s method and Picard iterations have been made before outside of the machine learning literature, our contribution is 6.4 r e l at e d w o r k105 the tight coupling of these methods to LDSs in the context of parallel evaluation of nonlinear sequences. Ortega and Rheinboldt [180, Ch. 7] considered the problem of solving a nonlinear equation F(s) = 0. They showed that Newton and Picard iterations are special cases of general iterative methods where each iterate is given by s (i+1) = s (i) − ̃ J(s (i) ) −1 F(s (i) ),(64) for some matrix ̃ J(s (i) ). We discuss the relationship between the unifying frame- works put forward in Ortega and Rheinboldt [180] and in our paper at greater length in Appendix C. The primary difference is that by focusing on the setting of nonlinear sequence evaluation, we bring into greater focus the role of the Jaco- bian of the dynamics function. Moreover, by unifying fixed-point iterations in the language of LDSs, we emphasize their parallelizability over the sequence length using the parallel scan [24]. c o n v e r g e n c e r at e s o f f i x e d - p o i n t m e t h o d sIn the context of analy- sis of fixed-point methods in general, there is a broad literature [180, 238] on the convergence rates of different fixed-point methods. For example, Ortega and Rheinboldt [180] also proved convergence rates for iterative methods of the form in equation (64). Though their methods have much in common with the proof techniques used to prove Proposition 6.3 of this paper, their provided results are actually trivial in the setting considered in this paper. Part of the reason 3 for the inapplicability of the convergence results from Ortega and Rheinboldt [180] to our paper is that Ortega and Rheinboldt [180] consider the asymptotic setting, while it has been firmly established that in the particular setting considered in this pa- per, Jacobi, Picard, quasi-Newton, and Newton iterations all globally converge in at most T iterations [80, 201, 221]. For moving beyond this worst-case analysis, in Chapter 5 we show that the difficulty of parallelizing a dynamical system is directly related to the stability of the system, which can be thought of as the "av- erage" spectral norm of ∂f t+1 /x t . Proposition 4 of Lu, Zhu, and Hou [153] develops the foundations of the convergence analysis we present in Proposition 6.3. We extend their work by applying it to a wider variety of fixed-point methods, ex- plicitly bounding many quantities of interest, and demonstrating its relevance in simulation. o t h e r f i x e d - p o i n t m e t h o d s : m i x i n g s e q u e n t i a l w i t h pa r a l l e lIn this chapter, we focus on Jacobi, Picard, and Newton iterations because of their prominence [40, 41, 61, 80, 83, 111, 142, 201, 212, 213, 244] and their relationship to LDSs, as listed in Table 6. However, there is a wide literature on iterative solvers [180, 238]. Many of these other methods can also be parallelized over the sequence length, or provide a mixture of parallel and sequential computation. For example, as we discussed in Section 1.1, Naumov [173] and Song et al. [213] consider us- 3 We elaborate in Appendix C. 6.5 d i s c u s s i o n106 ing Gauss-Seidel iterations to accelerate computations in deep learning. Although Gauss-Seidel iterations reduce to sequential evaluation when applied to Marko- vian processes, Song et al. [213] also emphasize how the structure of the problem and hardware considerations dictate the optimal mixture of parallel and sequen- tial computation. Parareal iterations mix parallel and sequential computation by applying parallelization at multiple length scales, and have also been used to par- allelize diffusion models [199]. Tang et al. [221] also parallelized diffusion models using both a generalization of Jacobi iterations, as well as Anderson acceleration [2, 229], which they modify to be a form of quasi-Newton. 6.5d i s c u s s i o n This work unified a variety of approaches for parallelizing recursions via fixed- point iterations—including zeroth-order methods like Jacobi and Picard iterations as well as first-order methods like Newton and quasi-Newton iterations—under a common framework. In each case, the iterates reduce to evaluating an appropri- ately constructed linear dynamical system, which approximates the nonlinear re- cursion of interest. Moreover, we have demonstrated how this unifying framework provides insight into which different problems in machine learning are likely to benefit from which types of fixed-point iterations. In particular, we demonstrate that the structure of the Jacobian matrix of the dynamics function plays a key role in determining which fixed-point method to use. For this reason, understanding the structure of the Jacobian of the dynamics function is important for using our framework. Fortunately, there are many prob- lems where the structure of the Jacobian matrix is known in advance. As we showed in Subsection 6.3.1, the group word problem can always be simulated with permutation matrices for its dynamics. As we showed in Subsection 6.3.3, discretized roll-outs from differential equations, used in sampling from diffusion models and rolling out neural ODEs, have ∂f /∂s equal to the identity matrix plus a correction term equal to the discretization step-size. Moreover, as shown in Zoltowski et al. [244], the dynamics of position and momenta variables in Hamil- tonian Monte Carlo (HMC) results in banded matrices. Furthermore, in sequence modeling, one can design a recurrent neural network to have Jacobians with de- sired structure, as we discussed in Subsection 3.4.3. Finally, if there is truly no analytic information about the Jacobian in advance, its structure could be probed with finite-difference methods. f u t u r e d i r e c t i o n sClarifying the relationships and properties of these ap- proaches through the lens of linear dynamical systems also suggests promising areas for future study. One clear direction of future work is to explore additional approaches for exploiting problem-specific structure, using our unifying frame- work to develop new fixed-point iterations. For example, an intermediate between 6.5 d i s c u s s i o n107 Picard and quasi-Newton methods is a scaled identity approximation, ̃ A t = a t I D . If we had prior knowledge on the appropriate scaling factors, a t ∈ R, we could avoid computing any Jacobian-vector product evaluations. More generally, there exist other groups of structured matrices with compact representations that are closed under composition such that a parallel evaluation of the LDS would be computationally efficient. Examples include permutation matrices, block-diagonal matrices, and block matrices where each sub-block is diagonal, among others. Future work should enumerate these use cases and investigate problem-specific applications where they are appropriate. One example application is for more ef- ficient parallelization of the group word problem using a compact representation of permutation matrices, as was done by Terzi ́c et al. [222]. In conclusion, understanding the shared backbone of these fixed-point meth- ods can also give practitioners guidance about which methods to use for which problems. As parallel evaluation of seemingly sequential processes becomes in- creasingly important in machine learning, these insights may provide valuable guidance to the field. Part IV C O N C L U S I O N We conclude with a synthesis of our contributions and discuss future research directions in the parallelization of sequential models. Figure 32: What unexplored, verdant pastures await for the ungulate (parallel Newton) methods? 7 C O N C L U S I O N A N D F U T U R E D I R E C T I O N S This dissertation has challenged the conventional wisdom that recurrent neural networks and other state space models are "inherently sequential." Through a combination of algorithmic innovation and theoretical analysis, we have demon- strated that predictable state space models can be evaluated efficiently on parallel hardware, with computational depth scaling as O((logT) 2 ) rather than O(T). Parallel Newton methods are powerful tools to accelerate computation previ- ously believed to be "inherently sequential." This parallelization has the direct benefit of accelerating established methods like nonlinear RNNs [40, 61, 80, 142], Markov chain Monte Carlo [244], and the vast range of important applications of state space models in machine learning broadly (see Table 1). Perhaps even more importantly, using parallel Newton methods allows researchers to explore alter- native approaches using state space models more quickly, which may enable even more fundamental breakthroughs in the future. In this conclusion, we briefly recapitulate the main contributions of this thesis, and highlight important directions for future work on parallel Newton methods. 7.1s u m m a r y o f c o n t r i b u t i o n s This thesis contributes to both the methodology and theoretical understanding of parallel Newton methods. Part I presents our methodological contributions. We extend parallel Newton methods by making connections to other canonical techniques from numerical analysis. In particular, we • improve the scalability of parallel Newton methods by making connections to the quasi-Newton literature (Chapter 3); and • improve the stability of parallel Newton methods by making connections to the trust-region literature (Chapter 4). Part I presents our theoretical contributions. Driven by a desire to understand the limits of parallelizability, we conduct an in-depth analysis of the convergence rates of parallel Newton methods. In particular, we • establish a novel connection between the predictability of the SSM dynamics and the conditioning of the merit function minimized by parallel Newton 109 7.2 f u t u r e d i r e c t i o n s110 methods (Chapter 5). This connection allows us to derive convergence rates for DEER (the Gauss-Newton method for parallelizing nSSMs), and leads to the conclusion we can parallelize predictable dynamics, but should evaluate unpredictable dynamics sequentially. We also • crystallize a unifying framework that shows how other popular fixed-point methods, like Picard and Jacobi iterations, are also parallel Newton meth- ods with different approaches to approximating the Jacobian (Chapter 6). This unifying framework allows for a general study of the convergence rates of many fixed-point methods, and highlights the settings where different methods excel. These methodological and theoretical contributions provide a strong founda- tion for the deployment of parallel Newton methods. However, this research pro- gram is just beginning, and so we highlight exciting future directions in the next section. 7.2f u t u r e d i r e c t i o n s We highlight two important directions for future work: • improving the methodology and implementation of parallel Newton meth- ods; and • finding the best application of parallel Newton methods across the wide range of state space models (Table 1). 7.2.1 Improving parallel Newton methods The growing excitement around parallel computation in machine learning has led to recent development of these parallel Newton methods across many different fields, including in the context of parallelizing nonlinear RNNs [40, 61, 80, 142], sampling from diffusion models [41, 153, 199, 201, 221], sampling from MCMC chains [83, 244], and solving differential equations [111]. However, as all of these developments are recent and are scattered across different subfields, there is still much work to be done in optimizing and improving these methods, in terms of algorithmic innovation as well as efficient implementation. 7.2.1.1 Broadening our use of numerical analysis A key contribution of part I of this thesis was extending parallel Newton methods by drawing on the vast literature of numerical analysis, in our case quasi-Newton and trust-region methods. However, we have only scratched the surface of numer- ical analysis [26, 48, 179, 180]. We hope we can begin a wide-ranging research 7.2 f u t u r e d i r e c t i o n s111 program to import useful techniques from numerical analysis to further improve parallel Newton methods. We discussed many extensions in Section 3.4. Another example is broadening the range of targets for our parallel Newton methods. In this dissertation, we apply parallel Newton methods only to the goal of rolling out the dynamics in equation (1) from a fixed initial condition s 0 . How- ever, instead of considering only initial conditions, we could also consider bound- ary value problems, where we may know the desired state at both the start (t =0) and the end (t = T ) [5, 123]. Such a boundary problem arises, for example, in the E-step of a predictive coding network [112]. This simple change adjusts certain aspects of the theory of parallel Newton methods. For example, in a boundary value problem, it is no longer required that there is a unique global minimizer, or that it results in a merit function with value 0. Moreover, each parallel Newton step now requires not one but two parallel scans (one in the forward direction, one in the backward direction), which may enhance the appeal of smoothing-inspired approaches. Broadly speaking, expanding the richness of problems to which we apply par- allel Newton methods will require a deeper usage of techniques from numerical analysis and possibly even further contributions to that field. 7.2.1.2 Efficient implementation on parallel hardware As we discussed in Section 2.2, a fundamental ingredient of parallel Newton meth- ods presented in this thesis is the parallel scan. However, there are a host of imple- mentation details for using the parallel scan when programming on accelerated hardware like GPUs [91, 195, 235]. For example, the presence of a general-purpose parallel scan is, as of the time of writing, a major difference between JAX [27] and PyTorch [184], two leading Python libraries for deep learning. JAX has a general purpose parallel scan (jax.lax.associative _ scan) as a fundamental primitive, which allows for implementation of a wide range of parallel scans. For example, dynamax, a JAX library for probabilistic state space modeling [144], implements the parallel filtering and smoothing algorithms from Särkkä and García-Fernández [192]. In contrast, PyTorch currently has only torch.cumsum, which is the parallel scan where the binary associative operator is addition 1 , and torch.cumprod (for scalar multiplication). This difference is why we implement the experiments in this dissertation in JAX. This lack of a general purpose parallel scan in PyTorch has also led to the custom development of highly-optimized, hardware-aware custom CUDA ker- nels for parallel scans [195]. These custom parallel scans appear most promi- nently in Mamba [85], a leading SSM for language modeling, and ParaRNN [40], which applies parallel Newton iterations to 7B parameter nonlinear RNNs 1 Although Heinsen [97] shows that clever uses of torch.cumsum can parallelize scalar/diagonal LDSs, of the type that are used in quasi-DEER. 7.2 f u t u r e d i r e c t i o n s112 to achieve strong language modeling performance. There also exist useful imple- mentations of parallel scans for scalar/diagonal LDSs in PyTorch such as [135], which we used to implement quasi-Newton iterations in PyTorch in this repo: https://github.com/lindermanlab/elk-torch. Further improvements of the im- plementations of parallel scans will directly improve the performance of parallel Newton methods. Moreover, in practice, when parallelizing over long sequences (T ≫ D), the memory cost is often dominated by the size of intermediate state representations and the need to unroll computations over multiple fixed-point iterations. Chunk- ing (dividing the sequence into smaller windows) and truncation (limiting the number of fixed-point iterations) are useful strategies to reduce memory usage in these settings [43, 70, 199, 201, 244]. n u m e r i c a l s ta b i l i t y a n d l o w p r e c i s i o nA particularly important area for improvement of parallel Newton methods is their numerical stability and, in particular, their ability to handle lower precision. In particular, LDS matrices with spectral norm close to or greater than one can cause numerical instabilities in the parallel scan operation [79, 80]. This is especially critical in high-precision tasks or long sequences, and practitioners should monitor for numerical divergence or the accumulation of floating-point error. In practice, it has been extremely difficult to get parallel Newton methods to work reliably with lower precision than float32. Unfortunately, the tensor cores of modern GPUs are optimized to work best with lower precision (achieving much higher FLOPs per second in lower precision) [158, 167]. Therefore, improving the robustness of parallel Newton methods (algorithmically or in their implemen- tation) in lower precision is very important for deployment of these methods— especially as quantization and low-precision become increasingly important in industrial AI [50, 72]. f u n da m e n ta l ly d i f f e r e n t a p p roac h e sFinally, we must be open to rad- ically different and possibly transformational approaches to parallelizing over the sequence length. For example, the parallel Newton methods presented in this the- sis are predicated on the ease of parallelizing linear dynamical systems with a parallel scan, and the difficulty of directly parallelizing a nonlinear dynamical system with a parallel scan. However, as discussed in Subsection 2.2.4, compo- sition of functions is inherently a binary associative operator—it is difficulties around intermediate storage that prevent us from directly using a parallel scan to parallelize nSSMs. We should be open to the existence of ingenious intermedi- ate representations of the compositions of nonlinear functions that could remain expressive enough for a broad range of applications. There may also be useful connections to Koopman operator theory [130, 166, 233] that could allow us to (at 7.2 f u t u r e d i r e c t i o n s113 least approximately) parallelize nonlinear dynamical systems in a constant num- ber of iterations, even when the dynamics are marginal or unpredictable. We should even be open to eschewing the parallel scan entirely! For example, Tang et al. [221] builds up a structured matrix G that is an approximation to J −1 . Thus, each application of their parallel Newton steps (called ParaTAA, where TAA stands for "Triangular Anderson Acceleration") is simply matrix multiplication by G. Another approach that could eschew parallel scans is to use conjugate gradient methods [200] to evaluate the solve J −1 r. These approaches based on direct matrix multiplication could get around the O(logT) depth of the parallel scan—as could approaches that truncate a work-inefficient parallel scan early. In short, there is a multitude of innovation yet to be discovered, both algorith- mically and in the hardware-aware implementation of parallel Newton methods. 7.2.2 Finding the best applications of parallel Newton methods Parallel Newton methods parallelize nonlinear SSMs over the sequence length. However, another (simpler) way to leverage parallel compute with SSMs is to eval- uate many SSMs simultaneously. Moreover, parallel Newton methods as currently conceived simply evaluate and train SSMs—they do not change the underlying properties of the SSM. Thus, if the SSM has certain unfavorable properties itself, the parallel Newton method will not fix them. For these reasons, it is important to consider the utility of parallel Newton methods for various SSMs (Table 1). We highlight two important considerations. 7.2.2.1 Latency vs. Throughput The benefit of parallel Newton methods is that they can decrease the latency of evaluating a single nSSM over its sequence length. Sequential evaluation requires O(T) iterations to get from the start s 0 to the finish s T . In contrast, if the SSM is predictable, then parallel Newton methods take O((logT) 2 ) iterations to evaluate the chain when there are T processors, thus reducing the latency. However, if we had simply launched T sequential chains simultaneously on such a parallel machine, then each clock would generate T new samples s (b) t , where b indicates the batch id of these T chains. In contrast, a parallel New- ton method run on a single chain would use all of the T processors, but would produce T samples in O((logT) 2 ) (one factor of log(T) for the parallel scan, and another factor of log(T) for the number of iterations needed for convergence). Therefore, batching sequential computation actually has better throughput than parallel Newton methods by a factor of O((logT) 2 ). For this reason, parallel Newton methods excel in settings where we care about latency and not throughput. Examples where latency is important include the training of nonlinear RNNs (where a forward pass must be completed before 7.2 f u t u r e d i r e c t i o n s114 learning during the backwards pass can begin) and sampling from an MCMC chain (where there is an initial burn-in period at the beginning of the chain when the samples have not yet converged to the target distribution). However, if throughput is more important for your application, then you will be better suited by employing sequential evaluation with large batch size. 7.2.2.2 Expressive nonlinear RNNs Even in the setting of training nonlinear RNNs—where decreasing the latency of the forward pass is of vital importance—parallel Newton methods are only as good as the target they are evaluating. In other words, DEER exactly evaluates the forward and backward passes of a nonlinear RNN—so if the underlying nonlinear RNN (say a GRU or LSTM) has undesirable properties, parallel Newton methods cannot fix those because they will replicate those undesirable properties as well. Stemming from their ability to simulate a Turing machine [202], RNNs have many desirable theoretical properties vis a vis transformers, including an im- proved ability to track state [149, 163, 164, 203] and the ability to express harder complexity classes [162]. However, currently recurrent architectures struggle vs. transformers both during training and evaluation. During training, recurrent ar- chitectures continue to struggle with the problems of vanishing and exploding gradients [101, 102] and the curse of memory [19, 245]. Because recurrent archi- tectures arise from the repeated application of the same cell, small changes in the parameter can result in large changes in performance, resulting in a jagged loss landscape [183, 219]. While research in improving gradient-based optimization of RNNs like BPTT [232] remains ongoing [239], the future of RNN training might even eschew backpropagation altogether [33, 191], perhaps one day unlocking more biologically plausible learning rules at scale [17, 234]. Moreover, recurrent architectures also struggle with memory-retrieval tasks during in-context learning vs. transformers [3]. The hidden state of a transformer- its KV cache—scales linearly with the sequence length, while the hidden state of an RNN is constant size [88]. Thus, the RNN enforces compression, at the cost of reducing its recall ability over long context. The predictability of an RNN—which we define and discuss in Section 5.1—is an important concept for both the training and deployment of RNNs. Predictable RNNs will also enjoy stable backwards passes, thus mitigating any issues from ex- ploding gradients. However, an overly contracting RNN will struggle with recall. Along these lines, works such as Orvieto et al. [181] suggest that the best per- forming RNNs will have LLE as close as possible to 0 without becoming chaotic. Nonetheless, as demonstrated in Chapter 3 and explained in Chapter 5, parallel Newton methods can struggle as we approach "the edge of stability" [11]. There- fore, while our theoretical work on predictability can help guide the design of nonlinear RNN architectures, fundamental work remains in the design, training and orchestration of RNNs towards the goal of achieving human-like intelligence. Part V A P P E N D I X A G L O B A L C O N V E R G E N C E O F PA R A L L E L N E W T O N M E T H O D S This appendix contains an extended discussion of the relationship between Propo- sition 3.1 of this thesis (Proposition 1 of Gonzalez et al. [80]) and Theorem 3.6 of Tang et al. [221]. At its core, Tang et al. [221] contains the fundamental ideas for global convergence with quasi-Newton methods, and their empirical results show that they had a correct understanding of this global convergence. However, to the best of my abilities to understand the notation of Tang et al. [221], their Theorem 3.6 is both incorrect as stated and weaker than necessary. To this end, in this sec- tion we discuss the different thrusts of Proposition 1 of Gonzalez et al. [80] and Theorem 3.6 of Tang et al. [221], and present a cleaned statement and proof of Theorem 3.6 of Tang et al. [221]. a.1c o m pa r i s o n o f t h e t w o r e s u lt s In terms of superficial differences, one aspect to keep in mind is that Gonzalez et al. [80] focused on RNNs while Tang et al. [221] focused on diffusion models. Both are nSSMs and the goal of parallelization is identical. However, the direction of time is different in both papers: for RNNs, time goes from 0 to T , while in sampling from diffusion models, the convention is often for time to go backwards. To keep uniformity in notation throughout this thesis, we will standardize on time going forwards for parallelization over the sequence length. Both Gonzalez et al. [80] and Tang et al. [221] also use quasi-Newton approaches towards this goal of parallelizing over the sequence length. However, Gonzalez et al. [80] approximates J with ̃ J and then uses a parallel scan to invert ̃ J. In contrast, Tang et al. [221] uses a form of Broyden’s bad update, i.e. they approximate J −1 directly with a matrix G. Tang et al. [221] have the very good insight that as long as G satisfies certain conditions, then global convergence of their quasi-Newton method (which they call ParaTAA) is guaranteed. Lightly massaging the notation of Tang et al. [221] to be in the format of this thesis, their statement of their Theorem 3.6 is Consider a general update rule: In the (i)th iteration, the update is s (i+1) 1:T = s (i) 1:T − G (i) r(s (i) 1:T ),(65) 116 A.1 c o m pa r i s o n o f t h e t w o r e s u lt s117 with G (i) being any arbitrary matrix. If for any j where r (i) k =0 for k<j, the matrix G (i) satisfies G (i) [:jD,:jD] = I jD , then the update rule will converge within T steps. We put in red the part of this statement that is overly strong (i.e. rendering the statement incorrect); and in blue the part that is weaker than necessary. Again, note the effect of time reversal, and the fact that Tang et al. [221] defines their residual function to be the negative of our definition in equation (11). In Figure 33, we illustrate how the conditions placed on G in Tang et al. [221] interact with the update rule in equation (65). Gr I D I D I D 0 0 (j-1)D jD 0 0 0 Figure 33: Illustration of Theorem 3.6 of Tang et al. [221]. In this illustration, j = 3. The portion shaded red must be zero for the proof by induction to work, showing that G cannot be an arbitrary matrix. The portion shaded blue can be nonzero, and the proof by induction will still hold. In particular, in Figure 33, we see that the blue-shaded blocks can be zero be- cause they are always multiplied against residual entries that are zero. If the red shaded blocks are not zero, however, there is in general no guarantee that they will be multiplied against non-zero entries, and so can in general undo the causal filling in effect of this family of induction proofs. Since throughout their paper Tang et al. [221] consider G that are lower-triangular, this point is very minor. Nonetheless, for clarity we reiterate that G cannot be an arbitrary matrix for the proof by induction to hold (though lower triangular would certainly suffice). A.2 c o r r e c t e d v e r s i o n o f t h e o r e m 3 . 6 o f ta n g e t a l .118 Let us show a simple and concrete counterexample showing that if G is an arbitrary matrix, then global convergence of equation (65) is not guaranteed. Let f(s) = 2s, and consider s 0 = 2. Then, if T = 2, it follows that s ⋆ 1 = 4 and s ⋆ 2 = 8. Consider the initialization s (0) 1 = s (0) 2 = 2, so that r (0) 1 = −2 and r (0) 2 = −2. Let the matrix G (i) be determined by the following rule: • If r (i) 1 =r (i) 2 =0, then G is the identity matrix • otherwise, G = 1 1 0 1 ! . This rule satisfies the conditions of Theorem 3.6 of Tang et al. [221]. Applying equation (65), the first update gives s (1) 1 = 6,s (1) 2 = 4 and the second update gives s (2) 1 = 12,s (2) 2 = 12. Therefore, we see that we do not get convergence in T = 2 iterations. For an example showing that G (i) [:jD,:jD] need not be the identity matrix for global convergence to hold, simply consider G (i) = J −1 , as shown in equation (19). While J −1 is block lower triangular, it is not exclusively the identity matrix in its blocks of the form J −1 [:jD,:jD]. a.2c o r r e c t e d v e r s i o n o f t h e o r e m 3 . 6 o f ta n g e t a l . For the purpose of maintaining the literature, we now present a corrected version of Theorem 3.6 of Tang et al. [221]: Theorem A.1. Consider a general quasi-Newton update of the form in equation (65). Assume G (i) is lower triangular and satisfies G (i) [iD : (i +1)D −1,iD : (i +1)D −1] =I D Then the update rule will converge to s ⋆ within T steps. Proof. By induction. • Induction hypothesis: Assume that at iteration (i), we have that s (i) t =s ⋆ t for all t⩽i. • Base case: at iteration 0, s 0 =s ⋆ 0 by construction. • Induction step: We also have r (i) t = 0 for all t⩽ i. So G (i) r (i) 1:T has the first i blocks equal to zero (i.e. iD entries). Moreover, because the (i +1)st block of G (i) is identity, it follows that the (i +1)st block entry of G (i) r (i) 1:T is equal to s (i) i+1 −f i+1 (s ⋆ i ), so that s (i+1) i+1 = s ∗ i+1 . Thus, we have shown that assuming A.2 c o r r e c t e d v e r s i o n o f t h e o r e m 3 . 6 o f ta n g e t a l .119 the induction hypothesis at iteration (i) leads to the induction hypothesis holding at iteration (i +1). Note that all quasi-DEER updates (i.e. of the form shown in Table 6) satisfy the assumption of Theorem A.1, as ̃ J is lower triangular and has all identities on its block diagonal. Thus, Theorem A.1 is a generalization of Proposition 3.1. Proposition 3.1 discusses only approximations to the dynamics Jacobians A t , while Theorem A.1 allows for approximations to the inverse of the Jacobian of the residual function, i.e. J −1 . B P R E D I C T A B I L I T Y A N D C O N D I T I O N I N G This appendix provides the proof of Theorem 5.3 and an extended discussion of its assumptions and implications. b.1t h e o r e m s tat e m e n t a n d p ro o f Theorem (Theorem 5.3). Assume that the LLE regularity condition from equation (37) holds. Then if λ̸=0 the PL constant μ of the merit function in (35) satisfies 1 a · e λ −1 e λT −1 ⩽ √ μ ⩽ min 1 b · 1 e λ(T−1) ,1 .(66) If λ =0, then the bounds are instead 1 aT ⩽ √ μ ⩽ min 1 b r 2D T +1 ,1 ! . Proof. Notice that the residual function Jacobian J can be written as the difference of the identity and a T -nilpotent matrix N, as J = I TD − N with N T = 0 TD Because N is nilpotent, the Neumann series for J −1 is a finite sum: J −1 = (I TD − N) −1 = T−1 X k=0 N k .(67) Straightforward linear algebra also shows that the norms of the powers of this nilpotent matrix are bounded, which enables one to upper bound the inverse of the Jacobian ∥N k ∥ 2 ⩽ae λk and therefore ∥J −1 ∥ 2 ⩽ T−1 X k=0 ∥N k ∥ 2 ⩽ T−1 X k=0 ae λk =a 1 −e λT 1 −e λ . (68) The powers of N are closely related to the dynamics of the nonlinear state space model. We provide a dynamical interpretation in Section B.2. 120 B.1 t h e o r e m s tat e m e n t a n d p ro o f121 To lower bound ∥J −1 ∥ 2 , we observe that by the SVD, a property of the spectral norm is that ∥J −1 ∥ 2 = sup ∥x∥ 2 =1 ∥y∥ 2 =1 x ⊤ J −1 y.(69) We pick two unit vectors u and v, both in R TD , that are zero everywhere other than where they need to be to pull out the bottom-left block of J −1 (i.e., the only non-zero block in N T−1 , which is equal to A t A T−1 ...A 2 ). Doing so, we get u T J −1 v = ̃u T (A t A T−1 ...A 2 ) ̃v, where ̃u and ̃v are unit vectors in R D , and are equal to the nonzero entries of u and v. Note, therefore, that because of equation (69), it follows that ̃u T ( A t A T−1 ...A 2 ) ̃v ⩽ ∥J −1 ∥ 2 ,(70) i.e. we also have a lower bound on∥J −1 ∥ 2 . Furthermore, choosing ̃u and ̃v to make ̃u T ( A t A T−1 ...A 2 ) ̃v =∥A t A T−1 ...A 2 ∥ 2 , we can plug in this choice of ̃u and ̃v into equation (70), to obtain ∥A t A T−1 ...A 2 ∥ 2 ⩽∥J −1 ∥ 2 . Applying the regularity conditions (37) for k =T −1 and t =2 we obtain b e λ(T−1) ⩽∥J −1 ∥ 2 .(71) Because λ min J ⊤ = 1 ∥J −1 ∥ 2 2 , the result for λ̸= 0 follows by applying equation (68) and equation (71) at all s (i) along the optimization trajectory. Note that any choice of ̃u and ̃v results in a lower bound, i.e. we could also have targeted the block identity matrices. So, it also follows that 1⩽∥J −1 ∥ 2 , and so max b e λ(T−1) ,1 ⩽∥J −1 ∥ 2 . Finally, let us conclude by considering the case λ = 0. In this setting, the lower bound on √ μ follows from L’Hôpital’s rule. For the upper bound, we again must B.2 d i s c u s s i o n o f w h y s m a l l s i n g u l a r va l u e s l e a d s t o i l l - c o n d i t i o n i n g122 lower bound ∥J −1 ∥ 2 . To do so, we leverage the relationship between spectral and Frobenius norms, namely that for an n×n matrix M, ∥M∥ F √ n ⩽∥M∥ 2 ⩽∥M∥ F .(72) We can find the squared Frobenius norm, i.e. ∥J −1 ∥ 2 F , which is the sum of the squares of all of the entries. The squared Frobenius norm factors over the block structure of the matrix, i.e. ∥J −1 ∥ 2 F is the sum of the squared Frobenius norms of the blocks. We know that each block has spectral norm lower bounded by b, so each block also has Frobenius norm lower bounded by b. Therefore, summing up over all of the blocks, it follows that b 2 T(T +1) 2 ⩽∥J −1 ∥ 2 F and ∥J −1 ∥ F ⩽ √ TD∥J −1 ∥ 2 . Putting these equations together, it follows that b r T(T +1) 2 ⩽ √ TD∥J −1 ∥ 2 or b r T +1 2D ⩽∥J −1 ∥ 2 , and so the upper bound on √ μ when λ =0 follows from taking reciprocals. The above proof sheds light on how many dynamical system properties fall out of the structure of J(s), which we now discuss further. b.2d i s c u s s i o n o f w h y s m a l l s i n g u l a r va l u e s l e a d s t o i l l - c o n d i t i o n i n g Recall that our goal is to find a lower bound on the smallest singular value of J(s), which we denote by σ min (J(s)). This quantity controls the difficulty of optimizing L. For example, the Gauss-Newton update is given by J(s) −1 r(s). Recall that σ max J(s) −1 = 1 /σ min ( J(s) ) =∥J(s) −1 ∥ 2 . Recall that an interpretation of the spectral norm ∥J(s)∥ 2 is how much multipli- cation by J(s) can increase the length of a vector. Therefore, we see that very small values of σ min (J(s)) result in large values of ∥J(s) −1 ∥ 2 , which means that ∥J(s) −1 r(s)∥ 2 can become extremely large as well, and small perturbations in r can B.3 t h e dy na m i c a l i n t e r p r e tat i o n o f t h e i n v e r s e jac o b i a n123 lead to very different Gauss-Newton updates (i.e. the problem is ill-conditioned, cf. Nocedal and Wright [179] Appendix A.1). Furthermore, we observe that in the λ>0 (unpredictable) setting and the large T limit, the upper and lower bounds in (66) are tight, as they are both O(e λ(T−1) ). Thus, the upper and lower bounds together ensure that unpredictable dynamics will suffer from degrading conditioning. In contrast, in the λ<0 (predictable) setting, the lower bound on √ μ converges to 1−e λ a , which is bounded away from zero and independent of the sequence length. Thus, in predictable dynamics, there is a lower bound on σ min (J) or, equivalently, an upper bound on σ max (J −1 ). b.3t h e dy na m i c a l i n t e r p r e tat i o n o f t h e i n v e r s e jac o b i a n As shown in the above proof, J(s) −1 = (I TD − N(s)) −1 = T−1 X k=0 N(s) k . It is worth noting explicitly that N(s) = 0 0 ... 0 0 A 2 0 ... 0 0 . . . . . . . . . . . . . . . 0 0 ... 0 0 0 0 ... A T 0 where A t : = ∂f t ∂s t−1 (s t−1 ),(73) i.e. N(s) collects the Jacobians of the dynamics function along the first lower diago- nal. Each matrix power N k therefore collects length k products along the kth lower diagonal. Thus, multiplication by J(s) −1 = P T−1 k=0 N(s) k recovers running forward a linearized form of the dynamics, which is one of the core insights of DeepPCR and DEER [41, 142]. B.3 t h e dy na m i c a l i n t e r p r e tat i o n o f t h e i n v e r s e jac o b i a n124 Concretely, in the setting where T =4, we have N 0 = I D 0 0 0 0 I D 0 0 0 0 I D 0 0 0 0 I D N = 0 0 0 0 A 2 0 0 0 0 A 3 0 0 0 0 A 4 0 N 2 = 0 0 0 0 0 0 0 0 A 3 A 2 0 0 0 0 A 4 A 3 0 0 N 3 = 0 0 0 0 0 0 0 0 0 0 0 0 A 4 A 3 A 2 0 0 0 J −1 = I D 0 0 0 A 2 I D 0 0 A 3 A 2 A 3 I D 0 A 4 A 3 A 2 A 4 A 3 A 4 I D b.3.1 Connection to semiseparable matrices and Mamba2 Having depicted the structure of J −1 , we note the connection between J −1 in this paper and the attention or sequence mixer matrix M in Dao and Gu [44], which introduced the Mamba2 architecture (see equation 6 or Figure 2 of Dao and Gu [44] for the form of M, and compare with J −1 above). Mamba2 is a deep learning sequence modeling architecture. Its sequence mixer in each layer has at its core a linear dynamical system. Dao and Gu [44] observe that while a linear dynamical system (LDS) can be evaluated recurrently (sequen- tially) or in parallel (for example, with a parallel scan), it can also be evaluated multiplying the inputs to the LDS by the matrix M. Since each DEER iteration is also a linear dynamical system, with the transition matrices given by A t T t=2 , it B.4 f r a m i n g b a s e d o n g l o b a l b o u n d s125 follows that M in Dao and Gu [44] and J −1 in our paper are the same object, and so results about these objects from these two papers transfer. In particular, we observe that, in the language from Dao and Gu [44], the J −1 we consider in this paper is D-semiseparable (see Definition 3.1 from Dao and Gu [44]). Thus, any efficient, hardware-aware algorithms and implementations developed for D-semiseparable matrices could also be applied to accelerating each iteration of DEER, though we note that Dao and Gu [44] focus on the 1-semiseparable set- ting, which they call a state space dual or SSD layer. In any case, using these connec- tions to accelerate each iteration of DEER and related parallel Newton algorithms from a systems implementation perspective would be an interesting direction for future work. b.4f r a m i n g b a s e d o n g l o b a l b o u n d s We chose to prove Theorem 5.3 using condition (37) in order to highlight the natural connection between the smallest singular value of J and system stability (as measured by its LLE). However, an assumption with a different framing would be to impose a uniform bound on the spectral norm of the dynamics Jacobian over the entire state space: sup s∈R D ∥A(s)∥ 2 ⩽ρ.(74) For ρ < 1, this assumption corresponds to global contraction of the dynamics [150]. If we replace the LLE regularity condition (37) with the global spectral norm bound (74) in the proof of Theorem 5.3, we obtain that the PL constant is bounded away from zero, i.e. 1 a · ρ −1 ρ T −1 ⩽ r inf s∈R TD σ 2 min (J(s)). In particular, if the dynamics are contracting everywhere (i.e., ρ<1), the condition (74) guarantees good conditioning of J throughout the entire state space. b.5d i s c u s s i o n o f t h e l l e r e g u l a r i t y c o n d i t i o n s The LLE regularity conditions in equation (37) highlight the more natural "average case" behavior experienced along actual trajectories s∈ R TD . This "average case" behavior is highlighted, for example, by our experiments with the two-well system (cf. Subsection 5.5.2, where even though a global upper bound on ∥A t (s t )∥ 2 over all of state space would be greater than 1 (i.e., there are unstable regions of state space), we observe fast convergence of DEER because the system as a whole has negative LLE (its trajectories are stable on average). We also note the pleasing relationship the LLE regularity conditions have with the definition of the LLE given in equation (31). Note that in the LLE regularity B.5 d i s c u s s i o n o f t h e l l e r e g u l a r i t y c o n d i t i o n s126 conditions in equation (37), the variable k denotes the sequence length under consideration. Taking logs and dividing by k, we therefore obtain logb k +λ⩽ 1 k log (∥ A t+k−1 A t+k−2 ·A t ∥) ⩽ loga k +λ. Therefore, as k → T , and as T → ∞ (i.e., we consider longer and longer se- quences), we observe that the finite-time estimates of the LLE converge to the true LLE λ. We observe that as s (i) approaches the true solution s ∗ , the regularity condi- tions in equation (37) become increasingly reasonable. Since any successful opti- mization trajectory must eventually enter a neighborhood of s ∗ , it is natural to expect these conditions to hold there. In fact, rather than requiring the regularity conditions over all of state space or along the entire optimization trajectory, one could alternatively assume that they hold within a neighborhood of s ∗ , and prove a corresponding version of Theorem 5.3. We now do so, using the additional assumption that J is L-Lipschitz. Theorem B.1. If J is L-Lipschitz, then there exists a ball of radius R around the solution s ∗ , denoted B(s ∗ ,R), such that ∀s ∈ B(s ∗ ,R)|σ min (J(s)) −σ min (J(s ∗ ))| ⩽ LR Proof. The argument parallels the proof of Theorem 2 in Liu, Zhu, and Belkin [147]. A fact stemming from the reverse triangle inequality is that for any two matrices A and B, σ min (A) ⩾ σ min (B) −∥A − B∥ . Applying this with A = J(s) and B = J(s ∗ ), we obtain σ min (J(s)) ⩾ σ min (J(s ∗ )) −∥J(s) − J(s ∗ )∥ . If the Jacobian J(·) is L-Lipschitz, then ∥J(s) − J(s ∗ )∥ ⩽ L∥s − s ∗ ∥ . Combining, we get σ min (J(s)) ⩾ σ min (J(s ∗ )) −L∥s − s ∗ ∥ and σ min (J(s ∗ )) ⩾ σ min (J(s)) −L∥s − s ∗ ∥ , which gives σ min (J(s ∗ )) −L∥s − s ∗ ∥⩽σ min (J(s))⩽σ min (J(s ∗ )) +L∥s − s ∗ ∥. B.6 c o n t ro l l i n g t h e m a x i m u m s i n g u l a r va l u e127 Ensuring that∥s − s ∗ ∥⩽R completes the proof. A consequence of Theorem B.1 is that if the system is unpredictable, then there exists a finite ball around s ∗ where the conditioning of the merit function land- scape is provably bad. As a concrete example, suppose that σ min (J(s ∗ )) = ε and L = 1. Then at best, the PL constant of the loss function inside the ball B(s ∗ ,R) is ε +R. If ε is small (bad conditioning) then R can be chosen such that the PL constant inside the ball B(s ∗ ,R) is also small. b.6c o n t ro l l i n g t h e m a x i m u m s i n g u l a r va l u e In our proof of Theorem 5.3, we proved upper and lower bounds for σ min (J(s)) that depended on the sequence length T . We can also prove upper and lower bounds for σ max (J(s)), but these do not depend on the sequence length. Assuming condition (74), an upper bound on σ max (J) is straightforward to com- pute via the triangle inequality, σ max (J) =∥J∥ 2 =∥I − N∥ 2 ⩽1 +∥N∥ 2 . Recalling the definition of N in (73), we observe that it is composed of A t along its lower block diagonal, and so we have ∥N(s)∥ 2 = sup t ∥A t (s t )∥ sup s∈R TD ∥N(s)∥ 2 = sup s∈R D ∥A(s)∥ Elaborating, for a particular choice of trajectory s∈ R TD ,∥N(s)∥ 2 is controlled by the maximum spectral norm of the Jacobians A t (s t ) along this trajectory. Analo- gously, sup s∈R TD ∥N(s)∥ 2 —i.e., the supremum of the spectral norm of N(s) over all possible trajectories s∈ R TD , i.e. the optimization space—is upper bounded by sup s∈R D ∥J(s)∥ 2 , i.e. the supremum of the spectral norm of the system Jacobians over the state space R D . Thus, it follows that σ max (J)⩽1 +ρ.(75) Importantly, the upper bound on σ max (J) does not scale with the sequence length T . B.7 c o n d i t i o n n u m b e r o f t h e jac o b i a n128 To obtain the lower bound on σ max (J), we notice that it has all ones along its main diagonal, and so simply by using the unit vector e 1 , we obtain e ⊤ 1 Je 1 =1⩽σ max (J).(76) b.7c o n d i t i o n n u m b e r o f t h e jac o b i a n Note that the condition number κ of a matrix is defined as the ratio of its maxi- mum and minimum singular values, i.e. κ(J) = σ max (J) σ min (J) . However, because our bounds in equation (75) and equation (76) on σ max (J) do not scale with the sequence length T , it follows that the scaling with T of an up- per bound on κ(J)—the conditioning of the optimization problem—is controlled solely by the bounds on σ min (J) that we provided in Theorem 5.3. The impor- tance of studying how the conditioning scales with T stems from the fact that we would like to understand if there are regimes—particularly involving large sequence lengths and parallel computers—where parallel evaluation can be faster than sequential evaluation. C D I S C U S S I O N O F PA R A L L E L C H O R D M E T H O D S Ortega and Rheinboldt [180] discuss at length iterative methods for solving arbi- trary systems of nonlinear equations F(x) =0 using iterations of the form x (i+1) = x (i) − ̃ J(x (i) ) −1 F(x (i) )(77) for some matrix ̃ J(x (i) ). In general, ̃ J can be a function of the current iterate x (i) or a fixed and constant matrix. Newton’s method corresponds to ̃ J(x (i) ) = J(x (i) : = ∂F ∂x (x (i) ). When ̃ J is fixed and constant, [180] describe the resulting family of fixed-point iterations as parallel-chord methods. However, we will use this term for all iterative methods with updates of the form in equation (77), which includes both Newton and Picard iterations. The term "parallel" in this context does not have to do with applying a parallel scan over the sequence length (which is the focus of this thesis). Instead, "parallel" in "parallel-chord methods" refers to the way in which Newton’s method finds the zero of a function by making a guess for the zero, and then forming a chord that is parallel to the function at the current guess (Figure 6). In one-dimension, the linearization is a line (a chord), while in higher-dimensions the linearization is in general a hyperplane. In Newton’s method, the chord/hyperplane is tangent to the function at the current guess, while for other parallel-chord methods the approximate linearization is in general not tangent. The equation F(x) = 0 is a fully general way to represent a system of nonlinear equations. However, in this paper, we focus on parallelizing Markovian state space models, as discussed in Chapter 2. In their treatment of Picard iterations, Ortega and Rheinboldt [180] consider a more general formulation than that presented in Shih et al. [201] or in equa- tion (54). Instead, similar to the definition presented in Appendix C.2.3 of Gu et al. [87], Ortega and Rheinboldt [180] define Picard iterations in the setting where we have removed a linear component of F, namely we have written F(s) =: ̃ Js − G(s),(78) 129 d i s c u s s i o n o f pa r a l l e l c h o r d m e t h o d s130 for some constant, nonsingular matrix ̃ J and nonlinear function G(·). Note that such redefinition of F(·) in terms of ̃ J and G(·) is always possible and not uniquely determined. After making such a redefinition, Ortega and Rheinboldt [180] define a Picard iteration as an update of the form s (i+1) = ̃ J −1 G(s (i) ).(79) However, by multiplying both sides of equation (78) by ̃ J −1 , it follows that ̃ J −1 G(s (i) ) = s (i) − ̃ J −1 F(s (i) ), showing that the Picard iterations as defined in equation (79) fit into the parallel- chord framework set out in equation (77). Note that Picard iterations as defined by Shih et al. [201] or in equation (54) of this paper also fit into the framework of equation (78): in the context of evaluating discretized ODEs, the residual becomes F t+1 (s) =x t+1 −x t −εg t (s t ). Thus, in the context of equation (78), we have that the resultingG t (s) =εg t−1 (x t−1 ), while the resulting ̃ J operator is given by equation (58). When we plug this ̃ J into equation (77) and simplify, we obtain the linear dy- namical system in the "Picard" row of Table 6. In general, the fixed-point methods of the common form given by equation (22) all give rise to ̃ J∈ R TD×TD matrices of the form shown in equation (56). Thus, Ortega and Rheinboldt [180] unites Newton and Picard iterations for the general root finding problem F(s) = 0 under the umbrella of parallel-chord meth- ods, which are iterative updates of the form of equation (77). The framework we provide in Table 6 can be understood as a specialization of parallel-chord meth- ods for the particular problem of sequential evaluation discussed in equation (1). Nonetheless, we focus on how in the specific problem of sequential evaluation, which is of great interest in many areas of machine learning, a wide variety of fixed-point methods become iterative application of LDSs, allowing them to be parallelized over the sequence length with an associative scan. This important perspective about parallelizability, which is of great interest in machine learning, is not discussed in Ortega and Rheinboldt [180] because they are considering a more general problem. Ortega and Rheinboldt [180] also discuss in their Chapters 7 and 10 how the closeness of the "parallel chord" (in general and in higher dimensions, the "approx- imating hyperplane") to the true linearization of the function (Newton’s method) affects the number of iterations needed for the parallel-chord method to converge. This analysis is directly analogous to our study of the effect of ̃ J(s 1:T ) − J(s 1:T ) 2 on the rate of convergence of fixed-point methods, see Theorem 6.4. In particu- d i s c u s s i o n o f pa r a l l e l c h o r d m e t h o d s131 lar, in Chapter 10 of [180], they consider the rates of convergence of fixed-point methods with updates taking the form of s (i+1) = U(s (i) ),(80) for some function U(·). Ortega and Rheinboldt [180] use the name one-step station- ary methods for such fixed-point methods with updates of the form equation (80). For parallel-chord methods of the form given in equation (77), it follows that U(s (i) ) = s (i) − ̃ J(s (i) ) −1 F(s (i) ).(81) In particular, in their Chapters 7 and 10, [180] introduce and study σ(U, F, s ⋆ ), which determines the rate of convergence of iterative methods with updates of the form given by equation (80) to the solution s ⋆ of F(s) = 0. They define σ as σ(U, F, s ⋆ ) :=ρ ∂U ∂s (s ⋆ ) ,(82) where ρ(M) denotes the spectral radius of a matrix M. In the context of parallel-chord methods where U(·) is given by equation (81), it follows that ∂U ∂s (s ⋆ ) = I − ̃ J(s ⋆ ) −1 ∂F ∂s (s ⋆ ), because F(s ⋆ ) =0. Thus it follows that if ̃ J = ∂F /∂s(s ⋆ ), thenσ =0. Thus, lower values of σ indicates that ̃ J is good approximation of the Jacobian matrix ∂F /∂s evaluated at the zero s ⋆ of F, while higher values of σ indicate that ̃ J is a poor approximation for ∂F /∂s. [180] then use σ in their Chapter 10 (in particular, their Theorem 10.1.4) to prove linear rates of convergence 1 for one-step stationary methods within a neighborhood of the solution s ⋆ . Thus, a takeaway from [180] (as paraphrased from Gasilov et al. [68]) is that the closer ̃ J is to ∂F /∂s, the fewer iterations are needed for convergence to s ⋆ . This take- away is extremely similar to our guidance, though we specialize to the particular system of equations given by equation (11) that results from the goal of rolling out the Markov process given by equation (1). However, in the setting we consider in this paper—using fixed-point iterations of the form equation (22) to solve nonlinear equations of the form equation (11)— Theorem 10.1.4 of Ortega and Rheinboldt [180] is actually trivial. By "trivial," we mean that it does not distinguish between the convergence rates of any of the fixed-point iterations we focus on in this paper. To make this point more precisely, we review 2 the notion of root-convergence, more commonly known as R-convergence. 1 where the rate is given by σ 2 We follow the presentation of Chapter 9 of Ortega and Rheinboldt [180], in particular Definition 9.2.1. d i s c u s s i o n o f pa r a l l e l c h o r d m e t h o d s132 Definition C.1 (R-convergence). Let A be a fixed-point operator with fixed-point s ⋆ . Let C(A, s ⋆ ) be the set of all sequences generated by A which converge to s ⋆ . Then the R 1 -factors of A at s ⋆ are given by R 1 (A, s ⋆ ) := sup lim sup i→∞ ∥s (i) − s ⋆ ∥ 1/i s (i) i⩾0 ∈C(A, s ⋆ ) .(83) Intuitively, R 1 (A, s ⋆ ) gives the rate of linear convergence of a fixed-point oper- ator A to its fixed-point s ⋆ . Theorem 10.1.4 of Ortega and Rheinboldt [180] im- plies that if A is a one-step stationary method with update given by U(·), then R 1 (A, s ⋆ ) = σ(U, F, s ⋆ ). Therefore, if σ > 0, then σ is the rate of R-linear conver- gence of A to s ⋆ , while if σ =0, we say that A converges R-superlinearly. However, it is important to note that these definitions are asymptotic in nature. The fixed-point iterations considered in this paper, i.e. following the common form equation (22), all have σ = 0, and therefore can be said to converge R- superlinearly. Proposition C.2. Let F(s) = 0 be a nonlinear equation of the form equation (11) with solution s ⋆ . Let A be a parallel-chord method with fixed-point s ⋆ . Then σ ( U, F, s ⋆ ) =0. Proof. Both ∂F /∂s(s ⋆ ) and ̃ J(s ⋆ ) are lower-triangular matrices with all D×D identity matrices on their main block-diagonal. In particular, ̃ J −1 is also a lower-triangular matrix with all D×D identity matrices on its main block-diagonal. Consequently, the product ̃ J −1 ∂F ∂s is also a lower-triangular matrix with all D×D identity matri- ces on its main block-diagonal. As a result, I − ̃ J −1 ∂F ∂s is a lower-triangular matrix with all zeros on its main block-diagonal, and so has all its eigenvalues equal to 0. Consequently, its spectral radius is equal to zero. It may seem counterintuitive that even Jacobi iterations technically enjoy R- superlinear convergence in the context of parallelizing Markov processes. How- ever, this seemingly strange result stems from the asymptotic nature of Defini- tion C.1 of R-convergence, and the fact that Proposition 1 of Gonzalez et al. [80] guarantees that all fixed-point iterations of the form given by equation (22) will converge to s ⋆ in a finite number of iterations (T , to be exact). Therefore, for any LDS fixed-point scheme, we always have lim i→∞ ∥s (i) − s ⋆ ∥ =0. However, in both Proposition 4 of Lu, Zhu, and Hou [153] and Proposition 6.3 of this paper, we effectively get around this difficulty by considering the spectral norm instead of the spectral radius. The spectral norm always bounds the spectral radius, and so by focusing on spectral radius, Ortega and Rheinboldt [180] could get tighter bounds (faster rates of convergence). However, in our setting the spec- tral radius cannot distinguish between any of the fixed-point methods, and so we d i s c u s s i o n o f pa r a l l e l c h o r d m e t h o d s133 instead use the looser bound provided by the spectral norm, which can distin- guish between the different fixed-point methods. Note that the core entities are effectively the same, as γ defined in equation (61) is equal to∥ ∂U /∂s(s ⋆ )∥ 2 . Finally, again, because all of our fixed-point methods converge in at most T iter- ations, asymptotic notions of linear convergence are not suitable to fully capture the behavior of these fixed point methods. For this reason, we use empirical case studies in Section 6.3 to show that efficacy of the intuition, inspired by Proposi- tion 6.3, that the closeness of ̃ A t to A t impacts the number of iterations needed for A to converge. This empirical approach also highlights how the increased compu- tational cost of higher-order fixed-point methods affects wall-clock time on GPUs. B I B L I O G R A P H Y [1] Nima Anari, Sinho Chewi, and Thuy-Duong Vuong. “Fast parallel sam- pling under isoperimetry.” In: Conference on Learning Theory (COLT). 2024. [2] Donald G Anderson. “Iterative procedures for nonlinear integral equa- tions.” In: Journal of the ACM (JACM) 12.4 (1965), p. 547–560. [3] Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, and Christopher Ré. “Zoology: Measuring and improving recall in efficient language models.” In: International Confer- ence on Learning Representations (ICLR). 2024. [4] Michael Artin. Abstract Algebra. 2nd. Pearson, 2011. isbn: 9780132413770. [5] Uri M Ascher, Robert M Mattheij, and Robert D Russell. Numerical So- lution of Boundary Value Problems for Ordinary Differential Equations. SIAM, 1995. [6] Anthony Bagnall, Hoang Anh Dau, Jason Lines, Michael Flynn, James Large, Aaron Bostrom, Paul Southam, and Eamonn Keogh. “The UEA mul- tivariate time series classification archive, 2018.” In: arXiv preprint arXiv:1811.00075 (2018). [7] Shaojie Bai. “Equilibrium Approaches to Modern Deep Learning.” PhD thesis. Carnegie Mellon University, 2022. [8] Shaojie Bai, J Zico Kolter, and Vladlen Koltun. “Deep equilibrium models.” In: Neural Information Processing Systems (NeurIPS). 2019. [9] Atilim Gunes Baydin, Barak A Pearlmutter, Alexey Andreyevich Radul, and Jeffrey Mark Siskind. “Automatic differentiation in machine learning: a survey.” In: Journal of machine learning research 18.153 (2018), p. 1–43. [10] Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Olek- sandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstet- ter, and Sepp Hochreiter. “xLSTM: Extended Long Short-Term Memory.” In: Advances in Neural Information Processing Systems (NeurIPS). 2024. [11] John M Beggs. The cortex and the critical point: understanding the power of emergence. MIT Press, 2022. [12] Hadi Beik-Mohammadi, Søren Hauberg, Georgios Arvanitidis, Nadia Figueroa, Gerhard Neumann, and Leonel Rozo. “Neural contractive dynamical sys- tems.” In: arXiv preprint arXiv:2401.09352 (2024). 134 b i b l i o g r a p h y135 [13] Costas Bekas, Effrosyni Kokiopoulou, and Yousef Saad. “An estimator for the diagonal of a matrix.” In: Applied Numerical Mathematics 57.11-12 (2007), p. 1214–1229. [14] B. M. Bell. “The iterated Kalman smoother as a Gauss–Newton method.” In: SIAM Journal on Optimization 4.3 (1994), p. 626–636. [15] C. Gordon Bell and Allen Newell. Computer Structures: Readings and Ex- amples. McGraw-Hill Computer Science Series. New York: McGraw-Hill, 1971. [16] S. M. Bell and F. W. Cathey. “The iterated Kalman filter update as a Gauss- Newton method.” In: IEEE Transactions on Automatic Control 38.2 (1993), p. 294–297. [17] Guillaume Bellec, Franz Scherr, Anand Subramoney, Elias Hajek, Darjan Salaj, Robert Legenstein, and Wolfgang Maass. “A solution to the learning dilemma for recurrent networks of spiking neurons.” In: Nature Communi- cations 11.1 (2020), p. 3625. [18] Alfredo Bellen and Marino Zennaro. “Parallel algorithms for initial-value problems for difference and differential equations.” In: Journal of Computa- tional and Applied Mathematics 25.3 (1989), p. 341–350. [19] Yoshua Bengio, Patrice Simard, and Paolo Frasconi. “Learning long-term dependencies with gradient descent is difficult.” In: IEEE Transactions on Neural Networks 5.2 (1994), p. 157–166. [20] Julian Besag. “Comment on “Representations of knowledge in complex systems” by Grenander and Miller.” In: Journal of the Royal Statistical Soci- ety: Series B (Methodological) 56.4 (1994), p. 549–581. [21] Michael Betancourt. “A conceptual introduction to Hamiltonian Monte Carlo.” In: arXiv preprint arXiv:1701.02434 (2017). [22] Richard Bird. Introduction to Functional Programming using Haskell. 2nd. Pren- tice Hall Series in Computer Science. Prentice Hall, 1998. isbn: 978-0134843469. [23] Christian H. Bischof and Charles F. Van Loan. “The WY representation for products of Householder matrices.” In: SIAM Conference on Parallel Process- ing for Scientific Computing. 1985. url: https://api.semanticscholar.org/ CorpusID:36094006. [24] Guy E. Blelloch. Prefix Sums and Their Applications. Tech. rep. CMU-CS-90- 190. Carnegie Mellon University, School of Computer Science, 1990. [25] Nawaf Bou-Rabee, Andreas Eberle, and Raphael Zimmer. “Coupling and Convergence for Hamiltonian Monte Carlo.” In: The Annals of Applied Prob- ability 30.3 (June 2020), p. 1209–1250. [26] Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge, UK: Cambridge University Press, 2004. isbn: 9780521833783. b i b l i o g r a p h y136 [27] James Bradbury et al. JAX: composable transformations of Python+NumPy pro- grams. 2018. [28] André EX Brown, Eviatar I Yemini, Laura J Grundy, Tadas Jucikas, and William R Schafer. “A dictionary of behavioral motifs reveals clusters of genes affecting Caenorhabditis elegans locomotion.” In: Proceedings of the National Academy of Sciences 110.2 (2013), p. 791–796. [29] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Ka- plan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sas- try, Amanda Askell, et al. “Language models are few-shot learners.” In: Advances in neural information processing systems 33 (2020), p. 1877–1901. [30] C.G. Broyden. “The convergence of a class of double-rank minimization algorithms.” In: IMA Journal of Applied Mathematics 6.1 (1970), p. 76–90. [31] F. Bullo. Contraction Theory for Dynamical Systems. 1.2. Kindle Direct Pub- lishing, 2024. isbn: 979-8836646806. [32] Paul Caillon, Erwan Fagnou, and Alexandre Allauzen. “Fast Training of Recurrent Neural Networks with Stationary State Feedbacks.” In: arXiv preprint arXiv:2503.23104 (2025). [33] Francois Chaubard and Mykel Kochenderfer. “Scaling Recurrent Neural Networks to a Billion Parameters with Zero-Order Optimization.” In: arXiv preprint arXiv:2505.17852 (2025). [34] Haoxuan Chen, Yinuo Ren, Lexing Ying, and Grant M. Rotskoff. “Accel- erating Diffusion Models with Parallel Sampling: Inference at Sub-Linear Time Complexity.” In: Neural Information Processing Systems (NeurIPS). 2024. [35] Y. Chen and D. S. Oliver. “Levenberg–Marquardt forms of the iterative ensemble smoother for efficient history matching and uncertainty quantifi- cation.” In: Computational Geosciences 17.4 (2013), p. 689–703. [36] Sinho Chewi and Austin J. Stromme. “The ballistic limit of the log-Sobolev constant equals the Polyak-Łojasiewicz constant.” In: Annales de l’Institut Henri Poincaré (B) Probabilités et Statistiques (2025). url: https://arxiv. org/abs/2411.11415. [37] Siddhartha Chib and Edward Greenberg. “Understanding the Metropolis- Hastings algorithm.” In: The American Statistician 49.4 (1995), p. 327–335. [38] Kyunghyun Cho, Bart van Merriënboer, Caglar Gulcehre, Dzmitry Bah- danau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. “Learning Phrase Representations using RNN Encoder–Decoder for Statistical Ma- chine Translation.” In: Proceedings of the 2014 Conference on Empirical Meth- ods in Natural Language Processing (EMNLP). 2014. [39] Federico Danieli and Scott MacLachlan. “Multigrid reduction in time for non-linear hyperbolic equations.” In: arXiv preprint arXiv:2104.09404 (2021). b i b l i o g r a p h y137 [40] Federico Danieli, Pau Rodriguez, Miguel Sarabia, Xavier Suau, and Luca Zappella. “ParaRNN: Unlocking Parallel Training of Nonlinear RNNs for Large Language Models.” In: International Conference on Learning Represen- tations (ICLR). 2026. [41] Federico Danieli, Miguel Sarabia, Xavier Suau, Pau Rodríguez, and Luca Zappella. “DeepPCR: Parallelizing Sequential Operations in Neural Net- works.” In: Advances in Neural Information Processing Systems (NeurIPS). 2023. [42] Tri Dao, Beidi Chen, Nimit S Sohoni, Arjun Desai, Michael Poli, Jessica Grogan, Alexander Liu, Aniruddh Rao, Atri Rudra, and Christopher Ré. “Monarch: Expressive structured matrices for efficient and accurate train- ing.” In: International Conference on Machine Learning (ICML). 2022. [43] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. “FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness.” In: Advances in Neural Information Processing Systems (NeurIPS). 2022. [44] Tri Dao and Albert Gu. “Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality.” In: Interna- tional Conference on Machine Learning (ICML). 2024. [45] Alexander Davydov and Francesco Bullo. “Perspectives on contractivity in control, optimization, and learning.” In: IEEE Control Systems Letters (2024). [46] Hans De Sterck, Stephanie Friedhoff, Oliver A Krzysik, and Scott P MacLach- lan. “Multigrid Reduction-In-Time Convergence for Advection Problems: A Fourier Analysis Perspective.” In: Numerical Linear Algebra with Applica- tions 32.1 (2025), e2593. [47] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. “Universal transformers.” In: International Conference on Learn- ing Representations (ICLR). 2019. [48] John E Dennis Jr and Robert B Schnabel. Numerical methods for unconstrained optimization and nonlinear equations. SIAM, 1996. [49] Ashish Deshpande, Sachit Malhotra, MH Schultz, and C Douglas. “A rigorous analysis of time domain parallelism.” In: Parallel Algorithms and Applications 6.1 (1995), p. 53–62. [50] Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. “LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale.” In: Advances in Neu- ral Information Processing Systems 35 (2022), p. 30318–30332. [51] Persi Diaconis. “The Markov chain Monte Carlo revolution.” In: Bulletin of the American Mathematical Society 46.2 (Apr. 2009), p. 179–205. [52] Persi Diaconis and David Freedman. “Iterated Random Functions.” In: SIAM Review 41.1 (1999), p. 45–76. b i b l i o g r a p h y138 [53] John R Dormand and Peter J Prince. “A family of embedded Runge-Kutta formulae.” In: Journal of Computational and Applied Mathematics 6.1 (1980), p. 19–26. [54] Jeffrey L Elman. “Finding structure in time.” In: Cognitive science 14.2 (1990), p. 179–211. [55] Rainer Engelken. “Gradient flossing: Improving gradient descent through dynamic control of jacobians.” In: Advances in Neural Information Processing Systems (NeurIPS) (2023). [56] Rainer Engelken, Fred Wolf, and Larry F Abbott. “Lyapunov spectra of chaotic recurrent neural networks.” In: Physical Review Research 5.4 (2023), p. 043044. [57] N Benjamin Erichson, Omri Azencot, Alejandro Queiruga, Liam Hodgkin- son, and Michael W Mahoney. “Lipschitz recurrent neural networks.” In: arXiv preprint arXiv:2006.12070 (2020). [58] Erwan Fagnou, Paul Caillon, Blaise Delattre, and Alexandre Allauzen. “Ac- celerated training through iterative gradient propagation along the resid- ual path.” In: International Conference on Learning Representations (ICLR). 2025. [59] Fletcher Fan, Bowen Yi, David Rye, Guodong Shi, and Ian R Manchester. “Learning stable Koopman embeddings.” In: 2022 American Control Confer- ence (ACC). IEEE. 2022, p. 2742–2747. [60] Haw-ren Fang and Yousef Saad. “Two classes of multisecant methods for nonlinear acceleration.” In: Numerical linear algebra with applications 16.3 (2009), p. 197–221. [61] Mónika Farsang and Radu Grosu. “Scaling Up Liquid-Resistance Liquid- Capacitance Networks for Efficient Sequence Modeling.” In: Advances in Neural Information Processing Systems (NeurIPS). 2025. [62] Maryam Fazel, Rong Ge, Sham Kakade, and Mehran Mesbahi. “Global convergence of policy gradient methods for the linear quadratic regulator.” In: International Conference on Machine Learning (ICML). 2018. [63] Leo Feng, Frederick Tung, Mohamed Osama Ahmed, Yoshua Bengio, and Hossein Hajimirsadeghi. “Were RNNs All We Needed?” In: arXiv (2024). [64] C.M. da Fonseca. “On the eigenvalues of some tridiagonal matrices.” In: Journal of Computational and Applied Mathematics 200.1 (2007), p. 283–286. [65] Roy Friedman. A Simplified Overview of Langevin Dynamics. Blog post. 2022. [66] Martin J Gander. “50 years of time parallel time integration.” In: Multi- ple Shooting and Time Domain Decomposition Methods: MuS-TDD, Heidelberg, May 6-8, 2013. Springer, 2015, p. 69–113. b i b l i o g r a p h y139 [67] Martin J. Gander and Stefan Vandewalle. “Analysis of the parareal time- parallel time-integration method.” In: SIAM Journal on Scientific Computing 29.2 (2007), p. 556–578. [68] V. A. Gasilov, V. F. Tishkin, A. P. Favorskii, and M. Yu. Shashkov. “The use of the parallel-chord method to solve hydrodynamic difference equations.” In: U.S.S.R. Computational Mathematics and Mathematical Physics 21.3 (1981), p. 178–192. issn: 0041-5553. doi: 10.1016/0041-5553(81)90075-6. [69] Charles William Gear. “Parallel methods for ordinary differential equa- tions.” In: Calcolo 25.1 (1988), p. 1–20. [70] Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bartoldson, Bhavya Kailkhura, Abhinav Bhatele, and Tom Goldstein. “Scaling up Test-Time Compute with Latent Reasoning: A Re- current Depth Approach.” In: Neural Information Processing Systems (NeurIPS). 2025. [71] Charles J Geyer. “Introduction to Markov chain Monte Carlo.” In: Handbook of Markov chain Monte Carlo 20116022.45 (2011), p. 22. [72] Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W Mahoney, and Kurt Keutzer. “A Survey of Quantization Methods for Efficient Neural Network Inference.” In: arXiv preprint arXiv:2103.13630 (2021). [73] William Gilpin. “Chaos as an interpretable benchmark for forecasting and data-driven modelling.” In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks 1 (NeurIPS Datasets and Benchmarks 2021), December 2021, virtual. Ed. by Joaquin Vanschoren and Sai-Kit Yeung. 2021. url: https://datasets- benchmarks- proceedings.neurips.c/ paper/2021/hash/ec5decca5ed3d6b8079e2e7e7bacc9f2-Abstract-round2. html. [74] James Gleick. Chaos: Making a new science. Penguin, 2008. [75] Karan Goel. “Beyond text: applying deep learning to signal data.” PhD thesis. Stanford, CA, USA: Stanford University, 2024. url: https://purl. stanford.edu/qb603fk1926. [76] Karan Goel, Albert Gu, Chris Donahue, and Christopher Ré. “It’s Raw! Audio Generation with State-Space Models.” In: International Conference on Machine Learning (ICML). 2022. [77] Gene H Golub and Charles F Van Loan. Matrix computations. JHU press, 2013. b i b l i o g r a p h y140 [78] Xavier Gonzalez, E. Kelly Buchanan, Hyun Dong Lee, Jerry Weihong Liu, Ke Alexander Wang, David M. Zoltowski, Leo Kozachkov, Christopher Ré, and Scott W. Linderman. “A Unifying Framework for Parallelizing Sequential Models with Linear Dynamical Systems.” In: Transactions on Machine Learning Research (TMLR) (2026). url: https://openreview.net/ forum?id=fw6GgAIGur. [79] Xavier Gonzalez, Leo Kozachkov, David M. Zoltowski, Kenneth L. Clark- son, and Scott W. Linderman. “Predictability Enables Parallelization of Nonlinear State Space Models.” In: Neural Information Processing Systems (NeurIPS). 2025. [80] Xavier Gonzalez, Andrew Warrington, Jimmy T. H. Smith, and Scott W. Linderman. “Towards Scalable and Stable Parallelization of Nonlinear RNNs.” In: Advances in Neural Information Processing Systems (NeurIPS). 2024. [81] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep learning. Vol. 1. MIT Press, 2016. [82] Riccardo Grazzi, Julien Siems, Jörg KH Franke, Arber Zela, Frank Hut- ter, and Massimiliano Pontil. “Unlocking State-Tracking in Linear RNNs Through Negative Eigenvalues.” In: International Conference on Learning Rep- resentations (ICLR). 2025. [83] Sebastiano Grazzi and Giacomo Zanella. Parallel computations for Metropolis Markov chains with Picard maps. 2025. arXiv: 2506.09762 [stat.CO]. [84] Albert Gu. “Modeling Sequences with Structured State Spaces.” PhD the- sis. Stanford University, 2023. url: https://purl.stanford.edu/mb976vf9362. [85] Albert Gu and Tri Dao. “Mamba: Linear-Time Sequence Modeling with Selective State Spaces.” In: Conference on Language Modeling (COLM). 2024. [86] Albert Gu, Karan Goel, and Christopher Ré. “Efficiently Modeling Long Sequences with Structured State Spaces.” In: The International Conference on Learning Representations (ICLR). 2022. [87] Albert Gu, Isys Johnson, Karan Goel, Khaled Saab, Tri Dao, Atri Rudra, and Christopher Ré. “Combining Recurrent, Convolutional, and Continuous- time Models with Linear State-Space Layers.” In: Advances in Neural Infor- mation Processing Systems (NeurIPS). 2021. [88] Han Guo, Songlin Yang, Tarushii Goel, Eric P Xing, Tri Dao, and Yoon Kim. “Log-linear attention.” In: arXiv preprint arXiv:2506.04761 (2025). [89] Vincent Gupta, Tomer Koren, and Yoram Singer. “Shampoo: Precondi- tioned Stochastic Tensor Optimization.” In: International Conference on Ma- chine Leaerning (ICML). 2018. b i b l i o g r a p h y141 [90] Jiaqi Han, Haotian Ye, Puheng Li, Minkai Xu, James Zou, and Stefano Er- mon. “CHORDS: Diffusion Sampling Accelerator with Multi-core Hierar- chical ODE Solvers.” In: Proceedings of the IEEE/CVF International Conference on Computer Vision. 2025, p. 19386–19395. [91] Mark Harris, Shubhabrata Sengupta, and John D. Owens. “Parallel Prefix Sum (Scan) with CUDA.” In: GPU Gems 3. Ed. by Hubert Nguyen. Up- per Saddle River, NJ: Addison-Wesley Professional, Aug. 2007. Chap. 39, p. 851–876. [92] Syeda Sakira Hassan, Simo Särkkä, and Ángel F García-Fernández. “Tem- poral parallelization of inference in hidden Markov models.” In: IEEE Transactions on Signal Processing 69 (2021), p. 4875–4887. [93] Trevor Hastie. “Ridge regularization: An essential concept in data science.” In: Technometrics 62.4 (2020), p. 426–433. [94] Trevor Hastie, Robert Tibshirani, Jerome Friedman, et al. The elements of statistical learning. Springer series in statistics New-York, 2009. [95] W Keith Hastings. “Monte Carlo sampling methods using Markov chains and their applications.” In: Biometrika 57.1 (1970), p. 97–109. [96] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep residual learning for image recognition.” In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, p. 770–778. [97] Franz A. Heinsen. Efficient Parallelization of a Ubiquitous Sequential Computa- tion. 2023. arXiv: 2311.06281 [cs.DS]. [98] Desmond J. Higham. “An Algorithmic Introduction to Numerical Simu- lation of Stochastic Differential Equations.” In: SIAM Review 43.3 (2001), p. 525–546. [99] W Daniel Hillis and Guy L Steele Jr. “Data parallel algorithms.” In: Com- munications of the ACM 29.12 (1986), p. 1170–1183. [100] Jonathan Ho, Ajay Jain, and Pieter Abbeel. “Denoising diffusion probabilis- tic models.” In: Advances in Neural Information Processing Systems (NeurIPS). 2020. [101] Sepp Hochreiter. “Untersuchungen zu dynamischen neuronalen Netzen.” German. Diploma thesis. Munich, Germany: Technische Universität München, 1991. [102] Sepp Hochreiter and Jürgen Schmidhuber. “Long short-term memory.” In: Neural computation 9.8 (1997), p. 1735–1780. [103] Arthur E Hoerl and Robert W Kennard. “Ridge regression: Biased estima- tion for nonorthogonal problems.” In: Technometrics 12.1 (1970), p. 55–67. [104] Peter Holderrieth and Ezra Erives. Introduction to Flow Matching and Diffu- sion Models. MIT course. 2025. b i b l i o g r a p h y142 [105] Sarah Hooker. “The Hardware Lottery.” In: Communications of the ACM 64.12 (2021), p. 58–65. [106] Graham Horton, Stefan Vandewalle, and P Worley. “An algorithm with polylog parallel complexity for solving parabolic partial differential equa- tions.” In: SIAM Journal on Scientific Computing 16.3 (1995), p. 531–541. [107] Amber Hu, Henry Smith, and Scott Linderman. “SING: SDE Inference via Natural Gradients.” In: Advances in Neural Information Processing Systems (NeurIPS). 2025. [108] John H Hubbard and Barbara Burke Hubbard. Vector calculus, linear algebra, and differential forms: a unified approach. Matrix Editions, 2015. [109] Michael F Hutchinson. “A stochastic estimator of the trace of the influence matrix for Laplacian smoothing splines.” In: Communications in Statistics- Simulation and Computation 18.3 (1989), p. 1059–1076. [110] L Hyafil and HT Kung. Bounds on the speed-up of parallel evaluation of re- currences. Carnegie Mellon University, Department of Computer Science, 1975. [111] Casian Iacob, Hassan Razavi, and Simo Särkkä. “A parallel-in-time New- ton’s method-based ODE solver.” In: arXiv preprint arXiv:2511.01465 (2025). [112] Francesco Innocenti. “Towards scaling deep neural networks with predic- tive coding: theory and practice.” PhD thesis. University of Sussex, Oct. 2025. [113] Sean Jaffe, Alexander Davydov, Deniz Lapsekili, Ambuj K Singh, and Francesco Bullo. “Learning neural contracting dynamics: Extended linearization and global guarantees.” In: Advances in Neural Information Processing Systems 37 (2024), p. 66204–66225. [114] Shuai Jiang, Marc Salvado, Eric C Cyr, Alena Kopaniˇcáková, Rolf Krause, and Jacob B Schroder. “Layer-Parallel Training for Transformers.” In: arXiv preprint arXiv:2601.09026 (2026). [115] Matthew J Johnson, David K Duvenaud, Alex Wiltschko, Ryan P Adams, and Sandeep R Datta. “Composing graphical models with neural networks for structured representations and fast inference.” In: Advances in Neural Information Processing Systems. 2016. [116] Alexia Jolicoeur-Martineau. “Less is more: Recursive reasoning with tiny networks.” In: arXiv preprint arXiv:2510.04871 (2025). [117] Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks. Dec. 2024. url: https://kellerjordan.github.io/posts/ muon/. b i b l i o g r a p h y143 [118] Michael I Jordan. Serial order: A parallel distributed processing approach. Tech. rep. ICS Report 8604. Institute for Cognitive Science, University of Califor- nia, San Diego, 1986. [119] R. E. Kalman. “A new approach to linear filtering and prediction prob- lems.” In: Journal of Basic Engineering 82.1 (1960), p. 35–45. [120] L. V. Kantorovich. “Functional analysis and applied mathematics.” In: Us- pekhi Matematicheskikh Nauk 3.6 (1948). In Russian. English translation in: NBS Report 1509, Washington D.C., 1952., p. 89–185. [121] Hamed Karimi, Julie Nutini, and Mark Schmidt. “Linear convergence of gradient and proximal-gradient methods under the Polyak-Łojasiewicz condition.” In: Machine Learning and Knowledge Discovery in Databases: Eu- ropean Conference, ECML PKDD 2016, Riva del Garda, Italy, September 19-23, 2016, Proceedings, Part I 16. Springer. 2016, p. 795–811. [122] Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and Francois Fleuret. “Transformers are RNNs: Fast autoregressive transformers with linear at- tention.” In: International Conference on Machine Learning (ICML). 2020. [123] Herbert B Keller. Numerical Methods for Two-Point Boundary-Value Problems. Dover, 1968. [124] Graeme Kennedy and Joaquim RRA Martins. “Parallel solution methods for aerostructural analysis and design optimization.” In: 13th AIAA/ISSMO multidisciplinary analysis optimization conference. 2010, p. 9308. [125] Patrick Kidger. “On Neural Differential Equations.” PhD thesis. University of Oxford, 2021. url: https://arxiv.org/abs/2202.02435. [126] Najoung Kim and Sebastian Schuster. “Entity tracking in language mod- els.” In: arXiv preprint arXiv:2305.02363 (2023). [127] Diederik P. Kingma and Jimmy Lei Ba. “Adam: A Method for Stochas- tic Optimization.” In: International Conference on Learning Representations (ICLR). 2015. [128] Mykel J Kochenderfer and Tim A Wheeler. Algorithms for optimization. Mit Press, 2026. [129] J Zico Kolter and Gaurav Manek. “Learning stable deep dynamics mod- els.” In: Advances in neural information processing systems 32 (2019). [130] Bernard O Koopman. “Hamiltonian systems and transformation in Hilbert space.” In: Proceedings of the National Academy of Sciences 17.5 (1931), p. 315– 318. [131] Leo Kozachkov, Michaela Ennis, and Jean-Jacques Slotine. “RNNs of RNNs: Recursive construction of stable assemblies of recurrent neural networks.” In: Advances in Neural Information Processing Systems (NeurIPS). 2022. b i b l i o g r a p h y144 [132] Anders Krogh and John Hertz. “A simple weight decay can improve gen- eralization.” In: Advances in Neural Information Processing Systems (NeurIPS) (1991). [133] Dmitry Krotov. “A new frontier for Hopfield networks.” In: Nature Reviews Physics 5.7 (2023), p. 366–367. [134] HT Kung. “New algorithms and lower bounds for the parallel evaluation of certain rational expressions and recurrences.” In: Journal of the ACM (JACM) 23.2 (1976), p. 252–261. [135] Volodymyr Kyrylov. Accelerated Scan. GitHub repository. 2024. url: https: //github.com/proger/accelerated-scan. [136] Richard E Ladner and Michael J Fischer. “Parallel prefix computation.” In: Journal of the ACM (JACM) 27.4 (1980), p. 831–838. [137] Chieh-Hsin Lai, Yang Song, Dongjun Kim, Yuki Mitsufuji, and Stefano Er- mon. “The principles of diffusion models.” In: arXiv preprint arXiv:2510.21890 (2025). [138] Sivaramakrishnan Lakshmivarahan and Sudarshan K Dhall. Parallel com- puting using the prefix problem. Oxford University Press, 1994. [139] Paul Langevin. “On the Theory of Brownian Motion.” In: American Jour- nal of Physics 65.11 (1908). English translation, introduced by D. S. Lemons and translated by A. Gythiel. Original: C. R. Acad. Sci. 146, 530–533 (1908), p. 1079–1081. [140] Kenneth Levenberg. “A method for the solution of certain non-linear prob- lems in least squares.” In: Quarterly of Applied Mathematics 2 (1944), p. 164– 168. [141] Michael James Lighthill. “The recently recognized failure of predictability in Newtonian dynamics.” In: Proceedings of the Royal Society of London. A. Mathematical and Physical Sciences 407.1832 (1986), p. 35–50. [142] Yi Heng Lim, Qi Zhu, Joshua Selfridge, and Muhammad Firmansyah Kasim. “Parallelizing non-linear sequential models over the sequence length.” In: International Conference on Learning Representations (ICLR). 2024. [143] Dachao Lin, Haishan Ye, and Zhihua Zhang. “Explicit superlinear con- vergence rates of Broyden’s methods in nonlinear equations.” In: arXiv preprint arXiv:2109.01974 (2021). [144] Scott W Linderman, Peter Chang, Giles Harper-Donnelly, Aleyna Kara, Xinglong Li, Gerardo Duran-Martin, and Kevin Murphy. “Dynamax: A Python package for probabilistic state space modeling with JAX.” In: Jour- nal of Open Source Software 10.108 (2025), p. 7069. b i b l i o g r a p h y145 [145] Jacques-Louis Lions, Yvon Maday, and Gabriel Turinici. “A “parareal” in time discretization of PDE’s.” In: Comptes Rendus de l’Académie des Sciences - Series I - Mathematics 332.7 (2001), p. 661–668. [146] Bingbin Liu, Jordan T. Ash, Surbhi Goel, Akshay Krishnamurthy, and Cyril Zhang. “Transformers Learn Shortcuts to Automata.” In: Proceedings of the International Conference on Learning Representations (ICLR). 2023. [147] Chaoyue Liu, Libin Zhu, and Mikhail Belkin. “Loss landscapes and opti- mization in over-parameterized non-linear systems and neural networks.” In: Applied and Computational Harmonic Analysis 59 (2022), p. 85–116. [148] Dong C Liu and Jorge Nocedal. “On the limited memory BFGS method for large scale optimization.” In: Mathematical Programming 45.1-3 (1989), p. 503–528. [149] Yuxi Liu, Konpat Preechakul, Kananart Kuwaranancharoen, and Yutong Bai. “The Serial Scaling Hypothesis.” In: arXiv preprint arXiv:2507.12549 (2025). [150] Winfried Lohmiller and Jean-Jacques E Slotine. “On contraction analysis for non-linear systems.” In: Automatica 34.6 (1998), p. 683–696. [151] Edward Lorenz. “Deterministic Nonperiodic Flow.” In: Journal of Atmo- spheric Sciences 20.2 (1963). [152] Edward N Lorenz. “Predictability: A problem partly solved.” In: Proceed- ings of the Seminar on Predictability. Vol. 1. ECMWF Reading, UK. 1996, p. 1–18. [153] Jianrong Lu, Zhiyu Zhu, and Junhui Hou. “ParaSolver: A Hierarchical Par- allel Integral Solver for Diffusion Models.” In: International Conference on Learning Representations (ICLR). 2025. [154] David G Luenberger. Introduction to dynamic systems: theory, models, and ap- plications. John Wiley & Sons, 1979. [155] Dougal Maclaurin. “Modeling, Inference and Optimization with Compos- able Differentiable Procedures.” PhD thesis. Cambridge, MA, USA: Har- vard University, 2016. [156] J. Mandel, E. Bergou, S. Gürol, S. Gratton, and I. Kasanick ` y. “Hybrid Levenberg–Marquardt and weak-constraint ensemble Kalman smoother method.” In: Nonlinear Processes in Geophysics 23.2 (2016), p. 59–73. [157] Oren Mangoubi and Aaron Smith. “Mixing of Hamiltonian Monte Carlo on strongly log-concave distributions: Continuous dynamics.” In: The An- nals of Applied Probability 31.5 (Oct. 2021), p. 2019–2045. b i b l i o g r a p h y146 [158] Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo Peng, and Jef- frey S Vetter. “NVIDIA Tensor Core Programmability, Performance & Pre- cision.” In: 2018 IEEE International Parallel and Distributed Processing Sympo- sium Workshops (IPDPSW). IEEE. 2018, p. 522–531. [159] Donald W. Marquardt. “An algorithm for least-squares estimation of non- linear parameters.” In: Journal of the Society for Industrial and Applied Mathe- matics 11.2 (1963), p. 431–441. [160] Eric Martin and Chris Cundy. “Parallelizing Linear Recurrent Neural Nets Over Sequence Length.” In: International Conference on Learning Representa- tions (ICLR). 2018. [161] Joaquim RRA Martins and Andrew B Lambe. “Multidisciplinary design optimization: a survey of architectures.” In: AIAA journal 51.9 (2013), p. 2049– 2075. [162] William Merrill, Hongjian Jiang, Yanhong Li, and Ashish Sabharwal. “Why Are Linear RNNs More Parallelizable?” In: arXiv preprint arXiv:2603.03612 (2026). [163] William Merrill, Jackson Petty, and Ashish Sabharwal. “The Illusion of State in State-Space Models.” In: International Conference on Machine Learn- ing (ICML). 2024. [164] William Merrill and Ashish Sabharwal. “The Parallelism Tradeoff: Limita- tions of Log-Precision Transformers.” In: Transactions of the Association for Computational Linguistics 11 (2023), p. 531–545. [165] Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Au- gusta H Teller, and Edward Teller. “Equation of state calculations by fast computing machines.” In: The Journal of Chemical Physics 21.6 (1953), p. 1087– 1092. [166] Igor Mezi ́c. “Spectral properties of dynamical systems, model reduction and decompositions.” In: Nonlinear Dynamics 41.1 (2005), p. 309–325. [167] Paulius Micikevicius et al. “Mixed Precision Training.” In: International Con- ference on Learning Representations. 2018. [168] John Miller and Moritz Hardt. “Stable Recurrent Models.” In: International Conference on Learning Representations (ICLR). 2019. [169] Sajad Movahedi, Felix Sarnthein, Nicola Muca Cirone, and Antonio Orvi- eto. “Fixed-point RNNs: From diagonal to dense in a few iterations.” In: Neural Information Processing Systems (NeurIPS). 2025. [170] Kevin Murphy. Probabilistic Machine Learning. Cambridge, 2022. [171] Kevin P Murphy. Probabilistic machine learning: Advanced topics. MIT Press, 2023. b i b l i o g r a p h y147 [172] Kevin P. Murphy, Scott W. Linderman, et al. State Space Models: A Modern Approach. https://probml.github.io/ssm-book/. 2023. [173] Maxim Naumov. “Parallel complexity of forward and backward propaga- tion.” In: arXiv preprint arXiv:1712.06577 (2017). [174] Radford M. Neal. “MCMC using Hamiltonian dynamics.” In: Handbook of Markov Chain Monte Carlo 2.11 (2011), p. 2. [175] Yurii Nesterov. Lectures on Convex Optimization. 2nd. Vol. 137. Springer Op- timization and Its Applications. Springer, 2018. doi: 10.1007/978-3-319- 91578-1. [176] Yurii Nesterov and B. T. Polyak. “Cubic regularization of Newton method and its global performance.” In: Mathematical Programming, Series A 108.1 (2006), p. 177–205. [177] J. Nievergelt. “Parallel methods for integrating ordinary differential equa- tions.” In: Communications of the ACM 7.12 (1964), p. 731–733. [178] Jorge Nocedal. “Updating quasi-Newton matrices with limited storage.” In: Mathematics of Computation 35.151 (1980), p. 773–782. [179] Jorge Nocedal and Stephen J. Wright. Numerical Optimization. 2nd ed. Springer, 2006. [180] James M Ortega and Werner C Rheinboldt. Iterative Solution of Nonlinear Equations in Several Variables. Republished by SIAM in 2000. New York and London: Academic Press, 1970. [181] Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. “Resurrecting Recurrent Neu- ral Networks for Long Sequences.” In: International Conference on Machine Learning (ICML). 2023. [182] Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, and Soham De. “Resurrecting recurrent neu- ral networks for long sequences.” In: International Conference on Machine Learning (ICML). 2023. [183] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. “On the difficulty of training recurrent neural networks.” In: International Conference on Machine Learning. PMLR. 2013, p. 1310–1318. [184] Adam Paszke et al. “PyTorch: An Imperative Style, High-Performance Deep Learning Library.” In: Advances in Neural Information Processing Sys- tems (NeurIPS) (2019). [185] Louis M Pecora and Thomas L Carroll. “Synchronization in chaotic sys- tems.” In: Physical review letters 64.8 (1990), p. 821. [186] Arkady Pikovsky and Antonio Politi. Lyapunov exponents: a tool to explore complex dynamics. Cambridge University Press, 2016. b i b l i o g r a p h y148 [187] Boris T Polyak. “Gradient methods for the minimisation of functionals.” Russian. In: Zh. Vychisl. Mat. Mat. Fiz. 3.4 (1963), p. 643–653. [188] Michael Psenka, Michael Rabbat, Aditi Krishnapriyan, Yann LeCun, and Amir Bar. “Parallel Stochastic Gradient-Based Planning for World Mod- els.” In: arXiv preprint arXiv:2602.00475 (2026). [189] H. E. Rauch, F. Tung, and C. T. Striebel. “Maximum likelihood estimates of linear dynamic systems.” In: AIAA Journal 3.8 (1965), p. 1445–1450. [190] Max Revay, Ruigang Wang, and Ian R Manchester. “Recurrent equilibrium networks: Flexible dynamic models with guaranteed stability and robust- ness.” In: IEEE Transactions on Automatic Control 69.5 (2023), p. 2855–2870. [191] Bidipta Sarkar, Mattie Fellows, Juan Agustin Duque, Alistair Letcher, An- tonio León Villares, Anya Sims, Dylan Cope, Jarek Liesen, Lukas Seier, Theo Wolf, et al. “Evolution strategies at the hyperscale.” In: arXiv preprint arXiv:2511.16652 (2025). [192] Simo Särkkä and Ángel F. García-Fernández. “Temporal Parallelization of Bayesian Smoothers.” In: IEEE Transactions on Automatic Control 66.1 (2021), p. 299–306. doi: 10.1109/TAC.2020.2976316. [193] Simo Särkkä and Lennart Svensson. “Levenberg-Marquardt and Line-Search Extended Kalman Smoothers.” In: ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, p. 5875–5879. [194] Simo Särkkä and Lennart Svensson. Bayesian filtering and smoothing. Vol. 17. Cambridge University Press, 2023. [195] Felix Sarnthein. “Linear Recurrences Accessible to Everyone.” In: ICLR Blogposts. 2025. [196] Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. “Linear Transform- ers Are Secretly Fast Weight Programmers.” In: International Conference on Machine Learning (ICML). 2021. [197] Mark Schöne, Babak Rahmani, Heiner Kremer, Fabian Falck, Hitesh Bal- lani, and Jannes Gladrow. “Implicit Language Models are RNNs: Balanc- ing Parallelization and Expressivity.” In: International Conference on Machine Learning (ICML). 2025. [198] Heinz Georg Schuster and Wolfram Just. Deterministic chaos: an introduction. John Wiley & Sons, 2006. [199] Nikil Roashan Selvam, Amil Merchant, and Stefano Ermon. “Self-Refining Diffusion Samplers: Enabling Parallelization via Parareal Iterations.” In: Advances in Neural Information Processing Systems (NeurIPS). 2024. [200] Jonathan Richard Shewchuk et al. “An introduction to the conjugate gradi- ent method without the agonizing pain.” In: (1994). b i b l i o g r a p h y149 [201] Andy Shih, Suneel Belkhale, Stefano Ermon, Dorsa Sadigh, and Nima Anari. “Parallel Sampling of Diffusion Models.” In: Advances in Neural In- formation Processing Systems (NeurIPS). 2023. [202] Hava T Siegelmann and Eduardo D Sontag. “On the computational power of neural nets.” In: Journal of Computer and System Sciences 50.1 (1995), p. 132–150. [203] Julien Siems, Riccardo Grazzi, Kirill Kalinin, Hitesh Ballani, and Babak Rahmani. “Learning State-Tracking from Code Using Linear RNNs.” In: arXiv preprint arXiv:2602.14814 (2026). [204] Dan Simon. Optimal state estimation: Kalman, H infinity, and nonlinear ap- proaches. John Wiley & Sons, 2006. [205] Vikas Sindhwani, Tara Sainath, and Sanjiv Kumar. “Structured transforms for small-footprint deep learning.” In: 2015. [206] Vikas Sindhwani, Stephen Tu, and Mohi Khansari. “Learning contracting vector fields for stable imitation learning.” In: arXiv preprint arXiv:1804.04878 (2018). [207] Jimmy T.H. Smith, Andrew Warrington, and Scott W. Linderman. “Simpli- fied State Space Layers for Sequence Modeling.” In: International Conference on Learning Representations (ICLR). 2023. [208] Jimmy Thomas Howard Smith. “Advancing sequence modeling with deep state space methods.” Available at https://purl.stanford.edu/gz824mn4488. PhD thesis. Stanford University, June 2024. url: https://purl.stanford. edu/gz824mn4488. [209] Jascha Sohl-Dickstein, Eric Weiss, Niru Maheswaranathan, and Surya Gan- guli. “Deep unsupervised learning using nonequilibrium thermodynam- ics.” In: International Conference on Machine Learning (ICML). 2015. [210] Yang Song. “Learning to Generate Data by Estimating Gradients of the Data Distribution.” PhD thesis. Stanford University, 2022. url: https:// purl.stanford.edu/zy983tp3399. [211] Yang Song and Stefano Ermon. “Generative Modeling by Estimating Gradi- ents of the Data Distribution.” In: Advances in Neural Information Processing Systems (NeurIPS). 2019. [212] Yang Song, Chenlin Meng, and Stefano Ermon. “Mintnet: Building invert- ible neural networks with masked convolutions.” In: Advances in Neural Information Processing Systems (NeurIPS) (2019). [213] Yang Song, Chenlin Meng, Renjie Liao, and Stefano Ermon. “Accelerating Feedforward Computation via Parallel Nonlinear Equation Solving.” In: International Conference on Machine Learning (ICML). 2021. b i b l i o g r a p h y150 [214] Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Abhishek Kumar, Stefano Ermon, and Ben Poole. “Score-Based Generative Modeling through Stochastic Differential Equations.” In: International Conference on Learning Representations (ICLR). 2021. [215] H. W. Sorenson. “Kalman Filtering Techniques.” In: Kalman Filtering: The- ory and Application. Ed. by H. W. Sorenson. New York: IEEE Press, 1966, p. 90. [216] Harold S. Stone. “An Efficient Parallel Algorithm for the Solution of a Tridiagonal Linear System of Equations.” In: Journal of the ACM 20.1 (1973), p. 27–38. [217] Steven H Strogatz. Nonlinear dynamics and chaos with student solutions man- ual: With applications to physics, biology, chemistry, and engineering. CRC press, 2018. [218] Dawei Sun, Susmit Jha, and Chuchu Fan. “Learning certified control using contraction metric.” In: conference on Robot Learning. PMLR. 2021, p. 1519– 1539. [219] Ilya Sutskever. “Training recurrent neural networks.” PhD thesis. 2013. [220] Richard S Sutton and Andrew G Barto. Reinforcement Learning: An Introduc- tion. Second. MIT press, 2018. [221] Zhiwei Tang, Jiasheng Tang, Hao Luo, Fan Wang, and Tsung-Hui Chang. “Accelerating Parallel Sampling of Diffusion Models.” In: International Con- ference on Machine Learning (ICML). 2024. [222] Aleksandar Terzi ́c, Nicolas Menet, Michael Hersche, Thomas Hoffman, and Abbas Rahimi. “Structure Sparse Transition Matrices to Enable State Tracking in State-Space Models.” In: Advances in Neural Information Process- ing Systems (NeurIPS). 2025. [223] Philip Duncan Thompson. “Uncertainty of initial state as a factor in the predictability of large scale atmospheric flow patterns.” In: Tellus 9.3 (1957), p. 275–295. [224] Andrei N Tikhonov. “Solution of incorrectly formulated problems and the regularization method.” In: Sov Dok 4 (1963), p. 1035–1038. [225] Hiroyasu Tsukamoto, Soon-Jo Chung, and Jean-Jacques E Slotine. “Con- traction theory for nonlinear stability analysis and learning-based control: A tutorial overview.” In: Annual Reviews in Control 52 (2021), p. 135–169. [226] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is All You Need.” In: Advances in Neural Information Processing Systems (NeurIPS). 2017. b i b l i o g r a p h y151 [227] Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. “SOAP: Improving and stabilizing shampoo using adam.” In: International Conference on Learning Representations (ICLR). 2025. [228] Saurabh Vyas, Matthew D. Golub, David Sussillo, and Krishna V. Shenoy. “Computation Through Neural Population Dynamics.” In: Annual Review of Neuroscience 43 (2020), p. 249–275. [229] Homer F Walker and Peng Ni. “Anderson acceleration for fixed-point iter- ations.” In: SIAM Journal on Numerical Analysis 49.4 (2011), p. 1715–1735. [230] Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and Yasin Abbasi Yadkori. “Hierarchical Reasoning Model.” In: arXiv preprint arXiv:2506.21734 (2025). [231] Ke Alexander Wang, Jiaxin Shi, and Emily B Fox. “Test-time regression: a unifying framework for designing sequence models with associative mem- ory.” In: arXiv preprint arXiv:2501.12352 (2025). [232] Paul J Werbos. “Backpropagation through time: what it does and how to do it.” In: Proceedings of the IEEE 78.10 (1990), p. 1550–1560. [233] Matthew O Williams, Ioannis G Kevrekidis, and Clarence W Rowley. “A Data-Driven Approximation of the Koopman Operator: Extending Dynamic Mode Decomposition.” In: Journal of Nonlinear Science 25 (2015), p. 1307– 1346. [234] Ronald J Williams and David Zipser. “A learning algorithm for continually running fully recurrent neural networks.” In: Neural Computation 1.2 (1989), p. 270–280. [235] Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. “Gated Linear Attention Transformers with Hardware-Efficient Training.” In: International Conference on Machine Learning (ICML). 2024. [236] Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. “Par- allelizing Linear Transformers with the Delta Rule over Sequence Length.” In: Proceedings of NeurIPS. 2024. [237] Zhewei Yao, Amir Gholami, Sheng Shen, Mustafa Mustafa, Kurt Keutzer, and Michael Mahoney. “Adahessian: An adaptive second order optimizer for machine learning.” In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 35. 2021, p. 10665–10673. [238] David M. Young. Iterative Solution of Large Linear Systems. Elsevier, 2014. isbn: 978-0-12-773050-9. [239] Youjing Yu, Rui Xia, Qingxi Ma, Máté Lengyel, and Guillaume Hennequin. “Second-order forward-mode optimization of recurrent neural networks for neuroscience.” In: Neural Information Processing Systems (NeurIPS). 2024. b i b l i o g r a p h y152 [240] Riccardo Zattra, Giacomo Baggio, Umberto Casti, Augusto Ferrante, and Francesco Ticozzi. “Context-Selective State Space Models: Feedback is All You Need.” In: arXiv preprint arXiv:2510.14027 (2025). [241] Ali Zemouche and Mohamed Boutayeb. “Observer design for Lipschitz nonlinear systems: the discrete-time case.” In: IEEE Transactions on Circuits and Systems I: Express Briefs 53.8 (2006), p. 777–781. [242] Jim Zhao, Sidak Pal Singh, and Aurelien Lucchi. “Theoretical characteri- sation of the Gauss-Newton conditioning in Neural Networks.” In: Neural Information Processing Systems (NeurIPS). 2024. [243] Yixiu Zhao and Scott Linderman. “Revisiting structured variational autoen- coders.” In: International Conference on Machine Learning (ICML). 2023. [244] David M. Zoltowski, Skyler Wu, Xavier Gonzalez, Leo Kozachkov, and Scott W. Linderman. “Parallelizing MCMC Across the Sequence Length.” In: Advances in Neural Information Processing Systems (NeurIPS). 2025. [245] Nicolas Zucchet and Antonio Orvieto. “Recurrent neural networks: van- ishing and exploding gradients are not the end of the story.” In: Neural Information Processing Systems (NeurIPS). 2024.