Paper deep dive
Scheduling Parallel Optical Circuit Switches for AI Training
Kevin Liang, Litao Qiao, Isaac Keslassy, Bill Lin
Abstract
Abstract:The rapid growth of AI training has dramatically increased datacenter traffic demand and energy consumption, which has motivated renewed interest in optical circuit switches (OCSes) as a high-bandwidth, energy-efficient alternative for AI fabrics. Deploying multiple parallel OCSes is a leading alternative. However, efficiently scheduling time-varying traffic matrices across parallel optical switches with non-negligible reconfiguration delays remains an open challenge. We consider the problem of scheduling a single AI traffic demand matrix $D$ over $s$ parallel OCSes while minimizing the makespan under reconfiguration delay $\delta$. Our algorithm Spectra relies on a three-step approach: Decompose $D$ into a minimal set of weighted permutations; Schedule these permutations across parallel switches using load-aware assignment; then Equalize the imbalanced loads on the switches via controlled permutation splitting. Evaluated on realistic AI training workloads (GPT model and Qwen MoE expert routing) as well as standard benchmarks, Spectra vastly outperforms a baseline based on state-of-the-art algorithms, reducing schedule makespan by an average factor of $1.4\times$ on GPT AI workloads, $1.9\times$ on MoE AI workloads, and $2.4\times$ on standard benchmarks. Further, the makespans achieved by Spectra consistently approach newly derived lower bounds.
Tags
Links
- Source: https://arxiv.org/abs/2603.07373v1
- Canonical: https://arxiv.org/abs/2603.07373v1
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%
Last extracted: 3/13/2026, 12:33:09 AM
Summary
The paper introduces 'Spectra', an algorithm for scheduling AI traffic demand matrices across parallel optical circuit switches (OCSes) while accounting for reconfiguration delays. Spectra employs a three-step approach: decomposing the demand matrix into weighted permutations, scheduling these permutations using a load-aware assignment, and equalizing switch loads via permutation splitting. Evaluation on GPT and MoE workloads shows Spectra significantly reduces makespan compared to state-of-the-art baselines.
Entities (5)
Relation Signals (3)
Spectra → schedulestrafficon → Optical Circuit Switches
confidence 98% · Spectra algorithm for scheduling an AI demand matrix D across s parallel switches
Spectra → evaluatedon → GPT
confidence 95% · Evaluated on realistic AI training workloads (GPT model and Qwen MoE expert routing)
Spectra → optimizes → Makespan
confidence 95% · Spectra vastly outperforms a baseline based on state-of-the-art algorithms, reducing schedule makespan
Cypher Suggestions (2)
Find all algorithms and the metrics they optimize · confidence 90% · unvalidated
MATCH (a:Algorithm)-[:OPTIMIZES]->(m:Metric) RETURN a.name, m.name
List hardware used by specific algorithms · confidence 90% · unvalidated
MATCH (a:Algorithm)-[:SCHEDULES_TRAFFIC_ON]->(h:Hardware) RETURN a.name, h.name
Full Text
51,958 characters extracted from source content.
Expand or collapse full text
Scheduling Parallel Optical Circuit Switches for AI Training Kevin Liang Litao Qiao Isaac Keslassy Bill Lin K. Liang, L. Qiao and B. Lin are with UC San Diego (e-mails: k3liang, l1qiao, billlin@ucsd.edu). I. Keslassy is with the Technion and UC Berkeley (isaac@technion.ac.il). This work has been submitted to the IEEE for possible publication. Abstract The rapid growth of AI training has dramatically increased datacenter traffic demand and energy consumption, which has motivated renewed interest in optical circuit switches (OCSes) as a high-bandwidth, energy-efficient alternative for AI fabrics. Deploying multiple parallel OCSes is a leading alternative. However, efficiently scheduling time-varying traffic matrices across parallel optical switches with non-negligible reconfiguration delays remains an open challenge. We consider the problem of scheduling a single AI traffic demand matrix D over s parallel OCSes while minimizing the makespan under reconfiguration delay δ. Our algorithm Spectra relies on a three-step approach: Decompose D into a minimal set of weighted permutations; Schedule these permutations across parallel switches using load-aware assignment; then Equalize the imbalanced loads on the switches via controlled permutation splitting. Evaluated on realistic AI training workloads (GPT model and Qwen MoE expert routing) as well as standard benchmarks, Spectra vastly outperforms a baseline based on state-of-the-art algorithms, reducing schedule makespan by an average factor of 1.4×1.4× on GPT AI workloads, 1.9×1.9× on MoE AI workloads, and 2.4×2.4× on standard benchmarks. Further, the makespans achieved by Spectra consistently approach newly derived lower bounds. I Introduction Datacenter networks have had trouble scaling at the speed of large-scale AI training compute, especially lagging the AI bandwidth and power constraints [1, 2, 3]. Unlike traditional cloud workloads, AI training compute produces sustained high bisection bandwidth demand with repeated collective communication patterns across a large number of accelerators [4]. Due to its tight iteration-level synchronization constraints, it is extremely sensitive to the worst-case completion time, i.e., the collective makespan also known as collective completion time (CCT). These properties stress networks based on conventional electronic packet-switched fabrics, whose power and bandwidth abilities scale poorly with network size [1]. There has been a significant body of literature proposing the use of optical circuit switches (OCSes) in datacenters as a compelling alternative [1, 5, 6, 7, 8]. In comparison with electronic packet switches, OCSes provide much higher bandwidth at much lower per-bit energy, both of which are important in meeting the rapidly increasing demands on AI datacenter networks. To increase aggregated capacity, prior work increasingly considers employing parallel OCSes [9, 10, 11, 12, 13]. On the other hand, OCSes suffer from reconfiguration delays that can significantly affect the CCT of the AI training collectives. In this paper, we consider the problem of scheduling an AI traffic demand matrix over a set of parallel OCSes with reconfiguration delays, while minimizing the makespan (and therefore the CCT) across the parallel switches. Related work. The existing literature mostly minimizes schedule length either assuming a single optical switch, or a multi-stage switch architecture that can be abstracted as a single switch [6, 7, 8]. In particular, these existing works focus on the scheduling of a given traffic demand matrix D onto an OCS by decomposing D into a sequence of configurations of given durations, such that the traffic demands from D are served. A critical difference with an analogous decomposition problem for electronic switches is that OCSes incur a nontrivial reconfiguration delay δ when the switch configuration has to change, which adds to the total delay (the makespan) in servicing D. Existing works (e.g., [6, 7, 8]) are effective in solving this scheduling problem with reconfiguration delays for the single OCS case. Less [9] is the most relevant work on scheduling a demand matrix D for s parallel OCSes. It splits D into s sub-matrices using a balancing criterion, then schedules each sub-matrix on a separate OCS. However, Less focuses more on a partial-reconfiguration scheduling algorithm intended for partially-reconfigurable switches. Additional works consider parallel OCSes, but under different assumptions: They (1) achieve a distributed scheduling [10], but with lower matching efficiency; (2) use OCSes to route around failures [11]; (3) extend results to a multi-hop schedule [12], which we do not consider in this paper; and (4) focus on integer-weight decompositions [13]. Contributions. We introduce the Spectra algorithm for scheduling an AI demand matrix D across s parallel switches with reconfiguration delay, while minimizing a CCT-based makespan objective. A key idea of Spectra is to cut the problem into three successive sub-problems, each practically solved with polynomial-time algorithms: (i) Decompose D into a set of permutations, (i) Schedule these permutations across the s switches via load-aware scheduling, then (i) Equalize the loads on the switches by splitting the durations of some of the permutations on the most-loaded switches and moving part of their load to the least-loaded ones. We rigorously derive new lower bounds on the achievable makespan for arbitrary demand matrices by any parallel OCS scheduling algorithm. We also introduce a new Qwen-57B Mixture of Experts (MoE) workload by measuring its traffic in a 64-GPU cluster. We then compare Spectra against a Baseline algorithm that extends the sparsity ideas of Less [9], as well as against an Eclipse [6]-based version of Spectra, using three workloads: (i) a GPT-based AI training model, (i) the new Qwen MoE measurements, and (i) a standard benchmark in the literature. Spectra vastly outperforms Baseline, with a schedule that is on average shorter by a factor of 1.4×1.4× on GPT AI workloads, 1.9×1.9× on MoE AI workloads, and 2.4×2.4× on standard benchmarks. It also always outperforms the Eclipse-based variant. More importantly, the makespans achieved by Spectra are consistently approaching the theoretically derived lower bounds, consistently achieving near-optimality evaluation in practice with fast polynomial-time algorithms, with robust performance across sparse and dense traffic regimes. Our results show that algorithmic co-design of traffic decomposition and parallel OCS scheduling is critical for improving optical circuit switching for next-generation AI datacenters.111This paper extends our earlier short letter publication [14]. We plan to share the code upon publication. Figure 1: Datacenter network topology with s parallel OCSes. I Problem Formulation Demand matrix. As Fig. 1 illustrates, we consider a datacenter core of s parallel optical circuit switches (OCSes). Each OCS has n input and output ports, connected to the Top-of-Rack (ToR) switches of n datacenter racks. Periodically, a centralized controller computes a new n×n× n traffic demand matrix D. Each DijD_ij corresponds to the amount of traffic that needs to be switched from rack i to rack j over the next period. Switch schedule. Each OCS has a schedule that consists of a sequence of permutations, where each input rack i is connected to a single output rack j. A sequence of k permutations P1,P2,…,PkP_1,P_2,…,P_k, with corresponding weights α1,α2,…,αk>0 _1, _2,…, _k>0 is said to cover D if the following is satisfied: ∑i=1kαiPi≥D. _i=1^k _iP_i≥ D. (1) In addition, the optical switch needs a reconfiguration delay δ before each permutation. Thus, the time needed to schedule the k permutations in a single switch is ∑i=1k(δ+αi)=k⋅δ+∑i=1kαi. _i=1^k(δ+ _i)=k·δ+ _i=1^k _i. (2) Makespan. We now consider the s schedules of the s optical switches. As above, consider some switch h, with h∈[1,s]h∈ [1,s ]. Assume that each such switch uses a schedule with khk^h permutations PihP_i^h and their corresponding weights αih _i^h, for i∈[1,kh]i∈ [1,k^h ]. Then, extending Eq. 1, the set of all s switch schedules covers D when ∑h=1s[∑i=1khαihPih]≥D. _h=1^s [ _i=1^k^h _i^hP_i^h ]≥ D. (3) The makespan of the s parallel switch schedules is the delay of the longest one, i.e., using Eq. 2, max1≤h≤s[∑i=1kh(δ+αih)]. _1≤ h≤ s [ _i=1^k^h (δ+ _i^h ) ]. (4) Objective. The goal of this paper is to find all permutations and weights in the s schedules of the s parallel optical switches, so that they manage to cover D (Eq. 3) while minimizing the makespan. The general minimization problem of the above objective is NP-hard, even with s=1s=1 [15]. Thus, in this paper, we focus on finding a practical algorithm that typically attains a small makespan in practice. I Algorithms I-A Overview In this section, we introduce our Spectra algorithm (short for Scheduling parallel circuit switches for data center traffic). As formally defined in Section I, Spectra finds s schedules of D over s parallel optical switches, such that the schedules cover D while attaining a small makespan. We start by illustrating Spectra’s three-step approach using an example, before formally defining each step. Decompose. The first step in Spectra is to Decompose a demand matrix D into a set of permutations P1,P2,…,Pk\P_1,P2,…,P_k\ with a corresponding set of weights α1,α2,…,αk\ _1, _2,…, _k\ such that their weighted sum covers D. i.e., ∑i=1kαiPi≥D _i=1^k _iP_i≥ D. Consider the demand D shown in Fig. 2. Fig. 3 provides an example of how D can be decomposed into 3 permutations, with the corresponding weights 0.61, 0.3, and 0.1. Fig. 3 also shows that their weighted sum indeed covers D. As detailed later in this section, given at most k nonzero elements in any row or column of D, our Decompose algorithm guarantees that exactly k permutations will be generated to cover D. It minimizes the number of reconfiguration delays needed while also reducing the sum duration of the permutations. Schedule. After the Decompose step, we next Schedule the k permutations across the s switches. This step is equivalent to scheduling k jobs across s machines, with each αi _i representing the amount of time that a job (permutation) PiP_i needs to run, with added consideration for a reconfiguration delay δ each time a machine (switch) needs to be (re)configured to run a job (switch permutation). For this step, we first sort the permutations by non-increasing αi _i weights, and then we schedule the corresponding PiP_i in non-increasing-weight-order to the least loaded switch each time. Fig. 4 illustrates the Schedule process for the 3 permutations from Fig. 3 across s=2s=2 switches with reconfiguration delay δ=0.01δ=0.01. Initially, both switches are empty, as shown in Fig. 4. We first schedule the permutation with α1=0.61 _1=0.61 to the first switch, so the load of the first switch becomes 0.61+0.01=0.620.61+0.01=0.62. We next schedule the permutation with α2=0.3 _2=0.3 to the second switch since it is empty and hence the least loaded. The load of the second switch becomes 0.3+0.01=0.310.3+0.01=0.31. Finally, we schedule the last permutations with α3=0.1 _3=0.1 to the second switch again, since it remains the least loaded of the two switches, making its final load 0.420.42. The makespan after the Schedule step is max(0.62,0.42)=0.62 (0.62,0.42)=0.62, as shown in Fig. 4(a). Equalize. As the final step, we aim to Equalize the loads on the switches by splitting the duration (αi) _i) of a permutation (switch configuration) PiP_i into multiple segments and moving some segments to different switches in order to load-balance the workloads. In particular, our Equalize step iteratively balances the loads between the most loaded switch hmaxh_max and the least loaded switch hminh_min by moving a portion of the longest-duration permutation in hmaxh_max to hminh_min. This is shown in Fig. 4(b), where we split α1=0.61 _1=0.61 in the first switch into two parts, an α11=0.515 _1^1=0.515 part that remains in the first switch and an α12=0.095 _1^2=0.095 part that moves to the second switch, resulting in the final makespan of 0.5250.525, as shown in Fig. 4(b). Figure 2: Example of demand matrix D. Figure 3: Example decomposition into k=3k=3 weighted permutations that cover D. Figure 4: (a) Scheduling the k=3k=3 permutations across s=2s=2 switches. (b) Equalizing the loads among the switches. Each of the above steps is further detailed below. I-B Decompose Support matrix. Given an n×n× n matrix D, let S∈0,1n×nS∈\0,1\^n× n be its support matrix defined by Sij=1S_ij=1 if Dij>0D_ij>0 and Sij=0S_ij=0 otherwise. Let k be the degree of D, corresponding to the maximum number of nonzero elements in any row or column in D. By definition, the support matrix S for D will also have a degree of k, corresponding to a maximum of k 1’s in any row or column in S. The following property follows from König’s Line Coloring Theorem: Property 1. k permutations are necessary and sufficient to cover any matrix of degree k. Decompose intuition. As Fig. 4(a) illustrates, the sum of all the schedule delays will include (1) δ times the number of permutations (in this example, 0.01⋅30.01· 3), and (2) the sum of the permutation weights (0.61+0.3+0.1=1.010.61+0.3+0.1=1.01). Thus, we want to minimize both of these components. (1) First, by Property 1, k permutations can cover D via some α1,α2,…,αk _1, _2,…, _k such that ∑i=1kαiPi≥D _i=1^k _iP_i≥ D. In our approach, the goal of minimizing the number of permutations is to later minimize the number of reconfiguration delays. Therefore, our approach is to decompose D into exactly this minimum number k of permutations. (2) However, there may be many possible sets of k permutations that can cover D. Among these, we want to find a set of permutations along with a set of parameters α1,α2,…,αk\ _1, _2,…, _k\ that also minimizes the total duration ∑i=1kαi _i=1^k _i. Algorithm 1 Decompose 1:Input: D and δ 2:Assign: S=support matrix of DS=support matrix of D, Drem=D_rem=D, Srem=S_rem=S, i=1i=1, =A=\\ and =P=\\ 3:while Srem≠∅S_rem≠ do 4: Pi←MWM-Node-Coverage(Drem,Srem)P_i -Node-Coverage(D_rem,S_rem) 5: αi←min(a,b),(Pi)ab=1(Drem)ab _i← _(a,b),(P_i)_ab=1(D_rem)_ab 6: ←∪PiP ∪\P_i\ 7: ←∪αiA ∪\ _i\ 8: Drem←Drem−(αiPi)D_rem← D_rem-( _iP_i) 9: Srem←Srem−(Pi∩Srem)S_rem← S_rem-(P_i∩ S_rem) 10: i←i+1i← i+1 11:^←Refine(D,,) A← Refine(D,A,P) 12:return (^,)( A,P) Decompose pseudocode. Alg. 1 shows the pseudocode of our Decompose algorithm. Given D, we first derive its support matrix S. Decompose then iteratively generates a set of k permutations =P1,P2,…,PkP=\P_1,P_2,…,P_k\ that cover S by solving a maximum weight matching (MWM) problem under node coverage constraints [16] in each iteration. At a high-level, we track the remaining traffic in D and the remaining uncovered elements in S after each round, denoted as DremD_rem and SremS_rem, respectively. They are initially Drem=D_rem=D and Srem=S_rem=S. Based on the updated DremD_rem and SremS_rem in each round, we would like to generate a permutation that satisfies two properties. Based on the above intuition, (1) we want to choose a permutation that is guaranteed to reduce the degree of SremS_rem in each round. This way, the algorithm will end after exactly k rounds, generating exactly k permutations, where k is the degree of the initial support matrix S. (2) Second, among the permutations that will reduce the degree of SremS_rem by 1, we want to choose a permutation that will reduce the amount of remaining traffic in DremD_rem as much as possible. Fortunately, this is a well-known optimization problem called MWM under node coverage constraints [17, 16], where efficient polynomial-time algorithms already exist to satisfy our two desired properties. In this formulation, we identify critical rows and columns in SremS_rem, which are the rows and columns that have k nonzero entries, where k is the degree of SremS_rem. To ensure that the degree of SremS_rem gets reduced by 1, we constrain the matching of every critical row or column to a nonzero entry in that row or column. Among the permutations that satisfy this constrained matching, we use DremD_rem as weights to optimize the MWM. This constrained MWM problem can be solved in polynomial time using the Hungarian algorithm [16], modified handle the above coverage constraints. This is shown in Line 4 of Alg. 1. For the generated PiP_i, we derive the initial weight αi _i by taking the minimum entry (Drem)ab (D_rem )_ab where the corresponding entry (Pi)ab (P_i )_ab is a 1 (Line 5). We then update DremD_rem and SremS_rem at the end of each round accordingly (Lines 8-9), and we repeat the while-loop until all 1-elements in S are covered by a permutation in P. Note that SremS_rem at each round corresponds to the remaining uncovered elements of S till that round, and not necessarily to the support of DremD_rem. Algorithm 2 Subroutine Refine 1:Input: D, A and P 2:^← A 3:Drem←D−(∑i=1kαiPi)D_rem← D- ( _i=1^k _iP_i ) 4:for i=1 to ki=1 to k do 5: d←max(a,b),(Pi)ab=1(Drem)abd← _(a,b),(P_i)_ab=1(D_rem)_ab 6: α^i←α^i+d α_i← α_i+d 7: for all (a,b) such that (Pi)ab=1(a,b) such that (P_i)_ab=1 do 8: (Drem)ab←max(0,(Drem)ab−d)(D_rem)_ab← (0,(D_rem)_ab-d) 9:return A Refine subroutine. After the while-loop in Lines 3-10 of Alg. 1, we are guaranteed that the set of permutations =P1,P2,…,PkP=\P_1,P_2,…,P_k\ will cover S, but the weighted sum with the corresponding =α1,α2,…,αkA=\ _1, _2,…, _k\ may not necessarily cover D. This is because in Line 5, we are taking the minimum value in DremD_rem covered by PiP_i, but our stopping criterion is just that the initial support matrix S is covered by the set of permutations. The role of the final Refine step in Line 11 of Alg. 1 is to refine the set of weights to ensure that the weighted sum indeed covers D. Given a set of k permutations =P1,P2,…,PkP=\P_1,P_2,…,P_k\ that covers S, we could readily derive a set of new weights ^=α^1,α^2,…,α^k A=\ α_1, α_2,…, α_k\ such that the weighted sum covers D, by solving the following linear programming (LP) problem: min∑i=1kα^i,s.t. ∑i=1kα^iPi≥D. _i=1^k α_i, .t. _i=1^k α_iP_i≥ D. (5) Instead, starting from the initial weights =α1,α2,…,αkA=\ _1, _2,…, _k\ from Lines 3-10, we greedily increase the weights to a new set of weights A such that the weighted sum covers D (see Alg. 2). We find that in practice this greedy approach works just as well as solving the LP. I-C Schedule After the Decompose step, we have a set of permutations P and their corresponding weights A. In the Schedule step, we want to assign each of these permutations to one of the s switches to minimize the makespan (Eq. 4). This is equivalent to the makespan minimization problem on identical parallel machines, for which the Longest Processing Time (LPT) First algorithm [18] is a well-known and effective greedy heuristic. We adopt it for our Schedule step (Alg. 3). Algorithm 3 Schedule 1:Input: A, P, s and δ 2:Sort P by non-increasing weights in A such that ασ(1)≥ασ(2)≥⋯≥ασ(k) _σ(1)≥ _σ(2)≥·s≥ _σ(k) 3:for h=1 to sh=1 to s do 4: Lh←0L_h← 0, h←P^h←\\, h←A^h←\\ 5:for i=1 to ki=1 to k do 6: h∗=argminhLhh^*= *argmin_hL_h 7: h∗←h∗∪Pσ(i)P^h^* ^h^*∪ P_σ(i) 8: h∗←h∗∪ασ(i)A^h^* ^h^*∪ _σ(i) 9: Lh∗←Lh∗+δ+ασ(i)L_h^*← L_h^*+δ+ _σ(i) 10:←(1,1),(2,2),…,(s,s)S←\(A^1,P^1),(A^2,P^2),…,(A^s,P^s)\ 11:return S Schedule pseudocode. In this greedy algorithm, we first sort the permutations by non-increasing weights, as shown in Line 2. We then initialize the load LhL_h for all switches to 0 with no permutations assigned to switch h yet, as shown in Lines 3-4. Then in Lines 5-9, following a non-increasing order of weights, we assign in a greedy manner the corresponding permutation to the least loaded switch (Line 6), add the permutation to the set of permutations already assigned to that switch (Lines 7-8), and update the load of that switch, including the reconfiguration delay δ (Line 9). In the end, the procedure returns the schedule S of the s switches in the form of weights and permutations (h,h)(A^h,P^h) per switch. I-D Equalize Finally, Fig. 4(b) illustrates how we can reduce the makespan by redistributing some of the switching load from one switch to another. Algorithm 4 Equalize 1:Input: =(1,1),(2,2),…,(s,s)S=\(A^1,P^1),(A^2,P^2),…,(A^s,P^s)\, s and δ 2:for h=1 to sh=1 to s do 3: Lh←∑i=1kh(δ+αih)L_h← _i=1^k^h(δ+ _i^h) 4:cont ← true 5:while cont do 6: hmax=argmaxhLhh_max= *argmax_hL_h 7: hmin=argminhLhh_min= *argmin_hL_h 8: if (Lhmax−Lhmin)>δ (L_h_max-L_h_min )>δ then 9: μ←(Lhmax+Lhmin+δ)/2μ← (L_h_max+L_h_min+δ )/2 10: z←argmaxiαihmaxz← *argmax_i _i^h_max 11: if αzhmax>(Lhmax−μ) _z^h_max> (L_h_max-μ ) then 12: τ←Lhmax−μτ← L_h_max-μ 13: αzhmax←αzhmax−τ _z^h_max← _z^h_max-τ 14: Copy permutation PzhmaxP_z^h_max to hminP^h_min 15: Add τ as corresponding weight to hminA^h_min 16: Lhmax←Lhmax−τL_h_max← L_h_max-τ 17: Lhmin←Lhmin+δ+τL_h_min← L_h_min+δ+τ 18: else 19: cont ← false 20: else 21: cont ← false 22:^←updated (1,1),(2,2),…,(s,s) S \(A^1,P^1),(A^2,P^2),…,(A^s,P^s)\ 23:return S Equalize pseudocode. The pseudocode for this Equalize step is shown in Alg. 4. We first compute the loads LhL_h for each switch after the Schedule step in Lines 2-3. We then iteratively equalize the loads in Lines 5-21 by moving a portion of the load from the most loaded switch hmaxh_max to the least loaded switch hminh_min. Lines 6-7 identify hmaxh_max and hminh_min. If the difference between them is more than δ, we compute the target load μ to be the average of the two loads that includes a δ delay when a portion of hmaxh_max is moved to hminh_min (Lines 8-9). In the most loaded switch hmaxh_max, we identify the index z of the longest-duration permutation in hmaxh_max in Line 10. If the weight of this permutation αzhmax _z^h_max is more than the difference between the current and target load of hmaxh_max, then we move the difference τ from αzhmax _z^h_max in hmaxh_max to hminh_min by copying the corresponding permutation PzhmaxP_z^h_max to switch hminh_min and assigning a weight of τ to it (Lines 11-15). We then adjust the loads of the two switches accordingly in Lines 16-17. The iterations continue as long as load balancing is possible. IV Lower Bounds In this section, we provide new lower bounds for the scheduling makespan of any demand matrix D using s optical switches and a reconfiguration cost of δ>0δ>0. Notations. We index the rows and columns of n×n× n matrix D using i∈[1,2n]i∈ [1,2n ], such that the first n indices cover the n rows and the next n indices correspond to the n columns. We also denote x+≡max(x,0).x^+≡ (x,0 ). Given some row or column of D, we provide below several lower bounds using different approaches. Then, the maximum of all these lower bounds over all rows/columns is also a lower bound for the makespan. Property 2. Assume that an alternative demand matrix DiD_i constructed with the same row/column i∈[1,2n]i∈ [1,2n ] as D and zero everywhere else yields L lower bounds LBi(l)LB_i^(l) for l∈[1,L]l∈ [1,L ] on its schedule makespan. Then a lower bound on the schedule makespan for D is LB=maxi∈[1,2n]maxl∈[1,L]LBi(l).LB= _i∈ [1,2n ] _l∈ [1,L ]LB_i^(l). (6) Proof: To schedule D we need to schedule at least its row/column i, so LBi(l)LB_i^(l) is a lower bound on the makespan for D. Moreover, the finite cartesian product of finite sets also has a finite cardinality; and for any finite set of finite numbers, the maximum element exists, is within that set, and is also finite. Thus, the maximum of the lower bounds is a lower bound itself, and it will be one of these lower bounds LBi(l)LB_i^(l). ∎ Lower bound 1. We start by demonstrating a general lower bound for any number of elements per row/column. Theorem 1 (Lower bound 1). Assume that row/column i has kik_i nonzero elements and a total weight wiw_i. Then the scheduling makespan has lower bound LBi(1)=wi+δ⋅max(ki,s)sLB_i^(1)= w_i+δ· (k_i,s )s (7) Proof: Any set of s schedules that we produce needs to schedule at least the kik_i nonzero elements of row/column i∈1,…,2ni∈ \1,…,2n \. To determine a lower bound, we assume that this is all we need to service and neglect the other elements in the demand matrix. We need at least kik_i scheduling configurations to schedule kik_i elements. Their total weight also needs to be at least wiw_i. Thus, the sum of the s schedule lengths will be at least wiw_i plus the reconfiguration times, i.e., wi+ki⋅δw_i+k_i·δ. Since we have s parallel schedules, the worst-case schedule duration is at least the average schedule duration, i.e., at least wi+ki⋅δs w_i+k_i·δs. In addition, we need to schedule wiw_i over the s switches, yielding a wi/sw_i/s average delay, and each switch needs at least δ reconfiguration time. Thus another lower bound is wis+δ w_is+δ. Combining both, we get the result. ∎ Example. Assume that in row/column i of doubly-stochastic demand matrix D, there are ki=16k_i=16 nonzero elements. If s=4s=4, then our lower bound is LBi(1)=1+16⋅δ4=1/4+4⋅δ.LB_i^(1)= 1+16·δ4=1/4+4·δ. Lower bound 2. We now prove a second lower bound for the special case where the number kik_i of nonzero elements in row or column i exactly equals s. We start with a useful lemma. Lemma 1. If row/column i has ki=sk_i=s nonzero elements, and there is a total of s+ms+m configurations in the s parallel schedules, then the s parallel schedules will necessarily service at most m element chunks. In other words, at least s−ms-m whole elements will be serviced within a single configuration and will not be spread between configurations. Proof: We will prove the lemma by induction on m. First, if there are no reconfigurations at all (m=0m=0) and the s nonzero elements are scheduled by the s configurations of the s parallel schedules, then necessarily each element is assigned to a single distinct configuration and single distinct schedule. That is, they cannot be combined within any schedule in any way. Otherwise, if one of them were cut into two chunks that are serviced by different configurations and/or schedules, then we would have s+1s+1 chunks serviced within the s schedules. Thus, by the pigeonhole principle, at least one schedule would service at least two chunks, which would need a reconfiguration. Contradiction. Assume that the statement holds for m≥0m≥ 0, and let’s prove it for m+1≥1m+1≥ 1 reconfigurations. (1) Assume that the last reconfiguration of at least one schedule out of the s schedules is followed by (a) an element chunk or (b) a zeroed element. Then by removing the reconfiguration and (a) appending the element chunk to another chunk of this element, or (b) throwing out the zeroed element, we get a valid schedule with m reconfigurations. By the inductive hypothesis, this alternative valid schedule would have at most m chunks, and therefore we can conclude that our real schedule has at most m+1m+1 chunks. (2) Else, if only full nonzero elements follow reconfigurations, either (a) all nonzero elements are full and there are no element chunks, which proves the result, or (b) at least one schedule starts with a nonzero element chunk. Permute this nonzero element chunk with any nonzero full element that follows a reconfiguration, and we get back to case (1). Thus all cases hold for m+1m+1. ∎ Theorem 2 (Lower bound 2). Assume that row/column i has ki=sk_i=s nonzero elements x1≥x2≥⋯≥xsx_1≥ x_2≥·s≥ x_s, and a total weight wiw_i. Then the scheduling makespan has lower bound LBi(2)=δ+min( LB_i^(2)=δ+ ( x1,max(x2,wi+1⋅δs,xs+δ), x_1, (x_2, w_i+1·δs,x_s+δ ), min2≤m≤s2max(xm+1,wi+m⋅δs)). _2≤ m≤ s^2 (x_m+1, w_i+m·δs ) ). (8) Proof: To establish the lower bound, we assume again that we only want to schedule the nonzero elements of row/column i∈1,…,2ni∈ \1,…,2n \, and neglect all the other elements in D. We can use Lemma 1. With no reconfigurations, all elements are serviced as whole elements (within a single configuration) and do not need to be split into several chunks. Thus the makespan is at least the weight of the strongest element x1x_1, since it cannot be split, together with its configuration cost δ. Likewise, with m reconfigurations, at least s−ms-m elements will be serviced within a single configuration, and only m elements can be split. In particular, at least one element of weight at least xm+1x_m+1 will be serviced within a single configuration, so the makespan is at least xm+1+δx_m+1+δ. Assume that our optimal schedule uses m reconfigurations. The theorem successively considers all possible numbers m of reconfigurations, and the makespan beyond the initial configuration cost of δ: (0) Assume we have no reconfiguration at all. Then we saw that the makespan is at least δ+x1δ+x_1. Note that adding the average weight wi/sw_i/s to δ would also be a lower bound for the makespan (as also shown in Theorem 1 using ki=sk_i=s), but since x1x_1 is the maximum weight, x1≥wi/sx_1≥ w_i/s so it forms a tighter lower bound. (1) Assume we have exactly m=1m=1 reconfiguration. Then after the initial configuration cost of δ, the makespan is at least (a) x2x_2, as we saw; (b) the average schedule length, given that the total schedule length is the sum of the weight wiw_i and the reconfiguration delay δ, and that there are s schedules; and (c) the weight xsx_s of the smallest element together with the reconfiguration delay δ. This is because we saw that at most one element is cut. Consider the unique schedule with the reconfiguration delay. If both of its configurations are two chunks of the same cut element, then we can just get a better schedule by merging them and removing the reconfiguration, so this cannot happen. So at least one of the two elements in the schedule with the reconfiguration delay is not cut, thus its weight is at least xsx_s. To conclude with m=1m=1, since the makespan is at least all of these three options, it is also at least their maximum. (m) Assume we have m≥2m≥ 2 reconfigurations. Then after the initial δ, the makespan is at least (a) xm+1x_m+1, as we saw; and (b) the average schedule length, given that the total schedule length is the sum of the weight wiw_i and the m reconfiguration delays of δ each, and that there are s schedules. Thus it is at least the maximum of these two options. Finally, since two chunks of the same element within the same schedule can always be merged, the optimal number of decompositions is no more than s2s^2, and therefore bounded. Moreover, an optimal number of reconfigurations with a minimal makespan could be any m, so the minimum of these values forms a lower bound. ∎ Note that when its condition ki=sk_i=s holds, the lower bound from Theorem 2 is at least as good as the first one from Theorem 1: LBi(2)≥LBi(1)LB_i^(2)≥ LB_i^(1). Furthermore, it is strictly better when not all nonzero elements are equal. Specifically, the first lower bound becomes LBi(1)=wis+δLB_i^(1)= w_is+δ. The proof details the case with m=0m=0 reconfigurations, and in the other cases the inequality is strict as δ+wi+m⋅δs>δ+wisδ+ w_i+m·δs>δ+ w_is for m>0m>0. Moreover, in the appendix, we show that when the n×n× n demand matrix D is the sum of k random weighted permutations, for a large enough n, it is highly likely that D’s degree is k, meaning there is at least one row or column with exactly k nonzero elements, and therefore we can apply the lower bounds of Section IV to some row/column i using the parameter k. This applies to the benchmark workload in the evaluations. V Evaluations V-A Settings We now want to evaluate our Spectra algorithm. These evaluations have two goals. First, evaluate Spectra on traffic matrices from AI training workloads. To do so, we use both a relatively sparse and strongly skewed matrix from a GPT model, as well as introduce a relatively dense and uniform matrix from a model with Mixture of Experts (MoE). A second goal is to compare Spectra with previous state-of-the-art algorithms that were designed for similar objectives, and understand the impact of different characteristics of D. To do so, we use a standard benchmark workload that has already been used in the evaluations of several papers. AI training workload 1: GPT-3B. Our first AI training workload results from a GPT model [19] with 3B parameters. Li et al. [20] use Microsoft’s DeepSpeed framework [21] to apply a hybrid parallelism that combines pipeline parallelism (P), tensor parallelism (TP) and data parallelism (DP). They also adopt the default mapping approach of DeepSpeed in a platform of 32 RTX3090 GPUs. They obtain a 32×3232× 32 total traffic matrix D that is strongly skewed and quite sparse. We then normalize D to make it doubly stochastic, and add a small Gaussian noise to nonzero elements of standard deviation σ=0.3%σ=0.3\% as in the literature benchmark. AI training workload 2: MoE. We introduce a new workload with MoE traffic. We implement the Qwen2 MoE model (57B) on Megatron-LM across 64 NVIDIA H100 GPUs, organized as 8 HGX nodes (with 80GB80\,GB per GPU). We use 64 experts, with each expert being placed on one of 64 GPUs. We load the weights to model iterations at the end of training rather than at the start. During each training iteration, we track communication traces between experts and accumulate the iteration’s traffic into a 64×6464× 64 demand matrix D, where each entry is the token count from source to destination. Workload 3: Benchmark. We use the standard benchmark in the field [7, 6, 9]. It builds a 100×100100× 100 demand D that consists of m=16m=16 random flows per source port, including 4 large flows, which evenly split 70% of the bandwidth, and 12 small flows, which evenly split 30%, to reflect the sparse and skewed datacenter traffic. Each flow corresponds to a permutation matrix, and the sum of flows yields D. Finally, the nonzero demands are perturbed with small Gaussian noise of standard deviation 0.3%0.3\% of the link bandwidth. Algorithms. We want to evaluate our Spectra algorithm against the best state-of-the-art optical switch scheduling algorithms. Unfortunately, most algorithms in the literature consider a single optical switch and cannot be applied to scheduling across s parallel optical switches. We compare Spectra against two algorithms: (1) Baseline. The closest to our work is Less [9], which considers scheduling a demand matrix D across s parallel optical switches by splitting D into s sub-workload matrices D1,…,DsD_1,…,D_s, each DiD_i to be independently scheduled on switch i. The splitting objective is to maximize the sparsity in these sub-workload matrices by minimizing the sum of nonzero elements across D1,…,DsD_1,…,D_s. While Less uses a partial-reconfiguration scheduling algorithm intended for partially-reconfigurable switches, we adopt its sparsity-based approach as our Baseline for comparisons. For an apples-to-apples comparison, we also use our Decompose algorithm to schedule the sub-workload matrices in Baseline, since Less does not provide a similar decomposition algorithm. (2) Spectra (Eclipse). Since Eclipse [6] is considered the state-of-the-art in decomposition with reconfiguration delays, we also compare against a Spectra version that uses Eclipse [6] for the Decompose step in place of our decomposition algorithm described in Section I-B. Implementation and Runs. For the Hungarian algorithm, we implemented an efficient variant called the Jonker and Volgenant algorithm that runs faster in practice [22, 23]. We conduct our experiments on a computer with a 3.7 GHz AMD Ryzen Threadripper 3970X Processor and 128 GB of memory, running on Ubuntu 20.04.6 LTS. The runtimes for Spectra are from under 1 ms to 14 ms.222Datacenter switches typically use specialized network processors or ASICS, so Spectra runtimes can be much faster in practice. Each datapoint is the average of 50 runs with random matrices. V-B AI Workloads Figure 5: MoE traffic matrix heatmap from 64 sender GPUs (rows) to 64 receiver GPUs (columns). Fig. 5 shows the distribution of traffic during an iteration of the Qwen model. Its traffic is relatively dense. It is slightly non-uniform between different experts, but more dense and uniform than the GPT pattern, and slightly more uniform than some other smaller examples in the literature [1, 24, 25]. (a) GPT with n=32n=32. (b) MoE with n=64n=64. Figure 6: Performance with AI training workloads given varying reconfiguration delay δ, with GPT and MoE workloads and two different numbers s of OCSes. (a) GPT with n=32n=32. (b) MoE with n=64n=64. Figure 7: Sensitivity to equalization with AI training workloads, comparing Spectra with and without equalization. Fig. 6 presents the results for the sparse skewed GPT workload (Fig. 6(a)) and for the dense near-uniform MoE workload (Fig. 6(b)). It also shows our derived lower bound (Section IV). In all cases, Spectra far outperforms Baseline. In fact, Spectra’s schedule is 1.4×1.4× shorter than Baseline on average on GPT traffic, and 1.9×1.9× shorter on average on MoE. Specifically, not only does Spectra perform much better for low reconfiguration cost δ, but also its makespan grows slower as δ grows. This is because Baseline tries to optimize for a minimal number of nonzero elements to each switch’s sub-workload demand matrix, but this optimization does not directly minimize the number of reconfigurations and does not necessarily translate into a lower makespan. Using Eclipse for the Decompose step achieves close results with the GPT workload due to its well-structured and double-stochastic matrix, but performs much worse with the denser and sub-stochastic MoE matrix. That is, the Eclipse-based variant of Spectra performs 1.1×1.1× worse on average for GPT, and 1.8×1.8× worse on average for MoE. Finally, Spectra appears as close to optimal, since it is close to the lower bound (LB), especially with the MoE workload where it is indistinguishable, maybe because it can focus on the few columns of highest-weight destination experts (Fig. 5). In summary, the results show that Spectra performs well on both sparse and dense matrices, which highlights its flexibility. Fig. 7 analyzes the sensitivity of Spectra to its equalization component. We can see that when we need to schedule the dense and near-uniform MoE traffic with many small elements (Fig. 7(b)), it is relatively easy to spread the traffic, and therefore equalization does not help. However, as we use the GPT traffic with large elements (Fig. 7(a)), the elements need to be split, and equalization makes a significant difference. (a) GPT with n=32n=32. (b) MoE with n=64n=64. Figure 8: Sensitivity to noise with AI training workload: Comparison of Spectra variants with noise standard deviations 0.3%0.3\% and 1%1\%. Fig. 8 presents the impact of added noise on the performance of Spectra and Spectra with Eclipse in the AI training workloads. It plots the result of having a small noise added to all non-zero elements (of standard deviation σ=0.3%σ=0.3\%, plotted in gray) vs. having a larger noise (σ=1%σ=1\%, plotted with usual colors). The 0.3% noise is already added in the initial GPT workload, while we add it specifically in the MoE workload for the comparison. The GPT initial traffic matrix is more regular and doubly stochastic, and therefore we can see that noise has a slightly negative impact, especially for s=4s=4. On the contrary, MoE traffic is strongly sub-stochastic, so the noise can in fact make it less skewed and improve the performance, V-C Benchmark Workload Fig. 9 compares Spectra against alternatives on the benchmark workload. Spectra vastly outperforms Baseline, with a 2.4×2.4× shorter schedule on average. Using Eclipse for the Decompose step always yields worse results, and Spectra performs better than Eclipse with a 1.2×1.2× shorter schedule on average. Spectra also appears quite close to the lower bound, and therefore, to optimality. Fig. 10 analyzes the sensitivity of the algorithms to a varying number m of outgoing flows per port, and therefore to a varying sparsity of the matrix D, given reconfiguration delay δ=0.04δ=0.04. Spectra is flexible enough to adapt to these differing sparsity patterns, again vastly outperforming Baseline. Using an Eclipse decomposition step never helps Spectra. Figure 9: Benchmark workload. Figure 10: Sensitivity to sparsity with baseline workload, with varying number of flows. VI Conclusion In this paper, we showed how the increasing network demands in AI datacenters could be serviced by a set of parallel OCSes, in which we want to minimize the makespan. We then explained how Spectra can schedule a demand matrix D over s parallel switches using three steps: Decompose, Schedule and Equalize. Our evaluation of Spectra shows that it vastly outperforms Baseline by an average factor of 1.4×1.4× on GPT AI workloads, 1.9×1.9× on MoE AI workloads, and 2.4×2.4× on standard benchmarks. Acknowledgment The authors would like to thank Ori Cohen for providing the MoE measurements, and Mark Silberstein and Jose Yallouz for their assistance in coordinating the necessary resources. This work was partly supported by the Louis and Miriam Benjamin Chair in Computer-Communication Networks and NSF Award No. 2410053. References [1] X. Liao, Y. Sun, H. Tian, X. Wan, Y. Jin, Z. Wang, Z. Ren, X. Huang, W. Li, K. F. Tse et al., “MixNet: A runtime reconfigurable optical-electrical fabric for distributed mixture-of-experts training,” in ACM SIGCOMM, 2025, p. 554–574. [2] A. Gangidi, R. Miao, S. Zheng, S. J. Bondu, G. Goes, H. Morsy, R. Puri, M. Riftadi, A. J. Shetty, J. Yang et al., “RDMA over Ethernet for distributed training at Meta scale,” in ACM SIGCOMM, 2024, p. 57–70. [3] J. Lu, J. Gao, F. Feng, Z. He, M. Zheng, K. Liu, J. He, B. Liao, S. Xu, K. Sun et al., “Alibaba Stellar: A new generation RDMA network for cloud AI,” in ACM SIGCOMM, 2025, p. 453–466. [4] K. Qian, Y. Xi, J. Cao, J. Gao, Y. Xu, Y. Guan, B. Fu, X. Shi, F. Zhu, R. Miao et al., “Alibaba HPN: A data center network for large language model training,” in Proceedings of the ACM SIGCOMM 2024 Conference, 2024, p. 691–706. [5] Y. Feng, T. Chen, Y. Wei, S. Shen, S. Wang, W. Li, K. Ma, and T. Hoefler, “RailX: a flexible, scalable, and low-cost network architecture for hyper-scale LLM training systems,” arXiv preprint arXiv:2507.18889, 2025. [6] S. Bojja Venkatakrishnan, M. Alizadeh, and P. Viswanath, “Costly circuits, submodular schedules and approximate carathéodory theorems,” Queueing Systems, vol. 88, no. 3, p. 311–347, 2018. [7] H. Liu et al., “Scheduling techniques for hybrid circuit/packet networks,” in ACM CoNEXT, 2015, p. 1–13. [8] N. Farrington et al., “Helios: a hybrid electrical/optical switch architecture for modular data centers,” in ACM SIGCOMM, vol. 40, no. 4, 2010, p. 339–350. [9] L. Liu, J. J. Xu, and M. Singh, “LESS: A matrix split and balance algorithm for parallel circuit (optical) or hybrid data center switching and more,” in IEEE/ACM Utility and Cloud Computing (UCC), 2019. [10] C. Liang et al., “NegotiaToR: towards a simple yet effective on-demand reconfigurable datacenter network,” in ACM SIGCOMM, 2024. [11] Y. Zu, A. Ghaffarkhah, H.-V. Dang, B. Towles, S. Hand, S. Huda, A. Bello, A. Kolbasov, A. Rezaei, D. Du et al., “Resiliency at scale: Managing \Google’s\\TPUv4\ machine learning supercomputer,” in Usenix NSDI, 2024, p. 761–774. [12] F. De Marchi, J. Li, Y. Zhang, W. Bai, and Y. Xia, “Unlocking superior performance in reconfigurable data center networks with credit-based transport,” in ACM SIGCOMM, 2025, p. 842–860. [13] V. Addanki et al., “Vermilion: A traffic-aware reconfigurable optical interconnect with formal throughput guarantees,” arXiv preprint arXiv:2504.09892, 2025. [14] K. Liang, L. Qiao, I. Keslassy, and B. Lin, “Scheduling parallel optical switches in datacenters,” IEEE Networking Letters, 2026. [15] X. Li and M. Hamdi, “On scheduling optical packet switches with reconfiguration delay,” IEEE JSAC, vol. 21, no. 7, p. 1156–1164, 2003. [16] A. Schrijver, Combinatorial Optimization: Polyhedra and Efficiency, ser. Algorithms and Combinatorics. Berlin: Springer, 2003, vol. 24, no. 2. [17] R. K. Ahuja, T. L. Magnanti, and J. B. Orlin, Network Flows: Theory, Algorithms, and Applications. Prentice Hall, 1993. [18] P. Bergé, M. Chaikovskaia, J.-P. Gayon, and A. Quilliot, “Approximation algorithms for job scheduling with reconfigurable resources,” arXiv preprint arXiv:2401.00419, 2023. [19] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019. [20] W. Li, X. Liu, Y. Li, Y. Jin, H. Tian, Z. Zhong, G. Liu, Y. Zhang, and K. Chen, “Understanding communication characteristics of distributed training,” in Proceedings of the 8th Asia-Pacific Workshop on Networking, 2024, p. 1–8. [21] “Megatron-DeepSpeed,” https://github.com/microsoft/Megatron-DeepSpeed. [22] R. Jonker and A.Volgenant, “A shortest augmenting path algorithm for dense and sparse linear assignment problems,” Computing, vol. 38, no. 4, p. 325–340, 1987. [23] D. F. Crouse, “On implementing 2D rectangular assignment algorithms,” IEEE Transactions on Aerospace and Electronic Systems, vol. 52, no. 4, p. 1679–1696, 2016. [24] S. Renganathan and N. McKeown, “Chronos: Prescheduled circuit switching for LLM training,” in Proceedings of the 2nd Workshop on Networks for AI Computing, 2025, p. 89–97. [25] S. Gandhi and C. Kozyrakis, “MoEtion: Efficient and reliable checkpointing for mixture-of-experts models at scale,” arXiv preprint arXiv:2412.15411, 2024. Sum of k random weighted permutations. We focus on the case where the n×n× n demand matrix D is the sum of k random weighted permutations, as in the benchmark workload in the evaluations (before noise is added). We want to show that for a large enough n, it is highly likely that D’s degree is k, meaning there is at least one row or column with exactly k nonzero elements, and therefore we can apply the lower bounds of Section IV to some row/column i using the parameter k. We start by proving a property of a single row or column of D. Proposition 1. If D is defined as the sum of k random weighted permutations, then the probability p that there are exactly k nonzero elements in a given row or column of D is p=n!(n−k)!⋅nkp= n!(n-k)!· n^k (9) Proof: Consider k colored balls that each randomly choose one of n bins. p is the probability that they end up in distinct bins. It is the division of the number n⋅(n−1)⋅⋯⋅(n−k+1)=n!(n−k)!n·(n-1)·…·(n-k+1)= n!(n-k)! of possible choices for the placement of the balls such that they get in distinct bins, by the total number nkn^k of possible placements. ∎ We now build a model that approximates the probability that D has degree k. This model considers all rows and columns of n as i.i.d. for this approximation. (a) k varies, given n=100n=100 (b) n varies, given k=16k=16 Figure 11: Probability that an n×n× n matrix D that is the sum of k random weighted permutations also has degree k. Proposition 2. If D is defined as the sum of k random weighted permutations, then the probability that its degree is k (and therefore the lower bounds for i hold with k nonzero elements) can be approximated as: 1−(1−n!(n−k)!⋅nk)2n.1- (1- n!(n-k)!· n^k )^2n. (10) Proof: We approximate the 2n2n rows and columns as i.i.d., with a probability p that any given row or column has k nonzero elements. Then the probability that at least one row or column has k nonzero elements is 1−(1−p)2n1-(1-p)^2n. ∎ Fig. 11(a) illustrates the simulated probability that the degree of D is k in a 100×100100× 100 matrix D that is the sum of k random weighted permutations. It shows that the approximation is close. Fig. 11(b) shows the same probability for a sum of 16 permutations, as D’s size n varies. Again, the approximation holds well. Moreover, for n beyond 50, it is highly likely that its degree will be 16.