Paper deep dive

Quotient Geometry, Effective Curvature, and Implicit Bias in Simple Shallow Neural Networks

Hang-Cheng Dong, Pengcheng Cheng

Year: 2026Venue: arXiv preprintArea: cs.LGType: PreprintEmbeddings: 161

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 92%

Last extracted: 3/26/2026, 2:29:15 AM

Summary

The paper develops a differential-geometric framework for analyzing shallow neural networks by treating them as quotient spaces, effectively removing parameter redundancies caused by hidden-unit permutations and rescalings. By defining a regular set of parameters where symmetry orbits are well-behaved, the authors introduce an effective curvature and a symmetry-reduced Hessian that capture intrinsic predictor-level geometry. They demonstrate that gradient flows can be decomposed into horizontal (predictor-changing) and vertical (gauge-varying) components, providing a rigorous basis for understanding implicit bias in overparameterized models.

Entities (6)

Shallow Neural Networks · model-architecture · 100%Implicit Bias · learning-phenomenon · 95%Parameter Symmetries · concept · 95%Quotient Geometry · mathematical-framework · 95%Effective Curvature · geometric-property · 90%Quadratic-activation model · specific-model · 90%

Relation Signals (3)

Shallow Neural Networks → exhibit → Parameter Symmetries

confidence 95% · Overparameterized shallow neural networks admit substantial parameter redundancy: distinct parameter vectors may represent the same predictor due to hidden-unit permutations, rescalings, and related symmetries.

Quotient Geometry → removes → Parameter Symmetries

confidence 90% · we develop a differential-geometric framework for analyzing simple shallow networks through the quotient space obtained by modding out parameter symmetries

Effective Curvature → characterizes → Shallow Neural Networks

confidence 85% · This leads to an effective notion of curvature that removes degeneracy along symmetry orbits and yields a symmetry-reduced Hessian

Cypher Suggestions (2)

Map the relationship between symmetry and quotient space construction. · confidence 85% · unvalidated

MATCH (s:Symmetry)-[:REDUCED_BY]->(q:QuotientSpace) RETURN s, q

Find all concepts related to the geometric analysis of neural networks. · confidence 80% · unvalidated

MATCH (n:Concept)-[:RELATED_TO]->(m:ModelArchitecture) WHERE m.name = 'Shallow Neural Networks' RETURN n, m

Abstract

Abstract:Overparameterized shallow neural networks admit substantial parameter redundancy: distinct parameter vectors may represent the same predictor due to hidden-unit permutations, rescalings, and related symmetries. As a result, geometric quantities computed directly in the ambient Euclidean parameter space can reflect artifacts of representation rather than intrinsic properties of the predictor. In this paper, we develop a differential-geometric framework for analyzing simple shallow networks through the quotient space obtained by modding out parameter symmetries on a regular set. We first characterize the symmetry and quotient structure of regular shallow-network parameters and show that the finite-sample realization map induces a natural metric on the quotient manifold. This leads to an effective notion of curvature that removes degeneracy along symmetry orbits and yields a symmetry-reduced Hessian capturing intrinsic local geometry. We then study gradient flows on the quotient and show that only the horizontal component of parameter motion contributes to first-order predictor evolution, while the vertical component corresponds purely to gauge variation. Finally, we formulate an implicit-bias viewpoint at the quotient level, arguing that meaningful complexity should be assigned to predictor classes rather than to individual parameter representatives. Our experiments confirm that ambient flatness is representation-dependent, that local dynamics are better organized by quotient-level curvature summaries, and that in underdetermined regimes, implicit bias is most naturally described in quotient coordinates.

PDF

Open source PDF →Open local PDF →

Full Text

161,042 characters extracted from source content.

Expand or collapse full text

Quotient Geometry, Effective Curvature, and Implicit Bias in Simple Shallow Neural Networks -Cheng Dong _d@hit.edu.cn of Instrumentation Science and Engineering, Harbin Institute of Technology Harbin, 150001, China Harbin Institute of Technology Suzhou Research Institute Suzhou, 215000, China Cheng 1022@mails.jlu.edu.cn of Mathematics, Jilin University Changchun, 130012, China Corresponding author. Abstract Overparameterized shallow neural networks admit substantial parameter redundancy: distinct parameter vectors may represent the same predictor due to hidden-unit permutations, rescalings, and related symmetries. As a result, geometric quantities computed directly in the ambient Euclidean parameter space can reflect artifacts of representation rather than intrinsic properties of the predictor. In this paper, we develop a differential-geometric framework for analyzing simple shallow networks through the quotient space obtained by modding out parameter symmetries on a regular set. We first characterize the symmetry and quotient structure of regular shallow-network parameters and show that the finite-sample realization map induces a natural metric on the quotient manifold. This leads to an effective notion of curvature that removes degeneracy along symmetry orbits and yields a symmetry-reduced Hessian capturing intrinsic local geometry. We then study gradient flows on the quotient and show that only the horizontal component of parameter motion contributes to first-order predictor evolution, while the vertical component corresponds purely to gauge variation. Finally, we formulate an implicit-bias viewpoint at the quotient level, arguing that meaningful complexity should be assigned to predictor classes rather than to individual parameter representatives. In the quadratic-activation model, the quotient object is represented explicitly by a symmetric matrix, allowing both the theory and the numerical experiments to be made fully concrete. Our experiments confirm that ambient flatness is representation-dependent, that local dynamics are better organized by quotient-level curvature summaries, and that in underdetermined regimes implicit bias is most naturally described in quotient coordinates. These results support the broader perspective that the natural state space of symmetric shallow networks is the quotient space of predictor classes rather than the raw parameter space. Keywords: shallow neural networks, quotient geometry, parameter symmetry, effective curvature, Hessian degeneracy, gradient flow, implicit bias, overparameterization, quadratic networks, symmetry-reduced optimization 1 Introduction Neural networks are usually trained in a high-dimensional Euclidean parameter space, but the objects of real interest are the predictors they realize. Even in shallow architectures, these two spaces are not the same. Hidden-unit permutations, positive rescalings, and related parameter symmetries imply that many different parameter vectors may encode the same function. As a consequence, geometric quantities computed directly in parameter coordinates—such as Hessians, flatness indicators, or norm-based complexities—may reflect redundancy of representation rather than intrinsic properties of the predictor. This mismatch is already present in the simplest overparameterized shallow networks, and it suggests that the ambient parameter space is not the natural state space for analyzing curvature, optimization, or implicit bias. Shallow networks remain a particularly useful setting in which to study this issue. On the one hand, the classical approximation literature showed that single-hidden-layer networks already possess strong expressive power, beginning with universal approximation theorems and continuing through refined quantitative bounds and norm-based descriptions of representational complexity (Cybenko, 1989; Hornik et al., 1989; Barron, 1993; Pinkus, 1999; Neyshabur et al., 2015). On the other hand, modern optimization theory has shown that overparameterized two-layer models provide one of the cleanest regimes in which benign nonconvex geometry, global convergence, mean-field limits, and kernel limits can all be analyzed explicitly (Du and Lee, 2018; Chizat and Bach, 2018; Jacot et al., 2018; Mei et al., 2018; Arora et al., 2019b). These two strands together make shallow networks an ideal testing ground for a geometric theory of symmetry-reduced learning. The universal approximation and overparameterized-two-layer perspectives cited here are standard landmarks in the literature. A separate line of work has emphasized symmetry in machine learning, most often in the form of invariance or equivariance with respect to transformations of the input or output space (Bloem-Reddy and Teh, 2020). Our concern in this paper is different. We study parameter symmetries of ordinary shallow networks: transformations of the parameters that leave the realized predictor unchanged. These symmetries generate non-identifiability, degenerate ambient curvature, and gauge dependence in parameter-space complexity measures. Similar concerns have appeared in work on symmetry-invariant optimization and scale-invariant parameterizations (Badrinarayanan et al., 2015), but they are not naturally resolved by staying in the raw Euclidean coordinates of the parameters. The broader symmetry literature in neural networks, including probabilistic formulations of invariance, makes clear that symmetry should be built into the mathematical formalism rather than treated as an afterthought. This observation points naturally to quotient geometry. If one restricts to a regular set on which the symmetry action is well behaved and then quotients out the parameter symmetries, each orbit becomes a locally identifiable predictor class. The relevant gradients, Hessians, and curvature quantities should then be defined on this symmetry-reduced space rather than on the original parameter manifold. This viewpoint is classical in differential geometry (Lee, 2003, 2018) and has been highly successful in optimization over matrix factorizations and fixed-rank models, where quotient-manifold methods separate intrinsic variation from pure gauge motion (Absil et al., 2008; Edelman et al., 1998; Boumal, 2023; Meyer et al., 2011; Vandereycken, 2013; Mishra et al., 2014). In particular, low-rank positive semidefinite regression and related factorized problems show that once one passes to the appropriate quotient space, many apparently degenerate directions in the ambient parameterization disappear, and the remaining geometry becomes both interpretable and algorithmically useful. The fixed-rank PSD regression literature is a direct precedent for the kind of quotient construction we use here. The main thesis of this paper is that the same principle applies to simple shallow neural networks. The natural geometric object is not the raw parameter vector (θ), but its equivalence class under parameter symmetries. On the resulting quotient space, the finite-sample realization map induces a canonical local metric; the associated Hessian defines an effective curvature that removes spurious flatness along symmetry directions; and the optimization dynamics admit a decomposition into vertical motions, which only change the representative within an orbit, and horizontal motions, which change the realized predictor. From this perspective, several familiar difficulties of neural-network theory—non-identifiability, singular Hessians, ambiguous flatness, and representation-dependent complexity—become different manifestations of the same underlying fact: the ambient parameter space is too large. We develop this viewpoint for structurally simple shallow networks, precisely because they allow the quotient structure to be made explicit. Our analysis begins with the symmetry and quotient structure of the regular parameter set. We then introduce a function-induced metric on the quotient and show that it yields a notion of effective curvature more faithful to the predictor-level geometry than the ambient Euclidean Hessian. Next, we study gradient flow on the quotient and prove that only the horizontal component of the parameter velocity contributes to first-order function evolution. Finally, we formulate an implicit-bias principle at the quotient level, arguing that whenever training selects among multiple feasible solutions, the meaningful notion of complexity should be defined on predictor classes rather than on individual parameter representatives. A particularly illuminating case is the quadratic-activation model, where the quotient object can be represented explicitly by a symmetric matrix (Q). In this setting, the network fθ(x)=∑i=1mai(wi⊤x)2f_θ(x)= _i=1^ma_i(w_i x)^2 depends on (θ) only through Q(θ)=∑i=1maiwiwi⊤.Q(θ)= _i=1^ma_iw_iw_i . This connects shallow-network training to the geometry of low-rank matrix factorization and makes the quotient structure directly visible. It also links our work to the broader literature on implicit regularization in factorized models, where optimization is now understood to prefer certain low-complexity factorizations even without explicit regularization (Gunasekar et al., 2017; Arora et al., 2019a; Soudry et al., 2018). Our goal is not to reduce neural-network training to matrix factorization, but rather to use this tractable model to expose the geometry that is hidden in more general shallow networks. The contribution of the paper is therefore conceptual and geometric. We provide a symmetry-reduced framework for understanding shallow-network curvature, optimization dynamics, and implicit bias. The resulting picture is coherent: symmetry creates redundancy; quotienting removes that redundancy; curvature should be measured on the quotient; gradient flow should be decomposed into vertical and horizontal components; and simplicity should be defined on predictor classes rather than on coordinates in an overcomplete parameterization. 2 Symmetry and Quotient Structure of Shallow Networks We consider a class of shallow scalar-output neural networks of the form fθ(x)=∑i=1maiσ(wi⊤x),x∈ℝd,f_θ(x)= _i=1^ma_iσ(w_i x), x ^d, (2.1) where m is the width, ai∈ℝa_i are output-layer coefficients, wi∈ℝdw_i ^d are hidden weights, and σ:ℝ→ℝσ:R is a fixed activation function. Throughout this section, the parameter vector is denoted by θ=((a1,w1),…,(am,wm))∈Θ:=(ℝ×ℝd)m.θ=((a_1,w_1),…,(a_m,w_m))∈ :=(R×R^d)^m. (2.2) Our goal is to isolate the geometric structure created by parameter symmetries. The main point is that Θ is not the appropriate space of effective model degrees of freedom: many distinct parameter values represent the same predictor, and the resulting redundancy is not incidental but structural. This redundancy is the source of a large class of degenerate directions in local optimization geometry. We therefore begin by identifying the symmetry group, the corresponding quotient structure, and the singular configurations at which this quotient ceases to be locally regular. 2.1 Parameter symmetries The model admits two basic forms of symmetry. First, the hidden units are unordered. For any permutation π∈Smπ∈ S_m, define π⋅θ:=((aπ(1),wπ(1)),…,(aπ(m),wπ(m))).π·θ:=((a_π(1),w_π(1)),…,(a_π(m),w_π(m))). (2.3) Since the network output is a sum over hidden units, we have fπ⋅θ(x)=fθ(x)f_π·θ(x)=f_θ(x) (2.4) for all x∈ℝdx ^d. Thus, the symmetric group SmS_m acts on Θ by reindexing hidden units without changing the realized function. Second, when the activation is homogeneous, the model also admits a continuous scaling symmetry. We assume in this section that σ is positively p-homogeneous for some p>0p>0, that is, σ(ct)=cpσ(t),c>0,t∈ℝ.σ(ct)=c^pσ(t), c>0,\ t . (2.5) For each c=(c1,…,cm)∈(ℝ>0)mc=(c_1,…,c_m)∈(R_>0)^m, define c⋅θ:=((c1−pa1,c1w1),…,(cm−pam,cmwm)).c·θ:= ((c_1^-pa_1,c_1w_1),…,(c_m^-pa_m,c_mw_m) ). (2.6) By homogeneity, (ci−pai)σ((ciwi)⊤x)=aiσ(wi⊤x),(c_i^-pa_i)σ((c_iw_i) x)=a_iσ(w_i x), (2.7) and therefore fc⋅θ(x)=fθ(x).f_c·θ(x)=f_θ(x). (2.8) Hence the group (ℝ>0)m(R_>0)^m acts on Θ by neuronwise rescaling. Combining permutation and scaling symmetries yields the transformation group G:=Sm⋉(ℝ>0)m,G:=S_m (R_>0)^m, (2.9) acting on Θ in the natural way. The semidirect product structure reflects the fact that permutations relabel the coordinates on which the scaling group acts. By construction, fg⋅θ=fθf_g·θ=f_θ (2.10) for every g∈Gg∈ G and every θ∈Θθ∈ . The orbit G⋅θ:=g⋅θ:g∈GG·θ:=\g·θ:g∈ G\ (2.11) therefore consists entirely of different parameterizations of the same predictor. This elementary observation has an important geometric consequence. Any local analysis performed directly in the Euclidean parameter space Θ necessarily contains directions tangent to the orbit G⋅θG·θ, and these directions do not correspond to any change in the realized function. As a result, degeneracies in the Euclidean geometry of the loss surface are partly induced by the representation and not by the predictor itself. The appropriate local object is therefore not the ambient parameter space Θ , but a quotient of Θ by the symmetry group G. 2.2 Realization map and local identifiability Let ℱF denote a function space containing the realizations fθf_θ; for concreteness one may take ℱF to be the linear span or closure of σ(w⊤⋅):w∈ℝd\σ(w ·):w ^d\ in a topology adapted to the learning problem. Define the realization map Φ:Θ→ℱ,Φ(θ)=fθ. : , (θ)=f_θ. (2.12) The group action described above is contained in the fibers of Φ : if θ′∈G⋅θ ∈ G·θ, then Φ(θ′)=Φ(θ) (θ )= (θ). In general, however, the fibers of Φ can be larger than group orbits. This is precisely where singular behavior enters. To formulate a regular quotient structure, we restrict attention to parameter configurations at which the only local non-identifiability is the symmetry encoded by G. Intuitively, such configurations should exclude vanishing neurons, collisions between distinct neurons, and other forms of linear dependence that create additional degeneracy in the map Φ . We therefore introduce a regular subset Θreg⊂Θ _reg⊂ characterized by two requirements. First, the isotropy of the group action should be minimal, so that the orbit dimension is locally constant. Second, the differential DΦθD _θ should have kernel exactly equal to the tangent space of the group orbit. Formally, we set Θreg:=θ∈Θ:ker⁡(DΦθ)=Tθ(G⋅θ)and the G-action is locally free at θ. _reg:= \θ∈ : (D _θ)=T_θ(G·θ)\ and the G-action is locally free at θ \. (2.13) The first condition says that every infinitesimal perturbation leaving the function unchanged is generated by a symmetry of the parameterization. The second rules out orbit-dimension collapse. Together they identify the region in which the quotient by symmetry captures the full local geometry of the model class. The precise analytic characterization of Θreg _reg depends on the activation and on the function space ℱF, but its interpretation is robust. Typical regularity conditions exclude at least the following pathologies: 1. Vanishing neurons: ai=0a_i=0 or wi=0w_i=0, which make the i-th unit functionally inactive and enlarge the stabilizer of the action. 2. Neuron collisions: wi=wjw_i=w_j for i≠ji≠ j, or more generally positive collinearity in homogeneous models, which causes different hidden units to realize the same feature. 3. Additional linear dependence among the feature functions x↦σ(wi⊤x)x σ(w_i x) and their first-order parameter derivatives, which creates nontrivial kernel directions for DΦθD _θ beyond the orbit directions. These conditions are not introduced merely for technical convenience. They distinguish parameter points at which the model is locally a smooth quotient of a Euclidean space from points at which the representation itself changes rank. The former support a clean differential-geometric description; the latter form the singular part of parameter space. 2.3 Orbit geometry and the regular quotient We next describe the local geometry of the symmetry orbits. Since the group SmS_m is discrete, only the scaling component contributes to the tangent space of the orbit. For θ∈Θθ∈ , the continuous part of the orbit is generated by curves of the form t↦((e−ptξ1a1,etξ1w1),…,(e−ptξmam,etξmwm)),ξ∈ℝm.t ((e^-pt _1a_1,e^t _1w_1),…,(e^-pt _ma_m,e^t _mw_m) ), ξ ^m. (2.14) Differentiating at t=0t=0 yields the tangent vectors δξθ=((−pξ1a1,ξ1w1),…,(−pξmam,ξmwm)). _ξθ= ((-p _1a_1, _1w_1),…,(-p _ma_m, _mw_m) ). (2.15) Hence the tangent space to the continuous orbit at θ is Tθ(G⋅θ)=((−pξ1a1,ξ1w1),…,(−pξmam,ξmwm)):ξ∈ℝm,T_θ(G·θ)= \ ((-p _1a_1, _1w_1),…,(-p _ma_m, _mw_m) ):ξ ^m \, (2.16) provided the action is locally free. By construction, DΦθ[δξθ]=0,D _θ[ _ξθ]=0, (2.17) which expresses infinitesimally the invariance of the network under scaling. The following statement records the geometric meaning of the regular set. Proposition 2.1. (Orbit directions and exact infinitesimal redundancy) Let Φ:Θ→ℱ,Φ(θ)=fθ, : , (θ)=f_θ, be the realization map of the shallow network fθ(x)=∑i=1maiσ(wi⊤x),f_θ(x)= _i=1^ma_iσ(w_i x), and let G=Sm⋉(ℝ>0)mG=S_m (R_>0)^m act on Θ by permutation and neuronwise positive rescaling. If θ∈Θregθ∈ _reg, then ker⁡(DΦθ)=Tθ(G⋅θ). (D _θ)=T_θ(G·θ). Consequently, the quotient tangent space T[θ](Θreg/G)T_[θ]( _reg/G) is canonically identified with the space of first-order function variations generated by perturbations of θ. Proof By definition, Θreg _reg consists of those points θ∈Θθ∈ such that 1. the action of G is locally free at θ, and 2. the differential of the realization map satisfies ker⁡(DΦθ)=Tθ(G⋅θ). (D _θ)=T_θ(G·θ). Therefore, for every θ∈Θregθ∈ _reg, the equality ker⁡(DΦθ)=Tθ(G⋅θ) (D _θ)=T_θ(G·θ) holds immediately. What remains is to justify that Tθ(G⋅θ)T_θ(G·θ) is indeed the space of infinitesimal symmetry directions, and that these directions lie in the kernel of DΦθD _θ. Since the permutation group SmS_m is discrete, it contributes no infinitesimal directions. Hence the tangent space to the orbit is generated entirely by the continuous scaling subgroup (ℝ>0)m(R_>0)^m. Let ξ=(ξ1,…,ξm)∈ℝm,ξ=( _1,…, _m) ^m, and define a smooth curve cξ:(−ε,ε)→Θc_ξ:(- , )→ by cξ(t)=((e−ptξ1a1,etξ1w1),…,(e−ptξmam,etξmwm)).c_ξ(t)= ((e^-pt _1a_1,e^t _1w_1),…,(e^-pt _ma_m,e^t _mw_m) ). For each t, cξ(t)∈G⋅θc_ξ(t)∈ G·θ, since it is obtained from θ by neuronwise positive rescaling. Therefore, c˙ξ(0)∈Tθ(G⋅θ). c_ξ(0)∈ T_θ(G·θ). A direct differentiation gives c˙ξ(0)=((−pξ1a1,ξ1w1),…,(−pξmam,ξmwm)). c_ξ(0)= ((-p _1a_1, _1w_1),…,(-p _ma_m, _mw_m) ). Now, because Φ is invariant under the action of G, Φ(cξ(t))=Φ(θ)for all t. (c_ξ(t))= (θ) all t. Differentiating at t=0t=0 yields DΦθ[c˙ξ(0)]=0.D _θ[ c_ξ(0)]=0. Thus, Tθ(G⋅θ)⊆ker⁡(DΦθ).T_θ(G·θ) (D _θ). At a regular point, the reverse inclusion holds by definition of Θreg _reg. Therefore, ker⁡(DΦθ)=Tθ(G⋅θ). (D _θ)=T_θ(G·θ). This proves the first claim. Let q:Θreg→Θreg/Gq: _reg→ _reg/G be the quotient map, and let [θ]=q(θ)[θ]=q(θ) denote the orbit of θ. Because the action is locally free and proper on Θreg _reg, the quotient Θreg/G _reg/G is a smooth manifold, and the differential Dqθ:TθΘreg→T[θ](Θreg/G)Dq_θ:T_θ _reg→ T_[θ]( _reg/G) is surjective with kernel ker⁡(Dqθ)=Tθ(G⋅θ). (Dq_θ)=T_θ(G·θ). Therefore, by the first isomorphism theorem for vector spaces, T[θ](Θreg/G)≅TθΘreg/Tθ(G⋅θ).T_[θ]( _reg/G) T_θ _reg/T_θ(G·θ). Using the identity established in Part I, Tθ(G⋅θ)=ker⁡(DΦθ),T_θ(G·θ)= (D _θ), we obtain T[θ](Θreg/G)≅TθΘreg/ker⁡(DΦθ).T_[θ]( _reg/G) T_θ _reg/ (D _θ). Now define the linear map DΦθ~:TθΘreg/ker⁡(DΦθ)→Im⁡(DΦθ) D _θ:T_θ _reg/ (D _θ) (D _θ) by DΦθ~([v])=DΦθ[v]. D _θ([v])=D _θ[v]. This map is well defined: if v−v′∈ker⁡(DΦθ)v-v ∈ (D _θ), then DΦθ[v−v′]=0,D _θ[v-v ]=0, hence DΦθ[v]=DΦθ[v′]D _θ[v]=D _θ[v ]. It is clearly linear, surjective by definition of Im⁡(DΦθ)Im(D _θ), and injective because DΦθ~([v])=0⟹DΦθ[v]=0⟹v∈ker⁡(DΦθ)⟹[v]=0. D _θ([v])=0 D _θ[v]=0 v∈ (D _θ) [v]=0. Hence DΦθ~ D _θ is a linear isomorphism. Combining the previous identifications gives T[θ](Θreg/G)≅Im⁡(DΦθ).T_[θ]( _reg/G) (D _θ). Since Im⁡(DΦθ)Im(D _θ) is precisely the space of first-order variations of the realized function generated by perturbations of θ, this proves the second claim. The proposition is immediate from the definition of Θreg _reg, but it expresses the key structural fact behind the rest of the paper: on the regular set, the only infinitesimally invisible directions are the symmetry directions. All other perturbations correspond to genuine local changes of the predictor. This observation permits a local quotient description. Since the action is locally free on Θreg _reg and the orbit dimension is constant there, standard quotient arguments imply that the regular parameter space modulo symmetry inherits a smooth manifold structure. Theorem 2.1. (Regular quotient manifold) Assume that σ is C1C^1 away from the origin and positively p-homogeneous, and let Θreg⊂Θ _reg⊂ be a G-invariant subset on which the action of G=Sm⋉(ℝ>0)mG=S_m (R_>0)^m is locally free and proper and such that ker⁡(DΦθ)=Tθ(G⋅θ)for all θ∈Θreg, (D _θ)=T_θ(G·θ) all θ∈ _reg, where Φ:Θ→ℱ : is the realization map Φ(θ)=fθ. (θ)=f_θ. Then the quotient ℳreg:=Θreg/GM_reg:= _reg/G is a smooth manifold. Moreover, the realization map Φ descends to a locally injective smooth map Φ¯:ℳreg→ℱ,Φ¯([θ])=Φ(θ), :M_reg , ([θ])= (θ), whose differential is injective at every point. Proof We first verify that the assumptions place us within the standard quotient-manifold framework. Since Θ=(ℝ×ℝd)m =(R×R^d)^m is a finite-dimensional smooth manifold, every G-invariant subset Θreg⊂Θ _reg⊂ that is itself a smooth embedded submanifold inherits a smooth manifold structure. We regard Θreg _reg with this induced structure. The group G=Sm⋉(ℝ>0)mG=S_m (R_>0)^m is a Lie group: SmS_m is a finite discrete Lie group and (ℝ>0)m(R_>0)^m is an m-dimensional Lie group; hence their semidirect product is again a Lie group. By assumption, the action α:G×Θreg→Θreg,(g,θ)↦g⋅θα:G× _reg→ _reg, (g,θ) g·θ is smooth, proper, and locally free. A standard theorem in differential geometry states that if a Lie group acts smoothly, properly, and locally freely on a smooth manifold M, then the orbit space M/GM/G carries a unique smooth manifold structure such that the quotient map q:M→M/Gq:M→ M/G is a smooth submersion. Applying this with M=ΘregM= _reg, we conclude that ℳreg=Θreg/GM_reg= _reg/G is a smooth manifold and that the quotient map q:Θreg→ℳregq: _reg _reg is a smooth submersion. In particular, for every θ∈Θregθ∈ _reg, the differential Dqθ:TθΘreg→T[θ]ℳregDq_θ:T_θ _reg→ T_[θ]M_reg is surjective, and its kernel is the tangent space to the orbit: ker⁡(Dqθ)=Tθ(G⋅θ). (Dq_θ)=T_θ(G·θ). We next show that Φ factors through q. By construction of the group action, for every g∈Gg∈ G and every θ∈Θregθ∈ _reg, Φ(g⋅θ)=Φ(θ). (g·θ)= (θ). Indeed, permutations merely reorder hidden units, while positive rescalings preserve each summand by p-homogeneity of σ. Therefore Φ is constant on every G-orbit. Hence there exists a unique set-theoretic map Φ¯:ℳreg→ℱ :M_reg such that Φ=Φ¯∘q,equivalentlyΦ¯([θ])=Φ(θ). = q, ([θ])= (θ). We now prove that Φ¯ is smooth. Since q is a smooth submersion, every point [θ]∈ℳreg[θ] _reg admits an open neighborhood U⊂ℳregU _reg and a smooth local section s:U→Θregsuch thatq∘s=idU.s:U→ _reg that q s=id_U. On U, the descended map is given by Φ¯|U=Φ∘s. |_U= s. Since both Φ and s are smooth, Φ¯|U |_U is smooth. Because [θ][θ] was arbitrary, Φ¯ is smooth on all of ℳregM_reg. Fix θ∈Θregθ∈ _reg. Since Φ=Φ¯∘q = q, differentiation yields DΦθ=DΦ¯[θ]∘Dqθ.D _θ=D _[θ] Dq_θ. Because DqθDq_θ is surjective, this identity completely determines DΦ¯[θ]D _[θ]. We claim that DΦ¯[θ]D _[θ] is injective. Let u∈T[θ]ℳregu∈ T_[θ]M_reg satisfy DΦ¯[θ][u]=0.D _[θ][u]=0. Since DqθDq_θ is surjective, there exists v∈TθΘregv∈ T_θ _reg such that Dqθ[v]=u.Dq_θ[v]=u. Applying the chain rule, 0=DΦ¯[θ][u]=DΦ¯[θ](Dqθ[v])=DΦθ[v].0=D _[θ][u]=D _[θ](Dq_θ[v])=D _θ[v]. Thus v∈ker⁡(DΦθ)v∈ (D _θ). By the regularity assumption, ker⁡(DΦθ)=Tθ(G⋅θ). (D _θ)=T_θ(G·θ). Since ker⁡(Dqθ)=Tθ(G⋅θ) (Dq_θ)=T_θ(G·θ), it follows that Dqθ[v]=0.Dq_θ[v]=0. But Dqθ[v]=uDq_θ[v]=u, hence u=0u=0. Therefore DΦ¯[θ]D _[θ] is injective. Since [θ][θ] was arbitrary, DΦ¯D is injective everywhere on ℳregM_reg. Finally, we prove that Φ¯ is locally injective. Because DΦ¯[θ]D _[θ] is injective at every point, Φ¯ is an immersion. By the constant-rank theorem, for every [θ]∈ℳreg[θ] _reg, there exists an open neighborhood U∋[θ]U [θ] such that Φ¯(U) (U) is an immersed submanifold of ℱF and Φ¯|U:U→Φ¯(U) |_U:U→ (U) is a diffeomorphism onto its image. In particular, Φ¯|U |_U is injective. Therefore Φ¯ is locally injective. The theorem formalizes the claim that, away from singular configurations, the effective parameter space of the model is not Θ but the quotient manifold ℳregM_reg. Local coordinates on ℳregM_reg describe distinct predictors rather than distinct parameterizations. In particular, a tangent vector at [θ]∈ℳreg[θ] _reg represents a genuine first-order deformation of the function fθf_θ, rather than an artifact of neuron rescaling or reordering. The local injectivity of Φ¯ is the quotient-space version of local identifiability. It implies that, on the regular set, the model class can be studied through the differential geometry of ℳregM_reg without ambiguity from redundant parameter coordinates. This is the geometric setting in which effective metrics and effective Hessians will be defined in the sequel. 2.4 Gauge choices and local sections Although the quotient manifold ℳregM_reg is the intrinsic object, calculations are often most convenient in a local coordinate system obtained by fixing a representative inside each orbit. Such a representative selection is a local gauge choice. A simple example is provided by neuronwise normalization of hidden weights. On any region where wi≠0w_i≠ 0 for all i, one may impose ‖wi‖=1,i=1,…,m,\|w_i\|=1, i=1,…,m, (2.18) and absorb the scaling into aia_i. This removes the continuous rescaling freedom, leaving only the discrete permutation symmetry. More generally, a local section of the quotient map q:Θreg→ℳregq: _reg _reg (2.19) is a smooth map s:U→Θregs:U→ _reg defined on an open set U⊂ℳregU _reg such that q∘s=idUq s=id_U. Any such section identifies a neighborhood in the quotient with a gauge-fixed submanifold of parameter space transverse to the orbits. In practice, this permits one to work with reduced coordinates while retaining an invariant interpretation of the resulting objects. The role of gauge fixing in our setting is conceptual rather than algorithmic. We do not use a gauge to eliminate redundancy globally; instead, local sections serve to compare Euclidean quantities in the parameter space with intrinsic quantities on the quotient. In particular, later constructions will show that the geometrically meaningful directions are precisely those transverse to the group orbits, whereas tangent orbit directions encode pure reparameterization effects. 2.5 Singular parameter configurations The quotient description above breaks down when the regularity assumptions fail. These failures are not exceptional curiosities: they are built into overparameterized neural models and must be understood as part of the global geometry. We define the singular set by Θsing:=Θ∖Θreg. _sing:= _reg. (2.20) At a point θ∈Θsingθ∈ _sing, at least one of the following occurs: • the isotropy subgroup of G is larger than its generic value, so that orbit dimension drops; • the kernel of DΦθD _θ strictly contains Tθ(G⋅θ)T_θ(G·θ), so that there exist infinitesimally function-invisible directions not generated by symmetry; • the local image of Φ changes dimension. Typical examples include inactive neurons, coincident neurons, and configurations at which several hidden units jointly collapse into a lower-complexity representation. At such points, the quotient by symmetry is no longer locally a smooth manifold, because the equivalence classes meet regions with different local dimensions or different isotropy types. The correct geometric object is then not a manifold but a stratified space. Informally, Θsing _sing can be decomposed into pieces on which the orbit type and the rank of DΦθD _θ are constant, and each such piece behaves like a smooth manifold of lower dimension. Proposition 2.2. (Semialgebraic stratified singular structure for the finite-sample realization map) Fix a dataset X=(x1,…,xn)∈(ℝd)n,X=(x_1,…,x_n)∈(R^d)^n, and define the finite-sample realization map ΦX:Θ→ℝn,ΦX(θ)=(fθ(x1),…,fθ(xn)), _X: ^n, _X(θ)= (f_θ(x_1),…,f_θ(x_n) ), where fθ(x)=∑i=1maiσ(wi⊤x),Θ≅ℝm(d+1).f_θ(x)= _i=1^ma_iσ(w_i x), ^m(d+1). Let G=Sm⋉(ℝ>0)mG=S_m (R_>0)^m act on Θ by hidden-unit permutation and neuronwise positive rescaling. Assume that: 1. σ:ℝ→ℝσ:R is semialgebraic and piecewise C1C^1; 2. the induced action of G on Θ is semialgebraic and proper. Define the regular set Θreg:=θ∈Θ:the G-action is locally free at θandker⁡(DΦX(θ))=Tθ(G⋅θ), _reg:= \θ∈ :the G-action is locally free at θ\ and\ (D _X(θ))=T_θ(G·θ) \, and let Θsing:=Θ∖Θreg. _sing:= _reg. Then the following hold. 1. The sets Θreg _reg and Θsing _sing are semialgebraic. 2. There exists a finite semialgebraic Whitney stratification Θ=⨆α∈ASα = _α∈ AS_α such that each stratum SαS_α is a smooth locally closed semialgebraic submanifold, each SαS_α is G-invariant, and both θ↦dimTθ(G⋅θ)andθ↦rank⁡(DΦX(θ))θ T_θ(G·θ) θ (D _X(θ)) are constant on each stratum. 3. The singular set Θsing _sing is a union of strata in this decomposition. 4. The quotient Θ/G /G is a semialgebraic stratified space, and its top smooth stratum is ℳreg=Θreg/G.M_reg= _reg/G. Proof Since σ is semialgebraic, the map (ai,wi)↦aiσ(wi⊤xk)(a_i,w_i) a_iσ(w_i x_k) is semialgebraic for every i and every sample point xkx_k. Summing over i=1,…,mi=1,…,m, we obtain that each coordinate θ↦fθ(xk)θ f_θ(x_k) is semialgebraic. Therefore the finite-sample realization map ΦX:Θ→ℝn _X: ^n is semialgebraic. Because σ is piecewise C1C^1, ΦX _X is piecewise C1C^1 as well. The group G=Sm⋉(ℝ>0)mG=S_m (R_>0)^m is semialgebraic, and its action on Θ is semialgebraic by assumption. Since ΦX _X is G-invariant, we have Tθ(G⋅θ)⊆ker⁡(DΦX(θ))for all θT_θ(G·θ) (D _X(θ)) all θ at every point where the differential is defined. Now, on each C1C^1 piece, the rank function θ↦rank⁡(DΦX(θ))θ (D _X(θ)) is semialgebraic, because it is determined by the vanishing and nonvanishing of minors of the Jacobian matrix of ΦX _X. Likewise, the orbit-dimension function θ↦dimTθ(G⋅θ)θ T_θ(G·θ) is semialgebraic, since it is the rank of the differential of the action map at the identity. Because Tθ(G⋅θ)⊆ker⁡(DΦX(θ)),T_θ(G·θ) (D _X(θ)), the equality ker⁡(DΦX(θ))=Tθ(G⋅θ) (D _X(θ))=T_θ(G·θ) is equivalent to the dimension identity dimker⁡(DΦX(θ))=dimTθ(G⋅θ). (D _X(θ))= T_θ(G·θ). By rank-nullity, dimker⁡(DΦX(θ))=dimΘ−rank⁡(DΦX(θ)), (D _X(θ))= -rank(D _X(θ)), so this is a semialgebraic condition. Local freeness of the action is also semialgebraic, since it is equivalent to discreteness of the isotropy group, or equivalently maximal orbit dimension in a neighborhood. Hence Θreg _reg is semialgebraic, and so is its complement Θsing _sing. By semialgebraic stratification theory, every semialgebraic set admits a finite Whitney stratification by smooth locally closed semialgebraic submanifolds. Moreover, given finitely many semialgebraic subsets and semialgebraic maps, one may choose a Whitney stratification compatible with all of them. Apply this to the semialgebraic manifold Θ , the semialgebraic subsets Θreg _reg and Θsing _sing, and the semialgebraic map ΦX _X. After refinement if necessary, we obtain a finite Whitney stratification Θ=⨆α∈ASα = _α∈ AS_α such that: • each SαS_α is a smooth locally closed semialgebraic submanifold; • ΦX|Sα _X|_S_α is C1C^1 and has constant differential rank on SαS_α; • Θreg _reg and Θsing _sing are unions of strata. Because the G-action is semialgebraic and proper, the stratification can be further refined so that each stratum is G-invariant and lies inside a single orbit-type piece. On such a stratum, the isotropy type is constant, hence the orbit dimension dimTθ(G⋅θ) T_θ(G·θ) is constant. By construction, the rank rank⁡(DΦX(θ))rank(D _X(θ)) is also constant on each stratum. This proves claim (2). By Θsing _sing is semialgebraic. By the chosen Whitney stratification is compatible with Θsing _sing. Therefore there exists a subset Asing⊆A_sing A such that Θsing=⨆α∈AsingSα. _sing= _α∈ A_singS_α. Hence Θsing _sing is a union of strata. This proves claim (3). Since the G-action on Θ is proper, the quotient Θ/G /G is Hausdorff. Because each stratum SαS_α is G-invariant and semialgebraic, its quotient Sα/GS_α/G defines a stratum in the quotient in the sense of semialgebraic stratified spaces. The collection of such quotient strata yields a semialgebraic stratification of Θ/G /G. On the regular set Θreg _reg, the action is locally free and proper, and ker⁡(DΦX(θ))=Tθ(G⋅θ)for all θ∈Θreg. (D _X(θ))=T_θ(G·θ) all θ∈ _reg. Therefore, by Theorem 2.1 applied to the finite-sample realization map ΦX _X, the quotient ℳreg=Θreg/GM_reg= _reg/G is a smooth manifold. Since Θreg _reg is the locus where both isotropy and infinitesimal realization rank are locally maximal in the regular sense, ℳregM_reg forms the top smooth stratum of the quotient Θ/G /G. This proves claim (4). Remark 2.2. The finite-sample realization map is the natural object for the geometry of empirical risk, since standard training objectives depend on θ only through the vector ΦX(θ)=(fθ(x1),…,fθ(xn))∈ℝn. _X(θ)= (f_θ(x_1),…,f_θ(x_n) ) ^n. Using ΦX _X in place of an abstract infinite-dimensional realization map makes the stratification statement fully concrete and places all rank and kernel conditions in a finite-dimensional semialgebraic setting. This proposition should be read as a structural statement rather than a fully explicit classification. For the purposes of the present paper, the crucial point is that singular configurations correspond precisely to failures of local identifiability beyond the prescribed parameter symmetries. They are thus the loci at which the effective number of function degrees of freedom changes. As a result, singular points are expected to play a distinguished role in optimization dynamics, model reduction, and implicit bias. Nevertheless, the regular manifold ℳregM_reg already captures the local geometry relevant for generic parameter values, and it is this smooth quotient that will form the basis of our subsequent constructions. 2.6 Consequences for optimization geometry We close this section by recording the interpretation of the quotient structure for the loss geometry of shallow networks. Let ℒ:Θ→ℝL: be any objective that depends on θ only through the realized function fθf_θ, as is the case for standard empirical risk minimization. Then ℒ(g⋅θ)=ℒ(θ)for all g∈G.L(g·θ)=L(θ) all g∈ G. (2.21) Consequently, ℒL is constant along each orbit, and every tangent direction in Tθ(G⋅θ)T_θ(G·θ) is a first-order flat direction of the loss. At regular points, these are precisely the directions of representational redundancy. Therefore, the Euclidean Hessian of ℒL in parameter space necessarily contains a degenerate block associated with the symmetry orbit, even when the loss is locally nondegenerate as a function of the realized predictor. This observation motivates the central distinction developed later in the paper. Flatness in parameter space is not an intrinsic notion unless symmetry directions have first been removed. The quotient manifold ℳregM_reg provides the correct geometric stage for this removal. In the next section, we endow ℳregM_reg with a function-induced metric and use it to define an effective curvature notion that separates symmetry-induced flatness from genuine functional flatness. 3 Function-Induced Metric and Effective Curvature In Section 2, we showed that the Euclidean parameter space Θ contains directions of purely representational redundancy induced by permutation and scaling symmetries. On the regular set Θreg _reg, these directions are precisely the tangent directions to the G-orbits, and the quotient ℳreg=Θreg/GM_reg= _reg/G provides the appropriate local parameter space of distinct predictors. The present section equips this quotient with a metric induced by the network outputs on the training sample and uses this metric to define an intrinsic notion of second-order geometry. This construction yields an effective curvature object that separates symmetry-induced flatness from genuine functional flatness. Throughout the section, we fix a dataset X=(x1,…,xn)∈(ℝd)nX=(x_1,…,x_n)∈(R^d)^n and consider the finite-sample realization map ΦX:Θ→ℝn,ΦX(θ)=(fθ(x1),…,fθ(xn)). _X: ^n, _X(θ)= (f_θ(x_1),…,f_θ(x_n) ). We restrict all geometric constructions to Θreg _reg, where the quotient ℳregM_reg is a smooth manifold and the differential DΦX(θ)D _X(θ) has kernel exactly equal to the orbit tangent space Tθ(G⋅θ)T_θ(G·θ). 3.1 A function-induced metric on parameter space The most natural geometry for empirical learning is not the ambient Euclidean geometry of parameters, but the geometry induced by the map ΦX _X into the output space ℝnR^n. We therefore define a symmetric bilinear form on TθΘregT_θ _reg by pullback of the Euclidean inner product on ℝnR^n. For θ∈Θregθ∈ _reg and tangent vectors u,v∈TθΘregu,v∈ T_θ _reg, define gθ(u,v):=1n⟨DΦX(θ)[u],DΦX(θ)[v]⟩ℝn.g_θ(u,v):= 1n D _X(θ)[u],D _X(θ)[v] _R^n. (3.1) Equivalently, gθ(u,v)=1n∑k=1nDfθ(xk)[u]Dfθ(xk)[v].g_θ(u,v)= 1n _k=1^nDf_θ(x_k)[u]Df_θ(x_k)[v]. Thus gθg_θ measures the first-order effect of parameter perturbations on the vector of network outputs over the sample X. By construction, gθg_θ is symmetric and positive semidefinite. The associated quadratic form is gθ(u,u)=1n|DΦX(θ)[u]|2.g_θ(u,u)= 1n|D _X(θ)[u]|^2. (3.2) Hence gθ(u,u)=0g_θ(u,u)=0 if and only if u∈ker⁡(DΦX(θ))u∈ (D _X(θ)). Since θ∈Θregθ∈ _reg, Proposition 2.1 implies ker⁡(DΦX(θ))=Tθ(G⋅θ). (D _X(θ))=T_θ(G·θ). Accordingly, gθg_θ is degenerate precisely along the symmetry directions. This metric has two immediate interpretations. First, it is the finite-sample pullback metric induced by the realization map. Second, in statistical terms it is the empirical Gauss–Newton or Fisher-type geometry associated with the model outputs. The important point for our purposes is not the terminology but the invariance structure: g depends only on first-order changes in the realized predictor on the sample, and therefore ignores parameter perturbations that merely move along a symmetry orbit. The G-invariance of ΦX _X implies the G-invariance of g. Indeed, for every g0∈Gg_0∈ G, ΦX∘αg0=ΦX, _X _g_0= _X, where αg0(θ)=g0⋅θ _g_0(θ)=g_0·θ. Differentiating gives DΦX(g0⋅θ)∘Dαg0,θ=DΦX(θ),D _X(g_0·θ) D _g_0,θ=D _X(θ), and therefore g0⋅θ(Dαg0,θu,Dαg0,θv)=gθ(u,v).g_g_0·θ(D _g_0,θu,D _g_0,θv)=g_θ(u,v). Thus g is constant along orbits in the natural tensorial sense. 3.2 Vertical and horizontal directions The degeneracy of g is not a defect but a geometric signal: it identifies the directions of representational redundancy. This leads to the canonical decomposition of the tangent space into vertical and horizontal parts. For θ∈Θregθ∈ _reg, define the vertical space θ:=Tθ(G⋅θ)=ker⁡(DΦX(θ)).V_θ:=T_θ(G·θ)= (D _X(θ)). (3.3) These are precisely the infinitesimal parameter perturbations that leave the sample outputs unchanged to first order. Since the action is locally free on Θreg _reg, θV_θ has constant dimension. We define the horizontal space by Euclidean orthogonality to θV_θ: ℋθ:=θ⟂Euc⊂TθΘreg.H_θ:=V_θ _Euc⊂ T_θ _reg. (3.4) Since Θreg _reg is a finite-dimensional smooth manifold and θV_θ is a smooth constant-rank subbundle, the assignment θ↦ℋθ _θ defines a smooth complementary subbundle. Hence we have the direct-sum decomposition TθΘreg=θ⊕ℋθ.T_θ _reg=V_θ _θ. (3.5) The choice of Euclidean orthogonality here is a gauge choice used only to select representatives of tangent classes. Intrinsic constructions below will not depend on this particular complement. The key fact is that the restriction of DΦX(θ)D _X(θ) to ℋθH_θ is injective, since ker⁡(DΦX(θ)|ℋθ)=ℋθ∩θ=0. (D _X(θ)|_H_θ)=H_θ _θ=\0\. As a consequence, the restriction of gθg_θ to ℋθH_θ is positive definite: u∈ℋθ,u≠0⟹gθ(u,u)>0.u _θ,\ u≠ 0 g_θ(u,u)>0. Therefore g is nondegenerate exactly on the directions transverse to the symmetry orbit. This is the first point at which the quotient geometry becomes visible: the degenerate directions are precisely those that should be modded out. 3.3 Descent of the metric to the quotient We now show that the pullback form g descends to a genuine Riemannian metric on the quotient manifold ℳregM_reg. Let q:Θreg→ℳregq: _reg _reg be the quotient map. For [θ]∈ℳreg[θ] _reg and tangent vectors ξ,η∈T[θ]ℳregξ,η∈ T_[θ]M_reg, choose any lifts u,v∈TθΘregu,v∈ T_θ _reg such that Dqθ[u]=ξ,Dqθ[v]=η.Dq_θ[u]=ξ, Dq_θ[v]=η. Define g¯[θ](ξ,η):=gθ(uhor,vhor), g_[θ](ξ,η):=g_θ(u^hor,v^hor), where uhoru^hor and vhorv^hor denote the horizontal components of u and v with respect to the decomposition TθΘreg=θ⊕ℋθ.T_θ _reg=V_θ _θ. The next result formalizes that g¯ g is well defined and positive definite. Theorem 3.1. (Quotient metric) Let q:Θreg→ℳreg:=Θreg/Gq: _reg _reg:= _reg/G be the quotient map, where G acts smoothly, properly, and locally freely on Θreg _reg. Let ΦX:Θreg→ℝn _X: _reg ^n be the finite-sample realization map, and define for each θ∈Θregθ∈ _reg the symmetric bilinear form gθ(u,v):=1n⟨DΦX(θ)[u],DΦX(θ)[v]⟩ℝn,u,v∈TθΘreg.g_θ(u,v):= 1n D _X(θ)[u],D _X(θ)[v] _R^n, u,v∈ T_θ _reg. Assume that for every θ∈Θregθ∈ _reg, ker⁡(DΦX(θ))=Tθ(G⋅θ). (D _X(θ))=T_θ(G·θ). Define the vertical bundle θ:=Tθ(G⋅θ),V_θ:=T_θ(G·θ), and let ℋ⊂TΘregH⊂ T _reg be any smooth G-equivariant horizontal subbundle complementary to V, i.e. TθΘreg=θ⊕ℋθfor all θ∈Θreg.T_θ _reg=V_θ _θ all θ∈ _reg. Then: 1. g is a smooth G-invariant symmetric bilinear form on TΘregT _reg; 2. the radical of gθg_θ is exactly θV_θ, that is, rad(gθ):=u:gθ(u,v)=0∀v=θ;rad(g_θ):=\u:g_θ(u,v)=0\ ∀ v\=V_θ; 3. the restriction gθ|ℋθg_θ|_H_θ is positive definite for every θ; 4. there exists a unique Riemannian metric g¯ g on ℳregM_reg such that for every θ∈Θregθ∈ _reg and every u,v∈ℋθu,v _θ, g¯[θ](Dqθ[u],Dqθ[v])=gθ(u,v). g_[θ](Dq_θ[u],Dq_θ[v])=g_θ(u,v). Proof Since ΦX _X is smooth on Θreg _reg, its differential DΦX:TΘreg→ℝnD _X:T _reg ^n depends smoothly on θ. Therefore gθ(u,v)=1n⟨DΦX(θ)[u],DΦX(θ)[v]⟩g_θ(u,v)= 1n D _X(θ)[u],D _X(θ)[v] defines a smooth bilinear form on TΘregT _reg. Symmetry is immediate from symmetry of the Euclidean inner product on ℝnR^n. We next prove G-invariance. Fix g0∈Gg_0∈ G, and let αg0:Θreg→Θreg,αg0(θ)=g0⋅θ. _g_0: _reg→ _reg, _g_0(θ)=g_0·θ. Because ΦX _X is G-invariant, ΦX∘αg0=ΦX. _X _g_0= _X. Differentiating at θ gives DΦX(g0⋅θ)∘Dαg0,θ=DΦX(θ).D _X(g_0·θ) D _g_0,θ=D _X(θ). Hence for any u,v∈TθΘregu,v∈ T_θ _reg, g0⋅θ(Dαg0,θu,Dαg0,θv) g_g_0·θ(D _g_0,θu,D _g_0,θv) =1n⟨DΦX(g0⋅θ)[Dαg0,θu],DΦX(g0⋅θ)[Dαg0,θv]⟩ = 1n D _X(g_0·θ)[D _g_0,θu],D _X(g_0·θ)[D _g_0,θv] =1n⟨DΦX(θ)[u],DΦX(θ)[v]⟩ = 1n D _X(θ)[u],D _X(θ)[v] =gθ(u,v). =g_θ(u,v). Thus g is G-invariant. This proves (1). We show that rad(gθ)=ker⁡(DΦX(θ)).rad(g_θ)= (D _X(θ)). Fix θ∈Θregθ∈ _reg. First, let u∈ker⁡(DΦX(θ))u∈ (D _X(θ)). Then for every v∈TθΘregv∈ T_θ _reg, gθ(u,v)=1n⟨DΦX(θ)[u],DΦX(θ)[v]⟩=0.g_θ(u,v)= 1n D _X(θ)[u],D _X(θ)[v] =0. Hence ker⁡(DΦX(θ))⊆rad(gθ). (D _X(θ)) (g_θ). Conversely, suppose u∈rad(gθ)u (g_θ). Then gθ(u,v)=0for all v∈TθΘreg.g_θ(u,v)=0 all v∈ T_θ _reg. In particular, taking v=uv=u, we obtain 0=gθ(u,u)=1n‖DΦX(θ)[u]‖2.0=g_θ(u,u)= 1n\|D _X(θ)[u]\|^2. Therefore DΦX(θ)[u]=0,D _X(θ)[u]=0, so u∈ker⁡(DΦX(θ))u∈ (D _X(θ)). Thus rad(gθ)=ker⁡(DΦX(θ)).rad(g_θ)= (D _X(θ)). By the standing regularity assumption, ker⁡(DΦX(θ))=Tθ(G⋅θ)=θ. (D _X(θ))=T_θ(G·θ)=V_θ. Hence rad(gθ)=θ.rad(g_θ)=V_θ. This proves (2). Fix θ∈Θregθ∈ _reg. Since TθΘreg=θ⊕ℋθ,T_θ _reg=V_θ _θ, we have ℋθ∩θ=0.H_θ _θ=\0\. From, θ=ker⁡(DΦX(θ)).V_θ= (D _X(θ)). Therefore the restriction DΦX(θ)|ℋθ:ℋθ→ℝnD _X(θ)|_H_θ:H_θ ^n is injective. Now let u∈ℋθu _θ with u≠0u≠ 0. Since the restriction is injective, DΦX(θ)[u]≠0.D _X(θ)[u]≠ 0. Hence gθ(u,u)=1n‖DΦX(θ)[u]‖2>0.g_θ(u,u)= 1n\|D _X(θ)[u]\|^2>0. Thus gθ|ℋθg_θ|_H_θ is positive definite. This proves (3). We now define a bilinear form on T[θ]ℳregT_[θ]M_reg. Because q:Θreg→ℳregq: _reg _reg is a smooth submersion, its differential Dqθ:TθΘreg→T[θ]ℳregDq_θ:T_θ _reg→ T_[θ]M_reg is surjective, with kernel ker⁡(Dqθ)=Tθ(G⋅θ)=θ. (Dq_θ)=T_θ(G·θ)=V_θ. Since ℋθH_θ is a complement of θV_θ, the restriction Dqθ|ℋθ:ℋθ→T[θ]ℳregDq_θ|_H_θ:H_θ→ T_[θ]M_reg is a linear isomorphism. For ξ,η∈T[θ]ℳregξ,η∈ T_[θ]M_reg, define g¯[θ](ξ,η):=gθ(u,v), g_[θ](ξ,η):=g_θ(u,v), where u,v∈ℋθu,v _θ are the unique horizontal vectors satisfying Dqθ[u]=ξ,Dqθ[v]=η.Dq_θ[u]=ξ, Dq_θ[v]=η. We must prove that this definition is independent of the representative θ∈[θ]θ∈[θ], smooth, and positive definite. At fixed θ, uniqueness of u and v follows because Dqθ|ℋθDq_θ|_H_θ is an isomorphism. So there is no ambiguity at this stage. Let θ′=g0⋅θ =g_0·θ for some g0∈Gg_0∈ G. Since ℋH is assumed G-equivariant, Dαg0,θ(ℋθ)=ℋθ′.D _g_0,θ(H_θ)=H_θ . Let u,v∈ℋθu,v _θ be the horizontal lifts of ξ,ηξ,η at θ. Then u′:=Dαg0,θu∈ℋθ′,v′:=Dαg0,θv∈ℋθ′.u :=D _g_0,θu _θ , v :=D _g_0,θv _θ . By equivariance of the quotient map, q∘αg0=q,q _g_0=q, hence Dqθ′[u′]=D(q∘αg0)θ[u]=Dqθ[u]=ξ,Dq_θ [u ]=D(q _g_0)_θ[u]=Dq_θ[u]=ξ, and similarly Dqθ′[v′]=ηDq_θ [v ]=η. So u′,v′u ,v are the horizontal lifts of ξ,ηξ,η at θ′θ . Using G-invariance of g, gθ′(u′,v′)=g0⋅θ(Dαg0,θu,Dαg0,θv)=gθ(u,v).g_θ (u ,v )=g_g_0·θ(D _g_0,θu,D _g_0,θv)=g_θ(u,v). Thus the value g¯[θ](ξ,η) g_[θ](ξ,η) is independent of the representative θ. Because q is a smooth submersion and ℋH is a smooth horizontal bundle, the inverse of Dqθ|ℋθ:ℋθ→T[θ]ℳregDq_θ|_H_θ:H_θ→ T_[θ]M_reg depends smoothly on θ in local trivializations. Since g is smooth on Θreg _reg, the bilinear form g¯ g is smooth on ℳregM_reg. Let ξ∈T[θ]ℳregξ∈ T_[θ]M_reg be nonzero, and let u∈ℋθu _θ be its horizontal lift. Because Dqθ|ℋθDq_θ|_H_θ is an isomorphism, ξ≠0ξ≠ 0 implies u≠0u≠ 0. Therefore, g¯[θ](ξ,ξ)=gθ(u,u)>0. g_[θ](ξ,ξ)=g_θ(u,u)>0. Thus g¯ g is positive definite. So g¯ g is a Riemannian metric on ℳregM_reg. Suppose g¯(1) g^(1) and g¯(2) g^(2) are two Riemannian metrics on ℳregM_reg satisfying g¯[θ](j)(Dqθ[u],Dqθ[v])=gθ(u,v)for all θ and all u,v∈ℋθ, g^(j)_[θ](Dq_θ[u],Dq_θ[v])=g_θ(u,v) all θ and all u,v _θ, for j=1,2j=1,2. Fix [θ]∈ℳreg[θ] _reg and ξ,η∈T[θ]ℳregξ,η∈ T_[θ]M_reg. Let u,v∈ℋθu,v _θ be the unique horizontal lifts. Then g¯[θ](1)(ξ,η)=gθ(u,v)=g¯[θ](2)(ξ,η). g^(1)_[θ](ξ,η)=g_θ(u,v)= g^(2)_[θ](ξ,η). Hence g¯(1)=g¯(2) g^(1)= g^(2). So the quotient metric is unique. This proves (4), and therefore the theorem follows. Theorem 3.1 shows that the finite-sample realization map induces a canonical local geometry on the quotient manifold of distinct predictors. In this geometry, the length of a tangent vector measures the first-order variation of the network outputs on the sample, after removal of symmetry directions. Thus g¯ g is the natural metric for empirical function-space geometry. 3.4 Loss functions and quotient gradients We now consider a loss functional on the parameter space that depends on θ only through the finite-sample prediction vector ΦX(θ) _X(θ). Let L~:ℝn→ℝ L:R^n be a smooth function, and define ℒ(θ):=L~(ΦX(θ)).L(θ):= L( _X(θ)). This covers standard empirical risks such as least squares, logistic loss, and cross-entropy, restricted to the finite sample. Because ΦX _X is G-invariant, ℒL is also G-invariant: ℒ(g⋅θ)=ℒ(θ)for all g∈G.L(g·θ)=L(θ) all g∈ G. Hence ℒL descends to a smooth function ℒ¯:ℳreg→ℝ,ℒ¯([θ])=ℒ(θ). L:M_reg , L([θ])=L(θ). The first-order variation of ℒL admits a particularly transparent form in the function-induced metric. By the chain rule, Dℒ(θ)[u]=⟨∇L~(ΦX(θ)),DΦX(θ)[u]⟩ℝn.DL(θ)[u]= ∇ L( _X(θ)),D _X(θ)[u] _R^n. Thus Dℒ(θ)[u]DL(θ)[u] depends only on the image DΦX(θ)[u]D _X(θ)[u], and in particular vanishes on θV_θ. This is consistent with G-invariance, but the stronger point is that the differential of ℒL is completely determined by the function-space differential of ΦX _X. The corresponding gradient with respect to the quotient metric g¯ g is characterized by g¯[θ](gradg¯⁡ℒ¯([θ]),ξ)=Dℒ¯[θ][ξ]for all ξ∈T[θ]ℳreg. g_[θ](grad_ g L([θ]),ξ)=D L_[θ][ξ] all ξ∈ T_[θ]M_reg. Equivalently, if uℒ(θ)∈ℋθu_L(θ) _θ denotes the unique horizontal vector satisfying gθ(uℒ(θ),v)=Dℒ(θ)[v]for all v∈ℋθ,g_θ(u_L(θ),v)=DL(θ)[v] all v _θ, then Dqθ[uℒ(θ)]=gradg¯⁡ℒ¯([θ]).Dq_θ[u_L(θ)]=grad_ g L([θ]). Thus the quotient gradient is represented in parameter space by the unique horizontal vector solving the above variational equation. This observation is central for the dynamics considered in Section 4. 3.5 Effective Hessian on the quotient We next turn to second-order geometry. The Euclidean Hessian of ℒL in parameter space is not intrinsic, because it contains a degenerate component arising from the orbit directions. The appropriate second-order object is the Hessian of the descended loss ℒ¯ L on the quotient manifold (ℳreg,g¯)(M_reg, g). Let ∇g¯∇ g denote the Levi–Civita connection of g¯ g. For [θ]∈ℳreg[θ] _reg, define the intrinsic Hessian of ℒ¯ L by Hessg¯⁡ℒ¯([θ])(ξ,η):=ξ(ηℒ¯)−(∇ξg¯η)ℒ¯,ξ,η∈T[θ]ℳreg.Hess_ g L([θ])(ξ,η):=ξ (η L )-(∇ g_ξη) L, ξ,η∈ T_[θ]M_reg. This is a symmetric bilinear form on T[θ]ℳregT_[θ]M_reg. Since g¯ g is positive definite, there exists a unique self-adjoint linear operator H[θ]eff:T[θ]ℳreg→T[θ]ℳregH^eff_[θ]:T_[θ]M_reg→ T_[θ]M_reg such that g¯[θ](H[θ]effξ,η)=Hessg¯⁡ℒ¯([θ])(ξ,η)for all ξ,η. g_[θ](H^eff_[θ]ξ,η)=Hess_ g L([θ])(ξ,η) all ξ,η. We call H[θ]effH^eff_[θ] the effective Hessian of ℒL at [θ][θ]. By construction, H[θ]effH^eff_[θ] is the second-order operator governing intrinsic curvature of the loss after quotienting out symmetry directions. It is therefore the correct object for distinguishing genuine functional flatness from parameterization-induced degeneracy. To relate H[θ]effH^eff_[θ] to computations in parameter space, let PθhorP_θ^hor denote the Euclidean orthogonal projection onto ℋθH_θ. Given a local section s:U→Θregs:U→ _reg of the quotient map, the pullback metric s∗g¯s^* g and the pullback loss ℒ¯∘q∘s L q s identify a neighborhood in the quotient with a gauge-fixed submanifold of parameter space. Under this identification, the effective Hessian is represented by the second covariant derivative of the gauge-fixed loss, not by the raw Euclidean Hessian of ℒL. The distinction is important. If one naively restricts the Euclidean Hessian ∇2ℒ(θ)∇^2L(θ) to ℋθH_θ, one captures only part of the intrinsic second-order geometry. In general there are additional terms arising from the variation of the horizontal distribution and from the Levi–Civita connection of the quotient metric. The next proposition records the local relation. Proposition 3.1. (Local representation of the effective Hessian) Let q:Θreg→ℳreg=Θreg/Gq: _reg _reg= _reg/G be the quotient map, let g¯ g be the quotient metric from Theorem 3.1, and let ℒ¯:ℳreg→ℝ L:M_reg be the descended loss. Let s:U→Θregs:U→ _reg be a smooth local section of q over an open set U⊂ℳregU _reg, so that q∘s=idU.q s=id_U. Define the gauge-fixed submanifold S:=s(U)⊂Θreg,S:=s(U)⊂ _reg, the pullback metric gS:=s∗g¯g^S:=s^* g on U, and the gauge-fixed loss ℒS:=ℒ¯|U=ℒ∘s.L^S:= L|_U=L s. Let ∇S∇^S denote the Levi–Civita connection of gSg^S. Then for every x∈Ux∈ U and ξ,η∈TxUξ,η∈ T_xU, Hessg¯⁡ℒ¯(x)(ξ,η)=HessgS⁡ℒS(x)(ξ,η)=ξ(ηℒS)−(∇ξSη)ℒS.Hess_ g L(x)(ξ,η)=Hess_g^SL^S(x)(ξ,η)=ξ ( ^S )-(∇^S_ξη)L^S. Equivalently, in any local coordinates (z1,…,zr)(z^1,…,z^r) on U, (Hessg¯⁡ℒ¯)ij=∂i∂jℒS−Γijk(gS)∂kℒS. (Hess_ g L )_ij= _i _jL^S- ^k_ij(g^S) _kL^S. In particular, if s is viewed as a gauge fixing inside parameter space, then the effective Hessian is represented locally by the second derivative of the gauge-fixed loss together with the Christoffel correction of the pulled-back quotient metric. Thus the effective Hessian is not, in general, equal to the raw Euclidean Hessian of ℒL restricted to a transverse slice. Proof Since q∘s=idUq s=id_U, the differential satisfies Dqs(x)∘Dsx=idTxUfor all x∈U.Dq_s(x) Ds_x=id_T_xU all x∈ U. Hence Dsx:TxU→Ts(x)ΘregDs_x:T_xU→ T_s(x) _reg is injective, and s is an immersion. Therefore S=s(U)S=s(U) is an embedded submanifold of Θreg _reg, and s:U→Ss:U→ S is a diffeomorphism. By definition of the pullback metric, gxS(ξ,η)=g¯x(ξ,η)for all ξ,η∈TxU.g^S_x(ξ,η)= g_x(ξ,η) all ξ,η∈ T_xU. Thus s is an isometric embedding of (U,g¯|U)(U, g|_U) into (Θreg,ambient parameter space)( _reg,ambient parameter space) in the sense relevant here, namely that it identifies the quotient metric with its pullback on the gauge-fixed slice. In particular, s is an isometry from the Riemannian manifold (U,g¯|U)(U, g|_U) onto the Riemannian manifold (U,gS)(U,g^S), where gS=s∗g¯g^S=s^* g. Since these two metrics are the same by definition, the identity map idU:(U,g¯|U)→(U,gS)id_U:(U, g|_U)→(U,g^S) is an isometry. A standard fact from Riemannian geometry is that if F:(M,g)→(N,h)F:(M,g)→(N,h) is an isometry and f:N→ℝf:N is smooth, then Hessg⁡(f∘F)=F∗(Hessh⁡f).Hess_g(f F)=F^* (Hess_hf ). We apply this with F=idUF=id_U, M=(U,gS)M=(U,g^S), N=(U,g¯|U)N=(U, g|_U), and f=ℒ¯|Uf= L|_U. Since ℒS=ℒ¯∘idU=ℒ¯|UL^S= L _U= L|_U, we obtain HessgS⁡ℒS=Hessg¯⁡ℒ¯|U.Hess_g^SL^S=Hess_ g L |_U. Pointwise, for every x∈Ux∈ U and ξ,η∈TxUξ,η∈ T_xU, Hessg¯⁡ℒ¯(x)(ξ,η)=HessgS⁡ℒS(x)(ξ,η).Hess_ g L(x)(ξ,η)=Hess_g^SL^S(x)(ξ,η). For completeness, we verify this directly from the definition of the Hessian. Let ∇g¯∇ g and ∇S∇^S be the Levi–Civita connections of g¯|U g|_U and gSg^S, respectively. Since the two metrics coincide on U, uniqueness of the Levi–Civita connection implies ∇g¯=∇Son U.∇ g=∇^S U. Therefore, for smooth vector fields ξ,ηξ,η on U, Hessg¯⁡ℒ¯(ξ,η) _ g L(ξ,η) =ξ(ηℒ¯)−(∇ξg¯η)ℒ¯ =ξ(η L)-(∇ g_ξη) L =ξ(ηℒS)−(∇ξSη)ℒS =ξ( ^S)-(∇^S_ξη)L^S =HessgS⁡ℒS(ξ,η). =Hess_g^SL^S(ξ,η). This proves the first identity. Choose local coordinates (z1,…,zr)(z^1,…,z^r) on U, and denote the coordinate vector fields by ∂i:=∂zi. _i:= ∂ z^i. By definition of the Riemannian Hessian, HessgS⁡ℒS(∂i,∂j)=∂i∂jℒS−(∇∂iS∂j)⁡ℒS.Hess_g^SL^S( _i, _j)= _i _jL^S-(∇^S_ _i _j)L^S. Writing the Levi–Civita connection in coordinates as ∇∂iS∂j=Γijk(gS)∂k,∇^S_ _i _j= ^k_ij(g^S) _k, we obtain HessgS⁡ℒS(∂i,∂j)=∂i∂jℒS−Γijk(gS)∂kℒS.Hess_g^SL^S( _i, _j)= _i _jL^S- ^k_ij(g^S) _kL^S. This is exactly the local coordinate representation of Hessg¯⁡ℒ¯Hess_ g L. This formula shows that the effective Hessian is represented locally by the ordinary second derivatives of the gauge-fixed loss together with the Christoffel-symbol correction induced by the quotient metric. In particular, it is not simply the Euclidean Hessian of the ambient loss restricted to a transverse slice unless the chosen local coordinates are geodesic for gSg^S at the point under consideration, in which case the Christoffel symbols vanish at that point. Proposition 3.1 is not intended as a computational formula in full generality. Its role is conceptual: it shows that the effective Hessian coincides with the horizontal second derivative only up to geometric correction terms. Thus a correct notion of intrinsic curvature must be formulated on the quotient manifold rather than by simple coordinate restriction in parameter space. 3.6 False flatness and intrinsic flatness The previous constructions allow a precise distinction between two qualitatively different sources of flatness. First, because ℒL is constant along each orbit G⋅θG·θ, every vertical direction belongs to the nullspace of the first derivative, and at regular points these directions also generate degenerate second-order behavior in the Euclidean parameterization. This type of flatness is a direct consequence of redundancy in the representation and carries no functional meaning. We refer to it as false flatness or symmetry-induced flatness. Second, even after quotienting out the symmetry directions, the descended loss ℒ¯ L may still have small curvature in certain quotient directions. This corresponds to genuine insensitivity of the realized predictor on the sample and is therefore a meaningful notion of flatness. We refer to it as intrinsic flatness or effective flatness. The effective Hessian makes this distinction exact. A zero or near-zero eigenvalue of the Euclidean Hessian of ℒL may reflect either an orbit direction or a genuine low-curvature direction of ℒ¯ L. By contrast, the spectrum of H[θ]effH^eff_[θ] encodes only quotient-level curvature, because the vertical degeneracy has been removed by construction. This leads to the following immediate corollary. Corollary 3.1. (Nondegeneracy modulo symmetry) Let ℒ¯:ℳreg→ℝ L:M_reg be the descended loss on the quotient manifold (ℳreg,g¯)(M_reg, g), and let [θ]∈ℳreg[θ] _reg be a critical point of ℒ¯ L, i.e. gradg¯⁡ℒ¯([θ])=0.grad_ g L([θ])=0. Assume that the effective Hessian H[θ]eff:T[θ]ℳreg→T[θ]ℳregH^eff_[θ]:T_[θ]M_reg→ T_[θ]M_reg is positive definite, equivalently, Hessg¯⁡ℒ¯([θ])(ξ,ξ)>0for all ξ∈T[θ]ℳreg∖0.Hess_ g L([θ])(ξ,ξ)>0 all ξ∈ T_[θ]M_reg \0\. Then [θ][θ] is a nondegenerate strict local minimizer of ℒ¯ L on ℳregM_reg. Consequently, any representative θ∈Θregθ∈ _reg is a strict local minimizer of ℒL modulo the orbit G⋅θG·θ: equivalently, there exists a neighborhood U⊂ΘregU⊂ _reg of θ such that ℒ(θ′)≥ℒ(θ)for all θ′∈U,L(θ ) (θ) all θ ∈ U, with equality only if q(θ′)=[θ]q(θ )=[θ], that is, only along the symmetry orbit through θ. In particular, degeneracy of the Euclidean Hessian of ℒL in parameter space does not by itself imply intrinsic flatness of the predictor. Proof Since (ℳreg,g¯)(M_reg, g) is a smooth Riemannian manifold, standard local Riemannian geometry applies. Let x0:=[θ]∈ℳreg.x_0:=[θ] _reg. Because x0x_0 is a critical point of ℒ¯ L, we have Dℒ¯x0=0.D L_x_0=0. Assume moreover that the Hessian is positive definite: Hessg¯⁡ℒ¯(x0)(ξ,ξ)>0for all ξ≠0.Hess_ g L(x_0)(ξ,ξ)>0 all ξ≠ 0. Choose a normal coordinate chart around x0x_0. More precisely, let expx0:V⊂Tx0ℳreg→U⊂ℳreg _x_0:V⊂ T_x_0M_reg→ U _reg be the exponential map, defined on a sufficiently small neighborhood V of 0, with expx0⁡(0)=x0 _x_0(0)=x_0. Define ψ(v):=ℒ¯(expx0⁡(v)),v∈V.ψ(v):= L( _x_0(v)), v∈ V. Then ψ is a smooth function on the Euclidean vector space Tx0ℳregT_x_0M_reg, with Dψ(0)=0,Dψ(0)=0, because x0x_0 is a critical point of ℒ¯ L, and D2ψ(0)(v,v)=Hessg¯⁡ℒ¯(x0)(v,v)D^2ψ(0)(v,v)=Hess_ g L(x_0)(v,v) for all v∈Tx0ℳregv∈ T_x_0M_reg. The latter identity is the standard characterization of the Riemannian Hessian in normal coordinates. Since the quadratic form D2ψ(0)D^2ψ(0) is positive definite, there exists c>0c>0 such that D2ψ(0)(v,v)≥c‖v‖2for all v∈Tx0ℳreg.D^2ψ(0)(v,v)≥ c\|v\|^2 all v∈ T_x_0M_reg. By Taylor’s theorem, ψ(v)=ψ(0)+12D2ψ(0)(v,v)+o(‖v‖2)as v→0.ψ(v)=ψ(0)+ 12D^2ψ(0)(v,v)+o(\|v\|^2) v→ 0. Therefore, ψ(v)−ψ(0)≥c4‖v‖2ψ(v)-ψ(0)≥ c4\|v\|^2 for all sufficiently small v≠0v≠ 0. Hence ℒ¯(expx0⁡(v))>ℒ¯(x0)for all sufficiently small v≠0. L( _x_0(v))> L(x_0) all sufficiently small v≠ 0. Thus x0=[θ]x_0=[θ] is a strict local minimizer of ℒ¯ L. Moreover, positive definiteness of the Hessian implies nondegeneracy in the standard Morse-theoretic sense: the bilinear form Hessg¯⁡ℒ¯(x0)Hess_ g L(x_0) is nondegenerate, equivalently the associated self-adjoint operator Hx0effH^eff_x_0 is invertible. Hence x0x_0 is a nondegenerate local minimizer on ℳregM_reg. Recall that q:Θreg→ℳregq: _reg _reg is the quotient map and that ℒ=ℒ¯∘qon Θreg.L= L q _reg. Let x0=[θ]x_0=[θ], and let W⊂ℳregW _reg be an open neighborhood of x0x_0 such that ℒ¯(x)>ℒ¯(x0)for all x∈W∖x0. L(x)> L(x_0) all x∈ W \x_0\. Set U:=q−1(W)⊂Θreg.U:=q^-1(W)⊂ _reg. Since q is continuous, U is an open neighborhood of θ. Now let θ′∈Uθ ∈ U. Then q(θ′)∈Wq(θ )∈ W, and ℒ(θ′)=ℒ¯(q(θ′)).L(θ )= L(q(θ )). Therefore, ℒ(θ′)≥ℒ¯(x0)=ℒ(θ),L(θ )≥ L(x_0)=L(θ), with equality if and only if q(θ′)=x0=[θ].q(θ )=x_0=[θ]. But q(θ′)=[θ]q(θ )=[θ] means exactly that θ′θ belongs to the same G-orbit as θ, i.e. θ′∈G⋅θ.θ ∈ G·θ. Thus θ is a strict local minimizer of ℒL modulo symmetry: in a neighborhood of θ, the loss can remain equal to ℒ(θ)L(θ) only along the orbit G⋅θG·θ. This proves the second claim. Because ℒL is G-invariant, it is constant along the orbit G⋅θG·θ. Hence every vertical direction u∈Tθ(G⋅θ)u∈ T_θ(G·θ) is a first-order flat direction of ℒL, and in particular the Euclidean Hessian of ℒL must be degenerate on parameter space whenever the orbit has positive dimension. Indeed, if γ(t)⊂G⋅θγ(t)⊂ G·θ is a smooth curve with γ(0)=θγ(0)=θ and γ˙(0)=u γ(0)=u, then ℒ(γ(t))=ℒ(θ)for all t,L(γ(t))=L(θ) all t, so both the first and second derivatives at t=0t=0 vanish along u. Thus degeneracy of the Euclidean Hessian is unavoidable in the presence of symmetry, even when [θ][θ] is a nondegenerate quotient minimizer. Therefore, zero eigenvalues of the Euclidean Hessian do not by themselves imply intrinsic flatness of the predictor. Only the quotient Hessian, equivalently the effective Hessian H[θ]effH^eff_[θ], measures curvature after removing symmetry-induced redundancy. Corollary 3.1 clarifies the geometric meaning of many flat directions observed in overparameterized networks: unless those directions survive in the quotient, they should not be interpreted as evidence of functional simplicity, robustness, or favorable generalization. The correct curvature notion is the quotient curvature measured by HeffH^eff. 3.7 Interpretation and role in the sequel The constructions of this section provide the geometric backbone of the remainder of the paper. The function-induced metric g identifies the first-order geometry of sample predictions, while the quotient metric g¯ g removes parameter redundancy and yields an intrinsic Riemannian structure on the space of locally identifiable predictors. The effective Hessian is then the second-order object associated with this quotient geometry. Three consequences will be important later. First, the metric g¯ g defines the natural notion of steepest descent for empirical risk on the quotient manifold. Second, the spectrum of the effective Hessian provides the relevant local curvature quantities after removal of symmetry-induced degeneracy. Third, the distinction between false flatness and intrinsic flatness allows one to reinterpret local optimization landscapes in terms of effective geometry rather than ambient parameter coordinates. In the next section, we use this framework to compare Euclidean gradient flow in parameter space with the intrinsic gradient flow induced by g¯ g on the quotient manifold. This will show that the function-level dynamics are governed by horizontal motion and by effective curvature, rather than by the full Euclidean geometry of the redundant parameterization. 4 Gradient Flows on the Quotient The metric construction of Section 3 identifies the quotient manifold ℳregM_reg as the intrinsic space of locally identifiable predictors and equips it with the Riemannian metric g¯ g induced by the finite-sample realization map. We now use this structure to analyze optimization dynamics. The central point of this section is that, on the regular set, the function-level evolution of training is governed by the quotient geometry rather than by the full Euclidean geometry of the redundant parameterization. Throughout, we consider a smooth empirical loss of the form ℒ(θ)=L~(ΦX(θ)),θ∈Θreg,L(θ)= L( _X(θ)), θ∈ _reg, where ΦX(θ)=(fθ(x1),…,fθ(xn))∈ℝn _X(θ)= (f_θ(x_1),…,f_θ(x_n) ) ^n is the finite-sample realization map, and L~:ℝn→ℝ L:R^n is smooth. Since ΦX _X is G-invariant, ℒL descends to a smooth function ℒ¯:ℳreg→ℝ,ℒ¯([θ])=ℒ(θ). L:M_reg , L([θ])=L(θ). We write q:Θreg→ℳregq: _reg _reg for the quotient map, and recall the vertical-horizontal decomposition TθΘreg=θ⊕ℋθ,θ=Tθ(G⋅θ)=ker⁡(DΦX(θ)).T_θ _reg=V_θ _θ, _θ=T_θ(G·θ)= (D _X(θ)). The quotient metric g¯ g was defined in Section 3 so that Dqθ:ℋθ→T[θ]ℳregDq_θ:H_θ→ T_[θ]M_reg is an isometry. 4.1 Euclidean and quotient gradients We begin by comparing the Euclidean gradient of ℒL in parameter space with the quotient gradient of ℒ¯ L on ℳregM_reg. Let ⟨⋅,⋅⟩Euc ·,· _Euc denote the ambient Euclidean inner product on Θ . The Euclidean gradient ∇ℒ(θ)∈TθΘreg (θ)∈ T_θ _reg is defined by ⟨∇ℒ(θ),u⟩Euc=Dℒ(θ)[u]for all u∈TθΘreg. (θ),u _Euc=DL(θ)[u] all u∈ T_θ _reg. By contrast, the quotient gradient gradg¯⁡ℒ¯([θ])∈T[θ]ℳreggrad_ g L([θ])∈ T_[θ]M_reg is defined by g¯[θ](gradg¯⁡ℒ¯([θ]),ξ)=Dℒ¯[θ][ξ]for all ξ∈T[θ]ℳreg. g_[θ](grad_ g L([θ]),ξ)=D L_[θ][ξ] all ξ∈ T_[θ]M_reg. The two gradients live in different geometries and should not be identified. The Euclidean gradient depends on the ambient parameterization, whereas the quotient gradient is intrinsic to the function-level geometry induced by the sample outputs. Nonetheless, the Euclidean gradient admits a natural decomposition relative to the vertical-horizontal splitting. Let Pθver:TθΘreg→θ,Pθhor:TθΘreg→ℋθP_θ^ver:T_θ _reg _θ, P_θ^hor:T_θ _reg _θ denote the Euclidean orthogonal projections. Then ∇ℒ(θ)=Pθver∇ℒ(θ)+Pθhor∇ℒ(θ). (θ)=P_θ^ver (θ)+P_θ^hor (θ). The next proposition shows that the vertical component does not influence the realized predictor even infinitesimally. Proposition 4.1 (Function evolution is determined by the horizontal component). Let θ∈Θregθ∈ _reg, and let TθΘreg=θ⊕ℋθT_θ _reg=V_θ _θ be the vertical-horizontal decomposition, where θ=Tθ(G⋅θ)=ker⁡(DΦX(θ)).V_θ=T_θ(G·θ)= (D _X(θ)). Let Pθver:TθΘreg→θ,Pθhor:TθΘreg→ℋθP_θ^ver:T_θ _reg _θ, P_θ^hor:T_θ _reg _θ denote the associated projections. Then for every u∈TθΘregu∈ T_θ _reg, DΦX(θ)[u]=DΦX(θ)[Pθhoru].D _X(θ)[u]=D _X(θ) [P_θ^horu ]. In particular, DΦX(θ)[Pθveru]=0.D _X(θ) [P_θ^veru ]=0. Consequently, if θt _t is a C1C^1 curve in Θreg _reg, then dtΦX(θt)=DΦX(θt)[Pθthorθ˙t], ddt _X( _t)=D _X( _t) [P_ _t^hor θ_t ], so the first-order evolution of the prediction vector depends only on the horizontal component of the velocity. Proof Fix θ∈Θregθ∈ _reg and u∈TθΘregu∈ T_θ _reg. Since TθΘreg=θ⊕ℋθ,T_θ _reg=V_θ _θ, the vector u admits the unique decomposition u=Pθveru+Pθhoru,u=P_θ^veru+P_θ^horu, with Pθveru∈θ,Pθhoru∈ℋθ.P_θ^veru _θ, P_θ^horu _θ. Applying the linear map DΦX(θ)D _X(θ) to this decomposition and using linearity, we obtain DΦX(θ)[u]=DΦX(θ)[Pθveru]+DΦX(θ)[Pθhoru].D _X(θ)[u]=D _X(θ) [P_θ^veru ]+D _X(θ) [P_θ^horu ]. Now, by definition of the vertical space on the regular set, θ=Tθ(G⋅θ)=ker⁡(DΦX(θ)).V_θ=T_θ(G·θ)= (D _X(θ)). Therefore Pθveru∈ker⁡(DΦX(θ)),P_θ^veru∈ (D _X(θ)), and hence DΦX(θ)[Pθveru]=0.D _X(θ) [P_θ^veru ]=0. Substituting this into the previous identity gives DΦX(θ)[u]=DΦX(θ)[Pθhoru].D _X(θ)[u]=D _X(θ) [P_θ^horu ]. This proves the first claim, and the second claim has already been shown. For the final statement, let θt _t be a C1C^1 curve in Θreg _reg. By the chain rule, dtΦX(θt)=DΦX(θt)[θ˙t]. ddt _X( _t)=D _X( _t)[ θ_t]. Applying the identity just proved at the point θt _t with u=θ˙tu= θ_t, we obtain DΦX(θt)[θ˙t]=DΦX(θt)[Pθthorθ˙t].D _X( _t)[ θ_t]=D _X( _t) [P_ _t^hor θ_t ]. Hence dtΦX(θt)=DΦX(θt)[Pθthorθ˙t]. ddt _X( _t)=D _X( _t) [P_ _t^hor θ_t ]. Therefore the first-order evolution of the prediction vector depends only on the horizontal component of the parameter velocity. Proposition 4.1 is the basic reason quotient geometry is relevant for optimization: function-space dynamics are blind to vertical motion. Different parameter trajectories may differ substantially in their orbit components while inducing the same first-order predictor evolution. 4.2 Projected dynamics and the quotient gradient The quotient gradient has a canonical horizontal lift to parameter space. For θ∈Θregθ∈ _reg, define the unique horizontal vector uℒ(θ)∈ℋθu_L(θ) _θ by Dqθ[uℒ(θ)]=gradg¯⁡ℒ¯([θ]).Dq_θ [u_L(θ) ]=grad_ g L([θ]). Equivalently, uℒ(θ)u_L(θ) is characterized by the variational identity gθ(uℒ(θ),v)=Dℒ(θ)[v]for all v∈ℋθ.g_θ(u_L(θ),v)=DL(θ)[v] all v _θ. Existence and uniqueness follow from positive definiteness of gθ|ℋθg_θ|_H_θ. This vector field defines the intrinsic steepest-descent direction in parameter space after removal of symmetry. Accordingly, the quotient gradient flow is naturally represented by the horizontal ODE θ˙t=−uℒ(θt). θ_t=-u_L( _t). The next theorem formalizes that this horizontal dynamics is exactly the lifted gradient flow of the quotient loss. Theorem 4.1 (Horizontal lift of quotient gradient flow). Let q:Θreg→ℳreg=Θreg/Gq: _reg _reg= _reg/G be the quotient map, let g¯ g be the quotient metric from Theorem 3.1, and let ℒ¯:ℳreg→ℝ L:M_reg be the descended loss. For each θ∈Θregθ∈ _reg, let uℒ(θ)∈ℋθu_L(θ) _θ denote the unique horizontal vector satisfying Dqθ[uℒ(θ)]=gradg¯⁡ℒ¯([θ]).Dq_θ[u_L(θ)]=grad_ g L([θ]). Equivalently, gθ(uℒ(θ),v)=Dℒ(θ)[v]for all v∈ℋθ.g_θ(u_L(θ),v)=DL(θ)[v] all v _θ. Then the following hold. 1. Let γ:[0,T)→ℳregγ:[0,T) _reg be a C1C^1 solution of the quotient gradient flow γ˙t=−gradg¯⁡ℒ¯(γt). γ_t=-grad_ g L( _t). Let s:U→Θregs:U→ _reg be a smooth local section of q defined on an open set U⊂ℳregU _reg such that γ([0,T0])⊂Uγ([0,T_0])⊂ U for some T0<T_0<T. Then the lifted curve θt:=s(γt),t∈[0,T0], _t:=s( _t), t∈[0,T_0], satisfies Dqθt[θ˙t]=−gradg¯⁡ℒ¯(γt),Dq_ _t[ θ_t]=-grad_ g L( _t), and its horizontal component is the unique horizontal lift of the quotient velocity: Pθthorθ˙t=−uℒ(θt).P_ _t^hor θ_t=-u_L( _t). 2. Conversely, if θ:[0,T)→Θregθ:[0,T)→ _reg is a C1C^1 solution of θ˙t=−uℒ(θt), θ_t=-u_L( _t), then the projected curve γt:=q(θt) _t:=q( _t) is a C1C^1 solution of the quotient gradient flow γ˙t=−gradg¯⁡ℒ¯(γt). γ_t=-grad_ g L( _t). Proof Fix θ∈Θregθ∈ _reg. Since Dqθ:TθΘreg→T[θ]ℳregDq_θ:T_θ _reg→ T_[θ]M_reg is surjective and ker⁡(Dqθ)=θ, (Dq_θ)=V_θ, its restriction to the horizontal space Dqθ|ℋθ:ℋθ→T[θ]ℳregDq_θ|_H_θ:H_θ→ T_[θ]M_reg is a linear isomorphism. Therefore, for the quotient gradient gradg¯⁡ℒ¯([θ])∈T[θ]ℳreg,grad_ g L([θ])∈ T_[θ]M_reg, there exists a unique horizontal vector uℒ(θ)∈ℋθu_L(θ) _θ satisfying Dqθ[uℒ(θ)]=gradg¯⁡ℒ¯([θ]).Dq_θ[u_L(θ)]=grad_ g L([θ]). Equivalently, for every v∈ℋθv _θ, gθ(uℒ(θ),v) g_θ(u_L(θ),v) =g¯[θ](Dqθ[uℒ(θ)],Dqθ[v]) = g_[θ] (Dq_θ[u_L(θ)],Dq_θ[v] ) =g¯[θ](gradg¯⁡ℒ¯([θ]),Dqθ[v]) = g_[θ] (grad_ g L([θ]),Dq_θ[v] ) =Dℒ¯[θ][Dqθ[v]] =D L_[θ][Dq_θ[v]] =D(ℒ¯∘q)θ[v] =D( L q)_θ[v] =Dℒ(θ)[v]. =DL(θ)[v]. Thus the two characterizations of uℒ(θ)u_L(θ) are equivalent. Let γ:[0,T)→ℳregγ:[0,T) _reg be a C1C^1 solution of γ˙t=−gradg¯⁡ℒ¯(γt), γ_t=-grad_ g L( _t), and let s:U→Θregs:U→ _reg be a smooth local section of q on an open set U⊂ℳregU _reg such that γ([0,T0])⊂Uγ([0,T_0])⊂ U. Define θt:=s(γt),t∈[0,T0]. _t:=s( _t), t∈[0,T_0]. Since s and γ are C1C^1, the curve θt _t is C1C^1. Because q∘s=idUq s=id_U, differentiation gives Dqs(x)∘Dsx=idTxUfor all x∈U.Dq_s(x) Ds_x=id_T_xU all x∈ U. Applying this with x=γtx= _t and using the chain rule, Dqθt[θ˙t] Dq_ _t[ θ_t] =Dqs(γt)[Dsγt[γ˙t]] =Dq_s( _t) [Ds_ _t[ γ_t] ] =D(q∘s)γt[γ˙t] =D(q s)_ _t[ γ_t] =γ˙t. = γ_t. Since γt _t satisfies the quotient gradient flow, γ˙t=−gradg¯⁡ℒ¯(γt), γ_t=-grad_ g L( _t), we conclude that Dqθt[θ˙t]=−gradg¯⁡ℒ¯(γt).Dq_ _t[ θ_t]=-grad_ g L( _t). This proves the first displayed identity. Decompose θ˙t θ_t into vertical and horizontal parts: θ˙t=Pθtverθ˙t+Pθthorθ˙t. θ_t=P_ _t^ver θ_t+P_ _t^hor θ_t. Since Pθtverθ˙t∈θt=ker⁡(Dqθt),P_ _t^ver θ_t _ _t= (Dq_ _t), we have Dqθt[θ˙t]=Dqθt[Pθthorθ˙t].Dq_ _t[ θ_t]=Dq_ _t [P_ _t^hor θ_t ]. Combining this with the identity yields Dqθt[Pθthorθ˙t]=−gradg¯⁡ℒ¯(γt).Dq_ _t [P_ _t^hor θ_t ]=-grad_ g L( _t). Now Pθthorθ˙t∈ℋθt,P_ _t^hor θ_t _ _t, and Dqθt|ℋθtDq_ _t|_H_ _t is an isomorphism. By uniqueness of the horizontal lift, we must have Pθthorθ˙t=−uℒ(θt).P_ _t^hor θ_t=-u_L( _t). This proves the second identity. Now let θ:[0,T)→Θregθ:[0,T)→ _reg be a C1C^1 solution of θ˙t=−uℒ(θt). θ_t=-u_L( _t). Define γt:=q(θt). _t:=q( _t). Since q and θt _t are C1C^1, the curve γt _t is C1C^1. By the chain rule, γ˙t=Dqθt[θ˙t]. γ_t=Dq_ _t[ θ_t]. Substituting the ODE for θt _t, γ˙t=Dqθt[−uℒ(θt)]=−Dqθt[uℒ(θt)]. γ_t=Dq_ _t [-u_L( _t) ]=-Dq_ _t[u_L( _t)]. By definition of uℒ(θt)u_L( _t), Dqθt[uℒ(θt)]=gradg¯⁡ℒ¯([θt]).Dq_ _t[u_L( _t)]=grad_ g L([ _t]). Since [θt]=q(θt)=γt[ _t]=q( _t)= _t, we obtain γ˙t=−gradg¯⁡ℒ¯(γt). γ_t=-grad_ g L( _t). Thus γt _t is a solution of the quotient gradient flow. This proves the converse statement and completes the proof. Theorem 4.1 shows that the quotient dynamics can be studied through a horizontal representative in parameter space, but the dynamics itself is defined on the quotient. In particular, any two parameter trajectories with the same quotient projection induce the same intrinsic optimization path, regardless of how they move along symmetry orbits. 4.3 Euclidean gradient flow and quotient dynamics The standard training dynamics in parameter space is the Euclidean gradient flow θ˙t=−∇ℒ(θt). θ_t=- ( _t). Because ℒL is G-invariant, the Euclidean gradient is orthogonal to the orbit directions in the first-order sense: Dℒ(θ)[v]=0for all v∈θ.DL(θ)[v]=0 all v _θ. This implies that ∇ℒ(θ) (θ) lies in the Euclidean orthogonal complement of θV_θ, that is, ∇ℒ(θ)∈ℋθ. (θ) _θ. Hence Euclidean gradient flow is itself horizontal with respect to the chosen Euclidean splitting. This observation should not be confused with equivalence to quotient gradient flow: although both are horizontal, they are generated by different metrics. Indeed, the Euclidean gradient is characterized by the Euclidean pairing ⟨∇ℒ(θ),v⟩Euc=Dℒ(θ)[v], (θ),v _Euc=DL(θ)[v], whereas uℒ(θ)u_L(θ) is characterized by the function-induced pairing gθ(uℒ(θ),v)=Dℒ(θ)[v].g_θ(u_L(θ),v)=DL(θ)[v]. Unless gθ|ℋθg_θ|_H_θ coincides with the Euclidean metric on ℋθH_θ, these vectors differ. Therefore Euclidean gradient flow and quotient gradient flow generally induce different time parameterizations and, in general, different trajectories even after projection to the quotient. What remains true is that the quotient geometry provides the correct local coordinates for understanding the function-level effect of Euclidean training. The Euclidean dynamics acts through horizontal motion, and its prediction-space velocity is dtΦX(θt)=DΦX(θt)[−∇ℒ(θt)]=DΦX(θt)[−Pθthor∇ℒ(θt)]. ddt _X( _t)=D _X( _t) [- ( _t) ]=D _X( _t) [-P_ _t^hor ( _t) ]. By Proposition 4.1, only the horizontal component matters. Thus, even when one studies standard Euclidean gradient descent, the redundant vertical geometry is irrelevant to first-order predictor evolution. 4.4 Effective local geometry and convergence We now formalize the local curvature quantities governing quotient dynamics. Let [θ]∈ℳreg[θ] _reg be a critical point of ℒ¯ L, and let H[θ]eff:T[θ]ℳreg→T[θ]ℳregH^eff_[θ]:T_[θ]M_reg→ T_[θ]M_reg be the effective Hessian introduced in Section 3, defined by g¯[θ](H[θ]effξ,η)=Hessg¯⁡ℒ¯([θ])(ξ,η). g_[θ](H^eff_[θ]ξ,η)=Hess_ g L([θ])(ξ,η). Since H[θ]effH^eff_[θ] is self-adjoint with respect to g¯[θ] g_[θ], its spectrum is real. We define the effective extremal eigenvalues by λmineff([θ]):=minξ≠0⁡Hessg¯⁡ℒ¯([θ])(ξ,ξ)g¯[θ](ξ,ξ),λmaxeff([θ]):=maxξ≠0⁡Hessg¯⁡ℒ¯([θ])(ξ,ξ)g¯[θ](ξ,ξ). _ ^eff([θ]):= _ξ≠ 0 Hess_ g L([θ])(ξ,ξ) g_[θ](ξ,ξ), _ ^eff([θ]):= _ξ≠ 0 Hess_ g L([θ])(ξ,ξ) g_[θ](ξ,ξ). When λmineff([θ])>0 _ ^eff([θ])>0, define the effective condition number κeff([θ]):=λmaxeff([θ])λmineff([θ]). _eff([θ]):= _ ^eff([θ]) _ ^eff([θ]). The next theorem gives the standard local linear convergence statement in quotient geometry. Theorem 4.2 (Local convergence controlled by effective curvature). Let (ℳreg,g¯)(M_reg, g) be the quotient manifold with quotient metric, and let ℒ¯:ℳreg→ℝ L:M_reg be a smooth function. Let x∗=[θ∗]∈ℳregx_*=[ _*] _reg be a critical point of ℒ¯ L, and assume that there exists a geodesically convex open neighborhood U⊂ℳregU _reg of x∗x_* and constants 0<μ≤L<∞0<μ≤ L<∞ such that for every x∈Ux∈ U and every ξ∈Txℳregξ∈ T_xM_reg, μg¯x(ξ,ξ)≤Hessg¯⁡ℒ¯(x)(ξ,ξ)≤Lg¯x(ξ,ξ).μ g_x(ξ,ξ) _ g L(x)(ξ,ξ)≤ L g_x(ξ,ξ). Then: 1. x∗x_* is the unique minimizer of ℒ¯ L in U; 2. every C1C^1 solution γ:[0,T)→Uγ:[0,T)→ U of the quotient gradient flow γ˙t=−gradg¯⁡ℒ¯(γt) γ_t=-grad_ g L( _t) satisfies ℒ¯(γt)−ℒ¯(x∗)≤e−2μt(ℒ¯(γ0)−ℒ¯(x∗))for all t∈[0,T); L( _t)- L(x_*)≤ e^-2μ t ( L( _0)- L(x_*) ) all t∈[0,T); 3. in particular, the local convergence rate is controlled by the lower quotient-curvature bound μ, while the local anisotropy is quantified by the effective condition number L/μL/μ. Proof Fix x∈Ux∈ U. Since U is geodesically convex, for every y∈Uy∈ U there exists a minimizing geodesic c:[0,1]→Uc:[0,1]→ U such that c(0)=x,c(1)=y.c(0)=x, c(1)=y. Define φ(t):=ℒ¯(c(t)). (t):= L(c(t)). Since c is a geodesic, the second derivative of φ satisfies φ′(t)=Hessg¯⁡ℒ¯(c(t))(c˙(t),c˙(t)). (t)=Hess_ g L(c(t)) ( c(t), c(t) ). By the lower Hessian bound, φ′(t)≥μg¯c(t)(c˙(t),c˙(t)). (t)≥μ g_c(t)( c(t), c(t)). Because c is a geodesic parameterized on a compact interval, g¯c(t)(c˙(t),c˙(t)) g_c(t)( c(t), c(t)) is constant in t; denote this constant by ‖c˙‖g¯2\| c\|_ g^2. Hence φ′(t)≥μ‖c˙‖g¯2. (t)≥μ\| c\|_ g^2. This is precisely geodesic μ-strong convexity of ℒ¯ L on U. Now take x=x∗x=x_*, where x∗x_* is a critical point, so gradg¯⁡ℒ¯(x∗)=0.grad_ g L(x_*)=0. Then along any geodesic c from x∗x_* to y∈Uy∈ U, we have φ′(0)=Dℒ¯x∗[c˙(0)]=g¯x∗(gradg¯⁡ℒ¯(x∗),c˙(0))=0. (0)=D L_x_*[ c(0)]= g_x_* (grad_ g L(x_*), c(0) )=0. Integrating the lower bound on φ′ twice gives φ(1)≥φ(0)+μ2‖c˙‖g¯2. (1)≥ (0)+ μ2\| c\|_ g^2. Equivalently, ℒ¯(y)≥ℒ¯(x∗)+μ2dg¯(x∗,y)2, L(y)≥ L(x_*)+ μ2d_ g(x_*,y)^2, where dg¯d_ g denotes the Riemannian distance on U. In particular, ℒ¯(y)>ℒ¯(x∗)for all y∈U∖x∗. L(y)> L(x_*) all y∈ U \x_*\. Thus x∗x_* is a strict local minimizer, and hence the unique minimizer of ℒ¯ L in U. This proves (1). We now show that geodesic μ-strong convexity implies ‖gradg¯⁡ℒ¯(x)‖g¯2≥2μ(ℒ¯(x)−ℒ¯(x∗))for all x∈U.\|grad_ g L(x)\|_ g^2≥ 2μ ( L(x)- L(x_*) ) all x∈ U. Fix x∈Ux∈ U, and let c:[0,1]→Uc:[0,1]→ U be a minimizing geodesic from x to x∗x_*: c(0)=x,c(1)=x∗.c(0)=x, c(1)=x_*. Again define φ(t):=ℒ¯(c(t)). (t):= L(c(t)). By geodesic μ-strong convexity, φ(t)≤(1−t)φ(0)+tφ(1)−μ2t(1−t)‖c˙‖g¯2. (t)≤(1-t) (0)+t (1)- μ2t(1-t)\| c\|_ g^2. Differentiating this inequality at t=0t=0 yields φ′(0)≤φ(1)−φ(0)−μ2‖c˙‖g¯2. (0)≤ (1)- (0)- μ2\| c\|_ g^2. Equivalently, ℒ¯(x)−ℒ¯(x∗)≤−φ′(0)−μ2‖c˙‖g¯2. L(x)- L(x_*)≤- (0)- μ2\| c\|_ g^2. Now φ′(0)=Dℒ¯x[c˙(0)]=g¯x(gradg¯⁡ℒ¯(x),c˙(0)). (0)=D L_x[ c(0)]= g_x (grad_ g L(x), c(0) ). Since c goes from x toward x∗x_*, the tangent c˙(0) c(0) points in a descent direction, so −φ′(0)=−g¯x(gradg¯⁡ℒ¯(x),c˙(0))≤‖gradg¯⁡ℒ¯(x)‖g¯‖c˙(0)‖g¯- (0)=- g_x (grad_ g L(x), c(0) )≤\|grad_ g L(x)\|_ g\| c(0)\|_ g by Cauchy–Schwarz. Because c is a constant-speed geodesic, ‖c˙(0)‖g¯=‖c˙‖g¯.\| c(0)\|_ g=\| c\|_ g. Therefore ℒ¯(x)−ℒ¯(x∗)≤‖gradg¯⁡ℒ¯(x)‖g¯‖c˙‖g¯−μ2‖c˙‖g¯2. L(x)- L(x_*)≤\|grad_ g L(x)\|_ g\| c\|_ g- μ2\| c\|_ g^2. The right-hand side is a quadratic function of a:=‖c˙‖g¯≥0a:=\| c\|_ g≥ 0: ‖gradg¯⁡ℒ¯(x)‖g¯a−μ2a2.\|grad_ g L(x)\|_ ga- μ2a^2. Maximizing over a≥0a≥ 0 gives the bound ℒ¯(x)−ℒ¯(x∗)≤12μ‖gradg¯⁡ℒ¯(x)‖g¯2. L(x)- L(x_*)≤ 12μ\|grad_ g L(x)\|_ g^2. Equivalently, ‖gradg¯⁡ℒ¯(x)‖g¯2≥2μ(ℒ¯(x)−ℒ¯(x∗)).\|grad_ g L(x)\|_ g^2≥ 2μ ( L(x)- L(x_*) ). Let γ:[0,T)→Uγ:[0,T)→ U be a C1C^1 solution of γ˙t=−gradg¯⁡ℒ¯(γt). γ_t=-grad_ g L( _t). Along this flow, the chain rule gives dtℒ¯(γt)=Dℒ¯γt[γ˙t]. ddt L( _t)=D L_ _t[ γ_t]. Using the defining property of the Riemannian gradient, Dℒ¯γt[γ˙t]=g¯γt(gradg¯⁡ℒ¯(γt),γ˙t).D L_ _t[ γ_t]= g_ _t (grad_ g L( _t), γ_t ). Substituting γ˙t=−gradg¯⁡ℒ¯(γt) γ_t=-grad_ g L( _t), we obtain dtℒ¯(γt)=−‖gradg¯⁡ℒ¯(γt)‖g¯2. ddt L( _t)=-\|grad_ g L( _t)\|_ g^2. Subtracting the constant ℒ¯(x∗) L(x_*), dt(ℒ¯(γt)−ℒ¯(x∗))=−‖gradg¯⁡ℒ¯(γt)‖g¯2. ddt ( L( _t)- L(x_*) )=-\|grad_ g L( _t)\|_ g^2. Applying the inequality at x=γtx= _t gives dt(ℒ¯(γt)−ℒ¯(x∗))≤−2μ(ℒ¯(γt)−ℒ¯(x∗)). ddt ( L( _t)- L(x_*) )≤-2μ ( L( _t)- L(x_*) ). Set E(t):=ℒ¯(γt)−ℒ¯(x∗).E(t):= L( _t)- L(x_*). Then E(t)≥0E(t)≥ 0 and E′(t)≤−2μE(t).E (t)≤-2μ E(t). By Grönwall’s inequality, E(t)≤e−2μtE(0),E(t)≤ e^-2μ tE(0), that is, ℒ¯(γt)−ℒ¯(x∗)≤e−2μt(ℒ¯(γ0)−ℒ¯(x∗)). L( _t)- L(x_*)≤ e^-2μ t ( L( _0)- L(x_*) ). This proves (2). Finally, statement (3) is immediate from the definitions: μ is the local lower curvature bound controlling the contraction rate, while L/μL/μ is the corresponding local effective condition number measuring anisotropy of the quotient Hessian. Theorem 4.2 makes precise the role of effective curvature: local optimization on the quotient is controlled by the spectrum of the effective Hessian, not by the full Euclidean Hessian in parameter space. In particular, large numbers of zero eigenvalues in the ambient Hessian may coexist with strong local contraction in the quotient, because the vertical degeneracy has been removed. 4.5 Gauge fixing and reduced coordinates Although the quotient formulation is intrinsic, it is often convenient to work in gauge-fixed coordinates. Let s:U→Θregs:U→ _reg be a smooth local section of the quotient map, and write S=s(U)⊂ΘregS=s(U)⊂ _reg for the corresponding gauge-fixed slice. Through the identification q∘s=idUq s=id_U, the quotient gradient flow becomes a gradient flow on U with respect to the pulled-back metric gS=s∗g¯g^S=s^* g and the gauge-fixed loss ℒS=ℒ∘s=ℒ¯|U.L^S=L s= L|_U. Thus z˙t=−gradgS⁡ℒS(zt),zt∈U, z_t=-grad_g^SL^S(z_t), z_t∈ U, is exactly the local coordinate representation of quotient gradient flow. At a critical point z∗∈Uz_*∈ U, the corresponding second-order operator is the Riemannian Hessian of ℒSL^S with respect to gSg^S. By Proposition 3.1, this is the effective Hessian expressed in local coordinates: HessgS⁡ℒS=Hessg¯⁡ℒ¯.Hess_g^SL^S=Hess_ g L. Therefore, once a gauge is fixed, all local convergence and stability statements may be expressed in reduced coordinates without ambiguity from symmetry directions. This observation yields the following immediate corollary. Corollary 4.1 (Gauge-fixed reduced dynamics). Let q:Θreg→ℳregq: _reg _reg be the quotient map, let g¯ g be the quotient metric from Theorem 3.1, and let ℒ¯:ℳreg→ℝ L:M_reg be the descended loss. Let s:U→Θregs:U→ _reg be a smooth local section of q over an open set U⊂ℳregU _reg, and define ℒS:=ℒ∘s=ℒ¯|U,gS:=s∗g¯.L^S:=L s= L|_U, g^S:=s^* g. Then: 1. the quotient gradient flow on U, γ˙t=−gradg¯⁡ℒ¯(γt), γ_t=-grad_ g L( _t), is, in the coordinates induced by s, exactly the Riemannian gradient flow of ℒSL^S with respect to the reduced metric gSg^S: z˙t=−gradgS⁡ℒS(zt); z_t=-grad_g^SL^S(z_t); 2. for every critical point z∗∈Uz_*∈ U, HessgS⁡ℒS(z∗)=Hessg¯⁡ℒ¯(z∗),Hess_g^SL^S(z_*)=Hess_ g L(z_*), under the canonical identification Tz∗U≃Tz∗ℳregT_z_*U T_z_*M_reg; 3. consequently, all local stability and local convergence statements for quotient gradient flow near z∗z_* are determined by the spectrum of the reduced Hessian HessgS⁡ℒS(z∗),Hess_g^SL^S(z_*), equivalently by the spectrum of the effective Hessian on the quotient. Proof Since s:U→Θregs:U→ _reg is a smooth local section of q, we have q∘s=idU.q s=id_U. Thus s identifies U with the embedded gauge-fixed submanifold S:=s(U)⊂Θreg,S:=s(U)⊂ _reg, and the reduced metric is defined by pullback: gS=s∗g¯.g^S=s^* g. Likewise, because ℒ=ℒ¯∘qL= L q on Θreg _reg, ℒS=ℒ∘s=ℒ¯∘q∘s=ℒ¯|U.L^S=L s= L q s= L|_U. Let z∈Uz∈ U, and let ξ∈TzUξ∈ T_zU. By definition of the Riemannian gradient with respect to gSg^S, gzS(gradgS⁡ℒS(z),ξ)=DℒzS[ξ].g^S_z (grad_g^SL^S(z),ξ )=DL^S_z[ξ]. Using gS=s∗g¯g^S=s^* g and ℒS=ℒ¯|UL^S= L|_U, we obtain gzS(gradgS⁡ℒS(z),ξ)=g¯z(gradgS⁡ℒS(z),ξ),g^S_z (grad_g^SL^S(z),ξ )= g_z (grad_g^SL^S(z),ξ ), and DℒzS[ξ]=Dℒ¯z[ξ].DL^S_z[ξ]=D L_z[ξ]. Therefore, g¯z(gradgS⁡ℒS(z),ξ)=Dℒ¯z[ξ]for all ξ∈TzU. g_z (grad_g^SL^S(z),ξ )=D L_z[ξ] all ξ∈ T_zU. But gradg¯⁡ℒ¯(z)grad_ g L(z) is characterized by g¯z(gradg¯⁡ℒ¯(z),ξ)=Dℒ¯z[ξ]for all ξ∈TzU. g_z (grad_ g L(z),ξ )=D L_z[ξ] all ξ∈ T_zU. By uniqueness of the Riemannian gradient, gradgS⁡ℒS(z)=gradg¯⁡ℒ¯(z).grad_g^SL^S(z)=grad_ g L(z). Hence the ODE z˙t=−gradgS⁡ℒS(zt) z_t=-grad_g^SL^S(z_t) coincides pointwise with z˙t=−gradg¯⁡ℒ¯(zt) z_t=-grad_ g L(z_t) on U. This proves the first claim. Let z∗∈Uz_*∈ U be a critical point of ℒSL^S. Since ℒS=ℒ¯|UL^S= L|_U, the point z∗z_* is also a critical point of ℒ¯ L restricted to U. By Proposition 3.1, for every z∈Uz∈ U and every ξ,η∈TzUξ,η∈ T_zU, Hessg¯⁡ℒ¯(z)(ξ,η)=HessgS⁡ℒS(z)(ξ,η).Hess_ g L(z)(ξ,η)=Hess_g^SL^S(z)(ξ,η). Applying this at z=z∗z=z_* gives HessgS⁡ℒS(z∗)=Hessg¯⁡ℒ¯(z∗),Hess_g^SL^S(z_*)=Hess_ g L(z_*), under the canonical identification Tz∗U=Tz∗ℳregT_z_*U=T_z_*M_reg, since U is an open subset of ℳregM_reg. This proves the second claim. Because the flow is exactly the quotient gradient flow written in the reduced coordinates induced by s, all local dynamical properties near z∗z_* are properties of the same Riemannian gradient system, merely expressed in different coordinates. In particular, the linearization of the gradient vector field at a critical point is determined by the corresponding Hessian. Since the Hessians agree, the local second-order geometry governing the flow near z∗z_* is the same whether described on the quotient manifold (U,g¯)(U, g) or on the reduced coordinate chart (U,gS)(U,g^S). More concretely, if the reduced Hessian HessgS⁡ℒS(z∗)Hess_g^SL^S(z_*) is positive definite, then by Corollary 3.1 the critical point is a nondegenerate strict local minimizer modulo symmetry, and by Theorem 4.2 the local gradient-flow convergence rate is controlled by the extremal eigenvalues of this Hessian relative to the metric gSg^S. Since this Hessian equals the quotient Hessian, its spectrum coincides with the spectrum of the effective Hessian on the quotient. Therefore all local stability and local convergence statements for quotient gradient flow near z∗z_* are determined by the spectrum of HessgS⁡ℒS(z∗),Hess_g^SL^S(z_*), equivalently by the effective Hessian on the quotient. Corollary 4.1 clarifies the role of gauge fixing. A good gauge does not change the intrinsic geometry; it merely provides coordinates on a local slice transverse to the symmetry orbits. In those coordinates, the quotient geometry appears as a reduced Riemannian optimization problem. 4.6 Interpretation The results of this section establish a clear separation between three levels of description. First, parameter-space Euclidean gradient flow is the standard optimization dynamics used in practice, but its ambient geometry is contaminated by redundancy. Second, the realization map removes the functionally invisible directions and induces the quotient metric g¯ g, which defines an intrinsic notion of steepest descent on the space of distinct predictors. Third, local convergence near critical points is governed by the effective Hessian, whose spectrum measures curvature only after symmetry directions have been removed. This viewpoint has two consequences that will matter later. On the one hand, apparently severe degeneracy in the Euclidean Hessian may be entirely attributable to symmetry and thus irrelevant for function-level training. On the other hand, once one passes to the quotient, the local dynamics is controlled by a nondegenerate geometric object whose curvature can be meaningfully related to optimization stability and, in the next section, to implicit bias. In summary, the quotient manifold (ℳreg,g¯)(M_reg, g) is not merely a reformulation of parameter redundancy. It is the natural dynamical state space for training trajectories once the objective is viewed through the finite-sample realization map. 5 Implicit Bias Through Quotient Geometry Sections 3 and 4 established that the natural local geometry of shallow networks is not the ambient Euclidean geometry of the parameter space, but the quotient geometry induced by the finite-sample realization map after removal of symmetry directions. We now turn to a complementary question: when the empirical objective admits many minimizing parameter configurations, which predictors are selected by gradient-based training? The purpose of this section is to formulate an implicit-bias principle at the quotient level and to show that, on the regular set, the relevant notion of simplicity is a property of equivalence classes in the quotient manifold rather than of individual parameter representatives. Our emphasis is structural rather than fully global. We do not attempt here to classify all asymptotic limits of training dynamics for general shallow networks. Instead, we isolate a quotient-geometric mechanism that becomes visible whenever the loss admits a nontrivial zero-loss set and the gradient flow converges to that set within a regular region. The main message is that any meaningful bias must be defined modulo the symmetry group. Quotient geometry provides the correct framework for making this statement precise. Throughout the section, we retain the notation of the previous sections. In particular, ΦX:Θreg→ℝn _X: _reg ^n denotes the finite-sample realization map, q:Θreg→ℳreg=Θreg/Gq: _reg _reg= _reg/G is the quotient map, g¯ g is the quotient metric induced by ΦX _X, and ℒ¯:ℳreg→ℝ L:M_reg is the descended empirical loss. 5.1 Quotient-level complexity A basic obstacle in discussing implicit bias for overparameterized networks is that a single predictor corresponds to many parameter values. Any complexity measure defined directly on Θ is therefore representation-dependent unless it is invariant under the group action. This motivates a quotient-level notion of simplicity. Let R:Θreg→ℝ≥0R: _reg _≥ 0 be a continuous function that may be interpreted as a parameter-space complexity, such as a norm, a path-like quantity, or a gauge-fixing energy. We assume that R is bounded from below and proper on each orbit, so that minimizing sequences along a fixed orbit admit convergent subsequences inside Θreg _reg. The quotient complexity induced by R is defined by R¯([θ]):=infθ′∈q−1([θ])R(θ′). R([θ]):= _θ ∈ q^-1([θ])R(θ ). Equivalently, R¯ R is the minimal value of R among all parameter representatives of the same predictor class. By construction, R¯ R is attached to the orbit [θ][θ], not to a particular parameterization. This definition formalizes a distinction that is often blurred in parameter-space discussions. A large Euclidean norm of a specific representative may reflect nothing more than an unfavorable gauge choice along the orbit. By contrast, R¯([θ]) R([θ]) records the smallest complexity compatible with the predictor class itself. It is therefore the natural notion of simplicity once parameter redundancy has been removed. The quotient point of view also clarifies the relation between complexity and identifiability. On the regular set, each [θ]∈ℳreg[θ] _reg is a locally identifiable predictor class, and the quantity R¯([θ]) R([θ]) measures simplicity within that class. Accordingly, any asymptotic preference exhibited by gradient flow should, if it is intrinsic, be expressible as a variational statement involving R¯ R on a subset of ℳregM_reg. 5.2 Zero-loss sets and quotient feasibility The natural feasible set for implicit bias is the set of predictor classes satisfying the empirical interpolation constraint. For a prescribed loss level c∈ℝc , define ≤c:=x∈ℳreg:ℒ¯(x)≤c,=c:=x∈ℳreg:ℒ¯(x)=c.S_≤ c:=\x _reg: L(x)≤ c\, _=c:=\x _reg: L(x)=c\. Of particular interest is the interpolating set :=x∈ℳreg:ℒ¯(x)=0,Z:=\x _reg: L(x)=0\, whenever the training problem is realizable on the regular set. Because ℒ¯ L is defined on the quotient, Z is the set of predictor classes fitting the data, not a subset of parameters. If interpolation is possible, then every element of Z corresponds to an entire orbit of parameter realizations. The implicit-bias question is therefore not which parameter vector is selected, but rather which orbit in Z is approached by training. This suggests the following general principle. Quotient implicit-bias principle: When gradient flow converges to a zero-loss set, the selected limit should be characterized by a complexity functional defined on ⊂ℳregZ _reg, rather than by a representative-dependent parameter norm. The remainder of the section develops a local version of this principle and then illustrates it in a fully tractable model class. 5.3 A local quotient variational principle We now formulate a local theorem stating that, under a compatibility condition between the gradient dynamics and a quotient complexity functional, the limiting orbit of the flow is characterized variationally inside the zero-loss component reached by the dynamics. The statement is intentionally local. Its role is to isolate the geometric mechanism of implicit bias without requiring a global classification of all zero-loss solutions. Let γ:[0,∞)→ℳregγ:[0,∞) _reg be a C1C^1 solution of the quotient gradient flow γ˙t=−gradg¯⁡ℒ¯(γt), γ_t=-grad_ g L( _t), and assume that ℒ¯(γt)→0 L( _t)→ 0 as t→∞t→∞. Let ⊂ℳregA _reg be a closed forward-invariant set containing the trajectory and all of its accumulation points. We regard A as a region of the quotient in which the dynamics remains regular and the geometry is well defined. Let R¯:→ℝ R:A be a C1C^1 quotient complexity functional. We assume that there exists a continuous scalar function λ:→ℝλ:A such that gradg¯⁡R¯(x)=λ(x)gradg¯⁡ℒ¯(x)for all x∈.grad_ g R(x)=λ(x)grad_ g L(x) all x . In words, the quotient complexity and the empirical loss have collinear gradients on the region explored by the dynamics. This is exactly the situation in which decrease of the empirical loss induces monotonicity of the complexity variable. The next theorem gives a rigorous local variational statement. Theorem 5.1 (Local quotient variational principle). Let γ:[0,∞)→ℳregγ:[0,∞) _reg be a C1C^1 solution of the quotient gradient flow γ˙t=−gradg¯⁡ℒ¯(γt), γ_t=-grad_ g L( _t), and assume that γt⊂ _t , where ⊂ℳregA _reg is closed and forward invariant. Assume moreover that ℒ¯(γt)→0andγt→x∞∈∩as t→∞, L( _t)→ 0 _t→ x_∞ t→∞, where :=x∈ℳreg:ℒ¯(x)=0.Z:=\x _reg: L(x)=0\. Let R¯∈C1() R∈ C^1(A) satisfy gradg¯⁡R¯(x)=λ(x)gradg¯⁡ℒ¯(x)for all x∈,grad_ g R(x)=λ(x)grad_ g L(x) all x , for some continuous function λ:→ℝλ:A with λ≥0λ≥ 0. Assume further that 1. R¯ R is constant on each connected component of ∩A ; 2. the connected component C of ∩A containing x∞x_∞ contains a unique minimizer x⋆x of R¯ R. Then x∞=x⋆.x_∞=x . In particular, the quotient gradient flow selects the unique minimizer of R¯ R in the zero-loss component reached by the dynamics. Proof Since γ is a C1C^1 solution of γ˙t=−gradg¯⁡ℒ¯(γt), γ_t=-grad_ g L( _t), the chain rule gives dtR¯(γt)=DR¯γt[γ˙t]. ddt R( _t)=D R_ _t[ γ_t]. Using the defining property of the Riemannian gradient, we obtain DR¯γt[γ˙t]=g¯γt(gradg¯⁡R¯(γt),γ˙t).D R_ _t[ γ_t]= g_ _t (grad_ g R( _t), γ_t ). Substituting the gradient-flow equation yields dtR¯(γt)=−g¯γt(gradg¯⁡R¯(γt),gradg¯⁡ℒ¯(γt)). ddt R( _t)=- g_ _t (grad_ g R( _t),grad_ g L( _t) ). Now apply the collinearity assumption gradg¯⁡R¯(γt)=λ(γt)gradg¯⁡ℒ¯(γt).grad_ g R( _t)=λ( _t)grad_ g L( _t). Then dtR¯(γt)=−λ(γt)‖gradg¯⁡ℒ¯(γt)‖g¯2. ddt R( _t)=-λ( _t) \|grad_ g L( _t) \|_ g^2. Since λ(γt)≥0λ( _t)≥ 0, it follows that dtR¯(γt)≤0for all t≥0. ddt R( _t)≤ 0 all t≥ 0. Hence t↦R¯(γt)t R( _t) is nonincreasing on [0,∞)[0,∞). Because γt→x∞ _t→ x_∞ and R¯ R is continuous on A, we also have limt→∞R¯(γt)=R¯(x∞). _t→∞ R( _t)= R(x_∞). Let C denote the connected component of ∩A containing x∞x_∞. By assumption, R¯ R is constant on each connected component of ∩A . Therefore, for every y∈Cy∈ C, R¯(y)=R¯(x∞). R(y)= R(x_∞). In particular, every point of C has the same R¯ R-value. Now let x⋆∈Cx ∈ C be the unique minimizer of R¯ R on C. Since x⋆∈Cx ∈ C, the constancy of R¯ R on C implies R¯(x⋆)=R¯(x∞). R(x )= R(x_∞). Because x⋆x is the unique minimizer of R¯ R on C, any point in C with the same R¯ R-value as x⋆x must coincide with x⋆x . But the above shows that x∞∈Cx_∞∈ C and R¯(x∞)=R¯(x⋆). R(x_∞)= R(x ). Therefore, x∞=x⋆.x_∞=x . Theorem 5.1 should be interpreted as a structural template rather than as a global theorem for arbitrary losses and architectures. Its importance is that the selected predictor is characterized by a quotient-level variational problem. The relevant optimization principle is not formulated on the parameter space Θ , where orbits introduce artificial multiplicity, but on the quotient ℳregM_reg, where each point represents a distinct predictor class. The assumption that R¯ R is constant on each connected zero-loss component is natural in the present setting. Indeed, whenever Z is a smooth submanifold and gradg¯⁡ℒ¯=0grad_ g L=0 on Z, the collinearity assumption implies gradg¯⁡R¯=0on ,grad_ g R=0 Z, and thus R¯ R is locally constant on each connected component of Z. The theorem isolates exactly this mechanism. 5.4 Interpretation in homogeneous models The quotient formulation is especially natural for positively homogeneous networks. In such models, the scaling symmetry identifies entire families of parameter configurations representing the same local predictor geometry. Any parameter norm that is not invariant under this symmetry is therefore an unreliable indicator of predictor complexity. By contrast, a quotient complexity R¯ R removes this ambiguity automatically. To see the conceptual advantage, suppose R(θ)R(θ) is a smooth gauge-dependent functional, such as the squared Euclidean norm of a chosen representative. Along an orbit, θ∼g⋅θ,θ g·θ, the values R(θ)R(θ) and R(g⋅θ)R(g·θ) may differ substantially even though the two parameterizations correspond to the same function. Hence a statement of the form ”gradient flow prefers small R(θ)R(θ)” is not invariantly meaningful unless one first fixes a gauge. The quotient complexity R¯([θ])=infθ′∈q−1([θ])R(θ′) R([θ])= _θ ∈ q^-1([θ])R(θ ) remedies this by encoding the smallest value of R compatible with the predictor class itself. This perspective also clarifies the role of balancing phenomena often observed in homogeneous models. A balanced representative is not intrinsically special because its Euclidean norm is small, but because it realizes an orbit in a canonical low-complexity gauge. What gradient flow selects, when viewed quotient-geometrically, is therefore not a specific balanced parameter vector but a predictor class whose orbit admits a distinguished low-complexity representative. 5.5 A solvable case: quadratic activation networks We now specialize to the model fθ(x)=∑i=1mai(wi⊤x)2,f_θ(x)= _i=1^ma_i(w_i x)^2, which provides a fully tractable instance of the quotient-bias perspective. As observed earlier, this model can be rewritten as fθ(x)=x⊤Q(θ)x,Q(θ):=∑i=1maiwiwi⊤.f_θ(x)=x Q(θ)x, Q(θ):= _i=1^ma_iw_iw_i . Thus the predictor depends on θ only through the symmetric matrix Q(θ)Q(θ). In particular, the quotient manifold of regular predictor classes can be locally identified with an appropriate manifold of symmetric matrices of fixed rank and signature. Let Ψ:ℳreg→ :M_reg denote the induced local identification with the matrix space Q, and write the empirical loss as ℒ¯([θ])=L~(x1⊤Qx1,…,xn⊤Qxn),Q=Ψ([θ]). L([θ])= L (x_1 Qx_1,…,x_n Qx_n ), Q= ([θ]). In this representation, the redundancy of the factorization disappears: distinct parameterizations with the same matrix Q define the same point of the quotient. Consequently, complexity should be measured at the matrix level. A natural quotient complexity in this model is any spectral functional of Q, for instance R¯([θ])=‖Q‖∗, R([θ])=\|Q\|_*, or, on a fixed-rank or fixed-signature stratum, a Frobenius-type energy. Since Q is itself a quotient coordinate, these quantities are intrinsically defined on predictor classes and do not depend on a particular factorization. The next proposition records the resulting variational interpretation. Proposition 5.1 (Matrix-level quotient bias for quadratic activation networks). Consider the quadratic activation model fθ(x)=∑i=1mai(wi⊤x)2,θ=((a1,w1),…,(am,wm))∈Θreg,f_θ(x)= _i=1^ma_i(w_i x)^2, θ=((a_1,w_1),…,(a_m,w_m))∈ _reg, and define Q(θ):=∑i=1maiwiwi⊤∈Sym(d).Q(θ):= _i=1^ma_iw_iw_i (d). Assume that there exists an open set U⊂ℳregU _reg, an embedded smooth matrix manifold ⊂Sym(d)Q (d), and a smooth bijection Ψ:U→ :U with smooth inverse, such that for every [θ]∈U[θ]∈ U, fθ(x)=x⊤Ψ([θ])xfor all x∈ℝd.f_θ(x)=x ([θ])x all x ^d. Assume moreover that the quotient gradient-flow trajectory γ:[0,∞)→Uγ:[0,∞)→ U of ℒ¯ L is contained in a closed forward-invariant set ⊂UA⊂ U, converges to some x∞∈∩x_∞ , and that the hypotheses of Theorem 5.1 hold on A for a quotient complexity functional R¯:U→ℝ R:U depending only on Q=Ψ([θ])Q= ([θ]). Let C be the connected component of ∩A containing x∞x_∞. If the minimizer of R¯ R over C is unique, then the limiting predictor is characterized by the matrix-level variational problem Q∞=arg⁡min⁡R¯(Q):Q∈Ψ(C),Q_∞= \ R(Q):Q∈ (C) \, where Q∞:=Ψ(x∞).Q_∞:= (x_∞). Proof By assumption, Ψ:U→ :U is a diffeomorphism onto the smooth matrix manifold Q, and for every [θ]∈U[θ]∈ U, fθ(x)=x⊤Ψ([θ])x.f_θ(x)=x ([θ])x. Thus the value Q=Ψ([θ])Q= ([θ]) is an intrinsic coordinate of the quotient point [θ][θ]: it depends only on the predictor class and not on the particular parameter representative θ. In particular, the image under Ψ of any subset of U is a subset of the matrix manifold Q, and the connected component C⊂∩C is mapped diffeomorphically onto the connected subset Ψ(C)⊂. (C) . By assumption, R¯ R depends only on Q=Ψ([θ])Q= ([θ]). Equivalently, there exists a function R~:→ℝ R:Q such that R¯=R~∘Ψon U. R= R U. To see this, define R~(Q):=R¯(Ψ−1(Q)),Q∈. R(Q):= R( ^-1(Q)), Q . This is well defined because Ψ is bijective on U, and smooth because both R¯ R and Ψ−1 ^-1 are smooth. Therefore minimizing R¯ R over a subset of U is equivalent to minimizing R~ R over the corresponding subset of Q. In particular, arg⁡minx∈C⁡R¯(x)=Ψ−1(arg⁡minQ∈Ψ(C)⁡R~(Q)). _x∈ C R(x)= ^-1 ( _Q∈ (C) R(Q) ). Since the proposition writes R¯(Q) R(Q) directly at matrix level, we identify R~ R with R¯ R by abuse of notation. Under this identification, R¯(Ψ(x))=R¯(x)for all x∈U. R( (x))= R(x) all x∈ U. All hypotheses of Theorem 5.1 are assumed to hold on A. Therefore, since γt→x∞∈C _t→ x_∞∈ C, Theorem 5.1 implies that x∞=arg⁡minx∈C⁡R¯(x),x_∞= _x∈ C R(x), and this minimizer is unique. Applying the diffeomorphism Ψ to both sides yields Ψ(x∞)=arg⁡minQ∈Ψ(C)⁡R¯(Q). (x_∞)= _Q∈ (C) R(Q). Define Q∞:=Ψ(x∞).Q_∞:= (x_∞). Then Q∞=arg⁡min⁡R¯(Q):Q∈Ψ(C).Q_∞= \ R(Q):Q∈ (C) \. The set Ψ(C) (C) is precisely the matrix representation of the zero-loss quotient component reached by the dynamics. Since Ψ is a quotient coordinate map, different parameter factorizations corresponding to the same matrix Q have already been identified. Consequently, the variational characterization above is intrinsic: it is a selection principle on matrices, not on parameter representatives. Proposition 5.1 illustrates why quotient geometry is particularly effective in simple homogeneous models. The apparent parameter redundancy of the neural network becomes an ordinary matrix-factorization redundancy, and the implicit bias becomes a variational selection principle on the corresponding matrix manifold. From this viewpoint, quotient geometry does not merely remove nuisance directions; it exposes the lower-dimensional object on which the bias is actually acting. 5.6 False simplicity versus intrinsic simplicity The distinction between parameter-space and quotient-space complexity parallels the distinction between false flatness and intrinsic flatness introduced in Section 3. The analogous dichotomy for simplicity is as follows. A parameter configuration may appear simple because it has small Euclidean norm, balanced layers, or sparse coordinates in a particular gauge. Yet such properties are not intrinsic unless they are stable under the symmetry action. They may vary substantially along the same orbit and therefore reflect representational choice rather than predictor structure. We refer to this phenomenon as false simplicity. By contrast, intrinsic simplicity is a property of the orbit itself. It is measured by a quotient-level functional such as R¯ R, which assigns the same value to all parameterizations of the same predictor class. The implicit-bias statements of this section concern intrinsic simplicity, not false simplicity. This distinction is important in overparameterized networks. A training trajectory may drift substantially in parameter space while remaining close to a fixed predictor class in the quotient. If one tracks only a gauge-dependent complexity of the parameters, one may misinterpret such drift as evidence for or against simplicity. Quotient geometry resolves this ambiguity: only motion in ℳregM_reg changes the realized predictor class, and only quotient complexity can capture a genuine preference among functionally distinct solutions. 5.7 Consequences for the remainder of the theory The results of this section do not claim a complete characterization of implicit bias for all shallow networks. Rather, they establish the correct geometric form such a characterization must take. Any meaningful asymptotic bias theorem for a symmetric overparameterized model should answer a quotient-level variational question: which point of ⊂ℳreg is selected by the training dynamics?which point of Z _reg is selected by the training dynamics? The quotient metric g¯ g governs the local descent geometry, the effective Hessian governs local curvature, and the quotient complexity R¯ R supplies the correct notion of simplicity on the feasible set. This viewpoint suggests two general lessons. First, implicit bias should be formulated on the space of predictor classes rather than on the raw parameter space. Second, in tractable models, quotient coordinates may convert an apparently neural-network-specific bias question into an ordinary variational selection problem on a lower-dimensional manifold. The quadratic activation model is the clearest example, but the same structural principle extends to broader homogeneous architectures whenever the symmetry-reduced representation can be made explicit. In the next section, we complement the theory with numerical illustrations showing that quotient-level curvature and quotient-level simplicity are more stable indicators of optimization behavior than their raw parameter-space analogues. 6 Numerical Illustrations of Quotient Geometry We now complement the preceding theory with numerical experiments in deliberately simple shallow-network models. The goal of this section is not to demonstrate competitive predictive performance, but to make the geometric claims of Sections 2–5 directly observable in a controlled setting. All experiments are conducted with the quadratic activation model fθ(x)=∑i=1mai(wi⊤x)2=x⊤Q(θ)x,Q(θ):=∑i=1maiwiwi⊤,f_θ(x)= _i=1^ma_i(w_i x)^2=x Q(θ)x, Q(θ):= _i=1^ma_iw_iw_i , for which the quotient object is explicit: distinct parameterizations with the same matrix Q(θ)Q(θ) represent the same predictor. This makes the model especially suitable for testing the distinction between ambient parameter-space geometry and quotient-level geometry. Our experiments address three questions. First, does ambient flatness depend on the chosen parameter representative even when the realized predictor is unchanged? Second, do local optimization dynamics correlate more naturally with quotient-level curvature than with raw parameter-space curvature? Third, in an underdetermined regime where many quotient-level solutions fit the data, is implicit bias more naturally described in terms of matrix-level complexity than in terms of raw parameter norms? 6.1 Experimental setup Unless otherwise stated, we use synthetic data generated from a teacher quadratic model of the same form. All computations are carried out in double precision. Optimization uses PyTorch with GPU acceleration, and all Hessians are computed exactly by automatic differentiation for the low-dimensional models considered here. Since our interest is geometric rather than statistical, we work in small dimensions where spectra and matrix-level quantities can be inspected directly. For the quadratic model, two types of quantities are distinguished throughout: 1. Parameter-space quantities, such as the Euclidean Hessian with respect to θ, the Euclidean parameter norm ‖θ‖\|θ\|, and path-like gauge-dependent complexities. 2. Quotient-level quantities, computed from the matrix Q(θ)Q(θ), such as the Hessian of the loss in Q-coordinates, the Frobenius norm ‖Q‖F\|Q\|_F, the nuclear norm ‖Q‖∗\|Q\|_*, the stable-rank surrogate ‖Q‖F2/‖Q‖op2\|Q\|_F^2/\|Q\|_op^2, and the singular spectrum of Q. Because the quadratic model admits the neuronwise scaling symmetry (ai,wi)∼(ci−2ai,ciwi),(a_i,w_i) (c_i^-2a_i,c_iw_i), as well as hidden-unit permutation symmetry, it provides a particularly transparent setting in which the quotient construction can be visualized directly. 6.2 Symmetry-induced false flatness Our first experiment tests the prediction of Section 3 that ambient flatness is representation-dependent, whereas quotient-coordinate curvature is intrinsic. We first train a quadratic network to essentially zero training error and then construct several orbit-equivalent representatives by combining hidden-unit permutations with neuronwise rescalings. Numerically, these representatives realize the same predictor on the sample set up to machine precision and induce the same matrix Q(θ)Q(θ). We then compare the Euclidean Hessian spectrum in parameter space with the Hessian spectrum in Q-space. The outcome is unambiguous. The Euclidean Hessian changes under symmetry-preserving rescaling even though the realized predictor does not change. In the plotted spectra, pure permutations leave the ambient Hessian almost unchanged, while rescalings visibly alter the small-eigenvalue region and hence the apparent local flatness. By contrast, after passing to quotient coordinates, the Q-space Hessian is numerically identical across all orbit-equivalent representatives. The comparison between the full spectra and the zoomed-in small-eigenvalue region makes clear that much of what appears as Euclidean flatness is not an intrinsic property of the predictor at all, but rather an artifact of redundant parameterization. This is precisely the numerical counterpart of the distinction developed in Section 3. Near-zero eigenvalues of the ambient Euclidean Hessian need not indicate predictor-level flatness; they may simply encode directions tangent to the symmetry orbit, or more generally directions distorted by the choice of representative. In the quadratic model, the quotient coordinate Q removes this ambiguity completely. The experiment therefore provides a direct illustration of symmetry-induced false flatness: the same predictor class can appear to have different local curvature in ambient coordinates, while its quotient-coordinate curvature remains fixed. Figure 1: Euclidean Hessian spectra across orbit-equivalent representatives (left) and projected Hessian spectra after removing scaling-orbit directions (right). The variation in the Euclidean spectrum reflects representation dependence, while the projected spectrum is invariant. Figure 2: Parameter-space Hessian spectra vary across orbit-equivalent representatives (left), whereas Q-space Hessian spectra are numerically identical (right), confirming the intrinsic nature of quotient-level curvature. Figure 3: Zoomed view of the small-eigenvalue region: Euclidean Hessian (left) changes under rescaling, while the projected proxy (right) remains stable, demonstrating that apparent flatness stems from symmetry-induced redundancy. 6.3 Quotient-level curvature and local dynamics Our second experiment addresses the relation between local optimization behavior and curvature. The theoretical claim of Section 4 is not that every aspect of Euclidean training is determined by quotient geometry, but rather that the predictor-relevant local geometry should be more naturally organized by quotient-level curvature than by raw ambient curvature. To make this effect numerically visible, we work with quadratic classification under logistic loss rather than squared loss: with logistic loss, the Hessian in Q-coordinates genuinely varies with the point Q, so quotient-level curvature can meaningfully be compared across nearby states. We evaluate local behavior at an intermediate checkpoint before the loss enters the saturated near-zero regime, where second-order quantities become numerically degenerate and local decay is too weak to be informative. Around this checkpoint we construct two classes of nearby points: orbit-equivalent controls, obtained by permutation and mild rescaling and therefore preserving Q, and nearby non-equivalent perturbations, which change Q while remaining in a comparable loss band. For each point we compute the Euclidean Hessian in parameter space, the Hessian in Q-space, several spectral summaries, and a short-run empirical log-loss decay rate under gradient descent. The stable conclusion is that the most informative descriptors of short-run local decay are not ambient condition numbers, but quotient-level curvature magnitudes. Across multiple random seeds, the trace of the Q-space Hessian, its Frobenius norm, and its smallest positive eigenvalue show the strongest and most consistent relation to the empirical decay rate, whereas ambient parameter-space condition numbers are less stable and less informative. This is exactly the pattern visible in the pooled scatter plots: parameter-space condition numbers do not organize the local decay rates cleanly, while quotient-level summaries separate the observed regimes much more coherently. The correct interpretation is also conceptually natural. A condition number measures anisotropy, whereas the short-run descent speed in this local experiment is governed more directly by the overall magnitude of local curvature. The evidence therefore supports a refined version of the theoretical message of Section 4: local dynamics are better organized by quotient-level curvature summaries than by ambient curvature descriptors tied to a redundant coordinate system. The orbit-equivalent controls provide an additional internal consistency check. For such controls, the quotient coordinate Q is unchanged, so quotient-level curvature is essentially unchanged as well, and the observed short-run dynamics remain nearly identical. The ambient parameter-space curvature can nevertheless vary slightly across these representatives. This is exactly what one expects if predictor evolution is governed by quotient-level geometry, while the ambient parameterization contributes only secondary numerical variation that should not be interpreted as a change in the intrinsic local landscape. Figure 4: Pooled scatter plots of short-run empirical log-loss decay rates versus various curvature descriptors. Quotient-level quantities (e.g., Q-space trace and Frobenius norm) exhibit stronger correlation with local dynamics than ambient parameter-space condition numbers. 6.4 Quotient-level implicit bias in an underdetermined quadratic model Our third experiment addresses the implicit-bias perspective of Section 5. To make quotient-level selection visible, it is necessary to work in a regime where the data do not determine a unique quotient solution. We therefore move to an underdetermined quadratic regression setting with n<dim(Sym(d))=d(d+1)2,n< (Sym(d))= d(d+1)2, so that multiple distinct matrices Q can interpolate the training data. The experiment has two complementary parts. First, starting from a fixed trained solution, we construct several orbit-equivalent representatives using permutations and rescalings. Second, we repeat training from multiple random initializations in the underdetermined regime and compare the learned quotient objects across seeds. The first part confirms the invariance claim that underlies the quotient-level formulation of complexity. Along a fixed orbit, gauge-dependent parameter complexities such as the Euclidean norm and path-like parameter complexity vary substantially across representatives, whereas quotient-level matrix quantities derived from Q, including ‖Q‖F\|Q\|_F and ‖Q‖∗\|Q\|_*, remain invariant up to numerical precision. The corresponding scatter plot makes this distinction explicit: ambient parameter complexity changes within the same orbit, while quotient complexity is stable at numerical precision. Thus complexity is not an intrinsic property of a particular representative in parameter space, but of the quotient object itself. The second part reveals the genuinely quotient-level nature of implicit bias. In contrast to the fully determined setting of the earlier experiment, different random initializations now converge to different near-interpolating matrices Q. This is visible most clearly in the learned singular spectra, which vary substantially across seeds, with some runs displaying faster spectral decay, lower effective rank, or greater concentration in leading singular directions than others. At the same time, raw parameter norms do not provide a clean coordinate system for organizing these solutions: similar parameter norms can correspond to substantially different matrices Q, and path-like parameter complexities do not map transparently to the structure of the learned predictor. By contrast, matrix-level quantities such as the Frobenius norm, nuclear norm, stable-rank surrogate, and especially the singular spectrum directly describe how the learned quotient solutions differ. In this sense, the quotient-level selection problem becomes visible only after passing from θ-space to Q-space. This evidence supports the main qualitative claim of Section 5. We do not claim that gradient descent in this setting exactly minimizes one specific convex matrix complexity such as the nuclear norm. The experiment is not designed to identify a universal closed-form variational principle. The point is more basic and more robust: once multiple quotient-feasible solutions exist, the meaningful comparison among learned solutions is naturally expressed in quotient coordinates. Implicit bias, insofar as it is visible here, acts at the level of predictor classes rather than gauge-dependent parameter representatives. Figure 5: Gauge-dependent parameter complexity varies within a single symmetry orbit (left panel), while quotient-level complexity remains invariant (right panel), illustrating that meaningful complexity resides in the quotient object Q. Figure 6: Across random seeds, raw parameter norms and path-like complexities show no coherent structure (left panels), whereas quotient-level quantities such as ‖Q‖F\|Q\|_F, ‖Q‖∗\|Q\|_*, stable rank, and training loss reveal organized patterns, supporting a quotient-level view of implicit bias. Figure 7: Learned singular spectra of Q across different random initializations. Substantial variation in spectral decay and effective rank demonstrates that implicit bias operates at the level of the quotient object. 6.5 Discussion of the numerical evidence Taken together, the experiments in this section support the geometric picture developed in the previous sections. First, ambient parameter-space curvature is representation-dependent, whereas quotient-coordinate curvature is intrinsic. Second, local optimization dynamics are better organized by quotient-level curvature magnitudes than by raw ambient condition numbers. Third, in underdetermined regimes the meaningful notion of complexity is attached to the quotient object Q, and implicit-bias questions are therefore naturally posed on the quotient manifold rather than in the redundant ambient parameterization. These three observations are exactly the experimentally robust consequences of the theory that the present quadratic model makes fully visible. 7 Conclusion This paper develops a quotient-geometric framework for understanding symmetry, curvature, dynamics, and implicit bias in simple shallow neural networks. The main starting point is that overparameterized networks are not naturally described by their raw parameter space: hidden-unit permutations, rescalings, and related symmetries produce large equivalence classes of parameter vectors that realize the same predictor. Once this redundancy is taken seriously, several familiar phenomena in neural-network theory can be reinterpreted as geometric consequences of working in an excessively large ambient space. Our first contribution is structural. On a regular set, we characterize the symmetry action and the resulting quotient space of predictor classes. This identifies the correct reduced state space on which locally identifiable predictors live. Our second contribution is geometric. Using the finite-sample realization map, we define a function-induced metric on the quotient and derive an effective curvature notion that removes orbit directions and isolates the intrinsic local geometry of the predictor class. This clarifies why Euclidean Hessians in parameter space can be highly degenerate even when the underlying predictor is not intrinsically flat. Our third contribution is dynamical. We show that gradient flow admits a vertical-horizontal decomposition, where vertical motion is pure gauge and horizontal motion governs function-level evolution. This provides a clean interpretation of optimization after symmetry reduction. Our fourth contribution is conceptual. We formulate implicit bias at the quotient level, arguing that meaningful complexity should be defined on predictor classes rather than on individual parameter representatives. The quadratic-activation model provides a particularly transparent realization of this program. In that setting, the quotient object can be represented explicitly by a symmetric matrix (Q), making the symmetry-reduced geometry directly observable. The numerical experiments confirm the core theoretical predictions that are robustly visible in this model: ambient flatness is representation-dependent, quotient-coordinate curvature is intrinsic, local dynamics correlate more naturally with quotient-level curvature summaries than with raw ambient quantities, and in underdetermined regimes different feasible solutions are most meaningfully compared through matrix-level complexity rather than through gauge-dependent parameter norms. Several directions remain open. First, it would be valuable to extend the quotient-metric and effective-curvature construction beyond the simplest shallow models to broader classes of homogeneous and partially homogeneous architectures. Second, the quotient-level implicit-bias principle developed here is qualitative; a sharper asymptotic characterization, analogous to max-margin or minimum-complexity results in other settings, would strengthen the theory considerably. Third, while our experiments make the quotient structure fully visible in quadratic models, developing numerically stable quotient-level curvature estimators for more general nonlinear shallow networks remains an important challenge. More broadly, our results suggest that many questions about optimization and generalization in overparameterized learning may be better posed after symmetry reduction. From this viewpoint, the quotient space of predictor classes is not merely a technical convenience, but the natural geometric arena for the theory of symmetric neural networks. Acknowledgments and Disclosure of Funding This research received no external funding. Remainder omitted in this sample. See http://w.jmlr.org/papers/ for full paper. References P. Absil, R. Mahony, and R. Sepulchre (2008) Optimization algorithms on matrix manifolds. Princeton University Press, Princeton, NJ. Cited by: §1. S. Arora, N. Cohen, W. Hu, and Y. Luo (2019a) Implicit regularization in deep matrix factorization. In Advances in Neural Information Processing Systems, Vol. 32, p. 7428–7439. Cited by: §1. S. Arora, S. S. Du, W. Hu, Z. Li, R. Salakhutdinov, and R. Wang (2019b) Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. In Proceedings of the 36th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, Vol. 97, p. 322–332. Cited by: §1. V. Badrinarayanan, B. Mishra, and R. Cipolla (2015) Symmetry-invariant optimization in deep networks. External Links: 1511.01754 Cited by: §1. A. R. Barron (1993) Universal approximation bounds for superpositions of a sigmoidal function. IEEE Transactions on Information Theory 39 (3), p. 930–945. Cited by: §1. B. Bloem-Reddy and Y. W. Teh (2020) Probabilistic symmetries and invariant neural networks. Journal of Machine Learning Research 21 (90), p. 1–61. Cited by: §1. N. Boumal (2023) An introduction to optimization on smooth manifolds. Cambridge University Press, Cambridge, United Kingdom. Cited by: §1. L. Chizat and F. Bach (2018) On the global convergence of gradient descent for over-parameterized models using optimal transport. In Advances in Neural Information Processing Systems, Vol. 31, p. 3036–3046. Cited by: §1. G. Cybenko (1989) Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals, and Systems 2 (4), p. 303–314. Cited by: §1. S. S. Du and J. D. Lee (2018) On the power of over-parametrization in neural networks with quadratic activation. In Proceedings of the 35th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, Vol. 80, p. 1329–1338. Cited by: §1. A. Edelman, T. A. Arias, and S. T. Smith (1998) The geometry of algorithms with orthogonality constraints. SIAM Journal on Matrix Analysis and Applications 20 (2), p. 303–353. Cited by: §1. S. Gunasekar, B. E. Woodworth, S. Bhojanapalli, B. Neyshabur, and N. Srebro (2017) Implicit regularization in matrix factorization. In Advances in Neural Information Processing Systems, Vol. 30, p. 6151–6159. Cited by: §1. K. Hornik, M. Stinchcombe, and H. White (1989) Multilayer feedforward networks are universal approximators. Neural Networks 2 (5), p. 359–366. Cited by: §1. A. Jacot, F. Gabriel, and C. Hongler (2018) Neural tangent kernel: convergence and generalization in neural networks. In Advances in Neural Information Processing Systems, Vol. 31, p. 8571–8580. Cited by: §1. J. M. Lee (2003) Introduction to smooth manifolds. Graduate Texts in Mathematics, Vol. 218, Springer New York, New York, NY. Cited by: §1. J. M. Lee (2018) Introduction to riemannian manifolds. Graduate Texts in Mathematics, Vol. 275, Springer, Cham. Cited by: §1. S. Mei, A. Montanari, and P. M. Nguyen (2018) A mean field view of the landscape of two-layer neural networks. Proceedings of the National Academy of Sciences 115 (33), p. E7665–E7671. Cited by: §1. G. Meyer, S. Bonnabel, and R. Sepulchre (2011) Regression on fixed-rank positive semidefinite matrices: a riemannian approach. Journal of Machine Learning Research 12, p. 593–625. Cited by: §1. B. Mishra, G. Meyer, S. Bonnabel, and R. Sepulchre (2014) Fixed-rank matrix factorizations and Riemannian low-rank optimization. Computational Statistics 29 (3–4), p. 591–621. Cited by: §1. B. Neyshabur, R. Tomioka, and N. Srebro (2015) Norm-based capacity control in neural networks. In Proceedings of the 28th Conference on Learning Theory (COLT), Proceedings of Machine Learning Research, Vol. 40, p. 1376–1401. Cited by: §1. A. Pinkus (1999) Approximation theory of the MLP model in neural networks. Acta Numerica 8, p. 143–195. Cited by: §1. D. Soudry, E. Hoffer, M. S. Nacson, S. Gunasekar, and N. Srebro (2018) The implicit bias of gradient descent on separable data. Journal of Machine Learning Research 19 (70), p. 1–57. Cited by: §1. B. Vandereycken (2013) Low-rank matrix completion by Riemannian optimization. SIAM Journal on Optimization 23 (2), p. 1214–1236. Cited by: §1.