Paper deep dive
Superposition in Graph Neural Networks
Lukas Pertl, Han Xuanyuan, Pietro Liò
Models: GAT, GATv2, GCN, GIN
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/12/2026, 5:54:55 PM
Summary
The paper investigates superposition in Graph Neural Networks (GNNs), where multiple features share latent space directions. Using controlled experiments on graph concepts, the authors analyze how width, pooling, and topology influence feature entanglement. They introduce metrics like Effective Rank (EffRank), Superposition Index (SI), and Welch-Normalized Overlap (WNO) to quantify geometry. Findings indicate that increasing width creates a phase pattern in overlap, sharper pooling (e.g., max pooling) reduces channel sharing, and shallow models can exhibit rank collapse. The study provides practical insights for designing more interpretable GNNs.
Entities (7)
Relation Signals (3)
GNN → exhibits → Superposition
confidence 95% · We study superposition, the sharing of directions by multiple features, directly in the latent space of GNNs.
Width → modulates → Superposition
confidence 92% · As the final hidden layer widens, feature overlap first decreases, then briefly increases around the point where capacity matches the number of concepts
Max Pooling → reduces → Superposition
confidence 90% · Making pooling more 'winner‑take‑all' (e.g., max pooling) increases alignment to coordinate axes and reduces node‑level feature sharing.
Cypher Suggestions (2)
Find all GNN architectures analyzed in the paper · confidence 90% · unvalidated
MATCH (e:Entity {entity_type: 'Model Architecture'}) RETURN e.nameIdentify relations between architectural choices and superposition · confidence 85% · unvalidated
MATCH (a:Entity)-[r:RELATION]->(b:Entity {name: 'Superposition'}) RETURN a.name, r.relation, b.nameAbstract
Abstract:Interpreting graph neural networks (GNNs) is difficult because message passing mixes signals and internal channels rarely align with human concepts. We study superposition, the sharing of directions by multiple features, directly in the latent space of GNNs. Using controlled experiments with unambiguous graph concepts, we extract features as (i) class-conditional centroids at the graph level and (ii) linear-probe directions at the node level, and then analyze their geometry with simple basis-invariant diagnostics. Across GCN/GIN/GAT we find: increasing width produces a phase pattern in overlap; topology imprints overlap onto node-level features that pooling partially remixes into task-aligned graph axes; sharper pooling increases axis alignment and reduces channel sharing; and shallow models can settle into metastable low-rank embeddings. These results connect representational geometry with concrete design choices (width, pooling, and final-layer activations) and suggest practical approaches for more interpretable GNNs.
Tags
Links
- Source: https://arxiv.org/abs/2509.00928
- Canonical: https://arxiv.org/abs/2509.00928
Full Text
60,822 characters extracted from source content.
Expand or collapse full text
Superposition in Graph Neural Networks Lukas Pertl 1 Han Xuanyuan11footnotemark: 1 2 Pietro Liò1 1University of Cambridge 2Independent Researcher ljfp2@cantab.ac.uk hxuany@outlook.com pl219@cam.ac.uk Equal contribution. Abstract Interpreting graph neural networks (GNNs) is difficult because message passing mixes signals and internal channels rarely align with human concepts. We study superposition, the sharing of directions by multiple features, directly in the latent space of GNNs. Using controlled experiments with unambiguous graph concepts, we extract features as (i) class-conditional centroids at the graph level and (i) linear-probe directions at the node level, and then analyze their geometry with simple basis-invariant diagnostics. Across GCN/GIN/GAT we find: increasing width produces a phase pattern in overlap; topology imprints overlap onto node-level features that pooling partially remixes into task-aligned graph axes; sharper pooling increases axis alignment and reduces channel sharing; and shallow models can settle into metastable low-rank embeddings. These results connect representational geometry with concrete design choices (width, pooling, and final-layer activations) and suggest practical approaches for more interpretable GNNs. 1 Introduction Understanding what features a model represents and how they are arranged in latent space is central to trustworthy ML. For graph neural networks (GNNs), this is unusually hard. Unlike pixels or tokens, graphs do not offer a fixed coordinate system across inputs; receptive fields are relational and variable; and many signals are structural (motifs, roles, spectral patterns) rather than obviously human readable. As a result, most GNN interpretability work answer which nodes or edges mattered, not what internal features the network formed or how those features are arranged (Ying et al., 2019; Luo et al., 2020; Yuan et al., 2022; Kakkad et al., 2023). A key obstacle to interpretability is superposition: many features are packed in fewer directions, creating polysemantic channels and entangled geometries (Elhage et al., 2022; Scherlis et al., 2023). Superposition has been studied in MLPs and transformers, but its behavior in GNNs, where message passing and pooling constrain geometry, remains underexplored. This motivates the central theme of our work, which asks: How does superposition arise in GNNs, how do architectural and graph-structural choices modulate it, and what are the downstream consequences for interpretability? To answer these questions, we build small, controllable datasets where the relevant graph concepts are unambiguous (e.g., "two adjacent identical types" or "contains a triangle and a hexagon"), train standard GNNs, and look directly at the geometry of their internal representations – both at the node level (before pooling) and at the graph level (after pooling). We then use two simple, basis-invariant diagnostics: one that says "how many distinct axes are effectively used" and another that says "how tightly packed the directions are compared with an ideal arrangement." Our study yields several main findings: • Width produces a phase pattern. As the final hidden layer widens, feature overlap first decreases, then briefly increases around the point where capacity matches the number of concepts, then decreases again. At high width, graph‑level features approach near‑ideal packing, but node‑level concepts often remain entangled – evidence that readout layers can disentangle what the message‑passing stack keeps mixed. • Pooling sharpness encourages axis alignment. Making pooling more "winner‑take‑all" (e.g., max pooling) increases alignment to coordinate axes and reduces node‑level feature sharing. We explain this with two complementary arguments: gradients concentrate on large coordinates, and under noise the axis‑aligned choice loses less information than an oblique one. • Rank collapse can appear even in shallow GNNs. In some runs the numerical rank of the pooled representation stays below the number of learned features while the accuracy remains high. We link this to hard gating after aggregation (exactly zeroing channels) and to a loss‑driven preference for "mutually obtuse" class directions that resists activating new dimensions. Contributions. (i) A representation‑centric framework for analyzing superposition in GNNs that works across architectures and layers; (i) robust, simple diagnostics and feature extraction procedures (centroids and probes) that avoid assumptions about neurons or coordinates; (i) empirical maps of how width, topology, and pooling shape superposition; and (iv) practical guidance – when to expect entanglement, when pooling helps, and how to avoid low‑rank traps. 2 Background 2.1 Superposition and the interpretability of internal representations Classic unit analyses in CNNs found channels that act like concept detectors (Zeiler and Fergus, 2014; Zhou et al., 2014; Bau et al., 2017; Olah et al., 2017); related work spans RNNs and transformers (Karpathy et al., 2015; Strobelt et al., 2017; Gurnee et al., 2023, 2024; Geva et al., 2020), and early GNN studies exist (Xuanyuan et al., 2023). However, concepts need not be axis‑aligned: they can correspond to arbitrary directions or nonlinear detectors. Directional tools (TCAV, linear probes) therefore treat a feature as a vector in latent space (Kim et al., 2018; Alain and Bengio, 2016); recent sparse‑autoencoder work aims to demix polysemantic units (Bricken et al., 2023; Gao et al., 2024). Superposition is the phenomenon where models represent more features than available dimensions by allowing features to share neurons/directions, producing polysemantic units and entangled directions (Elhage et al., 2022). Both neuron-level and mechanistic analyses benefit from (approximate) decomposability: having independently meaningful, linearly separable features. Superposition violates this assumption, obscuring neuron-level semantics and complicating circuit reconstruction. Methods that demix features (e.g., sparse autoencoders) can mitigate these issues (Bricken et al., 2023; Gao et al., 2024). 2.2 Representations in GNNs A message‑passing layer aggregates neighbor information and applies a local transform, (l+1)=ϕ(Agg((l),),(l)),H^(l+1)=φ\! ( Agg(H^(l),A),\,W^(l) ), (1) with architecture‑specific aggregation (e.g., normalized sum for GCN (Kipf and Welling, 2017), unnormalized sum + MLP for GIN (Xu et al., 2019), attention for GAT veličković2018graphattentionnetworks). After L layers, node embeddings (L)H^(L) are pooled into a graph embedding G=g((L))h_G=g(H^(L)) via sum/mean/max. Superposition pressure in GNNs. Message passing repeatedly mixes many neighbor features into fixed-width hidden states. This creates capacity pressure that encourages superposition – multiple features sharing the same channels – leading to polysemantic units and entangled directions. While superposition has been analyzed in toy feedforward/transformer settings, to our knowledge there is little analogous work for GNNs. This gap further complicates neuron-level semantics and the reconstruction of GNN circuits. 3 Quantifying Superposition in Representations 3.1 Interpreting and extracting features Prior work often reasons about superposition at the input to layers (e.g., pixels/tokens). For GNNs, we study it where it matters for interpretability: in the latent spaces of nodes and graphs. We ask: how many features are packed into how few directions, and how well are those directions separated? Following (Alain and Bengio, 2016; Kim et al., 2018; Elhage et al., 2022; Bricken et al., 2023; Gao et al., 2024), we treat a feature as a direction in a model’s latent space that supports a prediction. For a GNN with forward pass →ϕ1(1)→ϕ2⋯→ϕL(L)→G→^,X _1H^(1) _2·s _LH^(L) gh_G→ y, each ϕl _l is a message‑passing block, (l)∈ℝnl×dlH^(l)\!∈\!R^n_l× d_l are node embeddings, g is a graph‑pool, and G∈ℝdh_G\!∈\!R^d is the graph embedding. We interpret features in two complementary ways. (1) Linear‑probe features (model‑decodable concepts). Linear probes are a well established method for finding feature directions (Akhondzadeh et al., 2023). Given embeddings G\h_G\ (graph‑level) or rows of (l)H^(l) (node‑level) and binary targets yℓ∈0,1y_ \!∈\!\0,1\ for concept ℓ , we fit a logistic probe on a held‑out split: sℓ=ℓ⊤+bℓ,∈Gori:(l).s_ =w_ z+b_ , ∈\h_G\\ or\ \H^(l)_i:\. The probe normal ℓw_ (unit-normalized) is taken as the feature direction for concept ℓ . Since all our geometry is directional, the intercept bℓb_ is irrelevant. (2) Class-conditional centroids (task-aligned concepts). On a held‑out split, we form one‑hot sets using the ground truth y. ℓGT=G:(G)=ℓ,ℓ=1|ℓGT|∑G∈ℓGTG,S_ ^GT=\G:y(G)=e_ \, _ = 1|S_ ^GT| _G _ ^GTh_G, and stack rows to obtain C∈ℝK×dC ^K× d. In the linear/LDA regime, discriminants align with class means (Alain and Bengio, 2016), so centroid directions approximate task‑discriminative axes and aggregate the combined effects of message passing and pooling. Active features. Not all labels are reliably learned in every seed, thus including non-existent or highly confused features would distort superposition measurements. We therefore define an active feature ℓ using only the held-out split: • Probes: a concept ℓ is active if AUCℓ≥0.60AUC_ ≥ 0.60. • Centroids: class ℓ is active if in‑class recall ≥0.5≥ 0.5 and all off‑diagonal recalls <0.5<0.5. Let ℒactL_act be the active set (|ℒact|=ka|L_act|=k_a). We form the feature matrix by stacking unit vectors: either ^ℓ∈ℒact\ w_ \_ _act (probes) or ^ℓ∈ℒact\ c_ \_ _act (centroids). In the main results we use class-conditional centroids plus linear probes as complementary views: centroids assess task-aligned geometry, probes assess model-decodable directions. 3.2 Proposed Metrics We propose metrics to measure superposition for a ka×dk_a× d feature matrix C. Effective rank (EffRank). With singular values σ1≥σ2≥⋯ _1≥ _2≥·s and pi=σi/∑jσjp_i= _i/ _j _j, the entropy‑based effective rank is EffRank(C)=exp(−∑ipilogpi).EffRank(C)= \! (- _ip_i p_i ). (2) It estimates the number of effectively used axes. For centroids we COM‑center C∘=C−¯⊤C =C-1 c with ¯=1ka∑ℓ c= 1k_a _ c_ before computing EffRank, removing a global bias direction that no practitioner would interpret as a mechanism. For probe normals we use the raw C. Superposition Index (SI). To express features per effective axis we use the absolute index SI=kaEffRank(C).SI= k_aEffRank(C). (3) SI=1SI=1 indicates no extra sharing beyond having kak_a independent directions; SI>1SI>1 means multiple features share axes (greater superposition). This avoids dependence on the ambient d. Welch-Normalized Overlap (WNO). While SI measures the mean number of features distributed over each axis, it does not consider their angular geometry111For example, two different arrangements of packing 6 vectors into 2D: (i) arranged as a near-regular hexagon, and (i) arranged such that there are two antipodal clusters. Both have EffRank≈2EffRank≈ 2.. Therefore to complement SI, we introduce an angular overlap metric called the Welch-normalized overlap: let C~ C be the matrix obtained from C by normalizing each row to unit ℓ2 _2 norm and define cos2¯=2ka(ka−1)∑i<j⟨~i,~j⟩2. ^2= 2k_a(k_a-1) _i<j c_i, c_j ^2. We compare cos2¯ ^2 to (i) the random baseline 1/deff1/d_eff (independent unit vectors in ℝdeffR^d_eff) and (i) the Welch lower bound μ∗2=max(0,ka−deffdeff(ka−1)) _*^2= (0, k_a-d_effd_eff(k_a-1) ), and report WNO=1−1deff−cos2¯1deff−μ∗2,WNO=1- 1d_eff- ^2 1d_eff- _*^2\,, (4) so WNO=0WNO=0 is Welch‑optimal (least overlap), WNO=1WNO=1 is random, and WNO>1WNO>1 is worse‑than‑random packing. Since ambient d can be much larger than the subspace actually used. We therefore report intrinsic WNO: set r=⌈EffRank(C†)⌉r= (C ) (with C†=C∘C =C for centroids and C†=C =C for probes), project C to the top‑r right‑singular subspace to obtain CrC_r, row‑normalize, and take deff=rd_eff=r in (4). For centroids and graph‑level probes we additionally remove a single shared offset direction before the SVD (PC1 removal); for node‑level probes we do not remove PC1.222PC1 removal is appropriate when a single global bias dominates all classes (e.g., pooled magnitude); removing it improves sensitivity to relative packing. For node probes this shared offset is not guaranteed and we keep the raw geometry. If r≤1r≤ 1 or ka≤1k_a≤ 1, WNO is undefined and we report NA. Invariances and scope. EffRank and WNO are invariant to right‑multiplication by orthogonal matrices (basis changes) and to uniform rescaling of rows; they probe geometry intrinsic to the representation rather than coordinate choices. SI quantifies features‑per‑axis pressure; WNO quantifies how tightly those axes are packed relative to principled baselines. 4 Effects of Model Width and Graph Topology 4.1 Datasets The Pairwise dataset. We introduce a dataset where graphs are equal length linear chains of 20 nodes. Each node has an n-dimensional one-hot/zero feature. Class k is active if two adjacent nodes both activate dimension k, yielding sparse multi-class targets. The task is attribute driven: since each class is associated with identical graph structure, we can probe for superposition whilst factoring out the effects of varying topology. Sparsity is controlled by the activation probability p of a node having a one-hot (non-zero) feature. Exact details for reproduction are included in Appendix A.1. The Conjunction dataset. To isolate topology–driven superposition from effects of extraneous input attributes, we introduce a graph family in which all node features are initialized to the node’s degree, and supervision otherwise depends only on the presence of simple cycles. Let Cℓ(G)C_ (G) denote the indicator that graph G contains at least one simple cycle of length ℓ . We instantiate four motif detectors ℓ∈3,4,5,6 ∈\3,4,5,6\ (triangles, squares, pentagons, hexagons) and define two labels by conjunctions of cycles: yA(G)= 1C3(G)∧C6(G),yB(G)= 1C4(G)∧C5(G).y_A(G)\;=\;1\!\\,C_3(G) C_6(G)\,\, y_B(G)\;=\;1\!\\,C_4(G) C_5(G)\,\. All other graphs receive yA=yB=0y_A=y_B=0. Thus the same lower‑level motif concepts C3,C4,C5,C6\C_3,C_4,C_5,C_6\ are re‑used across the two tasks, but each label requires a distinct pair. This cleanly separates (i) the formation of topological features from (i) how a model combines them. A graph is formed by sampling required counts for the four motifs according to the label pattern and then adding Binomial(8,0.5)Binomial(8,0.5) extra copies of each motif. Exact details for reproduction are included in Appendix A.2. Motif instances are placed on disjoint node sets and connected by random bridges to avoid trivial overlaps. Because there are no informative node attributes, any successful solution must construct the motif detectors purely via message passing. 4.2 Width-induced superposition We study how width d affects superposition on PAIRWISE (k=16k=16) on 1616 seeds. The first hidden layer is set to the input width (k) and the second layer to d, inducing a bottleneck when d<kd<k. We measure (a) class‑conditional centroids in the graph space and (b) node‑level concepts t Is_t and t NextTo_t via linear probes on the last pre‑pooling layer: let xi∈e1,…,ekx_i∈\e_1,…,e_k\ be the one‑hot node type and iN_i the neighbors of i. For type t, t(i):=xi=et,t(i):=∃j∈i:xj=et. Is_t(i):=1\x_i=e_t\, NextTo_t(i):=1\∃ j _i:\,x_j=e_t\. We keep only active features: centroids with in‑class recall >0.5>0.5 and off‑class recall <0.5<0.5, and probes with AUROC >0.6>0.6. We report intrinsic WNO (computed in the EffRank‑selected subspace). Centroids are COM‑centered for EffRank; probe normals are not. Optimizer and training details for this experiment and subsequent ones are given in B.1. (a) Class-conditional centroids of the graph embeddings. (b) Linear probe features of the NextTo concept family after the second message-passing layer. (c) Linear probe features of the Is concept family after the second message-passing layer. Figure 1: Superposition effects across bottleneck dimensions d. A dashed line corresponds to k=dk=d. Shaded regions show uncertainty as mean ±1.96σ/R± 1.96σ/ R where R=16R=16 is the number of seeds. We observe a consistent three‑phase transition with increasing d for both graph centroids and node concepts (GCN/GIN/GAT): an initial improvement in packing (SI↓ , WNO↓ ), a reversal near d≈kd\!≈\!k (SI↑ , WNO↑ ), and a final decay back to good packing. The effect is stronger for node features: centroids approach Welch‑optimal geometry (WNO→0→ 0, SI→1→ 1), whereas node probes often stabilize around random or worse packing (WNO ≥1≥ 1). This suggests that superposition is largely resolved by the readout but not eliminated at the last node layer. Architecture trends follow inductive bias. For GIN, the post‑aggregation MLP tends to funnel many labels into shared axes ("generic neighbor/type pressure + small residuals"), which the readout later separates by sign/magnitude at the graph level. GAT shows peaks that are present but typically smaller; attention can separate channels earlier, while GCN produces smoother, less volatile curves. The WNO/SI peak typically coincides with saturation of the number of active features/probes and the onset of a flat test loss. This aligns with two competing effects of more width: feature formation (easier to instantiate additional task‑useful directions) versus packing efficiency (easier to spread existing directions). Early on, packing dominates; around d≈kd\!≈\!k capacity is spent on forming new directions (SI/WNO rise); once formation saturates, packing dominates again and SI/WNO drop. Centroids are tied directly to the BCE objective, so their feature formation saturates earlier, explaining the shorter first phase. 4.3 Topology‑induced superposition On CONJUNCTION we train 3‑layer GCN, GIN, and GATv2 models (16 channels per layer), apply a global mean pool followed by a ReLU, and fit a two‑logit readout (yA,yB)(y_A,y_B). For each architecture we repeat training over 250 seeds. We use probes to study two feature families: (i) node‑level ℓ(i) Inside_ (i) on the last pre‑pooling layer, which activates if node i is part of an ℓ -cycle, and (i) graph‑level ℓ(G) Has_ (G) on the post‑pooling representation, which activates if the graph G contains any ℓ -cycle. Figure 2: SI and WNO measured on GNNs trained on CONJUNCTION. Red highlighted data-points indicate models with perfect test accuracy. Despite d≫kd\! \!k, superposition remains large. With k=4k=4 motifs and width d=16d=16, independent axes would yield SI≈1SI\!≈\!1 and WNO≈0WNO\!≈\!0. Instead, Fig. 2 shows clusters at SI>1SI\!>\!1 and WNO≫0WNO\! \!0, i.e., above random overlap. In most runs, at least one motif pair has |cos|>0.9| |>0.9 (Table 1), most often (C3,C4)(C_3,C_4) or (C5,C6)(C_5,C_6); in rare cases (∼ 1% in GIN) all four are nearly collinear, consistent with an emergent lever of ’cycleness’. High accuracy can still be achieved, suggesting that the pooling / reading takes advantage of residual multicoordinate structure or magnitude even when node‑level directions coalesce. Cosine geometry reveals a cyclic structure. Figure 3 reports mean cosine similarity matrices. At the node level, the lengths of the nearby cycles align (for example, C3C_3 with C4C_4, C5C_5 with C6C_6) while distant pairs are weakly aligned or slightly antialigned. From the standard spectral view, a depth-L message-passing GNN implements a low-degree polynomial filter in the normalized adjacency matrix ~ A (Defferrard et al., 2016; Kipf and Welling, 2017): ignoring nonlinearities one can write (L)≈∑k=0L~kΘk,H^(L)≈ _k=0^L A^kX _k, where ∈ℝn×dinX ^n× d_in are the input node features, (L)∈ℝn×doutH^(L) ^n× d_out are the layer-L node embeddings, and each Θk∈ℝdin×dout _k ^d_in× d_out is a learnable coefficient matrix specifying how information arriving via k-step walks is mixed across feature channels. The power ~k A^k encodes k-step walks on the graph, and its diagonal entries (~k)ii( A^k)_i correspond to closed walks of length k at node i. Detectors for C3,…,C6C_3,…,C_6 therefore necessarily reuse many of the same short closed-walk monomials, which naturally makes nearby cycle lengths more aligned in representation space. After pooling, geometry re-mixes toward the task: the post-pool ReLU and linear readout learn graph-level axes that are linear combinations of node-level detectors (e.g., in GIN with mean-pool, C3C_3 aligns with C6C_6 while C4C_4 opposes C5C_5, which helps reject single-motif false positives). Figure 3: Mean cosine similarities for the node features (top) and graph features (bottom) on Conjunction. Node Graph Model Pool all one all one GCN mean 15.1% 87.8% 0.0% 44.9% GCN max 6.7% 78.8% 0.0% 20.2% GIN mean 1.1% 44.4% 0.0% 1.1% GIN max 0.4% 41.2% 0.0% 9.2% GAT mean 2.4% 64.1% 0.0% 10.0% GAT max 0.7% 63.0% 0.0% 12.2% Table 1: Percentage of instances above 0.9 absolute cosine similarity. Max pooling reduces lever sharing. Since a linear probe’s sign can flip with an equivalent boundary, we track |cos|| | to detect shared levers. Switching from mean to max pooling reduces node‑level alignment across all architectures: both the fraction of runs with any |cos|>0.9| |>0.9 pair and with all pairs >0.9>0.9 drop (Table 1). Intuitively, a per-channel max cannot encode a conjunction on a single channel – one large activation would mask the other—so the model is pressured to place motifs on distinct channels, lowering |cos|| |. The ordering in Table 1 (GCN >> GAT >> GIN in lever sharing) aligns with each layer’s mixing bias. GCN’s degree normalized sum acts as a low‑pass smoother, concentrating signals into a few graph‑harmonic axes and driving different motifs to share directions. GAT’s learned attention reduces indiscriminate mixing and yields moderate de‑alignment, but a single head offers limited selectivity. GIN’s unnormalized sum plus post aggregation MLP provides the most expressive channel rotation and gating, producing more specialized directions and the lowest node and graph level alignment. Discussion. Even with generous width, semantically different topology features occupy similar node‑space directions. Max pooling alleviates but does not eliminate this, probably reflecting message-passing inductive biases. Note that many high‑accuracy models sit at SI>1SI>1 and WNO≫0WNO 0 (Fig. 2), indicating that accurate solutions often rely on node‑level lever sharing with disentanglement deferred to the graph‑level readout – superposition here is not merely a symptom of under‑capacity. 5 Pooling-Induced Axis Alignment of Graph Representations Metric and preprocessing. We measure axis preference with the Alignment Index (AI) computed on any set of feature directions cℓ∈ℝdℓ=1k\c_ ^d\_ =1^k (node‑level probe normals or graph‑level class features): AI=1k∑ℓ=1kmaxj|(cℓ)j|‖cℓ‖2.AI\;=\; 1k _ =1^k _j|(c_ )_j|\|c_ \|_2. Thus, AI≈1/dAI≈ 1/ d for random orientations and AI→1AI→ 1 when each feature is concentrated on a coordinate axis. We always use row‑unit directions; for centroids we COM‑center across classes before normalizing (to remove the shared offset), while probes use the raw fitted normals. Global and local pooling conditions. For the graph readout we use the signed power‑mean family (details in Appendix C) that interpolates mean (p=1p=1) and max (p→∞p→∞): yd=sgn(sd)N−1/p|sd|1/p,sd=∑i=1Nsgn(xid)|xid|p.y_d\;=\;sgn(s_d)\,N^-1/p\,|s_d|^1/p, s_d= _i=1^Nsgn(x_id)\,|x_id|^p. To isolate local effects, we also replace the last message‑passing aggregator by the same power‑mean operator while holding the global readout at mean pooling. Alignment vs. pooling and regime. On Pairwise (k=16k=16) across both a bottleneck (d=10d=10) and a wide (d=22d=22) regime, AI increases as p>1p\!>\!1 (Fig. 4). With global pooling the effect is clearest at the graph level (the nonlinearity acts after node mixing). With a local power‑mean in the last layer, AI increases at both node and graph levels because the winner‑take‑most bias precedes the final ReLU and readout. Occasionally AI dips for p>2p\!>\!2 in the bottleneck regime; this coincides with less stable training and noisier features. The trends persist when conditioning on high‑accuracy runs. GATv2 tends to exhibit higher node‑level AI even for modest plocalp_local (attention already induces axis‑selective flows), whereas GCN’s degree‑normalized linear aggregation produces smoother trends. From alignment to constrained superposition. Under mean pooling, strong bottlenecks yield large SISI due to shared directions, especially at the node level where separation is not directly optimized by the loss. Increasing plocalp_local reliably reduces SInodeSI_node (Fig. 4), effectively capping features‑per‑axis. Raising pglobalp_global also lowers SInodeSI_node in the wide regime (consistent with §4.3); in the bottleneck regime the dimensional constraint dominates and this effect largely vanishes. SIgraphSI_graph stays near‑flat: in the wide regime it is already close to optimal under mean pooling (§4.2), and in the bottleneck regime additional alignment cannot overcome the shortage of dimensions. (a) Wide; global (b) Bottleneck; global (c) Wide; local (d) Bottleneck; local Figure 4: Alignment versus pooling parameter p across architectures on Pairwise (k=16k=16). Shaded regions show uncertainty as mean ±1.96σ/R± 1.96σ/ R where R=16R=16 is the number of seeds. Noise as the driver of axis alignment. Two complementary arguments explain why increasing p>1p>1 breaks rotational symmetry in favor of axis alignment. (i) Learning-dynamics view. Backpropagating through generalized mean pooling (App. C) yields, for each pooled feature hdh_d, ∂hd∂xid=1N(|s¯d|+ε)1p−1(|xid|+ε)p−1, ∂ h_d∂ x_id\;=\; 1N (| s_d|+ ) 1p-1 (|x_id|+ )^p-1, (a derivation is given in Appendix D) where s¯d s_d is the mean of the transformed features and ε>0 >0 is the stabilizer. For fixed d and p, the prefactor 1N(|s¯d|+ε)1p−1 1N (| s_d|+ ) 1p-1 is shared across nodes, so the relative gradient magnitudes are controlled by (|xid|+ε)p−1(|x_id|+ )^p-1: with p=1p=1 we recover ∂hd/∂xid=1/N∂ h_d/∂ x_id=1/N (a rotation-invariant Jacobian), whereas for p>1p>1 coordinates with larger |xid||x_id| receive disproportionately larger updates, steadily aligning features to coordinate axes. In the absence of noise all orientations become equivalent once the model has recognized the features; in realistic graphs ubiquitous irrelevant features inject noise and trigger the symmetry breaking. (i) Geometric view. Under equal-energy corruption, axis-aligned vectors lose less information than randomly oriented ones; in high dimensions the components of a random unit vector scale like 1/d1/ d and are easily swamped by noise spikes. Figure 5 visualizes this effect under max pooling. We numerically simulate the equal-energy corruption under max pooling: for each (n,σ)(n,σ) on a grid (dimensions n=2…20n=2… 20, noise levels σ∈[0.001,0.35]σ∈[0.001,0.35]) we sample 100100 random vectors v∈ℝnv ^n (either uniform on the unit sphere or one-hot), draw 2020 i.i.d. Gaussian noise candidates per coordinate with variance σ2σ^2, and replace each coordinate by the candidate with largest absolute value whenever its magnitude exceeds |vi||v_i|. Practical takeaways. For more axis‑aligned graph features, prefer global pooling with pglobal>1p_global>1. To make node channels less polysemantic, consider a power‑mean local aggregator in the final layer. Figure 5: Axes‑aligned embedding vectors lose less information than arbitrary‑angled vectors under max pooling when both are corrupted by equal‑energy noise. 6 Rank collapse and metastable superposition Our geometry metrics (SI/WNO) quantify how features pack inside the span of the learned representations. They do not say how many dimensions that span actually occupies. Let H∈ℝN×dH\!∈\!R^N× d stack pooled graph embeddings (rows are graphs) and let C∈ℝka×dC\!∈\!R^k_a× d stack the kak_a active class centroids. Since centroids are linear averages of rows of H, C=A⊤HRC=A HR for suitable averaging/projection matrices, hence EffRank(C)≤rank(C)≤rank(H).EffRank(C)\;≤\;rank(C)\;≤\;rank(H). If the span of H is low‑dimensional, superposition is inevitable: by pigeonhole, ka>rank(H)k_a>rank(H) forces multiple features to share axes. We formalize this with a numerical rank rτ(H)=#i:σi(H)/σ1(H)≥τ,r_τ(H)\;=\;\#\i: _i(H)/ _1(H)≥τ\\!, and call a run collapsed when rτ(H)<kar_τ(H)<k_a.333We use τ=10−4τ\!=\!10^-4; results are qualitatively unchanged for τ∈[10−5,10−3]τ∈[10^-5,10^-3]. An energy criterion rη(H)=minr:∑i≤rσi2≥(1−η)∑iσi2r_η(H)= \r: _i≤ r _i^2≥(1-η) _i _i^2\ gives the same conclusions. This is the regime where high SI/WNO are not just a matter of suboptimal packing but a hard capacity constraint from a thin span. Two training modes. On PAIRWISE we observe two characteristic dynamics (Fig. 6). In some seeds rτ(H)r_τ(H) rises to the layer width (d=16d\!=\!16), with EffRank(H)EffRank(H) tracking from below; SI/WNO on centroids drop accordingly. In other seeds rτ(H)r_τ(H) remains strictly below d, sometimes far below kak_a: SI/WNO stay high and the smallest singular values show brief activation attempts (spikes) that recede, indicating the optimizer returns to a low‑rank basin. Interestingly, across models that exhibit overfitting (such that the right model in Fig. 6) we observe sharp multi-phase transitions in SI and WNO throughout training. A possible (non-rigorous) explanation is provided in Appendix E. Figure 6: Examples showing the evolution of singular values of a GCN model (left) and GIN model (right). Why does the low‑rank basin persist? Two effects make it metastable: (i) Global channel gating by ReLU. If a channel is negative across the dataset just before the last nonlinearity, ReLU zeros it everywhere, so the corresponding column of H vanishes and algebraic rank drops exactly. Replacing the final ReLU with LeakyReLU substantially reduces such dead columns (App. F.5). (i) Obtuseness pressure under BCE. Before a separating hyperplane forms, last‑layer directions i\v_i\ tend to align with their class weights i\w_i\ and become mutually obtuse. Moving one feature into a fresh orthogonal dimension rotates it towards 90∘90 against the others and increases the BCE loss. After the hyperplane forms, the cross‑class margins satisfy j⊤i<0w_j^\! v_i<0 (i≠ji≠ j), and perturbing kv_k into a new dimension makes those dots less negative, again increasing loss (Appendix F). Hence gradients oppose escapes until the accuracy gain from a new dimension outweighs the obtuseness penalty, explaining the singular jump events. When is collapse likely? We see more collapse (i) with narrower last layers, (i) in GIN (post‑aggregation ReLUs) than GCN (no ReLU after the final aggregation), and (i) as the number of task features grows. This agrees with the regime analysis in Appendix F: when ka≤d+1k_a\!≤\!d+1, mutually obtuse low‑rank configurations exist and can act as local minima; as kak_a approaches or exceeds 2d2d, such configurations become geometrically impossible and rτ(H)r_τ(H) tends to reach d. Connection to oversmoothing. Classic oversmoothing is within‑graph: node states become indistinguishable with depth. Here we see a cross‑graph analogue: pooled graph embeddings concentrate in a thin global subspace. Accurate models can then separate classes inside that span by sign/magnitude, which explains why SI/WNO can be high at the node level yet acceptable in the graph space. Discussion. The width‑sweeps in §4.2 showed SI/WNO can be high even when d≥kd\!≥\!k; here we identify a complementary mechanism: training can settle into thin‑span solutions, so overlap is forced (because ka>rτ(H)k_a>r_τ(H)) rather than just suboptimal packing within a rich span. This also clarifies why node‑level superposition can persist while graph‑level centroids look well‑packed: the readout operates inside a low‑dimensional subspace, separating classes by sign/magnitude without increasing the span of H. 7 Conclusion In this work, we provided, to the best of our knowledge, the first systematic study of superposition in GNNs. Concretely, we introduced a representation‑centric framework that quantifies feature sharing in GNNs via class‑conditional centroids and linear‑probe directions. Across architectures, width induces a three‑phase trajectory in overlap; topology shapes node‑level geometry that pooling re‑mixes toward the task; and sharper pooling drives axis alignment and reduces lever sharing. We also observed metastable low‑rank graph embeddings in shallow models. Practically, modest increases in width, LeakyReLU in the final MLP, and sharper but stable pooling improve interpretability without sacrificing accuracy. A key next step is to link superposition dynamics to over‑smoothing and over‑squashing, integrating our geometric view with spectral, dynamical, and topological analyses. References M. S. Akhondzadeh, V. Lingam, and A. Bojchevski (2023) Probing graph representations. In International Conference on Artificial Intelligence and Statistics, p. 11630–11649. Cited by: §3.1. G. Alain and Y. Bengio (2016) Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644. Cited by: §2.1, §3.1, §3.1. D. Bau, B. Zhou, A. Khosla, A. Oliva, and A. Torralba (2017) Network dissection: quantifying interpretability of deep visual representations. In Computer Vision and Pattern Recognition, Cited by: §2.1. T. Bricken, A. Templeton, J. Batson, B. Chen, A. Jermyn, T. Conerly, N. Turner, C. Anil, C. Denison, A. Askell, et al. (2023) Towards monosemanticity: decomposing language models with dictionary learning. Transformer Circuits Thread 2. Cited by: §2.1, §2.1, §3.1. M. Defferrard, X. Bresson, and P. Vandergheynst (2016) Convolutional neural networks on graphs with fast localized spectral filtering. Advances in neural information processing systems 29. Cited by: §4.3. N. Elhage, T. Hume, and C. Olsson (2022) Toy Models of Superposition. arXiv. Note: arXiv:2209.10652 [cs] External Links: Link, Document Cited by: §1, §2.1, §3.1. L. Gao, T. D. la Tour, H. Tillman, G. Goh, R. Troll, A. Radford, I. Sutskever, J. Leike, and J. Wu (2024) Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093. Cited by: §2.1, §2.1, §3.1. M. Geva, R. Schuster, J. Berant, and O. Levy (2020) Transformer feed-forward layers are key-value memories. arXiv preprint arXiv:2012.14913. Cited by: §2.1. W. Gurnee, T. Horsley, Z. C. Guo, T. R. Kheirkhah, Q. Sun, W. Hathaway, N. Nanda, and D. Bertsimas (2024) Universal neurons in gpt2 language models. arXiv preprint arXiv:2401.12181. Cited by: §2.1. W. Gurnee, N. Nanda, M. Pauly, K. Harvey, D. Troitskii, and D. Bertsimas (2023) Finding neurons in a haystack: case studies with sparse probing. arXiv preprint arXiv:2305.01610. Cited by: §2.1. J. Kakkad, J. Jannu, K. Sharma, C. Aggarwal, and S. Medya (2023) A survey on explainability of graph neural networks. arXiv preprint arXiv:2306.01958. Cited by: §1. A. Karpathy, J. Johnson, and L. Fei-Fei (2015) Visualizing and understanding recurrent networks. arXiv preprint arXiv:1506.02078. Cited by: §2.1. B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, et al. (2018) Interpretability beyond feature attribution: quantitative testing with concept activation vectors (tcav). In International conference on machine learning, p. 2668–2677. Cited by: §2.1, §3.1. T. N. Kipf and M. Welling (2017) Semi-Supervised Classification with Graph Convolutional Networks. arXiv. Note: arXiv:1609.02907 [cs] External Links: Link, Document Cited by: §2.2, §4.3. D. Luo, W. Cheng, D. Xu, W. Yu, B. Zong, H. Chen, and X. Zhang (2020) Parameterized explainer for graph neural network. Advances in neural information processing systems 33, p. 19620–19631. Cited by: §1. C. Olah, A. Mordvintsev, and L. Schubert (2017) Feature visualization. Distill. Note: https://distill.pub/2017/feature-visualization External Links: Document Cited by: §2.1. A. Scherlis, K. Sachan, A. S. Jermyn, J. Benton, and B. Shlegeris (2023) Polysemanticity and Capacity in Neural Networks. arXiv. Note: arXiv:2210.01892 [cs] External Links: Link, Document Cited by: §1. H. Strobelt, S. Gehrmann, H. Pfister, and A. M. Rush (2017) Lstmvis: a tool for visual analysis of hidden state dynamics in recurrent neural networks. IEEE transactions on visualization and computer graphics 24 (1), p. 667–676. Cited by: §2.1. K. Xu, W. Hu, J. Leskovec, and S. Jegelka (2019) How Powerful are Graph Neural Networks?. arXiv. Note: arXiv:1810.00826 [cs] External Links: Link, Document Cited by: §2.2. H. Xuanyuan, P. Barbiero, D. Georgiev, L. C. Magister, and P. Lió (2023) Global Concept-Based Interpretability for Graph Neural Networks via Neuron Analysis. arXiv. Note: arXiv:2208.10609 External Links: Link, Document Cited by: §2.1. R. Ying, D. Bourgeois, J. You, M. Zitnik, and J. Leskovec (2019) GNNExplainer: Generating Explanations for Graph Neural Networks. arXiv. Note: arXiv:1903.03894 External Links: Link, Document Cited by: §1. H. Yuan, H. Yu, S. Gui, and S. Ji (2022) Explainability in graph neural networks: a taxonomic survey. IEEE transactions on pattern analysis and machine intelligence 45 (5), p. 5782–5799. Cited by: §1. M. D. Zeiler and R. Fergus (2014) Visualizing and understanding convolutional networks. In European conference on computer vision, p. 818–833. Cited by: §2.1. B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba (2014) Object detectors emerge in deep scene cnns. arXiv preprint arXiv:1412.6856. Cited by: §2.1. Appendix A Datasets A.1 The Pairwise dataset Graph structure and node features. The Pairwise dataset consists of graphs that are simple chains of fixed length L=20L=20. We represent the chain as an undirected path on nodes 1,…,L\1,…,L\ with edges (i,i+1)(i,i+1) for i=1,…,L−1i=1,…,L-1, encoded as two directed edges (i,i+1)(i,i+1) and (i+1,i)(i+1,i); in our experiments we do not add self-loops. Each node i carries a k-dimensional feature vector xi∈0,1kx_i∈\0,1\^k, where k is the number of categories (we denote the corresponding dataset by Pairwisek). Node features are sampled independently as follows: first draw an activation flag bi∼Bernoulli(p)b_i (p) with p=0.9p=0.9; if bi=1b_i=1, sample a category ci∼Unif1,…,kc_i \1,…,k\ and set (xi)ci=1(x_i)_c_i=1 and all other entries to 0, while if bi=0b_i=0 we set xi=x_i=0. Thus each node is either all-zero or one-hot, with a high expected fraction p of active nodes. All random choices are made with a fixed seed (4242) for reproducibility. Label definition, deduplication, and split. Given a feature matrix X∈0,1L×kX∈\0,1\^L× k on the chain, the multi-label target y∈0,1ky∈\0,1\^k is defined coordinate-wise by yc=∃(i,j)∈E with (xi)c=(xj)c=1,c=1,…,k,y_c=1 \∃\,(i,j)∈ E with (x_i)_c=(x_j)_c=1 \, c=1,…,k, where E is the set of chain edges (self-loops, if present, are ignored when computing y). In words, class c is active if and only if there exists at least one edge whose two endpoints both carry the one-hot category c. To build the dataset, we repeatedly sample node features X as above, compute y, and retain only unique graphs, where uniqueness is defined by the node-feature pattern: we hash X (after casting to uint8) and reject duplicates until we have ntrain+ntest=3000n_train+n_test=3000 distinct samples. We then randomly shuffle these 3000 graphs (using the same seed 4242) and split them into ntrain=2000n_train=2000 training and ntest=1000n_test=1000 test graphs, corresponding to a test ratio of 1/31/3. A.2 The Conjunction dataset Data generation. We instantiate the Conjunction dataset by drawing ntrain=3000n_train=3000 and ntest=1000n_test=1000 graphs with a fixed random seed of 4242. For each graph we first sample a latent category c∈none,A-only,B-only,bothc∈\none,A-only,B-only,both\ from a categorical prior (0.25,0.25,0.25,0.25)(0.25,0.25,0.25,0.25). Conditioned on c, we specify the required counts rℓ∈0,1r_ ∈\0,1\ of cycles of lengths ℓ∈3,4,5,6 ∈\3,4,5,6\ as follows. For A-only we set (r3,r4,r5,r6)=(1,0,0,1)(r_3,r_4,r_5,r_6)=(1,0,0,1); for B-only (0,1,1,0)(0,1,1,0); for both (1,1,1,1)(1,1,1,1). For the “none” category we sample uniformly from the four non-target pairs (1,0,1,0)(1,0,1,0), (0,1,0,1)(0,1,0,1), (1,1,0,0)(1,1,0,0), (0,0,1,1)(0,0,1,1), which guarantees that neither conjunction C3∧C6C_3 C_6 nor C4∧C5C_4 C_5 holds. Given these required counts, we then draw independent extra copies eℓ∼Binomial(8,0.5)e_ (8,0.5) for each length ℓ with rℓ=1r_ =1, and set the total motif counts to nℓ=rℓ+eℓn_ =r_ +e_ ; lengths with rℓ=0r_ =0 remain absent. This corresponds to a symmetric Bernoulli prior over up to eight additional copies per present motif. The final labels are then computed deterministically from the counts as yA(G)=n3≥1∧n6≥1,yB(G)=n4≥1∧n5≥1,y_A(G)=1\n_3≥ 1 n_6≥ 1\, y_B(G)=1\n_4≥ 1 n_5≥ 1\, so that the four categories correspond to (yA,yB)∈(0,0),(1,0),(0,1),(1,1)(y_A,y_B)∈\(0,0),(1,0),(0,1),(1,1)\ as intended. Graph construction. Given the counts (n3,n4,n5,n6)(n_3,n_4,n_5,n_6), we materialize a concrete graph by placing all motif instances on disjoint node sets. For each of the nℓn_ copies of a cycle of length ℓ , we create a simple cycle on ℓ new nodes and designate one “anchor” node on that cycle. To increase local structural variety while preserving the core motifs, we attach “whisker” paths to each anchor: for every anchor we sample a number of whiskers W∼Unif0,1,2W \0,1,2\, and for each whisker we sample a length Lwh∼Unif1,2L_wh \1,2\ and attach a simple path of length LwhL_wh starting from the anchor. Letting A denote the set of anchors of all motif instances, we then randomly permute A and connect consecutive anchors by simple paths of length Lbr∼Unif1,2L_br \1,2\, ensuring that the overall graph is connected while the cycles themselves remain node-disjoint. Node features are one-dimensional and equal to the node degree (i.e., we use purely structural degree features with no learned or random attributes). Deduplication and split. To avoid distributional artefacts from re-using the same graph topology, we perform bucketed rejection sampling with Weisfeiler–Lehman (WL) deduplication. We maintain four buckets (one per category) and generate candidate graphs as above until each bucket contains the desired number of graphs implied by the class prior (subject to rounding), rejecting any candidate whose 1-WL hash (obtained from three iterations of 1-WL colour refinement on the undirected graph and hashing the resulting colour histogram) has already been seen in that bucket, with a cap of 60,00060,000 proposals per bucket. Once a total of 40004000 unique graphs have been collected, we perform a per-bucket random split into 30003000 training and 10001000 test graphs using the same seed (4242), which preserves the class proportions up to integer rounding. Appendix B Optimizer and training B.1 Optimizer details All models are trained with the Adam optimizer as implemented in PyTorch, using the default parameters (β1,β2)=(0.9,0.999)( _1, _2)=(0.9,0.999) and ϵ=10−8ε=10^-8. Unless otherwise stated, we use a mini-batch size of 256256, no learning-rate scheduling, and no early stopping; every configuration is run over multiple random seeds, but the optimizer hyperparameters are identical across seeds. No dropout is performed. Table 2: Training hyperparameters (optimizer, learning rate, weight decay, and batch size) by experiment and architecture. Experiment Task / dataset Model Learning rate Weight decay Epochs 1 Pairwise16 GCN 0.10 0 300 1 Pairwise16 GIN 0.01 0 150 1 Pairwise16 GAT 0.01 0 150 2 Conjunction GCN 0.001 10−510^-5 1600 2 Conjunction GIN 0.001 10−510^-5 1600 2 Conjunction GAT 0.001 10−510^-5 1600 3 Pairwise16 GCN 0.10 0 200 3 Pairwise16 GIN 0.01 0 200 3 Pairwise16 GAT 0.01 0 200 4 Pairwise16 GCN 0.10 0 250 4 Pairwise16 GIN 0.01 0 250 4 Pairwise16 GAT 0.01 0 250 Experiment 1 (Dimension induced bottleneck on Pairwise). We use mean global pooling, batch size 256256, and the learning rates and weight decays from Table 2. We train GCN models for 300300 epochs and GIN/GAT models for 150150 epochs: GCN: lr=0.10,wd=0, 300epochs;GIN/GAT: lr=0.01,wd=0, 150epochs.GCN: lr=0.10,~wd=0,\;300~epochs; /GAT: lr=0.01,~wd=0,\;150~epochs. The optimizer is Adam with the above defaults and no scheduler. Experiment 2 (Topology induced superposition). For the second experiment we again use Adam, batch size 256256, and mean pooling, but employ smaller learning rates and a small amount of weight decay. All architectures share the same training schedule: GCN/GIN/GAT: lr=10−3,wd=10−5, 1600epochs.GCN/GIN/GAT: lr=10^-3,\;wd=10^-5,\;1600~epochs. Training is performed by calling model.fit(train_loader, optimizer, num_epochs=1600) with no additional regularization beyond weight decay. Experiment 3 (alignment on Pairwise16). In the alignment experiments on the Pairwise16 dataset we sweep the hidden dimension over [10,16,22][10,16,22] with batch size 256256, and the same Adam learning rates and weight decays as in Experiment 1 (see Table 2). We sweep the generalized-mean pooling exponent over p∈1.0,2.0,4.0,8.0,pmax,p∈\1.0,2.0,4.0,8.0,p_ \, where pmaxp_ denotes the largest exponent used in our code (approximating max pooling); all other optimization hyperparameters are kept fixed across p. Experiment 4 (rank / selectivity on Pairwise16). We use the ExperimentRank wrapper with hidden dimension 1616, batch size 256256, and the Adam settings from Table 2. Each model is trained via model.fit(train_loader, optimizer, num_epochs=num_epochs), where num_epochs is a fixed constant within this experiment and does not depend on the random seed; no learning-rate schedule or additional regularization is applied beyond the specified weight decay. Appendix C Generalized Mean Pooling Let N be the number of nodes, xid∈ℝx_id the d‑th dimension of node i, and p∈ℝp a pooling parameter. Introduce a small stabiliser ε>0 >0 and define gp,ε(x)=sgn(x)(|x|+ε)p.g_p, (x)\;=\;sgn(x)\, (|x|+ )^p. 1. Element‑wise transformation: For every node i and feature dimension d, x~id=gp,ε(xid)=sgn(xid)(|xid|+ε)p. x_id=g_p, (x_id)=sgn(x_id) (|x_id|+ )^p. 2. Pooling: Sum the transformed features and take their mean, sd=∑i=1Nx~id,s¯d=1Nsd.s_d\;=\; _i=1^N x_id, s_d\;=\; 1N\,s_d. 3. Inverse transformation: Apply the signed p‑th root to obtain the pooled feature hd=sgn(s¯d)(|s¯d|+ε)1/p.h_d\;=\;sgn( s_d)\, (| s_d|+ )^1/p. The resulting graph‑level representation is the vector =(h1,h2,…,hD).h= (h_1,h_2,…,h_D ). Remark. Setting p=1p=1 recovers ordinary mean pooling (modulo the stabilisation term ε ), while p→∞p→∞ recovers max pooling. Appendix D Symmetry Breaking Recall the stabilized generalized mean pooling from App. C. For each feature dimension d, let xid∈ℝx_id be the feature of node i, and define x~id=gp,ε(xid)=sgn(xid)(|xid|+ε)p,sd=∑i=1Nx~id,s¯d=1Nsd. x_id=g_p, (x_id)=sgn(x_id) (|x_id|+ )^p, s_d= _i=1^N x_id, s_d= 1Ns_d. The pooled feature is then hd=sgn(s¯d)(|s¯d|+ε)1/p.h_d=sgn( s_d) (| s_d|+ )^1/p. We now compute ∂hd/∂xid∂ h_d/∂ x_id, away from the measure-zero points where xid=0x_id=0 or s¯d=0 s_d=0 (so that the sign functions are locally constant). Step 1: derivative of the inner transform. For fixed d and node i, write x~id=sgn(xid)(|xid|+ε)p. x_id=sgn(x_id) (|x_id|+ )^p. Using ∂|xid|∂xid=sgn(xid) ∂|x_id|∂ x_id=sgn(x_id) and sgn(xid)2=1sgn(x_id)^2=1, we obtain ∂x~id∂xid=sgn(xid)⋅p(|xid|+ε)p−1⋅∂|xid|∂xid=p(|xid|+ε)p−1. ∂ x_id∂ x_id=sgn(x_id)· p (|x_id|+ )^p-1· ∂|x_id|∂ x_id=p (|x_id|+ )^p-1. Hence ∂sd∂xid=∂xid(∑j=1Nx~jd)=∂x~id∂xid=p(|xid|+ε)p−1, ∂ s_d∂ x_id= ∂ x_id ( _j=1^N x_jd )= ∂ x_id∂ x_id=p (|x_id|+ )^p-1, and therefore ∂s¯d∂xid=1N∂sd∂xid=pN(|xid|+ε)p−1. ∂ s_d∂ x_id= 1N ∂ s_d∂ x_id= pN (|x_id|+ )^p-1. Step 2: derivative of the outer map. Define y=s¯dy= s_d and write hd=f(y)=sgn(y)(|y|+ε)1/p.h_d=f(y)=sgn(y) (|y|+ )^1/p. For y≠0y≠ 0, sgn(y)sgn(y) is locally constant, and we have ∂hd∂y=sgn(y)⋅1p(|y|+ε)1/p−1⋅∂|y|∂y=1p(|y|+ε)1/p−1, ∂ h_d∂ y=sgn(y)· 1p (|y|+ )^1/p-1· ∂|y|∂ y= 1p (|y|+ )^1/p-1, since ∂|y|∂y=sgn(y) ∂|y|∂ y=sgn(y) and the sign factors cancel. Step 3: chain rule. Combining the two steps by the chain rule, ∂hd∂xid=∂hd∂s¯d⋅∂s¯d∂xid=1p(|s¯d|+ε)1/p−1⋅pN(|xid|+ε)p−1. ∂ h_d∂ x_id= ∂ h_d∂ s_d· ∂ s_d∂ x_id= 1p (| s_d|+ )^1/p-1· pN (|x_id|+ )^p-1. Thus the stabilized gradient of generalized mean pooling is ∂hd∂xid=1N(|s¯d|+ε)1p−1(|xid|+ε)p−1 ∂ h_d∂ x_id= 1N (| s_d|+ ) 1p-1 (|x_id|+ )^p-1 for all xidx_id and s¯d s_d away from the non-differentiable points of the absolute-value function. Remark. If we further let ε→0 → 0 and assume non-negative activations so that s¯d=sd/N s_d=s_d/N and xid≥0x_id≥ 0, this reduces to ∂hd∂xid=N−1/p|sd|1/p−1xidp−1 ∂ h_d∂ x_id=N^-1/p|s_d|^1/p-1x_id^p-1. Appendix E Sharp SI transitions during transition A training‑time phase transition. On PAIRWISE we observe two training modes: (i) full‑rank runs where rτ(H)r_τ(H) rises to d and EffRank(H)EffRank(H) tracks it; (i) low‑rank runs where rτ(H)r_τ(H) remains <d<d for most of training. Across seeds, SI/WNO of the centroids exhibit a reproducible four‑stage pattern: 1. Feature birth (noisy). Early on, new class directions appear; EffRank(C)EffRank(C) rises ⇒ SI↓ . Packing inside the span is crude, so WNO typically increases. 2. Compression (sharp transition). A narrow span forms: smallest singular values of H drop or remain suppressed, PC1 energy rises, and EffRank(C)EffRank(C) contracts ⇒ a sharp SI spike. This point almost always coincides with the onset of overfitting (test loss starts to rise). 3. Repacking (smooth). One or two small singular directions “wake up”, allowing repacking within a slightly larger span: WNO↓ and SI↓ smoothly. 4. Convergence (flat). Geometry stabilizes; train loss drifts down, test loss drifts up. The sharpness of Stage 2 reflects a discrete change in rτ(H)r_τ(H) (entire channels becoming globally inactive or reactivated), whereas WNO is computed after PC1 removal and need not jump at the same epoch. Appendix F Geometric Regimes for Anti‑Aligned Embedding Directions F.1 Problem Formulation We first consider the case where the feature number n is small and no hyperplane has formed: Let i∈ℝd∣‖i‖2=1,i=1,…,n\\,v_i ^d \|v_i\|_2=1,\;i=1,…,n\ be the (column‑normalized) embedding directions learned by the network’s last hidden layer, and let ii=1n\w_i\_i=1^n denote the corresponding unit‑length rows of the classifier weight matrix. The anti‑alignment hypothesis we explored in Section 6 states that after training i≈iandi⊤j≤ 0(i≠j),v_i\;≈\;w_i _i^\! v_j\;≤\;0 (i≠ j), i.e. every embedding is (almost) identical to its class weight, and pairwise inner products are non‑positive if possible. In other words, the network attempts to realise a set of n mutually obtuse vectors in ℝdR^d. The feasibility of (A.1) depends on the relation between n and d, which can be separated into four distinct regimes. F.2 Regime Analysis 1. Over‑complete case, n>2dn>2d. Via a simple counting argument, at least two vectors must lie in the same closed half‑space; hence their inner product is >0>0, so one pair is necessarily acute. 2. Intermediate regime, d+1<n<2d+1<n<2d. There are at most n=d+1n=d+1 mutually obtuse vectors. In this regime some vectors are mutually orthogonal and some are mutually obtuse. 3. Simplex threshold, n=d+1n=d+1. The largest set of vectors that can satisfy i⊤j<0v_i^\! v_j<0 for all i≠ji≠ j is the vertex set of a centred regular d‑simplex. This is the first value of n that permits pairwise obtuse directions. 4. Under‑complete case, n≤dn≤ d. One can choose any n orthogonal unit vectors (or the n vertices of a degenerate n‑simplex in a n−1n-1 dimensional subspace). F.3 Implication for Rank Collapse Assume training has converged to a low‑rank subspace ⊂ℝdS ^d with dim()=r<d (S)=r<d and all i∈v_i . Activating a new latent dimension ⟂e_ perturbs one vector as k↦~k=1−ε2k+ε⟂v_k v_k= 1- ^2\,v_k+ _ . Then, for i≠ki≠ k, ~k⊤i=1−ε2k⊤i≥k⊤i, v_k^\! v_i= 1- ^2\,v_k^\! v_i\;≥\;v_k^\! v_i, because k⊤i≤0v_k^\! v_i≤ 0 in the metastable state. Hence the angle decreases444Equality holds only when k⊤i=0v_k^\! v_i=0, keeping the angle unchanged at 90∘90 . towards 90∘90 , which reduces pairwise obtuseness. As discussed in Section 6 this is is penalised by the BCE loss function. During back propagation gradients oppose the escape from the low‑rank basin, effectively creating a metastable minima. If ε grows large enough performance gains may begin to outweigh the angle penalty, and the rank grows. F.4 Rank Collapse with Hyperplane So far we have assumed that i≈i,v_i\;≈\;w_i, yet when a separating hyperplane emerges, the learned embeddings i\v_i\ and the classifier weights i\w_i\ are no longer almost identical. Empirically, iv_i are often acute rather than mutually obtuse, but the cross‑class dot products stay strictly negative: j⊤i< 0(i≠j).w_j^\! v_i\;<\;0 (i≠ j). Perturbation into a fresh dimension. Using the same notation as before, we perturb one embedding as ~k=1−ε2k+ε⟂. v_k\;=\; 1- ^2\,v_k+ \,e_ . For any j≠kj≠ k j⊤~k=1−ε2j⊤k+εj⊤⟂>j⊤k,w_j^\! v_k\;=\; 1- ^2\,w_j^\! v_k+ \,w_j^\! e_ \;>\;w_j^\! v_k, since j⊤k<0w_j^\! v_k<0 by (A.4) while j⊤⟂=0w_j^\! e_ =0. Thus the cross‑class dot product becomes less negative, which increases BCE loss. In short, even after the hyperplane forms, the network remains trapped in a low‑rank metastable basin for the same reason as before: cross‑class obtuseness (j⊤i<0w_j^\! v_i<0) resists escapes that would reduce the margin. F.5 Leaky ReLU Rank Figure 7 illustrates the evolution of the rank of pooled embeddings across 15 training runs of a GIN model on the pairwise dataset. The left graph presents results for a standard GIN, while the right graph shows outcomes when the MLP layers in the GIN architecture use leaky ReLU as the activation function. Figure 7: GIN model training runs with input dimension 12 and hidden dimensions [12, 6].