Paper deep dive
Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness
Qi Zhang, Yifei Wang, Jingyi Cui, Xiang Pan, Qi Lei, Stefanie Jegelka, Yisen Wang
Models: MonoLoRA, NCL (Non-negative Contrastive Learning), Sparse Autoencoder (SAE)
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%
Last extracted: 3/12/2026, 6:12:42 PM
Summary
This paper challenges the traditional accuracy-interpretability tradeoff in deep learning by demonstrating that monosemantic features—where neurons correspond to distinct, consistent semantics—significantly enhance model robustness. Through empirical analysis across input noise, label noise, few-shot learning, and out-of-domain generalization, the authors show that models leveraging monosemantic features (attained via NCL or SAE) outperform polysemantic counterparts. The study provides both empirical and theoretical evidence suggesting that monosemanticity promotes better feature separation, leading to more robust decision boundaries.
Entities (6)
Relation Signals (4)
NCL → attains → Monosemanticity
confidence 95% · NCL (non-negative contrastive learning) attains high sparsity and monosemanticity
SAE → attains → Monosemanticity
confidence 95% · Sparse autoencoders (SAEs) reconstruct the original outputs of pretrained networks from a sparse bottleneck layer.
Monosemanticity → improves → Model Robustness
confidence 95% · monosemantic features not only enhance interpretability but also bring concrete gains in model performance.
Polysemanticity → causes → Lack of Interpretability
confidence 90% · Deep learning models often suffer from a lack of interpretability due to polysemanticity
Cypher Suggestions (2)
Find all methods used to achieve monosemanticity. · confidence 90% · unvalidated
MATCH (m:Method)-[:ATTAINS]->(c:Concept {name: 'Monosemanticity'}) RETURN m.nameIdentify the impact of monosemanticity on model metrics. · confidence 90% · unvalidated
MATCH (c:Concept {name: 'Monosemanticity'})-[r:IMPROVES]->(m:Metric) RETURN m.name, r.relationAbstract
Abstract:Deep learning models often suffer from a lack of interpretability due to polysemanticity, where individual neurons are activated by multiple unrelated semantics, resulting in unclear attributions of model behavior. Recent advances in monosemanticity, where neurons correspond to consistent and distinct semantics, have significantly improved interpretability but are commonly believed to compromise accuracy. In this work, we challenge the prevailing belief of the accuracy-interpretability tradeoff, showing that monosemantic features not only enhance interpretability but also bring concrete gains in model performance. Across multiple robust learning scenarios-including input and label noise, few-shot learning, and out-of-domain generalization-our results show that models leveraging monosemantic features significantly outperform those relying on polysemantic features. Furthermore, we provide empirical and theoretical understandings on the robustness gains of feature monosemanticity. Our preliminary analysis suggests that monosemanticity, by promoting better separation of feature representations, leads to more robust decision boundaries. This diverse evidence highlights the generality of monosemanticity in improving model robustness. As a first step in this new direction, we embark on exploring the learning benefits of monosemanticity beyond interpretability, supporting the long-standing hypothesis of linking interpretability and robustness. Code is available at \url{this https URL}.
Tags
Links
PDF not stored locally. Use the link above to view on the source site.
Full Text
202,529 characters extracted from source content.
Expand or collapse full text
Beyond Interpretability: The Gains of Feature Monosemanticity on Model Robustness Qi Zhang1∗*∗ Yifei Wang2 Jingyi Cui1 Xiang Pan3 Qi Lei3 Stefanie Jegelka4,5 Yisen Wang1 1 Peking University 2 MIT CSAIL 3 New York University 4 TUM CIT, MCML, MDSI 5 MIT EECS, CSAIL Equal Contribution.Corresponding Author: Yisen Wang (yisen.wang@pku.edu.cn). Abstract Deep learning models often suffer from a lack of interpretability due to polysemanticity, where individual neurons are activated by multiple unrelated semantics, resulting in unclear attributions of model behavior. Recent advances in monosemanticity, where neurons correspond to consistent and distinct semantics, have significantly improved interpretability but are commonly believed to compromise accuracy. In this work, we challenge the prevailing belief of the accuracy-interpretability tradeoff, showing that monosemantic features not only enhance interpretability but also bring concrete gains in model performance. Across multiple robust learning scenarios—including input and label noise, few-shot learning, and out-of-domain generalization—our results show that models leveraging monosemantic features significantly outperform those relying on polysemantic features. Furthermore, we provide empirical and theoretical understandings on the robustness gains of feature monosemanticity. Our preliminary analysis suggests that monosemanticity, by promoting better separation of feature representations, leads to more robust decision boundaries. This diverse evidence highlights the generality of monosemanticity in improving model robustness. As a first step in this new direction, we embark on exploring the learning benefits of monosemanticity beyond interpretability, supporting the long-standing hypothesis of linking interpretability and robustness. Code is available at https://github.com/PKU-ML/Beyond_Interpretability. 1 Introduction A long-standing problem of deep learning is the so-called “black-box” nature. People find that an important factor for its lack of interpretability is feature polysemanticity, where a single neuron (a dimension of feature maps) is activated by multiple irrelevant semantics (Arora et al., 2018; Olah et al., 2020), preventing clear attributions of neural behaviors. Following this understanding, recent research has made breakthroughs towards attaining monosemanticity, i.e., neurons corresponding to consistent semantics (monosemantic), which dramatically improves model interpretability; see a comparison in Figure 1(a). They achieve this through architectural designs (Elhage et al., 2022a; Wang et al., 2024) or post-training explanation modules (Cunningham et al., 2024), and have successfully scaled to visual backbones (e.g., ResNet) and large language models (LLMs, e.g., Claude, GPT, and Gemma), discovering many intriguing phenomena and applications (Templeton, 2024; Gao et al., 2024; Lieberum et al., 2024; Wang et al., 2024). However, these works on monosemanticity suggest an inevitable “accuracy-interpretability” tradeoff: monosemantic features, although more interpretable, come at the sacrifice of expressive power and underperform polysemantic features at prediction accuracy. This widely accepted belief (Huben et al., 2023; Elhage et al., 2022b) limits the applications of monosemanticity techniques to only interepretability-related domains. In this paper, we aim to push this boundary one step forward by demonstrating that monosemanticity can also bring significant gains on practical model performance beyond interpretability. In particular, we discover a widely appearing phenomenon, that monosemantic features are much more robust compared to polysemantic features, across multiple scenarios related to “robustness”. One such scenario is learning with noise. Real-world data are often imperfect with low-quality input and mislabeling, manifested in the form of various data noises and distribution shifts. We find that under either input or label noises, learning a classifier upon (pretrained) monosemantic features can attain much higher accuracy (e.g., +13.7% top-1 accuracy under 90% label noise) than polysemantic features, as shown in Figure 1(b). This feature-centric result also offers a new perspective to noisy learning where existing studies primarily focus on robust learning objectives (Wang et al., 2019a; Song et al., 2020). The second secenario is few-shot finetuning for downstream classification. Today’s large visual backbones often need to be finetuned on a small amount of downstream labeled data, where models easily overfit and deteriorate. We find that monosemantic finetuning, i.e., preserving the monosemanticity of representations during finetuning (with a technique from Wang et al. (2024)), can attain much higher accuracy under few-shot data compared to vanilla polysemantic finetuning (e.g., +3.9% top-1 accuracy with 10% samples). The same method also works for finetuning with noisy data or training from scratch. With these benefits in mind, we further explore a third scenario, LLM finetuning, which receives wide applications these days (Minaee et al., 2024). Pretrained LLMs need to be carefully finetuned on small-scale language data for different purposes, e.g., instruction following and certain abilities (e.g., reasoning), while avoiding conflicting and forgetting. Since LLMs do not have a natural representation space like visual models, we devise a simple sparse variant of LoRA, named MonoLoRA, to encourage the monosemanticity of the updates of all features. We show preliminary evidence that when finetuning an aligned LLM (Llama-2-7b-chat) on SST-2 (a classification task) and Dolly (instruction following task), MonoLoRA better preserves model alignment while improving task performance. At last, we attempt to offer a deeper understanding of the robustness gains of monosemanticity. Empirically, we compare the salient features of different classifiers, observing that the more robust classifiers tend to depend on more monosemantic features. Theoretically, as a preliminary step, we compare polysemantic and monosemantic features under a toy model proposed in Elhage et al. (2022b). The theory suggests that because monosemantic features have better separation of features, they are less prone to overfitting to noise, leading to more robust decision boundaries compared to polysemantic features. In summary, this work challenges the common “accuracy-interpretability” tradeoff by demonstrating the potential of feature monosemanticity to bring clear gains in model accuracy. These gains manifest themselves in various aspects of “learning robustness” that we can think of: input noise, label noise, out-of-domain data, few-shot image data, and few-shot language data. The diverse set of evidence strongly indicates that feature monosemanticity provides a general sense of robustness compared to polysemantic features, echoing with the long-lasting hypothesis on the relationship between better feature interpretability and better robustness (e.g., human decisions are both interpretable and robust) (Bengio et al., 2013; 2019). As a first step in this direction, we believe that it will embark on more intriguing discoveries and understandings on the learning benefits of monosemanticity beyond interpretability. (a) Illustration of Activated Samples on A Polysemantic (Left) and A Monomsemantic (Right) Dimension (b) Test Accuracy (%) of Classifiers Learned upon Polysemantic and Monosemantic Features on Different Scenarios Figure 1: A comparison between polysemantic (CL) and monosemantic features (NCL, SAE) pretrained on ImageNet-100. We consider noisy labels (90 % noise rate) and Gaussian input noise (0.60.60.60.6 stdev); see more details in Appendix A.4. 2 Preliminary & Related Work Polysemanticity and Superposition Hypothesis. Across various domains, many previous studies (Nguyen et al., 2016; Mu & Andreas, 2020; Olah et al., 2020) have consistently observed that a feature dimension in the neural networks is usually activated with multiple unrelated semantics. Researchers define this phenomenon as the feature polysemanticity. In contrast, when each dimension is activated with a single latent natural concept, the features are denoted as monosemantic features. A popular explanation of the feature polysemanticity is the superposition hypothesis (Arora et al., 2018; Olah et al., 2020), which states that each polysemantic dimension is an approximately linear combination of multiple natural concepts. To verify that, Elhage et al. (2022b) propose a toy model that obtains polysemantic features with the superposition hypothesis. Comparing polysemantic and monosemantic features, there exists a common belief that monosemantic features exhibit better interpretability at the cost of downstream performance (Cunningham et al., 2024; Elhage et al., 2022b). However, in this paper, we challenge this trade-off, finding that monosemantic features also show superiority when the performance is evaluated on robustness tasks. Methods to Attain Feature Monosemanticity. To enhance the feature interpretability, researchers propose several methods to obtain monosemantic features. For example, Variational Autoencoder (VAE) (Kingma, 2013) and its variants (Higgins et al., 2017; Chen et al., 2018) have been used to find the disentangled features with monosemantcity. However, the performance of these methods in real-world tasks like image classification and natural language understanding is quite unsatisfactory. Recently, researchers have tried to attain monosemanticity with minimal influence on performance. The approaches can be majorly divided into two categories (Bereska & Gavves, 2024): intrinsic and post-hoc methods. The intrinsic methods, represented by non-negative contrastive learning (Wang et al., 2024), focus on adjusting the pretraining algorithms. While the post-hoc methods apply downstream modifications on learned features. For example, the sparse autoencoder, which reconstructs the features from a sparse bottleneck layer, has recently shown impressive monosemanticity in various models (Ng et al., 2011; Gao et al., 2024). We note that previous works mainly focus on enhancing feature interpretability with monosemanticity. However, in this paper, we explore the relationship between monosemanticity and another crucial property of features: robustness. Robustness Learning. In the development of deep learning models, robustness is a critical measure for evaluating the quality of features (Wang et al., 2021; Xu et al., 2021; Muhammad & Bae, 2022). The evaluation of robustness involves various task scenarios. Common conditions include assessing the robustness of features against noisy labels (Song et al., 2022), distribution shifts (Yang et al., 2024), overfitting (Ying, 2019), etc. In this paper, we analyze the robustness from a new perspective, i.e., we evaluate the influence of monosemanticity in different robustness tasks. For learning with noisy labels, we apply the symmetric label noise to the training samples, i.e., with a probability η (noise rate), the labels of samples are uniformly flipped to the other classes. For robustness against distribution shifts, we apply various shifts, such as Gaussian noise, uniform noise, and real-world distribution shifts (Wang et al., 2019a; Geirhos et al., 2018) to the validation samples. For robustness against overfitting, we finetune the vision and language models with fewer samples and evaluate the validation performance. 3 The Robustness Gains of Monosemanticity In this section, we compare polysemantic with monosemantic features across three different robust learning scenarios commonly encoutered in the foundation model regime: first, noisy linear probing on pretrained features (either polysemantic or monosemantic); second, noisy and few-shot finetuning from pretrained weights; third, finetuning LLMs on small-scale supervised data. 3.1 Monosemantic Features are Robust under Linear Probing Foundation models typically have two training phases: 1) self-supervised learning (SSL) on massive unlabeled data, and 2) supervised finetuning on small human-labeled data (classification, instruction following, or specific tasks). In fact, since SSL-pretrained features contain rich semantics, learning a simpler linear classifier on top, known as linear probing (LP), can often attain competitive performance to fully supervised ones (Chen et al., 2020). Therefore, we start with this simplest setting for comparing the robustness of polysemantic and monosemantic pretrained features. Specifically, we consider a standard linear probing setting, where we first pretrain features on unlabeled data and then learn a linear classifier on top with noisy labeled data. 3.1.1 Methods for Feature Monosemanticity Among existing interpretability research, there are two categories of methods to attain monosemanticity: 1) intrinsic methods, where pretrained features are intrinsically monosemantic; 2) post-hoc methods, where we apply additional techniques to decode (polysemantic) pretrained features to monomsemantic ones. Here, we consider two representative methods for each paradigm. Intrinsic Monosemanticity with NCL. Many previous works have tried to train interpretable features by adding sparsity regularization (Tibshirani, 1996) or identifiability constraints (Zhang et al., 2024b); but they hardly scale to large-scale data with competitve performance. A recent work, NCL (non-negative contrastive learning) (Wang et al., 2024), as a modern counterpart to NMF (non-negative matrix factorization) (Lee & Seung, 1999), attains high sparsity and monosemanticity while having minimal influence on final performance. Specifically, NCL adopts the following InfoNCE loss (Oord et al., 2018) with non-negative feature outputs: ℒNCL(f)=−x,x+logexp(f+(x)⊤f+(x+))exp(f+(x)⊤f+(x+))+1M∑i=1Mexp(f+(x)⊤f+(xi−),L_ NCL(f)=-E_x,x^+ (f_+(x) % f_+(x^+)) (f_+(x) f_+(x_+))+ 1M _i=1^M % (f_+(x) f_+(x_i^-),Lroman_NCL ( f ) = - blackboard_Ex , x+ log divide start_ARG exp ( f+ ( x )⊤ f+ ( x+ ) ) end_ARG start_ARG exp ( f+ ( x )⊤ f+ ( x+ ) ) + divide start_ARG 1 end_ARG start_ARG M end_ARG ∑i = 1M exp ( f+ ( x )⊤ f+ ( xitalic_i- ) end_ARG , (1) where (x,x+)superscript(x,x^+)( x , x+ ), (x,x−)superscript(x,x^-)( x , x- ) are the positive and negative pairs in contrastive learning, f+(x)=σ(f(x))subscriptf_+(x)=σ(f(x))f+ ( x ) = σ ( f ( x ) ), σ is an activation function and f is the original neural network. With the non-negative constraints, the activations of learned representations become sparse and each dimension is almost only activated with samples from the same class (Wang et al., 2024). Post-hoc Monosemanticity with SAE. Another approach is to apply downstream modification on pretrained neural networks. Sparse autoencoders (SAEs) (Ng et al., 2011) find wide success in attaining monosemanticity in language models (Templeton, 2024; Gao et al., 2024; Lieberum et al., 2024). SAEs reconstruct the original outputs of pretrained networks from a sparse bottleneck layer. To be specific, the encoder and decoder are defined as: z(x) z(x)z ( x ) =topK((Wenc(f(x)−bpre)+benc), =topK((W_ enc(f(x)-b_ pre)+b_ enc),= topK ( ( Wroman_enc ( f ( x ) - broman_pre ) + broman_enc ) , (2) f^(x) f(x)over start_ARG f end_ARG ( x ) =Wdecz(x)+bpre.absentsubscriptdecsubscriptpre =W_ decz(x)+b_ pre.= Wroman_dec z ( x ) + broman_pre . where f(x)f(x)f ( x ) is the representation of input x; WencsubscriptencW_ encWroman_enc, WdecsubscriptdecW_ decWroman_dec, bpresubscriptpreb_ prebroman_pre and bencsubscriptencb_ encbroman_enc are the parameters of SAE; topKtopKtopKtopK is a sparse activation function proposed by Gao et al. (2024) that only preserves the top K elements; and the SAE training loss is the reconstruction MSE ℒSAE=x‖f(x)^−f(x)‖2subscriptℒSAEsubscriptsuperscriptnorm^2L_ SAE=E_x\| f(x)-f(x)\|^2Lroman_SAE = blackboard_Ex ∥ over start_ARG f ( x ) end_ARG - f ( x ) ∥2. As a result, the sparse latent feature z(x)z(x)z ( x ) has much better monosemanticity than the original feature f(x)f(x)f ( x ). Table 1: Linear probing accuracy and gain (%) of polysemantic and monosemantic representations on ImageNet-100 and CIFAR-100 under different rates (%) of label noise (00 (clean label) to 90909090). Dataset (%) Features 0 10 20 30 40 50 60 70 80 90 Poly (CL) 54.5 53.4 52.8 52.1 51.2 49.9 49.5 48.0 45.0 35.7 Mono (SAE) 54.4 53.9 53.4 52.9 51.9 51.3 50.5 49.7 47.1 39.1 -0.1 +0.5 +0.6 +0.8 +0.7 +1.4 +1.0 +1.7 +2.1 +3.4 Mono (NCL) 52.8 53.9 53.5 52.6 52.6 52.3 51.4 49.5 48.0 44.9 CIFAR-100 -1.7 +0.5 +0.7 +0.5 +1.4 +2.4 +1.9 +1.5 +3.0 +9.2 Poly (CL) 66.8 63.1 61.8 60.1 58.8 56.4 54.9 53.1 48.9 34.4 Mono (SAE) 66.2 65.3 62.3 60.5 59.8 59.7 58.5 55.9 54.3 45.9 -0.4 +2.2 +0.5 +0.4 +1.0 +3.3 +3.6 +2.8 +5.4 +11.5 Mono (NCL) 66.7 65.4 64.4 63.9 62.4 62.3 60.6 59.8 57.6 48.1 ImageNet-100 -0.1 +2.3 +2.6 +3.8 +3.6 +5.9 +5.7 +6.7 +8.7 +13.7 3.1.2 Experiments Setup. For the baseline, we pretrain a ResNet-18 (He et al., 2016) backbone with the widely-used contrastive framework SimCLR (Chen et al., 2020) on CIFAR-100 and ImageNet-100. In comparison, we use Non-negative Contrastive Learning (Wang et al., 2024) and Sparse Autoencoder (Gao et al., 2024) to represent two primary strategies for obtaining monosemantic features, i.e., improve the pretraining algorithm and apply downstream modification. For Non-negative Contrastive Learning (NCL), we follow the default SimCLR settings, with the addition of a non-negative constraint using the ReLU function. For the Sparse Autoencoder (SAE), we apply it following the pretrained backbone as Equation (2), and then we train the linear classifier on the frozen latent representation of SAE. More details can be found in Appendix A.1. Robustness Against Label Noise. When evaluating the robustness against label noise, we train a linear classifier following the frozen pretrained encoders, where the labels are uniformly flipped to the other classes with a probability η (noise rate). As shown in Table 1, when the linear classifiers are trained on the samples with clean labels, monosemantic and polysemantic features exhibit comparable performance. However, in the presence of label noise, both NCL and SAE significantly outperform across various datasets. Especially, when the noise rates are aggressive, the improvements are substantial, with NCL showing a 13.7% improvement on ImageNet-100 and a 9.2% improvement on CIFAR-10 under 90% noisy labels. The results are consistent with the results in toy models and further verify that monosemnatic features obtain stronger robustness against label noise. Robustness Against Distribution Shifts. For evaluating the resilience of features to distribution shifts, we evaluate three types of shifts, including random input noise, random Gaussian noise, and real-world distribution shifts (Wang et al., 2019a; Geirhos et al., 2018) on ImageNet-100 datasets. The models and classifiers are trained on the clean ImageNet-100 dataset while their classification performance is evaluated with noisy samples. As shown in Figure 2(a), 2(b), and 2(c), both the pretraining constraints and downstream modifications that enhance feature monosemanticity improve classification accuracy under noisy samples, and the benefits rise with the increase of noise strength. The results suggest that the monosemantic features can also enhance the robustness against various noises applied in inputs. (a) Gaussian Input Noise (b) Uniform Input Noise (c) Real-world Distribution Shift Figure 2: The evaluation of robustness against input distribution shifts on ImageNet-100. Monosemantic representations (SAE,NCL) exhibit improvements in the robustness against different kinds of distribution shifts. 3.2 Monosemantic Features are Robust under Few-shot and Noisy Finetuning In practice, fully finetuning a large pretrained model on downstream labeled data can often achieve better performance than linear probing, but it also easily overfits if there are only a few amount of labeled data. Here, we compare standard finetuning (polysemantic) to monosemantic finetuning. 3.2.1 Methods for Monosemantic Finetuning Standard Finetuning. For the baseline, we consider a common finetuning setting, i.e., we pretrain the encoders with contrastive learning on unlabeled ImageNet-100 and then learn a linear classifier on labeled Imagenet-100 with the cross-entropy loss: ℒCE(f)=x,ylogexp(f(x)⊤wy)∑c=1Cexp(f(x)⊤wc)subscriptℒCEsubscriptsuperscripttopsubscriptsuperscriptsubscript1superscripttopsubscriptL_ CE(f)=E_x,y (f(x) w_y)% _c=1^C (f(x) w_c)Lroman_CE ( f ) = blackboard_Ex , y log divide start_ARG exp ( f ( x )⊤ witalic_y ) end_ARG start_ARG ∑c = 1C exp ( f ( x )⊤ witalic_c ) end_ARG, where f is the encoder network and wcsubscriptw_cwitalic_c is the linear classifier weight of the related label. Unlike linear probing, we train classifiers on the pretrained representations without clipping the gradient of encoders. Non-negative Tuning. According to NCL (Wang et al., 2024), replacing the original cross-entropy (CE) loss used in the supervised learning process with the non-negative cross-entropy (NCE) loss will maintain monosemanticity during supervised learning. Thus, we use it as a monosemantic finetuning strategy. To be specific, NCE applies non-negative transformation to the representations f(x)f(x)f ( x ), i.e., ℒNCE(f)=x,ylogexp(f+(x)⊤wy)∑c=1Cexp(f+(x)⊤wc),subscriptℒNCEsubscriptsubscriptsuperscripttopsubscriptsuperscriptsubscript1subscriptsuperscripttopsubscriptL_ NCE(f)=E_x,y (f_+(x) w_y% ) _c=1^C (f_+(x) w_c),Lroman_NCE ( f ) = blackboard_Ex , y log divide start_ARG exp ( f+ ( x )⊤ witalic_y ) end_ARG start_ARG ∑c = 1C exp ( f+ ( x )⊤ witalic_c ) end_ARG , (3) where f+(x)=σ(f(x))subscriptf_+(x)=σ(f(x))f+ ( x ) = σ ( f ( x ) ) with a non-negative activation function, e.g., ReLU. By respectively finetuning contrastive pretrained models with CE and NCE objectives, we compare the robustness of polysemantic and monosemantic features across two different tasks: few-shot finetuning and noisy label finetuning. 3.2.2 Experiments Few-shot Finetuning. As the finetuning process usually involves fewer training samples, a crucial challenge for feature robustness is preventing overfitting on small training datasets. To evaluate the performance of polysemantic and monosemantic features during few-shot finetuning, we respectively use 10%, 20%, 50% and the entire training set of ImageNet-100 to finetune the pretrained representations with CE and NCE objectives. As shown in Figure 3(a), 3(b), the monosemantic features exhibit lower training accuracy but higher validation accuracy in few-shot finetuning, and the advantages grow when the training set becomes smaller, which implies that the monosemanticity helps representations to be less likely to overfit the training set in the downstream task. (a) Training Accuracy of Few-shot Finetuning (b) Validation Accuracy of Few-shot Finetuning (c) Noisy Label Finetuning Figure 3: The robustness of the models finetuned with polysemanticity (CE) and monosemanticity (NCE) under different noises on ImageNet-100. Attaining monosemanticity during the finetuning process enhances the robustness across various tasks. Noisy-label Finetuning. We also evaluate robustness against label noise in finetuning tasks on ImageNet-100. During the finetuning process, the labels of training samples are uniformly flipped to the other classes with a probability η (noise rate). As shown in Figure 3(c), non-negative finetuning leads to significant gains under label noise that keep growing with the increase of the noise rate. Notably, monosemantic features exhibit at most 11.9% improvement under large noise rate. These empirical results indicate that maintaining feature monosemanticity during the finetuning process can bring better learning robustness against overfitting and label noise. 3.3 Monosemantic LoRA for Large Language Models In Section 3.2.2, we show that maintaining feature monosemanticity during supervised finetuning can be much more resistant to overfitting. This favorable property suggests that monosemanticity can also benefit LLM finetuning of widely applications today. Existing LLM training has two stages: 1) pretraining on large-scale unlabeled data, and 2) supervised finetuning (or post-training) on small-scale data. Since LLMs are very large and labeled data are small, overfitting becomes a severe issue in LLM finetuning (VM et al., 2024; Zhang et al., 2024a). Given that LLMs, unlike supervised classifiers, do not have a natural representation space and they are more prone to overfit due to the large model size, we extend LoRA, a standard efficient finetuning method, to have a more monosemantic update per layer by prompting sparsity in its update. 3.3.1 Methods for Monosemantic LLM Finetuning Low-rank Adaptation (LoRA). LoRA (low-rank adaptation) is a de facto method for finetuning LLM weights a lower cost by factorizing it into low-rank weights. Specifically, for each LLM weight W0∈ℝd×ksubscript0superscriptℝW_0∈R^d× kW0 ∈ blackboard_Rd × k, we can reparameterize the fine-tuned weight as ΔW=ABΔ W=ABΔ W = A B, where A∈ℝd×r,B∈ℝr×kformulae-sequencesuperscriptℝsuperscriptℝA∈R^d× r,B∈R^r× kA ∈ blackboard_Rd × r , B ∈ blackboard_Rr × k are two low-rank matrices (with r≪min(d,k)much-less-thanr (d,k)r ≪ min ( d , k )) actually being learned during finetuning. After finetuning, the updated output of the linear layer with weight W becomes y=WLoRAx=(W0+ΔW)x=W0x+ΔWx=W0x+ABx.subscriptLoRAsubscript0Δsubscript0Δsubscript0 y=W_ LoRAx=(W_0+ W)x=W_0x+ Wx=W_0x+ABx.y = Wroman_LoRA x = ( W0 + Δ W ) x = W0 x + Δ W x = W0 x + A B x . (4) The LoRA weights can be used separately or merged back to model weights. Monosemantic LoRA. Inspired by non-negative finetuning (Section 3.2.1), we add non-negative constraints inside LoRA modules to better promote feature monosemanticity: y=WMonoLoRAx=W0x+ΔW(x)=W0x+σ(Aσ(Bσ(x))),subscriptMonoLoRAsubscript0Δsubscript0 y=W_ MonoLoRAx=W_0x+ W(x)=W_0x+σ(Aσ(B% σ(x))),y = Wroman_MonoLoRA x = W0 x + Δ W ( x ) = W0 x + σ ( A σ ( B σ ( x ) ) ) , (5) where σ is the non-negative transformation (ReLUReLUReLUReLU by default). Compared to Equation 4, the MonoLoRA update encourages the low-rank weights to yield sparse updates that help prevent overfitting. 3.3.2 Experiments When evaluating, we consider a common scenario related to robustness in large language model fine-tuning. Specifically, during the fine-tuning process, the large language models often compromise the already learned alignment, which leads to a security risk (Qi et al., 2023). In practice, we use the Llama-2-7B-Chat (Touvron et al., 2023) as the aligned model and further finetune it with SST2 (Socher et al., 2013) and Dolly (Conover et al., 2023) datasets as downstream tasks. To evaluate the security and alignment performance, we use the ShieldGemma-9B (Zeng et al., 2024) and Beavertails-7B (Ji et al., 2024) models to evaluate the alignment of model responses based on the response on Beavertails datasets (Ji et al., 2024). More details can be found in Appendix A.3. Table 2: Evaluation of LoRA and MonoLoRA with Llama-2-7B-Chat on SST2 and Dolly. SST2 is evaluated by accuracy and Dolly is evaluated by RougeL. Alignment and Beavertails scores are the lower the better. Dataset Model ShieldGemma Alignment Scores (↓ ↓) Alignment Sparsity Task Sparsity Beavertails (↓ ↓) Task Perf. (↑ ↑) Danger. Harass. Hate. Sex. Avg SST2 Base 7.66 2.88 6.14 2.64 4.83 - - 20.90 88.65 LoRA 8.48 6.91 9.43 6.77 7.90 0 0 20.60 92.78 MonoLoRA 5.37 2.23 4.63 1.88 3.53 45.54 36.71 20.00 94.84 Dolly Base 7.66 2.88 6.14 2.64 4.83 - - 20.90 10.21 LoRA 10.54 3.53 7.53 2.86 6.12 0 0 23.80 14.08 MonoLoRA 10.49 3.56 7.40 2.70 6.04 38.69 40.00 22.60 14.48 As shown in Table 2, the alignment of the monosemantic LoRA models is more resilient to overfitting than that of normal LoRAs and in the meantime, they can achieve comparable fine-tuning task performance. We evaluate the sparsity (zero value ratio) of the intermediate activations of the LoRA and MonoLoRA models, which is the intrinsic sparsity of the LoRA module. The results suggest that the monosemanticity at neuron levels can also improve the robustness of LLMs against overfitting when finetuned with small-scale data. 4 Understanding the Robustness Gains of Monosemanticity In Section 3, we provide a comprehensive evaluation of the robustness gains of feature monosemanticity across multiple scenarios. Yet, we do not have a fully clear understanding of why monosemantic features are more robust. As a preliminary step to demystify this phenomenon, in this section, we investigate the influence of monosemanticity on learned classifiers from both empirical (Section 4.1) and theoretical (Sections 4.2 & 4.3) perspectives. For simplicity, we focus on the label noise scenario. 4.1 Noisy Classifiers Prefer Monosemantic Features in Practice To further understand the robustness improvements brought by monosemanticity, we investigate the difference in the salient features of the robust and non-robust classifiers under noisy conditions. Taking the linear classifier trained on ImageNet-100 with 90% noisy labels (Section 3.1)) as an example, we start with respectively visualizing the dominant features for classes with the highest and lowest accuracy. For each class, we find the feature dimension with the largest classifier weight for the ground-truth label and visualize the top-activated samples along the dimension. As shown in Figure 4(a), 4(b), we observe a clear difference: samples activated in the dimension related to the lowest accuracy class (jeans) belong to different classes while samples activated in the dimension related to the highest accuracy class (boathouse) share the same label, i.e., the classifier with higher performance under label noise relies on a more monosemantic dimension. (a) The most salient feature of the lowest-accuracy class is polysemantic. (b) The most salient feature of the highest-accuracy class is monosemantic. (c) Correctly classified samples have more monosemantic features. Figure 4: Influence of feature monosemanticity on classification performance, where the classifier is applied after a frozen contrastive encoder and trained with 90% noisy labels. (a), (b) respectively draw the activated samples on the dimensions with the largest clssifier weight of the lowest-accuracy and highest-accuracy classes on ImageNet-100. (c) demonstrates the monosemanticity scores (Wang et al., 2024) of wrongly and correctly classified samples. We then validate this observation with the semantic consistency (Wang et al., 2024) as the quantitative monosemanticity score. The semantic consistency calculates the proportion of activated samples that belong to their most frequent class along a dimension. With a larger semantic consistency, the dimension is more likely to be activated by the samples from the same class, i.e., the feature is more monosemantic. To compare the robust and non-robust classifiers, we respectively draw the samples that are wrongly and correctly classified by the classifiers learned on ImageNet-100 with 90% noisy labels. For the embedding of each sample, we draw the dimension with the largest activation value and calculate the semantic consistency. As shown in Figure 4(c), we observe that the semantic consistency of the most salient features in correctly classified samples is much higher than that of misclassified samples. The results further indicate that the classifiers with superior performance under noise tend to depend on monosemantic features. 4.2 Replicating Monosemanticity Gains with the Superposition Model To further establish a theoretical understanding of the benefits brought by monosemanticity, we introduce a toy model proposed by (Elhage et al., 2022b) for the simplicity of analysis. The toy model constructs polysemantic representations with the superposition hypothesis (Arora et al., 2018), a widely-used explanation of feature polysemanticity. The hypothesis states that a polysemantic feature is an approximately linear combination of multiple latent semantics while a monosemantic feature is the reconstruction of a single natural concept. With the hypothesis, the toy model enables researchers to replicate the polysemanticity phenomenon and theoretically analyze the properties of polysemantic and monosemantic features, e.g., occurrence conditions, learning dynamics, and geometric structures (Lecomte et al., 2024; Marshall & Kirchner, 2024; Chen et al., 2023). In this section, we start by introducing the setups and observing the robustness of different features on the toy model. (a) Polysemantic dimensions correspond to multiple latent semantics. (b) Polysemantic features have worse performance under noisy data. Figure 5: The comparison between polysemantic and monosemantic features on the toy model introduced by Elhage et al. (2022b) (n=4040n=40n = 40, m=2020m=20m = 20, S=0.20.2S=0.2S = 0.2). (a) demonstrates the Parameters (W⊤WsuperscripttopW W⊤ W) of monosemantic (Left) and polysemantic features (Right) on the Toy Model. (b) evaluates the classification performance of features against different noises. The label noise denotes applying 90% noisy labels to the training samples and input noise denotes applying Gaussian noise to the validation samples. Toy Model Setups. In practice, we follow the settings proposed by Elhage et al. (2022b) and evaluate the robustness of polysemantic features on the toy model. Specifically, we assume each sample x has n dimensions and each dimension represents a natural concept. As the features in real-world datasets are usually sparsely activated (Olshausen & Field, 1997), we assume each dimension of a sample x has an associated sparsity S and let xi=0subscript0x_i=0xitalic_i = 0 with probability S. If not zero, we let each dimension be uniformly distributed between [0,1]01[0,1][ 0 , 1 ]. When evaluating the performance, we consider the classification tasks of natural concepts, i.e., the labels satisfy that y(x)=argmaxixisubscriptargmaxsubscripty(x)= *arg\,max _ix_iy ( x ) = start_OPERATOR arg max end_OPERATORi xitalic_i. For the encoding network, we consider a linear model h=W⊤Wxℎsuperscripttoph=W Wxh = W⊤ W x, where W∈ℝm×nsuperscriptℝW ^m× nW ∈ blackboard_Rm × n, with m<nm<nm < n, i.e., the hidden dimension is smaller than the input dimension. In practice, we use the reconstruction of x as the training objective and obtain two kinds of learned features. As shown in Figure 5(a), when the superposition does not occur, we observe that W⊤WsuperscripttopW W⊤ W is diagonal and has only m non-zero elements, which means the model only captures m concepts and each dimension is monosemantic. In contrast, when superposition happens, features obtain more concepts than the model dimensions and different concepts are projected into the same dimension. Noisy learning settings. To evaluate the robustness of features, we respectively add noise to the labels and samples. For training with noisy labels, we denote the noise rate as η, where each label y is uniformly switched to one of the other n−11n-1n - 1 labels with probability η. In experiments, we selected an aggressive noise rate (90%). With labeled samples, we train a linear classifier following the frozen features and evaluate the classification accuracy on a validation dataset without noisy labels. For noisy sample validation, we train a linear classifier on the clean dataset and add the Gaussian noise to the validation set samples. Empirical Results. As shown in Figure 5(b), in the absence of noise, polysemantic features exhibit better performance, which is expected as the superposition enables features to capture more concepts. However, when there exists noise in the labels and samples, the situation changes significantly. The feature without superposition shows improvements over that with superposition under both label noise and input noise. The empirical results replicate the phenomenon where the monosemantic features are more robust than polysemantic features. 4.3 Theoretical Analyses with the Superposition Model After replicating the robustness gains of monosemanticity on the toy model, we then establish a theoretical comparison between polysemantic and monosemantic features. For ease of theoretical analysis, we consider a binary classification case in the toy model (n=22n=2n = 2, m=11m=1m = 1, S=0.20.2S=0.2S = 0.2). To be specific, a sample x has two latent features x1,x2subscript1subscript2x_1,x_2x1 , x2, and the model parameter W∈ℝ1×2superscriptℝ12W∈ R^1× 2W ∈ blackboard_R1 × 2. When we obtain the monosemantic features, the model output is νmono:=x1assignsubscriptmonosubscript1 _mono:=x_1νroman_mono := x1. In contrast, when obtaining polysemantic features, the model keeps more natural concepts than the representation dimension. According to Elhage et al. (2022b), one common geometric structure of polysemantic features is antipodal pairs formed by two concepts. Therefore, we assume the learned polysemantic feature to be νpoly:=x1−x2assignsubscriptpolysubscript1subscript2 _poly:=x_1-x_2νroman_poly := x1 - x2. For conciseness of expressions, we introduce the following notations on mean and variance for a given feature representation ν. For a clean distribution without label noise, we denote the conditional means and variances by μi(ν):=(ν|y=i)assignsubscriptconditional _i(ν):=E(ν|y=i)μitalic_i ( ν ) := blackboard_E ( ν | y = i ), σi2(ν):=((ν−μ0(ν))2|y=i)assignsubscriptsuperscript2conditionalsuperscriptsubscript02σ^2_i(ν):=E((ν- _0(ν))^2|y=i)σ2italic_i ( ν ) := blackboard_E ( ( ν - μ0 ( ν ) )2 | y = i ), i=0,101i=0,1i = 0 , 1. For distinction, for a noisy distribution, we use μ~~ μover~ start_ARG μ end_ARG and σ~~ σover~ start_ARG σ end_ARG. Borrowing the concept from linear discriminant analysis (LDA) (Fisher, 1936), we deem that a good linearly discriminative representation should have a large distance between different classes whereas maintaining the intra-class variance as small as possible, i.e. maximize Δμ(ν)=|μ0(ν)−μ1(ν)|Δsubscript0subscript1 μ(ν)=| _0(ν)- _1(ν)|Δ μ ( ν ) = | μ0 ( ν ) - μ1 ( ν ) | whereas minimizing σ02(ν)subscriptsuperscript20σ^2_0(ν)σ20 ( ν ) and σ12(ν)subscriptsuperscript21σ^2_1(ν)σ21 ( ν ). Therefore, to quantitatively compare polysemantic and monosemantic representations, we use the criterion J(ν)=Δμ(ν)/(σ0(ν)σ1(ν))Δsubscript0subscript1J(ν)= μ(ν)/( _0(ν) _1(ν))J ( ν ) = Δ μ ( ν ) / ( σ0 ( ν ) σ1 ( ν ) ). A larger value of J(ν)J(ν)J ( ν ) indicates better linear separability. Theorem 4.1 (Conditional means and variances of monosemantic & polysemantic features). Let νmono=x1subscriptmonosubscript1 _mono=x_1νroman_mono = x1 and νpoly=x1−x2subscriptpolysubscript1subscript2 _poly=x_1-x_2νroman_poly = x1 - x2. For conditional means, we have μ0(νpoly)<μ0(νmono)subscript0subscriptpolysubscript0subscriptmono _0( _poly)< _0( _mono)μ0 ( νroman_poly ) < μ0 ( νroman_mono ) and μ1(νpoly)<μ1(νmono)subscript1subscriptpolysubscript1subscriptmono _1( _poly)< _1( _mono)μ1 ( νroman_poly ) < μ1 ( νroman_mono ), yet Δμ(νpoly)>Δμ(νmono)ΔsubscriptpolyΔsubscriptmono μ( _poly)> μ( _mono)Δ μ ( νroman_poly ) > Δ μ ( νroman_mono ). For conditional variances, we have σ12(νpoly)=σ12(νmono)subscriptsuperscript21subscriptpolysubscriptsuperscript21subscriptmonoσ^2_1( _poly)=σ^2_1( _mono)σ21 ( νroman_poly ) = σ21 ( νroman_mono ) and σ02(νpoly)>σ02(νmono)subscriptsuperscript20subscriptpolysubscriptsuperscript20subscriptmonoσ^2_0( _poly)>σ^2_0( _mono)σ20 ( νroman_poly ) > σ20 ( νroman_mono ). Overall, we have J(νpoly)>J(νmono)subscriptpolysubscriptmonoJ( _poly)>J( _mono)J ( νroman_poly ) > J ( νroman_mono ). According to the LDA criterion, the polysemantic feature with a larger J(ν)J(ν)J ( ν ) is more linearly separable. Intuitively, because the polysemantic embedding encodes information of both x1subscript1x_1x1 and x2subscript2x_2x2, it can do better classification w.r.t. the labels depending on both features. However, when there exits label noise, we observe a different situation. Theorem 4.2 (Influence of label noise on linear seprarability). We denote the linear separability criterion under noise as J~(ν)=Δμ~(ν)/(σ0~(ν)σ~1(ν))~Δ~~subscript0subscript~1 J(ν)= μ(ν)/( _0(ν) σ_% 1(ν))over~ start_ARG J end_ARG ( ν ) = Δ over~ start_ARG μ end_ARG ( ν ) / ( over~ start_ARG σ0 end_ARG ( ν ) over~ start_ARG σ end_ARG1 ( ν ) ). For noise rate η∈[0,0.5)00.5η∈[0,0.5)η ∈ [ 0 , 0.5 ), J~(νpoly)J(νpoly)≤J~(νmono)J(νmono)≤1.~subscriptpolysubscriptpoly~subscriptmonosubscriptmono1 J( _poly)J( _poly)≤ % J( _mono)J( _mono)≤ 1.divide start_ARG over~ start_ARG J end_ARG ( νroman_poly ) end_ARG start_ARG J ( νroman_poly ) end_ARG ≤ divide start_ARG over~ start_ARG J end_ARG ( νroman_mono ) end_ARG start_ARG J ( νroman_mono ) end_ARG ≤ 1 . (6) Meanwhile, we obtain J~(νpoly)≤J~(νmono)~subscriptpoly~subscriptmono J( _poly)≤ J( _mono)over~ start_ARG J end_ARG ( νroman_poly ) ≤ over~ start_ARG J end_ARG ( νroman_mono ) when η∈[0.25,0.5)0.250.5η∈[0.25,0.5)η ∈ [ 0.25 , 0.5 ). As shown in Theorem 4.2, with the increase of noise rate, the linear separability (J(ν)J(ν)J ( ν )) of both polysemantic and monosemantic features becomes worse. However, J(νmono)subscriptJ( _mono)J ( νitalic_m o n o ) decreases more slowly. As a result, when the noise rate is aggressive enough (η≥0.250.25η≥ 0.25η ≥ 0.25), the monosemantic feature exhibits better linear seperability than the polysemantic one. Moreover, in Appendix B.3, we show that input noise has a similar influence on linear separability. The theoretical results reveal that the linear separability of monosemantic features is more robust than polysemantic ones, which leads to better performance in tasks under noise. 5 Concluding Remarks Recent work has made significant strides in enhancing model interpretability by promoting feature monosemanticity through various techniques. However, a prevailing belief in the literature posits an accuracy-interpretability tradeoff, suggesting that achieving monosemantic features for better interpretability necessarily compromises prediction accuracy. In this study, we have challenged this notion by demonstrating the advantages of monosemanticity beyond interpretability alone. Specifically, we found that monosemantic features are significantly more robust to various types of distribution shifts, including input noise, label noise, and real-world out-of-domain inputs. Additionally, we have shown that maintaining feature monosemanticity during fine-tuning serves as an effective regularizer, reducing model overfitting in few-shot settings, noisy environments, and during large language model (LLM) fine-tuning. We also provide an in-depth analysis of the benefits of monosemantic features from both theoretical and empirical aspects. These diverse sources of learning robustness collectively indicate that monosemantic features have a general sense of robustness, resonating with its benefits in interpretability. Therefore, rather than viewing monosemanticity as a necessary cost for interpretability, we advocate for embracing and exploring the multiple learning advantages it offers. We believe our work, as a pioneering effort in this direction, will inspire future research to investigate these possibilities further. Reproducibility Statement To ensure the reproducibility of our results, we elaborate on the details of our experiments and theoretical analysis in the main paper and the appendix. In Section 3.1, 3.2, 3.3 of the main paper, we respectively introduce the methods for capturing polysemantic and monosemantic features in linear probing, finetuning vision models and finetuning LLMs. Furthermore, in Appendix A, we introduce the hyperparameters and implementation details of adopted methods, and the detailed settings of the robustness evaluation, including input and label noise, few-shot learning, and out-of-domain generalization. For theoretical results, we introduce the toy models we used in Section 4.3 of the main paper and provide detailed proofs and explanations for the theoretical comparison in Appendix B. Acknowledgement This research was funded in part by NSF Award CCF-2112665 (TILOS AI Institute), and an Alexander von Humboldt Professorship. References Arora et al. (2018) Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6:483–495, 2018. Bengio et al. (2013) Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new perspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798–1828, 2013. Bengio et al. (2019) Yoshua Bengio, Tristan Deleu, Nasim Rahaman, Rosemary Ke, Sébastien Lachapelle, Olexa Bilaniuk, Anirudh Goyal, and Christopher Pal. A meta-transfer objective for learning to disentangle causal mechanisms. arXiv preprint arXiv:1901.10912, 2019. Bereska & Gavves (2024) Leonard Bereska and Efstratios Gavves. Mechanistic interpretability for ai safety–a review. arXiv preprint arXiv:2404.14082, 2024. Chen et al. (2019) Pengfei Chen, Ben Ben Liao, Guangyong Chen, and Shengyu Zhang. Understanding and utilizing deep neural networks trained with noisy labels. In ICML, 2019. Chen et al. (2018) Ricky TQ Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement in variational autoencoders. In NeurIPS, 2018. Chen et al. (2020) Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, 2020. Chen et al. (2023) Zhongtian Chen, Edmund Lau, Jake Mendel, Susan Wei, and Daniel Murfet. Dynamical versus bayesian phase transitions in a toy model of superposition. arXiv preprint arXiv:2310.06301, 2023. Conover et al. (2023) Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free dolly: Introducing the world’s first truly open instruction-tuned llm. Company Blog of Databricks, 2023. Cunningham et al. (2024) Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. ICLR, 2024. Elhage et al. (2022a) Nelson Elhage, Tristan Hume, Catherine Olsson, Neel Nanda, Tom Henighan, Scott Johnston, Sheer ElShowk, Nicholas Joseph, Nova DasSarma, Ben Mann, Danny Hernandez, Amanda Askell, Kamal Ndousse, Andy Jones, Dawn Drain, Anna Chen, Yuntao Bai, Deep Ganguli, Liane Lovitt, Zac Hatfield-Dodds, Jackson Kernion, Tom Conerly, Shauna Kravec, Stanislav Fort, Saurav Kadavath, Josh Jacobson, Eli Tran-Johnson, Jared Kaplan, Jack Clark, Tom Brown, Sam McCandlish, Dario Amodei, and Christopher Olah. Softmax linear units. Transformer Circuits Thread, 2022a. Elhage et al. (2022b) Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition. arXiv preprint arXiv:2209.10652, 2022b. Fisher (1936) Ronald A Fisher. The use of multiple measurements in taxonomic problems. Annals of eugenics, 7(2):179–188, 1936. Gao et al. (2024) Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093, 2024. Geirhos et al. (2018) Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A Wichmann, and Wieland Brendel. Imagenet-trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. arXiv preprint arXiv:1811.12231, 2018. Ghosh et al. (2017) Aritra Ghosh, Himanshu Kumar, and P Shanti Sastry. Robust loss functions under label noise for deep neural networks. In AAAI, 2017. Han et al. (2018) Bo Han, Quanming Yao, Xingrui Yu, Gang Niu, Miao Xu, Weihua Hu, Ivor Tsang, and Masashi Sugiyama. Co-teaching: Robust training of deep neural networks with extremely noisy labels. In NeurIPS, 2018. He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016. Higgins et al. (2017) Irina Higgins, Loic Matthey, Arka Pal, Christopher P Burgess, Xavier Glorot, Matthew M Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017. Huben et al. (2023) Robert Huben, Hoagy Cunningham, Logan Riggs Smith, Aidan Ewart, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models. In ICLR, 2023. Ji et al. (2024) Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. In NeurIPS, 2024. Kingma (2013) Diederik P Kingma. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013. Lecomte et al. (2024) Victor Lecomte, Kushal Thaman, Rylan Schaeffer, Naomi Bashkansky, Trevor Chow, and Sanmi Koyejo. What causes polysemanticity? an alternative origin story of mixed selectivity from incidental causes. In ICLR 2024 Workshop on Representational Alignment, 2024. Lee & Seung (1999) Daniel D Lee and H Sebastian Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788–791, 1999. doi: 10.1038/44565. Lieberum et al. (2024) Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. arXiv preprint arXiv:2408.05147, 2024. Ma et al. (2018) Xingjun Ma, Yisen Wang, Michael E Houle, Shuo Zhou, Sarah Erfani, Shutao Xia, Sudanthi Wijewickrema, and James Bailey. Dimensionality-driven learning with noisy labels. In ICML, 2018. Ma et al. (2020) Xingjun Ma, Hanxun Huang, Yisen Wang, Simone Romano, Sarah Erfani, and James Bailey. Normalized loss functions for deep learning with noisy labels. In ICML, 2020. Marshall & Kirchner (2024) Simon C Marshall and Jan H Kirchner. Understanding polysemanticity in neural networks through coding theory. arXiv preprint arXiv:2401.17975, 2024. Minaee et al. (2024) Shervin Minaee, Tomas Mikolov, Narjes Nikzad, Meysam Chenaghlu, Richard Socher, Xavier Amatriain, and Jianfeng Gao. Large language models: A survey. arXiv preprint arXiv:2402.06196, 2024. Mu & Andreas (2020) Jesse Mu and Jacob Andreas. Compositional explanations of neurons. In NeurIPS, 2020. Muhammad & Bae (2022) Awais Muhammad and Sung-Ho Bae. A survey on efficient methods for adversarial robustness. IEEE Access, 10:118815–118830, 2022. Ng et al. (2011) Andrew Ng et al. Sparse autoencoder. CS294A Lecture notes, 72(2011):1–19, 2011. Nguyen et al. (2016) Anh Nguyen, Jason Yosinski, and Jeff Clune. Multifaceted feature visualization: Uncovering the different types of features learned by each neuron in deep neural networks. arXiv preprint arXiv:1602.03616, 2016. Olah et al. (2020) Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits. Distill, 5(3):e00024–001, 2020. Olshausen & Field (1997) Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997. Oord et al. (2018) Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018. Qi et al. (2023) Xiangyu Qi, Yi Zeng, Tinghao Xie, Pin-Yu Chen, Ruoxi Jia, Prateek Mittal, and Peter Henderson. Fine-tuning aligned language models compromises safety, even when users do not intend to! arXiv preprint arXiv:2310.03693, 2023. Reed et al. (2014) Scott Reed, Honglak Lee, Dragomir Anguelov, Christian Szegedy, Dumitru Erhan, and Andrew Rabinovich. Training deep neural networks on noisy labels with bootstrapping. arXiv preprint arXiv:1412.6596, 2014. Socher et al. (2013) Richard Socher, Alex Perelygin, Jean Wu, Jason Chuang, Christopher D. Manning, Andrew Ng, and Christopher Potts. Recursive deep models for semantic compositionality over a sentiment treebank. In EMNLP, 2013. Song et al. (2020) Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. Learning from noisy labels with deep neural networks: A survey. arXiv preprint arXiv: 2007.08199, 2020. Song et al. (2022) Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. Learning from noisy labels with deep neural networks: A survey. IEEE transactions on neural networks and learning systems, 34(11):8135–8153, 2022. Templeton (2024) Adly Templeton. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Anthropic, 2024. Tibshirani (1996) Robert Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society Series B: Statistical Methodology, 58(1):267–288, 1996. Touvron et al. (2023) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. Van Rooyen et al. (2015) Brendan Van Rooyen, Aditya Menon, and Robert C Williamson. Learning with symmetric label noise: The importance of being unhinged. In NeurIPS, 2015. VM et al. (2024) Kushala VM, Harikrishna Warrier, Yogesh Gupta, et al. Fine tuning llm for enterprise: Practical guidelines and recommendations. arXiv preprint arXiv:2404.10779, 2024. Wang et al. (2019a) Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power. In NeurIPS, 2019a. Wang et al. (2021) Xuezhi Wang, Haohan Wang, and Diyi Yang. Measure and improve robustness in nlp models: A survey. arXiv preprint arXiv:2112.08313, 2021. Wang et al. (2024) Yifei Wang, Qi Zhang, Yaoyu Guo, and Yisen Wang. Non-negative contrastive learning. arXiv preprint arXiv:2403.12459, 2024. Wang et al. (2019b) Yisen Wang, Xingjun Ma, Zaiyi Chen, Yuan Luo, Jinfeng Yi, and James Bailey. Symmetric cross entropy for robust learning with noisy labels. In ICCV, 2019b. Xu et al. (2021) Jiarong Xu, Junru Chen, Siqi You, Zhiqing Xiao, Yang Yang, and Jiangang Lu. Robustness of deep learning models on graphs: A survey. AI Open, 2:69–78, 2021. Yang et al. (2024) Jingkang Yang, Kaiyang Zhou, Yixuan Li, and Ziwei Liu. Generalized out-of-distribution detection: A survey. International Journal of Computer Vision, p. 1–28, 2024. Ying (2019) Xue Ying. An overview of overfitting and its solutions. In Journal of physics: Conference series, volume 1168, p. 022022. IOP Publishing, 2019. Zeng et al. (2024) Wenjun Zeng, Yuchi Liu, Ryan Mullins, Ludovic Peran, Joe Fernandez, Hamza Harkous, Karthik Narasimhan, Drew Proud, Piyush Kumar, Bhaktipriya Radharapu, et al. Shieldgemma: Generative ai content moderation based on gemma. arXiv preprint arXiv:2407.21772, 2024. Zhang et al. (2024a) Biao Zhang, Zhongtao Liu, Colin Cherry, and Orhan Firat. When scaling meets llm finetuning: The effect of data, model and finetuning method. arXiv preprint arXiv:2402.17193, 2024a. Zhang et al. (2024b) Qi Zhang, Yifei Wang, and Yisen Wang. Identifiable contrastive learning with automatic feature importance discovery. In NeurIPS, 2024b. Appendix A Experiment Details A.1 Experiment Details for Noisy Linear Probing During the pretraining process, we utilize ResNet-18 (He et al., 2016) as the backbone and train the models on CIFAR-100 and ImageNet-100. We pretrain the model for 200 epochs. The projector is a two-layer MLP with a hidden dimension 16384 and an output dimension 2048. We train the models with batch size 256 and weight decay 0.0001. When implementing NCL and SAE, we follow the default settings of SimCLR. For NCL, we adopt ReLU as the activation function σ. For SAE, the encoder and decoder are linear layers with 2048 input and output dimensions, and the number of activated features in the hidden layer is 256. During the linear evaluation, we train a classifier following the frozen backbone pretrained by different methods for 50 epochs. For noisy label probing, we apply symmetric label noise when training the linear classifiers, i.e., the labels are uniformly flipped to the other classes with the noisy rate. And for random input noise, we train the linear classifiers on clean datasets, while applying different scales of uniform noise and Gaussian noise to the validation sample. For real-world out-of-domain distribution shifts, we use ImageNet-sketch and ImageNet-stylized datasets (Geirhos et al., 2018; Wang et al., 2019a). As we pretrain the network on ImageNet-100, we select the samples of the corresponding 100 classes from these out-of-distribution datasets and evaluate the accuracy. A.2 Experiment Details for Few-shot and Noisy Finetuning from Prertrained Features During the pretraining process, we utilize ResNet-18 (He et al., 2016) as the backbone and train the models on ImageNet-100. We pretrain the model for 200 epochs. We use a projector which is a two-layer MLP with hidden dimension 16384 and output dimension 2048. We pretrain the models with batch size 256 and weight decay 0.0001. During the finetuning process, we train a classifier following the backbone for 100 epochs respectively with standard and non-negative tuning, following the default settings of finetuning. When implementing the non-negative tuning, we select the ReLU function as the non-negative operator. For few-shot finetuning, we respectively random draw 10%, 20%, 50%, and 100$ training samples from the original ImageNet-100 training set. For noisy label fine-tuning, we still apply symmetric label noise with different noise rates to the training samples. A.3 Experiment Details for Monosemantic LLM Finetuning Hyper-parameters We finetune the Llama-2-7b-Chat model in SST2 with 20 epochs, batch size 16 and learning rate 1e-4. we Lora with rank r=88r=8r = 8, scaling factor α=44α=4α = 4, and dropout rate 0.10.10.10.1 as default. For Dolly, we finetune it 1 epoch (by common practice), with batch size 4. The LoRA module is added to every query and value mapping module in the base model. For finetuning we use 5000 samples from the dataset. For inference, we use 1000 samples. Prompt Template Listing 1: SST-2 Prompt ⬇ Below is an instruction that describes a task. Write a response that appropriately completes the request. ### Instruction: Analyze the sentiment of the input, and respond only ’positive’ or ’negative’. ### Input: sentence ### Response: Listing 2: ShieldGemma Guideline ⬇ \"No Dangerous Content\": The prompt shall not contain or seek generation of content that harming oneself and/or others (e.g., accessing or building firearms and explosive devices, promotion of terrorism, instructions for suicide). \"No Harassment\": The prompt shall not contain or seek generation of content that is malicious, intimidating, bullying, or abusive content targeting another individual (e.g., physical threats, denial of tragic events, disparaging victims of violence). \"No Hate Speech\": The prompt shall not contain or seek generation of content that targeting identity and/or protected attributes (e.g., racial slurs, promotion of discrimination, calls to violence against protected groups) or dehumanizing/belittling/vilifying on the bases of race, ethnicity, religion, disability, age, nationality, veteran status, sexual orientation, gender, gender identity, caste, or any other protected status. We do allow content that includes a reference to a protected individual or group in a positive manner. \"No Sexually Explicit Information\": The prompt shall not contain or seek generation of content that contains references to sexual acts or other lewd content (e.g., sexually graphic descriptions, content aimed at causing arousal). Medical or scientific terms related to human anatomy or sex education are permitted. A.4 Experiments Details for Figure 1 For Figure 1(a), we respectively draw a random dimension from the models trained by CL and NCL, and then draw the top-activated samples along two dimensions. We utilize ResNet-18 (He et al., 2016) as the backbone and train the models on ImageNet-100 for 200 epochs. For Figure 1(b), we evaluate the performance in linear probing with noise. During the linear evaluation, we train a classifier following the frozen backbone pretrained by different methods for 50 epochs. For noisy label probing, we apply 90% symmetric label noise when training the linear classifiers. For random input noise, we train the linear classifiers on clean datasets, while applying Gaussian noise with 0.6 standard variation to the validation sample. Appendix B proofs B.1 Proofs Related to Theorem 4.1 B.1.1 Monosemantic Representations In the monosemantic case, we assume the learned representation only keeps the most important dimension ν=x1subscript1ν=x_1ν = x1. Theorem B.1 (Conditional mean and variance of monosemantic representations). The conditional means and variances of νmono=x1subscriptmonosubscript1 _mono=x_1νroman_mono = x1 are μ0(νmono)=13(1−S)21+S2 and μ1(νmono)=132+S1+Ssubscript0subscriptmono13superscript121superscript2 and subscript1subscriptmono1321 _0( _mono)= 13 (1-S)^21+S^2% and _1( _mono)= 13 2+S1+Sμ0 ( νroman_mono ) = divide start_ARG 1 end_ARG start_ARG 3 end_ARG divide start_ARG ( 1 - S )2 end_ARG start_ARG 1 + S2 end_ARG and μ1 ( νroman_mono ) = divide start_ARG 1 end_ARG start_ARG 3 end_ARG divide start_ARG 2 + S end_ARG start_ARG 1 + S end_ARG (7) σ02(νmono)=16(1−S)21+S2−μ0(νmono)2 and σ12(νmono)=163+S1+S−μ1(νmono)2.subscriptsuperscript20subscriptmono16superscript121superscript2subscript0superscriptsubscriptmono2 and subscriptsuperscript21subscriptmono1631subscript1superscriptsubscriptmono2 σ^2_0( _mono)= 16 (1-S)^21% +S^2- _0( _mono)^2 and σ^2_1(% _mono)= 16 3+S1+S- _1( _mono)^% 2.σ20 ( νroman_mono ) = divide start_ARG 1 end_ARG start_ARG 6 end_ARG divide start_ARG ( 1 - S )2 end_ARG start_ARG 1 + S2 end_ARG - μ0 ( νroman_mono )2 and σ21 ( νroman_mono ) = divide start_ARG 1 end_ARG start_ARG 6 end_ARG divide start_ARG 3 + S end_ARG start_ARG 1 + S end_ARG - μ1 ( νroman_mono )2 . (8) Proof of Theorem B.1. We first calculate the conditional probability density functions. P(x1≤x|y=0)Psubscript1conditional0 (x_1≤ x|y=0)P ( x1 ≤ x | y = 0 ) =P(x1≤x,x1≤x2)P(x1≤x2)absentPformulae-sequencesubscript1subscript1subscript2Psubscript1subscript2 = P(x_1≤ x,x_1≤ x_2)P(x_1% ≤ x_2)= divide start_ARG P ( x1 ≤ x , x1 ≤ x2 ) end_ARG start_ARG P ( x1 ≤ x2 ) end_ARG =P(x1≤x2,x1≤x,x2≤x)+P(x1≤x2,x1≤x,x2>x)P(x1≤x2)absentPformulae-sequencesubscript1subscript2formulae-sequencesubscript1subscript2Pformulae-sequencesubscript1subscript2formulae-sequencesubscript1subscript2Psubscript1subscript2 = P(x_1≤ x_2,x_1≤ x,x_2≤ x)+% P(x_1≤ x_2,x_1≤ x,x_2>x)P(x_1≤ x_2)= divide start_ARG P ( x1 ≤ x2 , x1 ≤ x , x2 ≤ x ) + P ( x1 ≤ x2 , x1 ≤ x , x2 > x ) end_ARG start_ARG P ( x1 ≤ x2 ) end_ARG =P(x1=0,x2≤x)+P(x1≤x2,0<x1≤x,0<x2≤x)+P(x1≤x)P(x2>x)P(x1≤x2) = P(x_1=0,x_2≤ x)+P(x_1≤ x_2% ,0<x_1≤ x,0<x_2≤ x)+P(x_1≤ x)P(x_2>x)% P(x_1≤ x_2)= divide start_ARG P ( x1 = 0 , x2 ≤ x ) + P ( x1 ≤ x2 , 0 < x1 ≤ x , 0 < x2 ≤ x ) + P ( x1 ≤ x ) P ( x2 > x ) end_ARG start_ARG P ( x1 ≤ x2 ) end_ARG =S[S+(1−S)x]+12(1−S)2x2+[S+(1−S)x](1−S)(1−x)12(1+S2)absentdelimited-[]112superscript12superscript2delimited-[]111121superscript2 = S[S+(1-S)x]+ 12(1-S)^2x^2+[S+(1-S)x](1-S)(1-x)% 12(1+S^2)= divide start_ARG S [ S + ( 1 - S ) x ] + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 1 - S )2 x2 + [ S + ( 1 - S ) x ] ( 1 - S ) ( 1 - x ) end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 1 + S2 ) end_ARG =1−(1−S)2(1−x)21+S2.absent1superscript12superscript121superscript2 =1- (1-S)^2(1-x)^21+S^2.= 1 - divide start_ARG ( 1 - S )2 ( 1 - x )2 end_ARG start_ARG 1 + S2 end_ARG . (9) P(x1≤x|y=1)Psubscript1conditional1 (x_1≤ x|y=1)P ( x1 ≤ x | y = 1 ) =P(x1≤x,x1>x2)P(x1>x2)absentPformulae-sequencesubscript1subscript1subscript2Psubscript1subscript2 = P(x_1≤ x,x_1>x_2)P(x_1>x_2% )= divide start_ARG P ( x1 ≤ x , x1 > x2 ) end_ARG start_ARG P ( x1 > x2 ) end_ARG =P(x1≤x)−P(x1≤x,x1≤x2)P(x1>x2)absentPsubscript1Pformulae-sequencesubscript1subscript1subscript2Psubscript1subscript2 = P(x_1≤ x)-P(x_1≤ x,x_1≤ x% _2)P(x_1>x_2)= divide start_ARG P ( x1 ≤ x ) - P ( x1 ≤ x , x1 ≤ x2 ) end_ARG start_ARG P ( x1 > x2 ) end_ARG =P(x1≤x)P(x1>x2)−P(x1≤x|y=0)⋅P(x1≤x2)P(x1>x2)absentPsubscript1Psubscript1subscript2⋅Psubscript1conditional0Psubscript1subscript2Psubscript1subscript2 = P(x_1≤ x)P(x_1>x_2)-% P(x_1≤ x|y=0)· P(x_1≤ x_2)P(x_1>x_% 2)= divide start_ARG P ( x1 ≤ x ) end_ARG start_ARG P ( x1 > x2 ) end_ARG - P ( x1 ≤ x | y = 0 ) ⋅ divide start_ARG P ( x1 ≤ x2 ) end_ARG start_ARG P ( x1 > x2 ) end_ARG =S+(1−S)x12(1−S2)−[1−(1−S)2(1−x)21+S2]⋅1+S21−S2absent1121superscript2⋅delimited-[]1superscript12superscript121superscript21superscript21superscript2 = S+(1-S)x 12(1-S^2)- [1- (1-S)^2(1% -x)^21+S^2 ]· 1+S^21-S^2= divide start_ARG S + ( 1 - S ) x end_ARG start_ARG divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 1 - S2 ) end_ARG - [ 1 - divide start_ARG ( 1 - S )2 ( 1 - x )2 end_ARG start_ARG 1 + S2 end_ARG ] ⋅ divide start_ARG 1 + S2 end_ARG start_ARG 1 - S2 end_ARG =(1−S)2x2+2S(1−S)x1−S2.absentsuperscript12superscript2211superscript2 = (1-S)^2x^2+2S(1-S)x1-S^2.= divide start_ARG ( 1 - S )2 x2 + 2 S ( 1 - S ) x end_ARG start_ARG 1 - S2 end_ARG . (10) Then the conditional means of νmono=x1subscriptmonosubscript1 _mono=x_1νroman_mono = x1 are μ0(νmono)subscript0subscriptmono _0( _mono)μ0 ( νroman_mono ) =∫xP(x1≤x|y=0)absentsubscriptdifferential-dPsubscript1conditional0 = _xx\,dP(x_1≤ x|y=0)= ∫x x d P ( x1 ≤ x | y = 0 ) =∫x∈(0,1]x2(1−S)21+S2(1−x)xabsentsubscript012superscript121superscript21differential-d = _x∈(0,1]x 2(1-S)^21+S^2(1-x)\,dx= ∫x ∈ ( 0 , 1 ] x divide start_ARG 2 ( 1 - S )2 end_ARG start_ARG 1 + S2 end_ARG ( 1 - x ) d x =2(1−S)21+S2(12−13)absent2superscript121superscript21213 = 2(1-S)^21+S^2( 12- 13)= divide start_ARG 2 ( 1 - S )2 end_ARG start_ARG 1 + S2 end_ARG ( divide start_ARG 1 end_ARG start_ARG 2 end_ARG - divide start_ARG 1 end_ARG start_ARG 3 end_ARG ) =13(1−S)21+S2absent13superscript121superscript2 = 13 (1-S)^21+S^2= divide start_ARG 1 end_ARG start_ARG 3 end_ARG divide start_ARG ( 1 - S )2 end_ARG start_ARG 1 + S2 end_ARG (11) and μ1(νmono)subscript1subscriptmono _1( _mono)μ1 ( νroman_mono ) =∫xP(x1≤x|y=1)absentsubscriptdifferential-dPsubscript1conditional1 = _xx\,dP(x_1≤ x|y=1)= ∫x x d P ( x1 ≤ x | y = 1 ) =∫x∈(0,1]x⋅2(1−S)(1−S)x+S1−S2xabsentsubscript01⋅2111superscript2differential-d = _x∈(0,1]x· 2(1-S) (1-S)x+S1-S^2\,dx= ∫x ∈ ( 0 , 1 ] x ⋅ 2 ( 1 - S ) divide start_ARG ( 1 - S ) x + S end_ARG start_ARG 1 - S2 end_ARG d x =2(1−S)1−S2[13(1−S)+12S]absent211superscript2delimited-[]13112 = 2(1-S)1-S^2 [ 13(1-S)+ 12S ]= divide start_ARG 2 ( 1 - S ) end_ARG start_ARG 1 - S2 end_ARG [ divide start_ARG 1 end_ARG start_ARG 3 end_ARG ( 1 - S ) + divide start_ARG 1 end_ARG start_ARG 2 end_ARG S ] =132+S1+S.absent1321 = 13 2+S1+S.= divide start_ARG 1 end_ARG start_ARG 3 end_ARG divide start_ARG 2 + S end_ARG start_ARG 1 + S end_ARG . (12) Then we have μ1(νmono)−μ0(νmono)subscript1subscriptmonosubscript0subscriptmono _1( _mono)- _0( _mono)μ1 ( νroman_mono ) - μ0 ( νroman_mono ) =131+S1+S2.absent1311superscript2 = 13 1+S1+S^2.= divide start_ARG 1 end_ARG start_ARG 3 end_ARG divide start_ARG 1 + S end_ARG start_ARG 1 + S2 end_ARG . (13) Similarly, we have the conditional variances as follows. σ02(νmono)subscriptsuperscript20subscriptmono σ^2_0( _mono)σ20 ( νroman_mono ) =∫x2P(x1≤x|y=0)−μ0(νmono)2absentsubscriptsuperscript2differential-dPsubscript1conditional0subscript0superscriptsubscriptmono2 = _xx^2dP(x_1≤ x|y=0)- _0( _% mono)^2= ∫x x2 d P ( x1 ≤ x | y = 0 ) - μ0 ( νroman_mono )2 =∫x∈(0,1]x22(1−S)21+S2(1−x)x−μ0(νmono)2absentsubscript01superscript22superscript121superscript21differential-dsubscript0superscriptsubscriptmono2 = _x∈(0,1]x^2 2(1-S)^21+S^2(1-x)\,dx- _0% ( _mono)^2= ∫x ∈ ( 0 , 1 ] x2 divide start_ARG 2 ( 1 - S )2 end_ARG start_ARG 1 + S2 end_ARG ( 1 - x ) d x - μ0 ( νroman_mono )2 =2(1−S)21+S2[13−14]−μ0(νmono)2absent2superscript121superscript2delimited-[]1314subscript0superscriptsubscriptmono2 = 2(1-S)^21+S^2 [ 13- 14 ]-% _0( _mono)^2= divide start_ARG 2 ( 1 - S )2 end_ARG start_ARG 1 + S2 end_ARG [ divide start_ARG 1 end_ARG start_ARG 3 end_ARG - divide start_ARG 1 end_ARG start_ARG 4 end_ARG ] - μ0 ( νroman_mono )2 =16(1−S)21+S2−μ0(νmono)2absent16superscript121superscript2subscript0superscriptsubscriptmono2 = 16 (1-S)^21+S^2- _0( _mono% )^2= divide start_ARG 1 end_ARG start_ARG 6 end_ARG divide start_ARG ( 1 - S )2 end_ARG start_ARG 1 + S2 end_ARG - μ0 ( νroman_mono )2 (14) and σ12(νmono)subscriptsuperscript21subscriptmono σ^2_1( _mono)σ21 ( νroman_mono ) =∫xP(x1≤x|y=1)−μ1(νmono)2absentsubscriptdifferential-dPsubscript1conditional1subscript1superscriptsubscriptmono2 = _xx\,dP(x_1≤ x|y=1)- _1( _% mono)^2= ∫x x d P ( x1 ≤ x | y = 1 ) - μ1 ( νroman_mono )2 =∫x∈(0,1]x2⋅2(1−S)(1−S)x+S1−S2x−μ1(νmono)2absentsubscript01⋅superscript22111superscript2differential-dsubscript1superscriptsubscriptmono2 = _x∈(0,1]x^2· 2(1-S) (1-S)x+S1-S^2\,dx-% _1( _mono)^2= ∫x ∈ ( 0 , 1 ] x2 ⋅ 2 ( 1 - S ) divide start_ARG ( 1 - S ) x + S end_ARG start_ARG 1 - S2 end_ARG d x - μ1 ( νroman_mono )2 =2(1−S)(1−S2)[14(1−S)+13S]−μ1(νmono)2absent211superscript2delimited-[]14113subscript1superscriptsubscriptmono2 = 2(1-S)(1-S^2) [ 14(1-S)+ 13S % ]- _1( _mono)^2= divide start_ARG 2 ( 1 - S ) end_ARG start_ARG ( 1 - S2 ) end_ARG [ divide start_ARG 1 end_ARG start_ARG 4 end_ARG ( 1 - S ) + divide start_ARG 1 end_ARG start_ARG 3 end_ARG S ] - μ1 ( νroman_mono )2 =163+S1+S−μ1(νmono)2.absent1631subscript1superscriptsubscriptmono2 = 16 3+S1+S- _1( _mono)^2.= divide start_ARG 1 end_ARG start_ARG 6 end_ARG divide start_ARG 3 + S end_ARG start_ARG 1 + S end_ARG - μ1 ( νroman_mono )2 . (15) ∎ B.1.2 Polysemantic Representations To study the polysemantic case, we first have to derive the probability distribution of νpoly=x1−x2subscriptpolysubscript1subscript2 _poly=x_1-x_2νroman_poly = x1 - x2 and the corresponding conditional probability density functions on y=00y=0y = 0 and y=11y=1y = 1, separately. We first calculate the cumulative distribution functions as follows. Lemma B.2 (Distribution of νpoly=x1−x2subscriptpolysubscript1subscript2 _poly=x_1-x_2νroman_poly = x1 - x2). P(x1−x2≤x)Psubscript1subscript2 (x_1-x_2≤ x)P ( x1 - x2 ≤ x ) =−12[1−(1−S)x]2+1+12S2,x∈[0,1],12[(1−S)x+1]2−12S2,x∈[−1,0). = \ aligned &- 12[1-(1-S)x]^2+1+ 12% S^2,&x∈[0,1],\\ & 12[(1-S)x+1]^2- 12S^2,&x∈[-1,0). aligned .= start_ROW start_CELL end_CELL start_CELL - divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ 1 - ( 1 - S ) x ]2 + 1 + divide start_ARG 1 end_ARG start_ARG 2 end_ARG S2 , end_CELL start_CELL x ∈ [ 0 , 1 ] , end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ ( 1 - S ) x + 1 ]2 - divide start_ARG 1 end_ARG start_ARG 2 end_ARG S2 , end_CELL start_CELL x ∈ [ - 1 , 0 ) . end_CELL end_ROW Proof of Lemma B.2. For x∈[0,1]01x∈[0,1]x ∈ [ 0 , 1 ], we have P(x1−x2≤x)Psubscript1subscript2 (x_1-x_2≤ x)P ( x1 - x2 ≤ x ) =limN→∞∑n=−NNP(x1≤x+n/N)P(x2=n/N)absentsubscript→superscriptsubscriptPsubscript1Psubscript2 = _N→∞ _n=-N^NP(x_1≤ x+n/N)% P(x_2=n/N)= limitalic_N → ∞ ∑n = - Nitalic_N P ( x1 ≤ x + n / N ) P ( x2 = n / N ) =limN→∞∑n=0⌊(1−x)N⌋P(x1≤x+n/N)P(x2=n/N)+∑n=⌊(1−x)N⌋+1N1⋅P(x2=n/N)absentsubscript→superscriptsubscript01Psubscript1Psubscript2superscriptsubscript11⋅1Psubscript2 = _N→∞ _n=0 (1-x)N P(x_% 1≤ x+n/N)P(x_2=n/N)+ _n= (1-x)N +1^N1·% P(x_2=n/N)= limitalic_N → ∞ ∑n = 0⌊ ( 1 - x ) N ⌋ P ( x1 ≤ x + n / N ) P ( x2 = n / N ) + ∑n = ⌊ ( 1 - x ) N ⌋ + 1N 1 ⋅ P ( x2 = n / N ) =[S+(1−S)x]⋅Sabsent⋅delimited-[]1 =[S+(1-S)x]· S= [ S + ( 1 - S ) x ] ⋅ S +limN→∞∑n=1⌊(1−x)N⌋[S+(1−S)(x+n/N)]⋅(1−S)/N+∑n=⌊(1−x)N⌋+1N(1−S)/Nsubscript→superscriptsubscript11⋅delimited-[]11superscriptsubscript111 + _N→∞ _n=1 (1-x)N [S+(1-S)(% x+n/N)]·(1-S)/N+ _n= (1-x)N +1^N(1-S)/N+ limitalic_N → ∞ ∑n = 1⌊ ( 1 - x ) N ⌋ [ S + ( 1 - S ) ( x + n / N ) ] ⋅ ( 1 - S ) / N + ∑n = ⌊ ( 1 - x ) N ⌋ + 1N ( 1 - S ) / N =S[S+(1−S)x]+limN→∞[S(1−S)+(1−S)2x]⌊(1−x)N⌋/Nabsentdelimited-[]1subscript→delimited-[]1superscript121 =S[S+(1-S)x]+ _N→∞[S(1-S)+(1-S)^2x] (1-x)N% /N= S [ S + ( 1 - S ) x ] + limitalic_N → ∞ [ S ( 1 - S ) + ( 1 - S )2 x ] ⌊ ( 1 - x ) N ⌋ / N +(1−S)2⌊(1−x)N⌋(⌊(1−x)N⌋+1)/(2N2)+(1−S)(N−⌊(1−x)N⌋−1)/Nsuperscript121112superscript2111 +(1-S)^2 (1-x)N ( (1-x)N +1)/(2N^% 2)+(1-S)(N- (1-x)N -1)/N+ ( 1 - S )2 ⌊ ( 1 - x ) N ⌋ ( ⌊ ( 1 - x ) N ⌋ + 1 ) / ( 2 N2 ) + ( 1 - S ) ( N - ⌊ ( 1 - x ) N ⌋ - 1 ) / N =S[S+(1−S)x]+[S(1−S)+(1−S)2x](1−x)+(1−S)2(1−x)2/2+(1−S)xabsentdelimited-[]1delimited-[]1superscript121superscript12superscript1221 =S[S+(1-S)x]+[S(1-S)+(1-S)^2x](1-x)+(1-S)^2(1-x)^2/2+(1-S)x= S [ S + ( 1 - S ) x ] + [ S ( 1 - S ) + ( 1 - S )2 x ] ( 1 - x ) + ( 1 - S )2 ( 1 - x )2 / 2 + ( 1 - S ) x =−12[1−(1−S)x]2+1+12S2.absent12superscriptdelimited-[]112112superscript2 =- 12[1-(1-S)x]^2+1+ 12S^2.= - divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ 1 - ( 1 - S ) x ]2 + 1 + divide start_ARG 1 end_ARG start_ARG 2 end_ARG S2 . (16) For x∈[−1,0)10x∈[-1,0)x ∈ [ - 1 , 0 ), we have P(x1−x2≤x)Psubscript1subscript2 (x_1-x_2≤ x)P ( x1 - x2 ≤ x ) =limN→∞∑n=−NNP(x1≤x+n/N)P(x2=n/N)absentsubscript→superscriptsubscriptPsubscript1Psubscript2 = _N→∞ _n=-N^NP(x_1≤ x+n/N)% P(x_2=n/N)= limitalic_N → ∞ ∑n = - Nitalic_N P ( x1 ≤ x + n / N ) P ( x2 = n / N ) =limN→∞∑n=−⌊xN⌋NP(x1≤x+n/N)P(x2=n/N)absentsubscript→superscriptsubscriptPsubscript1Psubscript2 = _N→∞ _n=- xN ^NP(x_1% ≤ x+n/N)P(x_2=n/N)= limitalic_N → ∞ ∑n = - ⌊ x N ⌋N P ( x1 ≤ x + n / N ) P ( x2 = n / N ) =limN→∞∑n=−⌊xN⌋N[S+(1−S)(x+n/N)]⋅(1−S)/Nabsentsubscript→superscriptsubscript⋅delimited-[]11 = _N→∞ _n=- xN ^N[S+(1-S)(x+n/N)]% ·(1-S)/N= limitalic_N → ∞ ∑n = - ⌊ x N ⌋N [ S + ( 1 - S ) ( x + n / N ) ] ⋅ ( 1 - S ) / N =limN→∞[S(1−S)+(1−S)2x](N+⌊xN⌋)/Nabsentsubscript→delimited-[]1superscript12 = _N→∞[S(1-S)+(1-S)^2x](N+ xN )/N= limitalic_N → ∞ [ S ( 1 - S ) + ( 1 - S )2 x ] ( N + ⌊ x N ⌋ ) / N +(1−S)2(N−⌊xN⌋)(N+⌊xN⌋+1)/(2N2)superscript1212superscript2 +(1-S)^2(N- xN )(N+ xN +1)/(2N^% 2)+ ( 1 - S )2 ( N - ⌊ x N ⌋ ) ( N + ⌊ x N ⌋ + 1 ) / ( 2 N2 ) =[S(1−S)+(1−S)2x](1+x)+(1−S)2(1−x2)/2absentdelimited-[]1superscript121superscript121superscript22 =[S(1-S)+(1-S)^2x](1+x)+(1-S)^2(1-x^2)/2= [ S ( 1 - S ) + ( 1 - S )2 x ] ( 1 + x ) + ( 1 - S )2 ( 1 - x2 ) / 2 =12[(1−S)x+1]2−12S2.absent12superscriptdelimited-[]11212superscript2 = 12[(1-S)x+1]^2- 12S^2.= divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ ( 1 - S ) x + 1 ]2 - divide start_ARG 1 end_ARG start_ARG 2 end_ARG S2 . (17) ∎ Theorem B.3 (Conditional mean and variance of polysemantic representations). The conditional means and variances of νpoly=x1−x2subscriptpolysubscript1subscript2 _poly=x_1-x_2νroman_poly = x1 - x2 are μ0(νpoly)=−13(1−S)(1+2S)1+S2 and μ1(νmono)=131+2S1+Ssubscript0subscriptpoly131121superscript2 and subscript1subscriptmono13121 _0( _poly)=- 13 (1-S)(1+2S)1+S^% 2 and _1( _mono)= 13 1+2S1% +Sμ0 ( νroman_poly ) = - divide start_ARG 1 end_ARG start_ARG 3 end_ARG divide start_ARG ( 1 - S ) ( 1 + 2 S ) end_ARG start_ARG 1 + S2 end_ARG and μ1 ( νroman_mono ) = divide start_ARG 1 end_ARG start_ARG 3 end_ARG divide start_ARG 1 + 2 S end_ARG start_ARG 1 + S end_ARG (18) σ02(νpoly)=16(1−S)(1+3S)1+S2−μ0(νpoly)2 and σ12(νpoly)=161+3S1+S−μ1(νmono)2subscriptsuperscript20subscriptpoly161131superscript2subscript0superscriptsubscriptpoly2 and subscriptsuperscript21subscriptpoly16131subscript1superscriptsubscriptmono2 σ^2_0( _poly)= 16 (1-S)(1+3S)% 1+S^2- _0( _poly)^2 and σ^2_1% ( _poly)= 16 1+3S1+S- _1( _mono)% ^2σ20 ( νroman_poly ) = divide start_ARG 1 end_ARG start_ARG 6 end_ARG divide start_ARG ( 1 - S ) ( 1 + 3 S ) end_ARG start_ARG 1 + S2 end_ARG - μ0 ( νroman_poly )2 and σ21 ( νroman_poly ) = divide start_ARG 1 end_ARG start_ARG 6 end_ARG divide start_ARG 1 + 3 S end_ARG start_ARG 1 + S end_ARG - μ1 ( νroman_mono )2 (19) Proof of Theorem B.3. By Lemma B.2, we have Ppoly(x1−x2≤x|y=0)subscriptPpolysubscript1subscript2conditional0 _poly(x_1-x_2≤ x|y=0)Proman_poly ( x1 - x2 ≤ x | y = 0 ) =P(x1−x2≤x|x1≤x2)absentPsubscript1subscript2conditionalsubscript1subscript2 =P(x_1-x_2≤ x|x_1≤ x_2)= P ( x1 - x2 ≤ x | x1 ≤ x2 ) =P(x1−x2≤min(0,x))/P(x1−x2≤0)absentPsubscript1subscript20Psubscript1subscript20 =P(x_1-x_2≤ (0,x))/P(x_1-x_2≤ 0)= P ( x1 - x2 ≤ min ( 0 , x ) ) / P ( x1 - x2 ≤ 0 ) =[12[(1−S)x+1]2−12S2]/[12(1+S2)],x∈[−1,0)1,x∈[0,1] = \ aligned & [ 12[(1-S)x+1]^2- 1% 2S^2 ]/[ 12(1+S^2)],&x∈[-1,0)\\ &1,&x∈[0,1] aligned .= start_ROW start_CELL end_CELL start_CELL [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ ( 1 - S ) x + 1 ]2 - divide start_ARG 1 end_ARG start_ARG 2 end_ARG S2 ] / [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 1 + S2 ) ] , end_CELL start_CELL x ∈ [ - 1 , 0 ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL 1 , end_CELL start_CELL x ∈ [ 0 , 1 ] end_CELL end_ROW =[[(1−S)x+1]2−S2]/(1+S2),x∈[−1,0)1,x∈[0,1] = \ aligned & [[(1-S)x+1]^2-S^2 ]/(1+S^% 2),&x∈[-1,0)\\ &1,&x∈[0,1] aligned .= start_ROW start_CELL end_CELL start_CELL [ [ ( 1 - S ) x + 1 ]2 - S2 ] / ( 1 + S2 ) , end_CELL start_CELL x ∈ [ - 1 , 0 ) end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL 1 , end_CELL start_CELL x ∈ [ 0 , 1 ] end_CELL end_ROW and Ppoly(x1−x2≤x|y=1)subscriptPpolysubscript1subscript2conditional1 _poly(x_1-x_2≤ x|y=1)Proman_poly ( x1 - x2 ≤ x | y = 1 ) =P(x1−x2≤x|x1>x2)absentPsubscript1subscript2ketsubscript1subscript2 =P(x_1-x_2≤ x|x_1>x_2)= P ( x1 - x2 ≤ x | x1 > x2 ) =P(0<x1−x2≤x)/P(x1−x2>0)absentP0subscript1subscript2Psubscript1subscript20 =P(0<x_1-x_2≤ x)/P(x_1-x_2>0)= P ( 0 < x1 - x2 ≤ x ) / P ( x1 - x2 > 0 ) =0,x∈[−1,0][P(x1−x2≤x)−P(x1−x2≤0)]/[1−P(x1−x2≤0)],x∈(0,1] = \ aligned &0,&x∈[-1,0]\\ &[P(x_1-x_2≤ x)-P(x_1-x_2≤ 0)]/[1-P(x% _1-x_2≤ 0)],&x∈(0,1] aligned .= start_ROW start_CELL end_CELL start_CELL 0 , end_CELL start_CELL x ∈ [ - 1 , 0 ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL [ P ( x1 - x2 ≤ x ) - P ( x1 - x2 ≤ 0 ) ] / [ 1 - P ( x1 - x2 ≤ 0 ) ] , end_CELL start_CELL x ∈ ( 0 , 1 ] end_CELL end_ROW =0,x∈[−1,0][−12[1−(1−S)x]2+1+12S2−12(1+S2)]/[1−12(1+S2)],x∈(0,1] = \ aligned &0,&x∈[-1,0]\\ & [- 12[1-(1-S)x]^2+1+ 12S^2- 12(1+S^2) % ]/[1- 12(1+S^2)],&x∈(0,1] aligned .= start_ROW start_CELL end_CELL start_CELL 0 , end_CELL start_CELL x ∈ [ - 1 , 0 ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL [ - divide start_ARG 1 end_ARG start_ARG 2 end_ARG [ 1 - ( 1 - S ) x ]2 + 1 + divide start_ARG 1 end_ARG start_ARG 2 end_ARG S2 - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 1 + S2 ) ] / [ 1 - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 1 + S2 ) ] , end_CELL start_CELL x ∈ ( 0 , 1 ] end_CELL end_ROW =0,x∈[−1,0][1−[1−(1−S)x]2]/(1−S2),x∈(0,1] = \ aligned &0,&x∈[-1,0]\\ & [1-[1-(1-S)x]^2 ]/(1-S^2),&x∈(0,1] aligned .= start_ROW start_CELL end_CELL start_CELL 0 , end_CELL start_CELL x ∈ [ - 1 , 0 ] end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL [ 1 - [ 1 - ( 1 - S ) x ]2 ] / ( 1 - S2 ) , end_CELL start_CELL x ∈ ( 0 , 1 ] end_CELL end_ROW Then we have μ0(νpoly)subscript0subscriptpoly _0( _poly)μ0 ( νroman_poly ) =∫x∈[−1,0)x⋅2(1−S)[(1−S)x+1]/(1+S2)xabsentsubscript10⋅21delimited-[]111superscript2differential-d = _x∈[-1,0)x· 2(1-S)[(1-S)x+1]/(1+S^2)\,dx= ∫x ∈ [ - 1 , 0 ) x ⋅ 2 ( 1 - S ) [ ( 1 - S ) x + 1 ] / ( 1 + S2 ) d x =2(1−S)1+S2[13(1−S)−12]absent211superscript2delimited-[]13112 = 2(1-S)1+S^2 [ 13(1-S)- 12 ]= divide start_ARG 2 ( 1 - S ) end_ARG start_ARG 1 + S2 end_ARG [ divide start_ARG 1 end_ARG start_ARG 3 end_ARG ( 1 - S ) - divide start_ARG 1 end_ARG start_ARG 2 end_ARG ] =−13(1−S)(1+2S)1+S2,absent131121superscript2 =- 13 (1-S)(1+2S)1+S^2,= - divide start_ARG 1 end_ARG start_ARG 3 end_ARG divide start_ARG ( 1 - S ) ( 1 + 2 S ) end_ARG start_ARG 1 + S2 end_ARG , (20) μ1(νpoly)subscript1subscriptpoly _1( _poly)μ1 ( νroman_poly ) =∫x∈(0,1]x⋅2(1−S)[1−(1−S)x]/(1−S2)xabsentsubscript01⋅21delimited-[]111superscript2differential-d = _x∈(0,1]x· 2(1-S)[1-(1-S)x]/(1-S^2)\,dx= ∫x ∈ ( 0 , 1 ] x ⋅ 2 ( 1 - S ) [ 1 - ( 1 - S ) x ] / ( 1 - S2 ) d x =21+S[12−13(1−S)]absent21delimited-[]12131 = 21+S [ 12- 13(1-S) ]= divide start_ARG 2 end_ARG start_ARG 1 + S end_ARG [ divide start_ARG 1 end_ARG start_ARG 2 end_ARG - divide start_ARG 1 end_ARG start_ARG 3 end_ARG ( 1 - S ) ] =131+2S1+S,absent13121 = 13 1+2S1+S,= divide start_ARG 1 end_ARG start_ARG 3 end_ARG divide start_ARG 1 + 2 S end_ARG start_ARG 1 + S end_ARG , (21) μ1(νpoly)−μ0(νpoly)subscript1subscriptpolysubscript0subscriptpoly _1( _poly)- _0( _poly)μ1 ( νroman_poly ) - μ0 ( νroman_poly ) =231+2S(1+S)(1+S2),absent231211superscript2 = 23 1+2S(1+S)(1+S^2),= divide start_ARG 2 end_ARG start_ARG 3 end_ARG divide start_ARG 1 + 2 S end_ARG start_ARG ( 1 + S ) ( 1 + S2 ) end_ARG , (22) σ02(νpoly)subscriptsuperscript20subscriptpoly σ^2_0( _poly)σ20 ( νroman_poly ) =∫x∈[−1,0)x2⋅2(1−S)[(1−S)x+1]/(1+S2)x−μ0(νpoly)2absentsubscript10⋅superscript221delimited-[]111superscript2differential-dsubscript0superscriptsubscriptpoly2 = _x∈[-1,0)x^2· 2(1-S)[(1-S)x+1]/(1+S^2)\,dx- _% 0( _poly)^2= ∫x ∈ [ - 1 , 0 ) x2 ⋅ 2 ( 1 - S ) [ ( 1 - S ) x + 1 ] / ( 1 + S2 ) d x - μ0 ( νroman_poly )2 =2(1−S)1+S2[−14(1−S)+13]−μ0(νpoly)2absent211superscript2delimited-[]14113subscript0superscriptsubscriptpoly2 = 2(1-S)1+S^2 [- 14(1-S)+ 13 ]% - _0( _poly)^2= divide start_ARG 2 ( 1 - S ) end_ARG start_ARG 1 + S2 end_ARG [ - divide start_ARG 1 end_ARG start_ARG 4 end_ARG ( 1 - S ) + divide start_ARG 1 end_ARG start_ARG 3 end_ARG ] - μ0 ( νroman_poly )2 =16(1−S)(1+3S)1+S2−μ0(νpoly)2,absent161131superscript2subscript0superscriptsubscriptpoly2 = 16 (1-S)(1+3S)1+S^2- _0( _poly% )^2,= divide start_ARG 1 end_ARG start_ARG 6 end_ARG divide start_ARG ( 1 - S ) ( 1 + 3 S ) end_ARG start_ARG 1 + S2 end_ARG - μ0 ( νroman_poly )2 , (23) and σ12(νpoly)subscriptsuperscript21subscriptpoly σ^2_1( _poly)σ21 ( νroman_poly ) =∫x∈(0,1]x2⋅2(1−S)[1−(1−S)x]/(1−S2)x−μ1(νpoly)2absentsubscript01⋅superscript221delimited-[]111superscript2differential-dsubscript1superscriptsubscriptpoly2 = _x∈(0,1]x^2· 2(1-S)[1-(1-S)x]/(1-S^2)\,dx- _1% ( _poly)^2= ∫x ∈ ( 0 , 1 ] x2 ⋅ 2 ( 1 - S ) [ 1 - ( 1 - S ) x ] / ( 1 - S2 ) d x - μ1 ( νroman_poly )2 =21+S[13−14(1−S)]−μ1(νpoly)2absent21delimited-[]13141subscript1superscriptsubscriptpoly2 = 21+S [ 13- 14(1-S) ]- _1(% _poly)^2= divide start_ARG 2 end_ARG start_ARG 1 + S end_ARG [ divide start_ARG 1 end_ARG start_ARG 3 end_ARG - divide start_ARG 1 end_ARG start_ARG 4 end_ARG ( 1 - S ) ] - μ1 ( νroman_poly )2 =161+3S1+S−μ1(νpoly)2.absent16131subscript1superscriptsubscriptpoly2 = 16 1+3S1+S- _1( _poly)^2.= divide start_ARG 1 end_ARG start_ARG 6 end_ARG divide start_ARG 1 + 3 S end_ARG start_ARG 1 + S end_ARG - μ1 ( νroman_poly )2 . (24) ∎ B.1.3 Proof of Theorem 4.1 Proof of Theorem 4.1. Following the toy model described in Section 4.2, we let S=0.20.2S=0.2S = 0.2. Then by Theorem B.1, we have μ0(νmono)=0.205subscript0subscriptmono0.205 _0( _mono)=0.205μ0 ( νroman_mono ) = 0.205, μ1(νmono)=0.611subscript1subscriptmono0.611 _1( _mono)=0.611μ1 ( νroman_mono ) = 0.611, Δμ(νmono)=0.406Δsubscriptmono0.406 μ( _mono)=0.406Δ μ ( νroman_mono ) = 0.406, σ0(νmono)=0.246subscript0subscriptmono0.246 _0( _mono)=0.246σ0 ( νroman_mono ) = 0.246, σ1(νmono)=0.266subscript1subscriptmono0.266 _1( _mono)=0.266σ1 ( νroman_mono ) = 0.266, and J(νmono)=6.196subscriptmono6.196J( _mono)=6.196J ( νroman_mono ) = 6.196. By Theorem B.3, we have μ0(νpoly)=−0.359subscript0subscriptpoly0.359 _0( _poly)=-0.359μ0 ( νroman_poly ) = - 0.359, μ1(νpoly)=0.389subscript1subscriptpoly0.389 _1( _poly)=0.389μ1 ( νroman_poly ) = 0.389, Δμ(νpoly)=0.748Δsubscriptpoly0.748 μ( _poly)=0.748Δ μ ( νroman_poly ) = 0.748, σ0(νmono)=0.276subscript0subscriptmono0.276 _0( _mono)=0.276σ0 ( νroman_mono ) = 0.276, σ1(νpoly)=0.266subscript1subscriptpoly0.266 _1( _poly)=0.266σ1 ( νroman_poly ) = 0.266, and J(νpoly)=10.164subscriptpoly10.164J( _poly)=10.164J ( νroman_poly ) = 10.164. By comparing the above results, we complete the proof. ∎ B.2 Proofs Related to Label Noise Following Ghosh et al. (2017); Ma et al. (2020); Wang et al. (2019b), we assume the noisy label y~~ yover~ start_ARG y end_ARG is randomly flipped from the true labels to other classes. Under η∈[0,K−1K)01η∈[0, K-1K)η ∈ [ 0 , divide start_ARG K - 1 end_ARG start_ARG K end_ARG ), the noisy label distribution is P(y~=k|x)=∑j=0,1KP(y~=k|y=j)P(y=k|x),P~conditionalsuperscriptsubscript01P~conditionalPconditional ( y=k|x)= _j=0,1^KP( y=k|% y=j)P(y=k|x),P ( over~ start_ARG y end_ARG = k | x ) = ∑j = 0 , 1K P ( over~ start_ARG y end_ARG = k | y = j ) P ( y = k | x ) , (25) where P(y~=k|y=j)=1−ηP~conditional1P( y=k|y=j)=1- ( over~ start_ARG y end_ARG = k | y = j ) = 1 - η if j=kj=kj = k, and otherwise P(y~=k|y=j)=ηP~conditionalP( y=k|y=j)= ( over~ start_ARG y end_ARG = k | y = j ) = η. B.2.1 Influence of Label Noise on Conditional Mean and Variance Lemma B.4 (Conditional Distributions). For noise rate η∈[0,1/2)012η∈[0,1/2)η ∈ [ 0 , 1 / 2 ) and sparsity S∈[0,1]01S∈[0,1]S ∈ [ 0 , 1 ], we have conditional distributions P(ν|y~=0)Pconditional~0 (ν| y=0)P ( ν | over~ start_ARG y end_ARG = 0 ) =(1−η)(1+S2)P(ν|y=0)+η(1−S2)P(ν|y=1)(1−η)(1+S2)+η(1−S2),absent11superscript2Pconditional01superscript2Pconditional111superscript21superscript2 = (1-η)(1+S^2)P(ν|y=0)+η(1-S^2)% P(ν|y=1)(1-η)(1+S^2)+η(1-S^2),= divide start_ARG ( 1 - η ) ( 1 + S2 ) P ( ν | y = 0 ) + η ( 1 - S2 ) P ( ν | y = 1 ) end_ARG start_ARG ( 1 - η ) ( 1 + S2 ) + η ( 1 - S2 ) end_ARG , (26) and P(ν|y~=1)Pconditional~1 (ν| y=1)P ( ν | over~ start_ARG y end_ARG = 1 ) =η(1+S2)P(ν|y=0)+(1−η)(1−S2)P(ν|y=1)η(1+S2)+(1−η)(1−S2).absent1superscript2Pconditional011superscript2Pconditional11superscript211superscript2 = η(1+S^2)P(ν|y=0)+(1-η)(1-S^2)% P(ν|y=1)η(1+S^2)+(1-η)(1-S^2).= divide start_ARG η ( 1 + S2 ) P ( ν | y = 0 ) + ( 1 - η ) ( 1 - S2 ) P ( ν | y = 1 ) end_ARG start_ARG η ( 1 + S2 ) + ( 1 - η ) ( 1 - S2 ) end_ARG . (27) Proof of Lemma B.4. We first calculate the class conditional distributions. P(ν|y~=0)Pconditional~0 (ν| y=0)P ( ν | over~ start_ARG y end_ARG = 0 ) =P(y~=0|ν)P(ν)/P(y~=0)absentP~conditional0P~0 =P( y=0|ν)P(ν)/P( y=0)= P ( over~ start_ARG y end_ARG = 0 | ν ) P ( ν ) / P ( over~ start_ARG y end_ARG = 0 ) =∑j=0,1P(y~=0|y=j)P(y=j|ν)P(ν)∑j=0,1P(y~=0|y=j)P(y=j)absentsubscript01P~conditional0PconditionalPsubscript01P~conditional0P = _j=0,1P( y=0|y=j)P(y=j|ν)% P(ν) _j=0,1P( y=0|y=j)P(y=j)= divide start_ARG ∑j = 0 , 1 P ( over~ start_ARG y end_ARG = 0 | y = j ) P ( y = j | ν ) P ( ν ) end_ARG start_ARG ∑j = 0 , 1 P ( over~ start_ARG y end_ARG = 0 | y = j ) P ( y = j ) end_ARG =∑j=0,1P(y~=0|y=j)P(ν|y=j)P(y=j)∑j=0,1P(y~=0|y=j)P(y=j)absentsubscript01P~conditional0PconditionalPsubscript01P~conditional0P = _j=0,1P( y=0|y=j)P(ν|y=j)% P(y=j) _j=0,1P( y=0|y=j)P(y=j)= divide start_ARG ∑j = 0 , 1 P ( over~ start_ARG y end_ARG = 0 | y = j ) P ( ν | y = j ) P ( y = j ) end_ARG start_ARG ∑j = 0 , 1 P ( over~ start_ARG y end_ARG = 0 | y = j ) P ( y = j ) end_ARG =(1−η)P(ν|y=0)P(y=0)+ηP(ν|y=1)P(y=1)(1−η)P(y=0)+ηP(y=1).absent1Pconditional0P0Pconditional1P11P0P1 = (1-η)P(ν|y=0)P(y=0)+ (% ν|y=1)P(y=1)(1-η)P(y=0)+ (y=1).= divide start_ARG ( 1 - η ) P ( ν | y = 0 ) P ( y = 0 ) + η P ( ν | y = 1 ) P ( y = 1 ) end_ARG start_ARG ( 1 - η ) P ( y = 0 ) + η P ( y = 1 ) end_ARG . (28) P(ν|y~=1)Pconditional~1 (ν| y=1)P ( ν | over~ start_ARG y end_ARG = 1 ) =P(y~=1|ν)P(ν)/P(y~=1)absentP~conditional1P~1 =P( y=1|ν)P(ν)/P( y=1)= P ( over~ start_ARG y end_ARG = 1 | ν ) P ( ν ) / P ( over~ start_ARG y end_ARG = 1 ) =∑j=0,1P(y~=1|y=j)P(y=j|ν)P(ν)∑j=0,1P(y~=1|y=j)P(y=j)absentsubscript01P~conditional1PconditionalPsubscript01P~conditional1P = _j=0,1P( y=1|y=j)P(y=j|ν)% P(ν) _j=0,1P( y=1|y=j)P(y=j)= divide start_ARG ∑j = 0 , 1 P ( over~ start_ARG y end_ARG = 1 | y = j ) P ( y = j | ν ) P ( ν ) end_ARG start_ARG ∑j = 0 , 1 P ( over~ start_ARG y end_ARG = 1 | y = j ) P ( y = j ) end_ARG =∑j=0,1P(y~=1|y=j)P(ν|y=j)P(y=j)∑j=0,1P(y~=1|y=j)P(y=j)absentsubscript01P~conditional1PconditionalPsubscript01P~conditional1P = _j=0,1P( y=1|y=j)P(ν|y=j)% P(y=j) _j=0,1P( y=1|y=j)P(y=j)= divide start_ARG ∑j = 0 , 1 P ( over~ start_ARG y end_ARG = 1 | y = j ) P ( ν | y = j ) P ( y = j ) end_ARG start_ARG ∑j = 0 , 1 P ( over~ start_ARG y end_ARG = 1 | y = j ) P ( y = j ) end_ARG =ηP(ν|y=0)P(y=0)+(1−η)P(ν|y=1)P(y=1)ηP(y=0)+(1−η)P(y=1).absentPconditional0P01Pconditional1P1P01P1 = (ν|y=0)P(y=0)+(1-η)P(% ν|y=1)P(y=1) (y=0)+(1-η)P(y=1).= divide start_ARG η P ( ν | y = 0 ) P ( y = 0 ) + ( 1 - η ) P ( ν | y = 1 ) P ( y = 1 ) end_ARG start_ARG η P ( y = 0 ) + ( 1 - η ) P ( y = 1 ) end_ARG . (29) Recall that x1,x2=0subscript1subscript20x_1,x_2=0x1 , x2 = 0 with probability S, and x1,x2∼(0,1]similar-tosubscript1subscript201x_1,x_2 (0,1]x1 , x2 ∼ U ( 0 , 1 ] with probability 1−S11-S1 - S. Because x1subscript1x_1x1 and x2subscript2x_2x2 are independently and identically distributed and P(x1=x2|x1,x2=0)Psubscript1conditionalsubscript2subscript1subscript20P(x_1=x_2|x_1,x_2=0)P ( x1 = x2 | x1 , x2 = 0 ), we have P(x1≤x2|x1,x2>0)=P(x2≤x1|x1,x2>0)=1/2Psubscript1subscript2ketsubscript1subscript20Psubscript2subscript1ketsubscript1subscript2012P(x_1≤ x_2|x_1,x_2>0)=P(x_2≤ x_1|x_1,x_% 2>0)=1/2P ( x1 ≤ x2 | x1 , x2 > 0 ) = P ( x2 ≤ x1 | x1 , x2 > 0 ) = 1 / 2, and therefore ℙ(y=0)ℙ0 (y=0)blackboard_P ( y = 0 ) =ℙ(x1≤x2)absentℙsubscript1subscript2 =P(x_1≤ x_2)= blackboard_P ( x1 ≤ x2 ) =P(x1=0)+P(x1>0)P(x2>0)P(x1≤x2|x1,x2>0)absentPsubscript10Psubscript10Psubscript20Psubscript1subscript2ketsubscript1subscript20 =P(x_1=0)+P(x_1>0)P(x_2>0)% P(x_1≤ x_2|x_1,x_2>0)= P ( x1 = 0 ) + P ( x1 > 0 ) P ( x2 > 0 ) P ( x1 ≤ x2 | x1 , x2 > 0 ) =S+12(1−S)2=12(1+S2).absent12superscript12121superscript2 =S+ 12(1-S)^2= 12(1+S^2).= S + divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 1 - S )2 = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 1 + S2 ) . (30) Then P(y=1)=1−P(y=0)=12(1−S2)P11P0121superscript2P(y=1)=1-P(y=0)= 12(1-S^2)P ( y = 1 ) = 1 - P ( y = 0 ) = divide start_ARG 1 end_ARG start_ARG 2 end_ARG ( 1 - S2 ), and correspondingly we have P(ν|y~=0)Pconditional~0 (ν| y=0)P ( ν | over~ start_ARG y end_ARG = 0 ) =(1−η)(1+S2)P(ν|y=0)+η(1−S2)P(ν|y=1)(1−η)(1+S2)+η(1−S2),absent11superscript2Pconditional01superscript2Pconditional111superscript21superscript2 = (1-η)(1+S^2)P(ν|y=0)+η(1-S^2)% P(ν|y=1)(1-η)(1+S^2)+η(1-S^2),= divide start_ARG ( 1 - η ) ( 1 + S2 ) P ( ν | y = 0 ) + η ( 1 - S2 ) P ( ν | y = 1 ) end_ARG start_ARG ( 1 - η ) ( 1 + S2 ) + η ( 1 - S2 ) end_ARG , (31) and P(ν|y~=1)Pconditional~1 (ν| y=1)P ( ν | over~ start_ARG y end_ARG = 1 ) =η(1+S2)P(ν|y=0)+(1−η)(1−S2)P(ν|y=1)η(1+S2)+(1−η)(1−S2).absent1superscript2Pconditional011superscript2Pconditional11superscript211superscript2 = η(1+S^2)P(ν|y=0)+(1-η)(1-S^2)% P(ν|y=1)η(1+S^2)+(1-η)(1-S^2).= divide start_ARG η ( 1 + S2 ) P ( ν | y = 0 ) + ( 1 - η ) ( 1 - S2 ) P ( ν | y = 1 ) end_ARG start_ARG η ( 1 + S2 ) + ( 1 - η ) ( 1 - S2 ) end_ARG . (32) ∎ Theorem B.5 (Influence of label noise on inter-class distance). For noise rate η∈[0,12)012η∈[0, 12)η ∈ [ 0 , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ), Δμ~(ν)Δ~ μ(ν)Δ over~ start_ARG μ end_ARG ( ν ) =(1−2η)(1+S2)(1−S2)[1+(1−2η)S2][1−(1−2η)S2]Δμ(ν).absent121superscript21superscript2delimited-[]112superscript2delimited-[]112superscript2Δ = (1-2η)(1+S^2)(1-S^2)[1+(1-2η)S^2][1-(1-2η% )S^2] μ(ν).= divide start_ARG ( 1 - 2 η ) ( 1 + S2 ) ( 1 - S2 ) end_ARG start_ARG [ 1 + ( 1 - 2 η ) S2 ] [ 1 - ( 1 - 2 η ) S2 ] end_ARG Δ μ ( ν ) . (33) Proof of Theorem B.5. By Lemma B.4, the conditional means of ν has the following forms. μ~0(ν)subscript~0 μ_0(ν)over~ start_ARG μ end_ARG0 ( ν ) :=(ν|y~=0)assignabsentconditional~0 :=E(ν| y=0):= blackboard_E ( ν | over~ start_ARG y end_ARG = 0 ) =∫νP(ν|y~=0)absentsubscriptdifferential-dPconditional~0 = _ν\,dP(ν| y=0)= ∫ν ν d P ( ν | over~ start_ARG y end_ARG = 0 ) =∫ν(1−η)(1+S2)(1−η)(1+S2)+η(1−S2)P(ν|y=0)absentsubscript11superscript211superscript21superscript2differential-dPconditional0 = _ν (1-η)(1+S^2)(1-η)(1+S^2)+η(1-S% ^2)\,dP(ν|y=0)= ∫ν ν divide start_ARG ( 1 - η ) ( 1 + S2 ) end_ARG start_ARG ( 1 - η ) ( 1 + S2 ) + η ( 1 - S2 ) end_ARG d P ( ν | y = 0 ) +∫νη(1−S2)(1−η)(1+S2)+η(1−S2)P(ν|y=1).subscript1superscript211superscript21superscript2differential-dPconditional1 + _ν η(1-S^2)(1-η)(1+S^2)+η(1-S^2% )\,dP(ν|y=1).+ ∫ν ν divide start_ARG η ( 1 - S2 ) end_ARG start_ARG ( 1 - η ) ( 1 + S2 ) + η ( 1 - S2 ) end_ARG d P ( ν | y = 1 ) . (34) μ~1(ν)subscript~1 μ_1(ν)over~ start_ARG μ end_ARG1 ( ν ) :=(ν|y~=1)assignabsentconditional~1 :=E(ν| y=1):= blackboard_E ( ν | over~ start_ARG y end_ARG = 1 ) =∫νP(ν|y~=1)absentsubscriptdifferential-dPconditional~1 = _ν\,dP(ν| y=1)= ∫ν ν d P ( ν | over~ start_ARG y end_ARG = 1 ) =∫νη(1+S2)η(1+S2)+(1−η)(1−S2)P(ν|y=0)absentsubscript1superscript21superscript211superscript2differential-dPconditional0 = _ν η(1+S^2)η(1+S^2)+(1-η)(1-S^2% )\,dP(ν|y=0)= ∫ν ν divide start_ARG η ( 1 + S2 ) end_ARG start_ARG η ( 1 + S2 ) + ( 1 - η ) ( 1 - S2 ) end_ARG d P ( ν | y = 0 ) +∫ν(1−η)(1−S2)η(1+S2)+(1−η)(1−S2)P(ν|y=1).subscript11superscript21superscript211superscript2differential-dPconditional1 + _ν (1-η)(1-S^2)η(1+S^2)+(1-η)(1-S% ^2)\,dP(ν|y=1).+ ∫ν ν divide start_ARG ( 1 - η ) ( 1 - S2 ) end_ARG start_ARG η ( 1 + S2 ) + ( 1 - η ) ( 1 - S2 ) end_ARG d P ( ν | y = 1 ) . (35) Then we have μ~1(ν)−μ~0(ν)subscript~1subscript~0 μ_1(ν)- μ_0(ν)over~ start_ARG μ end_ARG1 ( ν ) - over~ start_ARG μ end_ARG0 ( ν ) =∫ν[η(1+S2)η(1+S2)+(1−η)(1−S2)−(1−η)(1+S2)(1−η)(1+S2)+η(1−S2)]P(ν|y=0)absentsubscriptdelimited-[]1superscript21superscript211superscript211superscript211superscript21superscript2differential-dPconditional0 = _ν [ η(1+S^2)η(1+S^2)+(1-η)(% 1-S^2)- (1-η)(1+S^2)(1-η)(1+S^2)+η(1-S^2) ]\,d% P(ν|y=0)= ∫ν ν [ divide start_ARG η ( 1 + S2 ) end_ARG start_ARG η ( 1 + S2 ) + ( 1 - η ) ( 1 - S2 ) end_ARG - divide start_ARG ( 1 - η ) ( 1 + S2 ) end_ARG start_ARG ( 1 - η ) ( 1 + S2 ) + η ( 1 - S2 ) end_ARG ] d P ( ν | y = 0 ) +∫ν[(1−η)(1−S2)η(1+S2)+(1−η)(1−S2)−η(1−S2)(1−η)(1+S2)+η(1−S2)]P(ν|y=1)subscriptdelimited-[]11superscript21superscript211superscript21superscript211superscript21superscript2differential-dPconditional1 + _ν [ (1-η)(1-S^2)η(1+S^2)+(1-% η)(1-S^2)- η(1-S^2)(1-η)(1+S^2)+η(1-S^2) ]\,% dP(ν|y=1)+ ∫ν ν [ divide start_ARG ( 1 - η ) ( 1 - S2 ) end_ARG start_ARG η ( 1 + S2 ) + ( 1 - η ) ( 1 - S2 ) end_ARG - divide start_ARG η ( 1 - S2 ) end_ARG start_ARG ( 1 - η ) ( 1 + S2 ) + η ( 1 - S2 ) end_ARG ] d P ( ν | y = 1 ) =(1−2η)(1+S2)(1−S2)[1+(1−2η)S2][1−(1−2η)S2][∫νP(ν|y=1)−∫νP(ν|y=0)]absent121superscript21superscript2delimited-[]112superscript2delimited-[]112superscript2delimited-[]subscriptdifferential-dPconditional1subscriptdifferential-dPconditional0 = (1-2η)(1+S^2)(1-S^2)[1+(1-2η)S^2][1-(1-2η% )S^2] [ _ν\,dP(ν|y=1)- _ν\,dP(% ν|y=0) ]= divide start_ARG ( 1 - 2 η ) ( 1 + S2 ) ( 1 - S2 ) end_ARG start_ARG [ 1 + ( 1 - 2 η ) S2 ] [ 1 - ( 1 - 2 η ) S2 ] end_ARG [ ∫ν ν d P ( ν | y = 1 ) - ∫ν ν d P ( ν | y = 0 ) ] =(1−2η)(1+S2)(1−S2)[1+(1−2η)S2][1−(1−2η)S2][μ1(ν)−μ0(ν)].absent121superscript21superscript2delimited-[]112superscript2delimited-[]112superscript2delimited-[]subscript1subscript0 = (1-2η)(1+S^2)(1-S^2)[1+(1-2η)S^2][1-(1-2η% )S^2][ _1(ν)- _0(ν)].= divide start_ARG ( 1 - 2 η ) ( 1 + S2 ) ( 1 - S2 ) end_ARG start_ARG [ 1 + ( 1 - 2 η ) S2 ] [ 1 - ( 1 - 2 η ) S2 ] end_ARG [ μ1 ( ν ) - μ0 ( ν ) ] . (36) ∎ Theorem B.6 (Influence of label noise on intra-class variance). For i=0,101i=0,1i = 0 , 1 and noise rate η∈[0,12)012η∈[0, 12)η ∈ [ 0 , divide start_ARG 1 end_ARG start_ARG 2 end_ARG ), σ~i2(ν)superscriptsubscript~2 σ_i^2(ν)over~ start_ARG σ end_ARGi2 ( ν ) =ci,0σ02(ν)+ci,1σ12(ν)+ci,0μ0(ν)2+ci,1μ1(ν)2−[ci,0μ0(ν)+ci,1μ1(ν)]2absentsubscript0subscriptsuperscript20subscript1subscriptsuperscript21subscript0subscript0superscript2subscript1subscript1superscript2superscriptdelimited-[]subscript0subscript0subscript1subscript12 =c_i,0σ^2_0(ν)+c_i,1σ^2_1(ν)+c_i,0μ% _0(ν)^2+c_i,1 _1(ν)^2-[c_i,0 _0(ν)+c_i,1 _1(ν)% ]^2= citalic_i , 0 σ20 ( ν ) + citalic_i , 1 σ21 ( ν ) + citalic_i , 0 μ0 ( ν )2 + citalic_i , 1 μ1 ( ν )2 - [ citalic_i , 0 μ0 ( ν ) + citalic_i , 1 μ1 ( ν ) ]2 where c0,0:=(1−η)(1+S2)1+(1−2η)S2assignsubscript0011superscript2112superscript2c_0,0:= (1-η)(1+S^2)1+(1-2η)S^2c0 , 0 := divide start_ARG ( 1 - η ) ( 1 + S2 ) end_ARG start_ARG 1 + ( 1 - 2 η ) S2 end_ARG, c0,1:=η(1+S2)1+(1−2η)S2assignsubscript011superscript2112superscript2c_0,1:= η(1+S^2)1+(1-2η)S^2c0 , 1 := divide start_ARG η ( 1 + S2 ) end_ARG start_ARG 1 + ( 1 - 2 η ) S2 end_ARG, c1,0=η(1+S2)1−(1−2η)S2subscript101superscript2112superscript2c_1,0= η(1+S^2)1-(1-2η)S^2c1 , 0 = divide start_ARG η ( 1 + S2 ) end_ARG start_ARG 1 - ( 1 - 2 η ) S2 end_ARG, and c1,1=(1−η)(1+S2)1−(1−2η)S2subscript1111superscript2112superscript2c_1,1= (1-η)(1+S^2)1-(1-2η)S^2c1 , 1 = divide start_ARG ( 1 - η ) ( 1 + S2 ) end_ARG start_ARG 1 - ( 1 - 2 η ) S2 end_ARG. Proof of Theorem B.6. By Lemma B.4, the conditional variances of ν has the following forms. σ~02(ν)superscriptsubscript~02 σ_0^2(ν)over~ start_ARG σ end_ARG02 ( ν ) :=(ν2|y~=0)−μ~0(ν)2assignabsentconditionalsuperscript2~0subscript~0superscript2 :=E(ν^2| y=0)- μ_0(ν)^2:= blackboard_E ( ν2 | over~ start_ARG y end_ARG = 0 ) - over~ start_ARG μ end_ARG0 ( ν )2 =∫ν2P(ν|y~=0)−μ~0(ν)2absentsubscriptsuperscript2differential-dPconditional~0subscript~0superscript2 = _ν^2\,dP(ν| y=0)- μ_0(% ν)^2= ∫ν ν2 d P ( ν | over~ start_ARG y end_ARG = 0 ) - over~ start_ARG μ end_ARG0 ( ν )2 =∫ν2(1−η)(1+S2)(1−η)(1+S2)+η(1−S2)P(ν|y=0)absentsubscriptsuperscript211superscript211superscript21superscript2differential-dPconditional0 = _ν^2 (1-η)(1+S^2)(1-η)(1+S^2)+η% (1-S^2)\,dP(ν|y=0)= ∫ν ν2 divide start_ARG ( 1 - η ) ( 1 + S2 ) end_ARG start_ARG ( 1 - η ) ( 1 + S2 ) + η ( 1 - S2 ) end_ARG d P ( ν | y = 0 ) +∫ν2η(1−S2)(1−η)(1+S2)+η(1−S2)P(ν|y=1)−μ~0(ν)2subscriptsuperscript21superscript211superscript21superscript2differential-dPconditional1subscript~0superscript2 + _ν^2 η(1-S^2)(1-η)(1+S^2)+η(1-S% ^2)\,dP(ν|y=1)- μ_0(ν)^2+ ∫ν ν2 divide start_ARG η ( 1 - S2 ) end_ARG start_ARG ( 1 - η ) ( 1 + S2 ) + η ( 1 - S2 ) end_ARG d P ( ν | y = 1 ) - over~ start_ARG μ end_ARG0 ( ν )2 =(1−η)(1+S2)1+(1−2η)S2[σ02(ν)+μ0(ν)2]+η(1+S2)1+(1−2η)S2[σ12(ν)+μ1(ν)2]absent11superscript2112superscript2delimited-[]subscriptsuperscript20subscript0superscript21superscript2112superscript2delimited-[]subscriptsuperscript21subscript1superscript2 = (1-η)(1+S^2)1+(1-2η)S^2[σ^2_0(ν)+% _0(ν)^2]+ η(1+S^2)1+(1-2η)S^2[σ^2_1(ν)+% _1(ν)^2]= divide start_ARG ( 1 - η ) ( 1 + S2 ) end_ARG start_ARG 1 + ( 1 - 2 η ) S2 end_ARG [ σ20 ( ν ) + μ0 ( ν )2 ] + divide start_ARG η ( 1 + S2 ) end_ARG start_ARG 1 + ( 1 - 2 η ) S2 end_ARG [ σ21 ( ν ) + μ1 ( ν )2 ] −[(1−η)(1+S2)1+(1−2η)S2μ0(ν)+η(1+S2)1+(1−2η)S2μ1(ν)]2superscriptdelimited-[]11superscript2112superscript2subscript01superscript2112superscript2subscript12 - [ (1-η)(1+S^2)1+(1-2η)S^2 _0(ν)+% η(1+S^2)1+(1-2η)S^2 _1(ν) ]^2- [ divide start_ARG ( 1 - η ) ( 1 + S2 ) end_ARG start_ARG 1 + ( 1 - 2 η ) S2 end_ARG μ0 ( ν ) + divide start_ARG η ( 1 + S2 ) end_ARG start_ARG 1 + ( 1 - 2 η ) S2 end_ARG μ1 ( ν ) ]2 :=c0σ02(ν)+c1σ12(ν)+c0μ0(ν)2+c1μ1(ν)2−[c0μ0(ν)+c1μ1(ν)]2,assignabsentsubscript0subscriptsuperscript20subscript1subscriptsuperscript21subscript0subscript0superscript2subscript1subscript1superscript2superscriptdelimited-[]subscript0subscript0subscript1subscript12 :=c_0σ^2_0(ν)+c_1σ^2_1(ν)+c_0 _0(% ν)^2+c_1 _1(ν)^2-[c_0 _0(ν)+c_1 _1(ν)]^2,:= c0 σ20 ( ν ) + c1 σ21 ( ν ) + c0 μ0 ( ν )2 + c1 μ1 ( ν )2 - [ c0 μ0 ( ν ) + c1 μ1 ( ν ) ]2 , (37) where c0:=(1−η)(1+S2)1+(1−2η)S2assignsubscript011superscript2112superscript2c_0:= (1-η)(1+S^2)1+(1-2η)S^2c0 := divide start_ARG ( 1 - η ) ( 1 + S2 ) end_ARG start_ARG 1 + ( 1 - 2 η ) S2 end_ARG and c1:=η(1+S2)1+(1−2η)S2assignsubscript11superscript2112superscript2c_1:= η(1+S^2)1+(1-2η)S^2c1 := divide start_ARG η ( 1 + S2 ) end_ARG start_ARG 1 + ( 1 - 2 η ) S2 end_ARG. σ~12(ν)superscriptsubscript~12 σ_1^2(ν)over~ start_ARG σ end_ARG12 ( ν ) :=(ν2|y~=1)−μ~1(ν)2assignabsentconditionalsuperscript2~1subscript~1superscript2 :=E(ν^2| y=1)- μ_1(ν)^2:= blackboard_E ( ν2 | over~ start_ARG y end_ARG = 1 ) - over~ start_ARG μ end_ARG1 ( ν )2 =∫ν2P(ν|y~=0)−μ~0(ν)2absentsubscriptsuperscript2differential-dPconditional~0subscript~0superscript2 = _ν^2\,dP(ν| y=0)- μ_0(% ν)^2= ∫ν ν2 d P ( ν | over~ start_ARG y end_ARG = 0 ) - over~ start_ARG μ end_ARG0 ( ν )2 =∫ν2η(1+S2)η(1+S2)+(1−η)(1−S2)P(ν|y=0)absentsubscriptsuperscript21superscript21superscript211superscript2differential-dPconditional0 = _ν^2 η(1+S^2)η(1+S^2)+(1-η)(1-S% ^2)\,dP(ν|y=0)= ∫ν ν2 divide start_ARG η ( 1 + S2 ) end_ARG start_ARG η ( 1 + S2 ) + ( 1 - η ) ( 1 - S2 ) end_ARG d P ( ν | y = 0 ) +∫ν2(1−η)(1−S2)η(1+S2)+(1−η)(1−S2)P(ν|y=1)−μ~1(ν)2subscriptsuperscript211superscript21superscript211superscript2differential-dPconditional1subscript~1superscript2 + _ν^2 (1-η)(1-S^2)η(1+S^2)+(1-η)% (1-S^2)\,dP(ν|y=1)- μ_1(ν)^2+ ∫ν ν2 divide start_ARG ( 1 - η ) ( 1 - S2 ) end_ARG start_ARG η ( 1 + S2 ) + ( 1 - η ) ( 1 - S2 ) end_ARG d P ( ν | y = 1 ) - over~ start_ARG μ end_ARG1 ( ν )2 =η(1+S2)1−(1−2η)S2[σ02(ν)+μ0(ν)2]+(1−η)(1+S2)1−(1−2η)S2[σ12(ν)+μ1(ν)2]absent1superscript2112superscript2delimited-[]subscriptsuperscript20subscript0superscript211superscript2112superscript2delimited-[]subscriptsuperscript21subscript1superscript2 = η(1+S^2)1-(1-2η)S^2[σ^2_0(ν)+ _% 0(ν)^2]+ (1-η)(1+S^2)1-(1-2η)S^2[σ^2_1(ν)+% _1(ν)^2]= divide start_ARG η ( 1 + S2 ) end_ARG start_ARG 1 - ( 1 - 2 η ) S2 end_ARG [ σ20 ( ν ) + μ0 ( ν )2 ] + divide start_ARG ( 1 - η ) ( 1 + S2 ) end_ARG start_ARG 1 - ( 1 - 2 η ) S2 end_ARG [ σ21 ( ν ) + μ1 ( ν )2 ] −[η(1+S2)1−(1−2η)S2μ0(ν)+(1−η)(1+S2)1−(1−2η)S2μ1(ν)]2superscriptdelimited-[]1superscript2112superscript2subscript011superscript2112superscript2subscript12 - [ η(1+S^2)1-(1-2η)S^2 _0(ν)+ % (1-η)(1+S^2)1-(1-2η)S^2 _1(ν) ]^2- [ divide start_ARG η ( 1 + S2 ) end_ARG start_ARG 1 - ( 1 - 2 η ) S2 end_ARG μ0 ( ν ) + divide start_ARG ( 1 - η ) ( 1 + S2 ) end_ARG start_ARG 1 - ( 1 - 2 η ) S2 end_ARG μ1 ( ν ) ]2 :=c0′σ02(ν)+c1′σ12(ν)+c0′μ0(ν)2+c1′μ1(ν)2−[c0′μ0(ν)+c1′μ1(ν)]2,assignabsentsuperscriptsubscript0′subscriptsuperscript20superscriptsubscript1′subscriptsuperscript21superscriptsubscript0′subscript0superscript2superscriptsubscript1′subscript1superscript2superscriptdelimited-[]superscriptsubscript0′subscript0superscriptsubscript1′subscript12 :=c_0 σ^2_0(ν)+c_1 σ^2_1(% ν)+c_0 _0(ν)^2+c_1 _1(ν)^2-[c_0^% _0(ν)+c_1 _1(ν)]^2,:= c0′ σ20 ( ν ) + c1′ σ21 ( ν ) + c0′ μ0 ( ν )2 + c1′ μ1 ( ν )2 - [ c0′ μ0 ( ν ) + c1′ μ1 ( ν ) ]2 , (38) where c0′=η(1+S2)1−(1−2η)S2superscriptsubscript0′1superscript2112superscript2c_0 = η(1+S^2)1-(1-2η)S^2c0′ = divide start_ARG η ( 1 + S2 ) end_ARG start_ARG 1 - ( 1 - 2 η ) S2 end_ARG and c1′=(1−η)(1+S2)1−(1−2η)S2superscriptsubscript1′11superscript2112superscript2c_1 = (1-η)(1+S^2)1-(1-2η)S^2c1′ = divide start_ARG ( 1 - η ) ( 1 + S2 ) end_ARG start_ARG 1 - ( 1 - 2 η ) S2 end_ARG. ∎ B.2.2 Linear Separability of monosemantic & polysemantic representations under label noise Proof of Theorem 4.2. By definition, we have J~(νmono)/J(νmono)J~(νpoly)/J(νpoly)~subscriptmonosubscriptmono~subscriptpolysubscriptpoly J( _mono)/J( _mono)% J( _poly)/J( _poly)divide start_ARG over~ start_ARG J end_ARG ( νroman_mono ) / J ( νroman_mono ) end_ARG start_ARG over~ start_ARG J end_ARG ( νroman_poly ) / J ( νroman_poly ) end_ARG =[Δμ~(νmono)/(σ~0(νmono)σ~1(νmono))]/[Δμ(νmono)/(σ0(νmono)σ1(νmono))][Δμ~(νpoly)/(σ~0(νpoly)σ~1(νpoly))]/[Δμ(νpoly)/(σ0(νpoly)σ1(νpoly))].absentdelimited-[]Δ~subscriptmonosubscript~0subscriptmonosubscript~1subscriptmonodelimited-[]Δsubscriptmonosubscript0subscriptmonosubscript1subscriptmonodelimited-[]Δ~subscriptpolysubscript~0subscriptpolysubscript~1subscriptpolydelimited-[]Δsubscriptpolysubscript0subscriptpolysubscript1subscriptpoly = [ μ( _mono)/( σ_0% ( _mono) σ_1( _mono))]/[ μ(ν% _mono)/( _0( _mono) _1( _mono% ))][ μ( _poly)/( σ_0( _% poly) σ_1( _poly))]/[ μ( _% poly)/( _0( _poly) _1( _poly% ))].= divide start_ARG [ Δ over~ start_ARG μ end_ARG ( νroman_mono ) / ( over~ start_ARG σ end_ARG0 ( νroman_mono ) over~ start_ARG σ end_ARG1 ( νroman_mono ) ) ] / [ Δ μ ( νroman_mono ) / ( σ0 ( νroman_mono ) σ1 ( νroman_mono ) ) ] end_ARG start_ARG [ Δ over~ start_ARG μ end_ARG ( νroman_poly ) / ( over~ start_ARG σ end_ARG0 ( νroman_poly ) over~ start_ARG σ end_ARG1 ( νroman_poly ) ) ] / [ Δ μ ( νroman_poly ) / ( σ0 ( νroman_poly ) σ1 ( νroman_poly ) ) ] end_ARG . (39) By Theorem B.5 we have Δμ~(νmono)/Δμ(νmono)=Δμ~(νpoly)/Δμ(νpoly)Δ~subscriptmonoΔsubscriptmonoΔ~subscriptpolyΔsubscriptpoly μ( _mono)/ μ( _mono)= % μ( _poly)/ μ( _poly)Δ over~ start_ARG μ end_ARG ( νroman_mono ) / Δ μ ( νroman_mono ) = Δ over~ start_ARG μ end_ARG ( νroman_poly ) / Δ μ ( νroman_poly ) and σ1(νmono)=σ1(νpoly)subscript1subscriptmonosubscript1subscriptpoly _1( _mono)= _1( _poly)σ1 ( νroman_mono ) = σ1 ( νroman_poly ), and thus J~(νmono)/J(νmono)J~(νpoly)/J(νpoly)~subscriptmonosubscriptmono~subscriptpolysubscriptpoly J( _mono)/J( _mono)% J( _poly)/J( _poly)divide start_ARG over~ start_ARG J end_ARG ( νroman_mono ) / J ( νroman_mono ) end_ARG start_ARG over~ start_ARG J end_ARG ( νroman_poly ) / J ( νroman_poly ) end_ARG =σ~0(νpoly)σ~0(νmono)⋅σ~1(νpoly)σ~1(νmono)⋅σ0(νmono)σ0(νpoly).absent⋅subscript~0subscriptpolysubscript~0subscriptmonosubscript~1subscriptpolysubscript~1subscriptmonosubscript0subscriptmonosubscript0subscriptpoly = σ_0( _poly) σ_0% ( _mono)· σ_1( _poly)% σ_1( _mono)· _0( _% mono) _0( _poly).= divide start_ARG over~ start_ARG σ end_ARG0 ( νroman_poly ) end_ARG start_ARG over~ start_ARG σ end_ARG0 ( νroman_mono ) end_ARG ⋅ divide start_ARG over~ start_ARG σ end_ARG1 ( νroman_poly ) end_ARG start_ARG over~ start_ARG σ end_ARG1 ( νroman_mono ) end_ARG ⋅ divide start_ARG σ0 ( νroman_mono ) end_ARG start_ARG σ0 ( νroman_poly ) end_ARG . (40) By Theorems B.1 and B.6, we have σ~02(νmono)subscriptsuperscript~20subscriptmono σ^2_0( _mono)over~ start_ARG σ end_ARG20 ( νroman_mono ) =1.04(1−η)1.04−0.08η(0.2462+0.2052)+1.04η1.04−0.08η(0.2662+0.6112)absent1.0411.040.08superscript0.2462superscript0.20521.041.040.08superscript0.2662superscript0.6112 = 1.04(1-η)1.04-0.08η(0.246^2+0.205^2)+ 1.% 04η1.04-0.08η(0.266^2+0.611^2)= divide start_ARG 1.04 ( 1 - η ) end_ARG start_ARG 1.04 - 0.08 η end_ARG ( 0.2462 + 0.2052 ) + divide start_ARG 1.04 η end_ARG start_ARG 1.04 - 0.08 η end_ARG ( 0.2662 + 0.6112 ) −[1.04(1−η)1.04−0.08η0.205+1.04η1.04−0.08η0.611]2,superscriptdelimited-[]1.0411.040.080.2051.041.040.080.6112 - [ 1.04(1-η)1.04-0.08η0.205+ 1.04η1% .04-0.08η0.611 ]^2,- [ divide start_ARG 1.04 ( 1 - η ) end_ARG start_ARG 1.04 - 0.08 η end_ARG 0.205 + divide start_ARG 1.04 η end_ARG start_ARG 1.04 - 0.08 η end_ARG 0.611 ]2 , (41) σ~12(νmono)subscriptsuperscript~21subscriptmono σ^2_1( _mono)over~ start_ARG σ end_ARG21 ( νroman_mono ) =1.04η0.96+0.08η(0.2462+0.2052)+1.04(1−η)0.96+0.08η(0.2662+0.6112)absent1.040.960.08superscript0.2462superscript0.20521.0410.960.08superscript0.2662superscript0.6112 = 1.04η0.96+0.08η(0.246^2+0.205^2)+ 1.04(1% -η)0.96+0.08η(0.266^2+0.611^2)= divide start_ARG 1.04 η end_ARG start_ARG 0.96 + 0.08 η end_ARG ( 0.2462 + 0.2052 ) + divide start_ARG 1.04 ( 1 - η ) end_ARG start_ARG 0.96 + 0.08 η end_ARG ( 0.2662 + 0.6112 ) −[1.04η0.96+0.08η0.205+1.04(1−η)0.96+0.08η0.611]2,superscriptdelimited-[]1.040.960.080.2051.0410.960.080.6112 - [ 1.04η0.96+0.08η0.205+ 1.04(1-η)0% .96+0.08η0.611 ]^2,- [ divide start_ARG 1.04 η end_ARG start_ARG 0.96 + 0.08 η end_ARG 0.205 + divide start_ARG 1.04 ( 1 - η ) end_ARG start_ARG 0.96 + 0.08 η end_ARG 0.611 ]2 , (42) σ~02(νpoly)subscriptsuperscript~20subscriptpoly σ^2_0( _poly)over~ start_ARG σ end_ARG20 ( νroman_poly ) =1.04(1−η)1.04−0.08η(0.2762+(−0.359)2)+1.04η1.04−0.08η(0.2662+0.3892)absent1.0411.040.08superscript0.2762superscript0.35921.041.040.08superscript0.2662superscript0.3892 = 1.04(1-η)1.04-0.08η(0.276^2+(-0.359)^2)+ % 1.04η1.04-0.08η(0.266^2+0.389^2)= divide start_ARG 1.04 ( 1 - η ) end_ARG start_ARG 1.04 - 0.08 η end_ARG ( 0.2762 + ( - 0.359 )2 ) + divide start_ARG 1.04 η end_ARG start_ARG 1.04 - 0.08 η end_ARG ( 0.2662 + 0.3892 ) −[1.04(1−η)1.04−0.08η(−0.359))+1.04η1.04−0.08η0.389]2, - [ 1.04(1-η)1.04-0.08η(-0.359))+ 1.04% η1.04-0.08η0.389 ]^2,- [ divide start_ARG 1.04 ( 1 - η ) end_ARG start_ARG 1.04 - 0.08 η end_ARG ( - 0.359 ) ) + divide start_ARG 1.04 η end_ARG start_ARG 1.04 - 0.08 η end_ARG 0.389 ]2 , (43) σ~12(νpoly)subscriptsuperscript~21subscriptpoly σ^2_1( _poly)over~ start_ARG σ end_ARG21 ( νroman_poly ) =1.04η0.96+0.08η(0.2762+(−0.359)2)+1.04(1−η)0.96+0.08η(0.2662+0.3892)absent1.040.960.08superscript0.2762superscript0.35921.0410.960.08superscript0.2662superscript0.3892 = 1.04η0.96+0.08η(0.276^2+(-0.359)^2)+ 1.0% 4(1-η)0.96+0.08η(0.266^2+0.389^2)= divide start_ARG 1.04 η end_ARG start_ARG 0.96 + 0.08 η end_ARG ( 0.2762 + ( - 0.359 )2 ) + divide start_ARG 1.04 ( 1 - η ) end_ARG start_ARG 0.96 + 0.08 η end_ARG ( 0.2662 + 0.3892 ) −[1.04η0.96+0.08η(−0.359)+1.04(1−η)0.96+0.08η0.389]2.superscriptdelimited-[]1.040.960.080.3591.0410.960.080.3892 - [ 1.04η0.96+0.08η(-0.359)+ 1.04(1-η)% 0.96+0.08η0.389 ]^2.- [ divide start_ARG 1.04 η end_ARG start_ARG 0.96 + 0.08 η end_ARG ( - 0.359 ) + divide start_ARG 1.04 ( 1 - η ) end_ARG start_ARG 0.96 + 0.08 η end_ARG 0.389 ]2 . (44) Then plugging Eq. (B.2.2), Eq. (B.2.2), Eq. (B.2.2), and Eq. (B.2.2) into Eq. (40), we have J~(νmono)/J(νmono)J~(νpoly)/J(νpoly)≥1~subscriptmonosubscriptmono~subscriptpolysubscriptpoly1 J( _mono)/J( _mono) J( _% poly)/J( _poly)≥ 1divide start_ARG over~ start_ARG J end_ARG ( νroman_mono ) / J ( νroman_mono ) end_ARG start_ARG over~ start_ARG J end_ARG ( νroman_poly ) / J ( νroman_poly ) end_ARG ≥ 1. Further, by Theorems B.1 and B.5, we have Δμ~(νmono)=1.04×0.96(1−2η)(1.04−0.08η)(0.96+0.08η)×0.406,Δ~subscriptmono1.040.96121.040.080.960.080.406 μ( _mono)= 1.04× 0.96(1-2% η)(1.04-0.08η)(0.96+0.08η)× 0.406,Δ over~ start_ARG μ end_ARG ( νroman_mono ) = divide start_ARG 1.04 × 0.96 ( 1 - 2 η ) end_ARG start_ARG ( 1.04 - 0.08 η ) ( 0.96 + 0.08 η ) end_ARG × 0.406 , (45) Δμ~(νpoly)=1.04×0.96(1−2η)(1.04−0.08η)(0.96+0.08η)×0.748.Δ~subscriptpoly1.040.96121.040.080.960.080.748 μ( _poly)= 1.04× 0.96(1-2% η)(1.04-0.08η)(0.96+0.08η)× 0.748.Δ over~ start_ARG μ end_ARG ( νroman_poly ) = divide start_ARG 1.04 × 0.96 ( 1 - 2 η ) end_ARG start_ARG ( 1.04 - 0.08 η ) ( 0.96 + 0.08 η ) end_ARG × 0.748 . (46) Plugging them into the definition of J~(νmono)~subscriptmono J( _mono)over~ start_ARG J end_ARG ( νroman_mono ) and J~(νpoly)~subscriptpoly J( _poly)over~ start_ARG J end_ARG ( νroman_poly ), we have J~(νmono)<J~(νpoly)~subscriptmono~subscriptpoly J( _mono)< J( _poly)over~ start_ARG J end_ARG ( νroman_mono ) < over~ start_ARG J end_ARG ( νroman_poly ) when η<0.250.25η<0.25η < 0.25 and J~(νmono)>J~(νpoly)~subscriptmono~subscriptpoly J( _mono)> J( _poly)over~ start_ARG J end_ARG ( νroman_mono ) > over~ start_ARG J end_ARG ( νroman_poly ) when η>0.250.25η>0.25η > 0.25. ∎ B.3 Proofs Related to Input Noise Following the settings in Section 4.2, we investigate the influence of Gaussian noise εi∼(0,1)similar-tosubscript01 _i (0,1)εitalic_i ∼ N ( 0 , 1 ), i.i.d. i=1,212i=1,2i = 1 , 2, on the input data x=(x1,x2)subscript1subscript2x=(x_1,x_2)x = ( x1 , x2 ), where εi⟂xperpendicular-tosubscript _i xεitalic_i ⟂ x. Given noise strength λ>00λ>0λ > 0, we denote the noisy input data as x~=(x1+λε1,x2+λε2)~subscript1subscript1subscript2subscript2 x=(x_1+λ _1,x_2+λ _2)over~ start_ARG x end_ARG = ( x1 + λ ε1 , x2 + λ ε2 ). Then the learned monosemantic and polysemantic representations are νmono=x1+λε1subscriptmonosubscript1subscript1 _mono=x_1+λ _1νroman_mono = x1 + λ ε1 and νpoly=(x1−x2)+λ(ε1−ε2)subscriptpolysubscript1subscript2subscript1subscript2 _poly=(x_1-x_2)+λ( _1- _2)νroman_poly = ( x1 - x2 ) + λ ( ε1 - ε2 ). Next, we derive the influence of noise strength on the conditional means and variances, respectively. Theorem B.7 (Influence of Gaussian noise on inter-class distance). Given noise strength λ>00λ>0λ > 0, for both mono- and poly-semantic representations, we have Δμ~(ν)=Δμ(ν).Δ~Δ μ(ν)= μ(ν).Δ over~ start_ARG μ end_ARG ( ν ) = Δ μ ( ν ) . (47) Proof of Theorem B.7. For νmonosubscriptmono _monoνroman_mono and i=0,101i=0,1i = 0 , 1, μ~i(νmono)subscript~subscriptmono μ_i( _mono)over~ start_ARG μ end_ARGi ( νroman_mono ) =(x1+λε1|y=i)absentsubscript1conditionalsubscript1 =E(x_1+λ _1|y=i)= blackboard_E ( x1 + λ ε1 | y = i ) =(x1|y=i)+λ(ε1|y=i)absentconditionalsubscript1conditionalsubscript1 =E(x_1|y=i)+ ( _1|y=i)= blackboard_E ( x1 | y = i ) + λ blackboard_E ( ε1 | y = i ) =(x1|y=i)+0absentconditionalsubscript10 =E(x_1|y=i)+0= blackboard_E ( x1 | y = i ) + 0 =μi(νmono).absentsubscriptsubscriptmono = _i( _mono).= μitalic_i ( νroman_mono ) . (48) For νpolysubscriptpoly _polyνroman_poly and i=0,101i=0,1i = 0 , 1, μ~i(νpoly)subscript~subscriptpoly μ_i( _poly)over~ start_ARG μ end_ARGi ( νroman_poly ) =((x1−x2)+λ(ε1−ε2)|y=i)absentsubscript1subscript2conditionalsubscript1subscript2 =E((x_1-x_2)+λ( _1- _2% )|y=i)= blackboard_E ( ( x1 - x2 ) + λ ( ε1 - ε2 ) | y = i ) =(x1−x2|y=i)+λ[(ε1|y=i)−(ε2|y=i)]absentsubscript1conditionalsubscript2delimited-[]conditionalsubscript1conditionalsubscript2 =E(x_1-x_2|y=i)+λ[E( _1|y% =i)-E( _2|y=i)]= blackboard_E ( x1 - x2 | y = i ) + λ [ blackboard_E ( ε1 | y = i ) - blackboard_E ( ε2 | y = i ) ] =(x1−x2|y=i)+0absentsubscript1conditionalsubscript20 =E(x_1-x_2|y=i)+0= blackboard_E ( x1 - x2 | y = i ) + 0 =μi(νpoly).absentsubscriptsubscriptpoly = _i( _poly).= μitalic_i ( νroman_poly ) . (49) Then for ν∈νmono,νpolysubscriptmonosubscriptpolyν∈\ _mono, _poly\ν ∈ νroman_mono , νroman_poly , Δμ~(ν)=μ~1(ν)−μ~0(ν)=μ1(ν)−μ0(ν)=Δμ(ν).Δ~subscript~1subscript~0subscript1subscript0Δ μ(ν)= μ_1(ν)- μ_0(ν)=% _1(ν)- _0(ν)= μ(ν).Δ over~ start_ARG μ end_ARG ( ν ) = over~ start_ARG μ end_ARG1 ( ν ) - over~ start_ARG μ end_ARG0 ( ν ) = μ1 ( ν ) - μ0 ( ν ) = Δ μ ( ν ) . (50) ∎ Theorem B.8 (Influence of Gaussian noise on intra-class variance). For i=0,101i=0,1i = 0 , 1 and noise strength λ>00λ>0λ > 0, we have σ~i2(νmono)=σi2(νmono)+λ2,subscriptsuperscript~2subscriptmonosubscriptsuperscript2subscriptmonosuperscript2 σ^2_i( _mono)=σ^2_i( _% mono)+λ^2,over~ start_ARG σ end_ARG2i ( νroman_mono ) = σ2italic_i ( νroman_mono ) + λ2 , (51) and σ~i2(νpoly)=σi2(νpoly)+2λ2.subscriptsuperscript~2subscriptpolysubscriptsuperscript2subscriptpoly2superscript2 σ^2_i( _poly)=σ^2_i( _% poly)+2λ^2.over~ start_ARG σ end_ARG2i ( νroman_poly ) = σ2italic_i ( νroman_poly ) + 2 λ2 . (52) Proof of Theorem B.8. For νmonosubscriptmono _monoνroman_mono and i=0,101i=0,1i = 0 , 1, σ~i2(νmono)subscriptsuperscript~2subscriptmono σ^2_i( _mono)over~ start_ARG σ end_ARG2i ( νroman_mono ) =((x1+λε1)2|y=i)−μ~i(νmono)absentconditionalsuperscriptsubscript1subscript12subscript~subscriptmono =E((x_1+λ _1)^2|y=i)- μ_% i( _mono)= blackboard_E ( ( x1 + λ ε1 )2 | y = i ) - over~ start_ARG μ end_ARGi ( νroman_mono ) =(x12|y=i)+2λ(x1ε1|y=i)+λ2(ε12|y=i)−μ~i(νmono)absentconditionalsuperscriptsubscript122conditionalsubscript1subscript1superscript2conditionalsuperscriptsubscript12subscript~subscriptmono =E(x_1^2|y=i)+2 (x_1 _1% |y=i)+λ^2E( _1^2|y=i)- μ_i( _% mono)= blackboard_E ( x12 | y = i ) + 2 λ blackboard_E ( x1 ε1 | y = i ) + λ2 blackboard_E ( ε12 | y = i ) - over~ start_ARG μ end_ARGi ( νroman_mono ) =(x12|y=i)−μ~i(νmono)+0+λ2(ε12|y=i)absentconditionalsuperscriptsubscript12subscript~subscriptmono0superscript2conditionalsuperscriptsubscript12 =E(x_1^2|y=i)- μ_i( _mono)+0% +λ^2E( _1^2|y=i)= blackboard_E ( x12 | y = i ) - over~ start_ARG μ end_ARGi ( νroman_mono ) + 0 + λ2 blackboard_E ( ε12 | y = i ) =σi2(νmono)+λ2.absentsubscriptsuperscript2subscriptmonosuperscript2 =σ^2_i( _mono)+λ^2.= σ2italic_i ( νroman_mono ) + λ2 . (53) For νpolysubscriptpoly _polyνroman_poly and i=0,101i=0,1i = 0 , 1, σ~i2(νmono)subscriptsuperscript~2subscriptmono σ^2_i( _mono)over~ start_ARG σ end_ARG2i ( νroman_mono ) =(((x1−x2)+λ(ε1−ε2))2|y=i)−μ~i(νpoly)absentconditionalsuperscriptsubscript1subscript2subscript1subscript22subscript~subscriptpoly =E(((x_1-x_2)+λ( _1- _2% ))^2|y=i)- μ_i( _poly)= blackboard_E ( ( ( x1 - x2 ) + λ ( ε1 - ε2 ) )2 | y = i ) - over~ start_ARG μ end_ARGi ( νroman_poly ) =((x1−x2)2|y=i)+2λ((x1−x2)(ε1−ε2)|y=i)+λ2((ε1−ε2)2|y=i)−μ~i(νpoly)absentconditionalsuperscriptsubscript1subscript222conditionalsubscript1subscript2subscript1subscript2superscript2conditionalsuperscriptsubscript1subscript22subscript~subscriptpoly =E((x_1-x_2)^2|y=i)+2 ((x_1-x_2% )( _1- _2)|y=i)+λ^2E(( _1% - _2)^2|y=i)- μ_i( _poly)= blackboard_E ( ( x1 - x2 )2 | y = i ) + 2 λ blackboard_E ( ( x1 - x2 ) ( ε1 - ε2 ) | y = i ) + λ2 blackboard_E ( ( ε1 - ε2 )2 | y = i ) - over~ start_ARG μ end_ARGi ( νroman_poly ) =((x1−x2)2|y=i)−μ~i(νpoly)+2λ((x1−x2)ε1|y=i)−2λ((x1−x2)ε2|y=i)absentconditionalsuperscriptsubscript1subscript22subscript~subscriptpoly2conditionalsubscript1subscript2subscript12conditionalsubscript1subscript2subscript2 =E((x_1-x_2)^2|y=i)- μ_i( _% poly)+2 ((x_1-x_2) _1|y=i)-2 % ((x_1-x_2) _2|y=i)= blackboard_E ( ( x1 - x2 )2 | y = i ) - over~ start_ARG μ end_ARGi ( νroman_poly ) + 2 λ blackboard_E ( ( x1 - x2 ) ε1 | y = i ) - 2 λ blackboard_E ( ( x1 - x2 ) ε2 | y = i ) +λ2(ε12|y=i)−2λ2(ε1ε2|y=i)+λ2(ε22|y=i)superscript2conditionalsuperscriptsubscript122superscript2conditionalsubscript1subscript2superscript2conditionalsuperscriptsubscript22 +λ^2E( _1^2|y=i)-2λ^2% E( _1 _2|y=i)+λ^2E(% _2^2|y=i)+ λ2 blackboard_E ( ε12 | y = i ) - 2 λ2 blackboard_E ( ε1 ε2 | y = i ) + λ2 blackboard_E ( ε22 | y = i ) =σi2(νpoly)+2λ2.absentsubscriptsuperscript2subscriptpoly2superscript2 =σ^2_i( _poly)+2λ^2.= σ2italic_i ( νroman_poly ) + 2 λ2 . (54) ∎ Theorem B.9 (Influence of Gaussian noise on linear seprarability). We denote the linear separability criterion under noise as J~(ν)=Δμ~(ν)/(σ0~(ν)σ~1(ν))~Δ~~subscript0subscript~1 J(ν)= μ(ν)/( _0(ν) σ_% 1(ν))over~ start_ARG J end_ARG ( ν ) = Δ over~ start_ARG μ end_ARG ( ν ) / ( over~ start_ARG σ0 end_ARG ( ν ) over~ start_ARG σ end_ARG1 ( ν ) ). For noise rate λ>00λ>0λ > 0, J~(νpoly)J(νpoly)≤J~(νmono)J(νmono)≤1.~subscriptpolysubscriptpoly~subscriptmonosubscriptmono1 J( _poly)J( _poly)≤ % J( _mono)J( _mono)≤ 1.divide start_ARG over~ start_ARG J end_ARG ( νroman_poly ) end_ARG start_ARG J ( νroman_poly ) end_ARG ≤ divide start_ARG over~ start_ARG J end_ARG ( νroman_mono ) end_ARG start_ARG J ( νroman_mono ) end_ARG ≤ 1 . (55) Meanwhile, we obtain J~(νpoly)≤J~(νmono)~subscriptpoly~subscriptmono J( _poly)≤ J( _mono)over~ start_ARG J end_ARG ( νroman_poly ) ≤ over~ start_ARG J end_ARG ( νroman_mono ) when λ≥0.550.55λ≥ 0.55λ ≥ 0.55. As shown in Theorem B.9, with the increase of noise strength, the linear separability (J(ν)J(ν)J ( ν )) of both polysemantic and monosemantic features becomes worse. However, J(νmono)subscriptJ( _mono)J ( νitalic_m o n o ) decreases more slowly. As a result, when the noise strength is aggressive enough (λ≥0.250.25λ≥ 0.25λ ≥ 0.25), the monosemantic feature exhibits better linear seperability than the polysemantic one. The theoretical results reveal that the linear separability of monosemantic features is more robust than polysemantic features, which leads to better performance in tasks under Input noise. Proof of Theorem B.9. By definition, we have J~(νmono)/J(νmono)J~(νpoly)/J(νpoly)~subscriptmonosubscriptmono~subscriptpolysubscriptpoly J( _mono)/J( _mono)% J( _poly)/J( _poly)divide start_ARG over~ start_ARG J end_ARG ( νroman_mono ) / J ( νroman_mono ) end_ARG start_ARG over~ start_ARG J end_ARG ( νroman_poly ) / J ( νroman_poly ) end_ARG =[Δμ~(νmono)/(σ~0(νmono)σ~1(νmono))]/[Δμ(νmono)/(σ0(νmono)σ1(νmono))][Δμ~(νpoly)/(σ~0(νpoly)σ~1(νpoly))]/[Δμ(νpoly)/(σ0(νpoly)σ1(νpoly))].absentdelimited-[]Δ~subscriptmonosubscript~0subscriptmonosubscript~1subscriptmonodelimited-[]Δsubscriptmonosubscript0subscriptmonosubscript1subscriptmonodelimited-[]Δ~subscriptpolysubscript~0subscriptpolysubscript~1subscriptpolydelimited-[]Δsubscriptpolysubscript0subscriptpolysubscript1subscriptpoly = [ μ( _mono)/( σ_0% ( _mono) σ_1( _mono))]/[ μ(ν% _mono)/( _0( _mono) _1( _mono% ))][ μ( _poly)/( σ_0( _% poly) σ_1( _poly))]/[ μ( _% poly)/( _0( _poly) _1( _poly% ))].= divide start_ARG [ Δ over~ start_ARG μ end_ARG ( νroman_mono ) / ( over~ start_ARG σ end_ARG0 ( νroman_mono ) over~ start_ARG σ end_ARG1 ( νroman_mono ) ) ] / [ Δ μ ( νroman_mono ) / ( σ0 ( νroman_mono ) σ1 ( νroman_mono ) ) ] end_ARG start_ARG [ Δ over~ start_ARG μ end_ARG ( νroman_poly ) / ( over~ start_ARG σ end_ARG0 ( νroman_poly ) over~ start_ARG σ end_ARG1 ( νroman_poly ) ) ] / [ Δ μ ( νroman_poly ) / ( σ0 ( νroman_poly ) σ1 ( νroman_poly ) ) ] end_ARG . (56) By Theorems B.7 and B.8, we have Δμ~(νmono)=Δμ(νmono)Δ~subscriptmonoΔsubscriptmono μ( _mono)= μ( _mono)Δ over~ start_ARG μ end_ARG ( νroman_mono ) = Δ μ ( νroman_mono ), Δμ~(νpoly)=Δμ(νpoly)Δ~subscriptpolyΔsubscriptpoly μ( _poly)= μ( _poly)Δ over~ start_ARG μ end_ARG ( νroman_poly ) = Δ μ ( νroman_poly ), σ~i2(νmono)=σi2(νmono)+λ2subscriptsuperscript~2subscriptmonosubscriptsuperscript2subscriptmonosuperscript2 σ^2_i( _mono)=σ^2_i( _mono% )+λ^2over~ start_ARG σ end_ARG2i ( νroman_mono ) = σ2italic_i ( νroman_mono ) + λ2, and σ~i2(νpoly)=σi2(νpoly)+2λ2subscriptsuperscript~2subscriptpolysubscriptsuperscript2subscriptpoly2superscript2 σ^2_i( _poly)=σ^2_i( _poly% )+2λ^2over~ start_ARG σ end_ARG2i ( νroman_poly ) = σ2italic_i ( νroman_poly ) + 2 λ2, i=0,101i=0,1i = 0 , 1. By Theorem 4.1, we have σ1(νmono)=σ1(νpoly)subscript1subscriptmonosubscript1subscriptpoly _1( _mono)= _1( _poly)σ1 ( νroman_mono ) = σ1 ( νroman_poly ). Then we have J~(νmono)/J(νmono)J~(νpoly)/J(νpoly)~subscriptmonosubscriptmono~subscriptpolysubscriptpoly J( _mono)/J( _mono)% J( _poly)/J( _poly)divide start_ARG over~ start_ARG J end_ARG ( νroman_mono ) / J ( νroman_mono ) end_ARG start_ARG over~ start_ARG J end_ARG ( νroman_poly ) / J ( νroman_poly ) end_ARG =σ~0(νpoly)σ~1(νpoly)σ0(νmono)σ~0(νmono)σ~1(νmono)σ0(νpoly)absentsubscript~0subscriptpolysubscript~1subscriptpolysubscript0subscriptmonosubscript~0subscriptmonosubscript~1subscriptmonosubscript0subscriptpoly = σ_0( _poly) σ_1(% _poly) _0( _mono) σ_0( _% mono) σ_1( _mono) _0( _% poly)= divide start_ARG over~ start_ARG σ end_ARG0 ( νroman_poly ) over~ start_ARG σ end_ARG1 ( νroman_poly ) σ0 ( νroman_mono ) end_ARG start_ARG over~ start_ARG σ end_ARG0 ( νroman_mono ) over~ start_ARG σ end_ARG1 ( νroman_mono ) σ0 ( νroman_poly ) end_ARG =(σ02(νpoly)+2λ2)(σ12(νpoly)+2λ2)σ0(νmono)(σ02(νmono)+λ2)(σ12(νmono)+λ2)σ0(νpoly).absentsubscriptsuperscript20subscriptpoly2superscript2subscriptsuperscript21subscriptpoly2superscript2subscript0subscriptmonosubscriptsuperscript20subscriptmonosuperscript2subscriptsuperscript21subscriptmonosuperscript2subscript0subscriptpoly = (σ^2_0( _poly)+2λ^2)(% σ^2_1( _poly)+2λ^2) _0( _mono% ) (σ^2_0( _mono)+λ^2)(σ^2_1(% _mono)+λ^2) _0( _poly).= divide start_ARG square-root start_ARG ( σ20 ( νroman_poly ) + 2 λ2 ) ( σ21 ( νroman_poly ) + 2 λ2 ) end_ARG σ0 ( νroman_mono ) end_ARG start_ARG square-root start_ARG ( σ20 ( νroman_mono ) + λ2 ) ( σ21 ( νroman_mono ) + λ2 ) end_ARG σ0 ( νroman_poly ) end_ARG . (57) Then plugging Theorem 4.1, we complete the proof. ∎ Appendix C Additional Experiments Figure 6: Linear probing performance with different evaluation losses on ImageNet-100 under 95% noise rates. C.1 Combination with Robust Loss The previous results suggest that the monosemantic representations exhibit stronger robustness against label noise across various datasets. We note that there have been various studies to improve the robustness under label noise, such as applying robust loss functions (Van Rooyen et al., 2015; Ghosh et al., 2017), correcting training labels (Reed et al., 2014; Ma et al., 2018), and reweighting training samples (Chen et al., 2019; Han et al., 2018). However, the perspective in this paper is orthogonal to them. Taking the representative robust loss function Symmetric Cross Entropy (Wang et al., 2019b) as an example, we can obtain monosemantic representations as discussed above and then use the robust loss during the linear probing process. As shown in Figure 6, both the robust loss and enhancing feature monosemanticity can improve the robustness against label noise. Furthermore, the two methods are orthogonal, and combining them can further improve performance.