Paper deep dive

AdapterTune: Zero-Initialized Low-Rank Adapters for Frozen Vision Transformers

Salim Khazem

Year: 2026Venue: arXiv preprintArea: cs.CVType: PreprintEmbeddings: 44

Abstract

Abstract:Frozen-backbone transfer with Vision Transformers faces two under-addressed issues: optimization instability when adapters are naively inserted into a fixed feature extractor, and the absence of principled guidance for setting adapter capacity. We introduce AdapterTune, which augments each transformer block with a residual low-rank bottleneck whose up-projection is zero-initialized, guaranteeing that the adapted network starts exactly at the pretrained function and eliminates early-epoch representation drift. On the analytical side, we formalize adapter rank as a capacity budget for approximating downstream task shifts in feature space. The resulting excess-risk decomposition predicts monotonic but diminishing accuracy gains with increasing rank, an ``elbow'' behavior we confirm through controlled sweeps. We evaluate on 9 datasets and 3 backbone scales with multi-seed reporting throughout. On a core 5 dataset transfer suite, AdapterTune improves top-1 accuracy over head-only transfer by +14.9 points on average while training only 0.92 of the parameters required by full fine-tuning, and outperforms full fine-tuning on 10 of 15 dataset-backbone pairs. Across the full benchmark, AdapterTune improves over head-only transfer on every dataset-backbone pair tested. Ablations on rank, placement, and initialization isolate each design choice. The code is available at: this https URL

PDF

Open source PDF →Open local PDF →

Intelligence

Status: not_run | Model: - | Prompt: - | Confidence: 0%

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

43,566 characters extracted from source content.

Expand or collapse full text

AdapterTune: Zero-Initialized Low-Rank Adapters for Frozen Vision Transformers Salim Khazem 1 1 Talan Research Center, Paris, France 2 salim.khazem@talan.com Abstract. Frozen-backbone transfer with Vision Transformers faces two under-addressed issues: optimization instability when adapters are naively inserted into a fixed feature extractor, and the absence of principled guid- ance for setting adapter capacity. We introduce AdapterTune, which aug- ments each transformer block with a residual low-rank bottleneck whose up-projection is zero-initialized, guaranteeing that the adapted network starts exactly at the pretrained function and eliminates early-epoch rep- resentation drift. On the analytical side, we formalize adapter rank as a capacity budget for approximating downstream task shifts in feature space. The resulting excess-risk decomposition predicts monotonic but diminishing accuracy gains with increasing rank, an “elbow” behavior we confirm through controlled sweeps. We evaluate on 9 datasets and 3 backbone scales with multi-seed reporting throughout. On a core 5 dataset transfer suite, AdapterTune improves top-1 accuracy over head- only transfer by +14.9 points on average while training only 0.92 of the parameters required by full fine-tuning, and outperforms full fine-tuning on 10 of 15 dataset-backbone pairs. Across the full benchmark, Adapter- Tune improves over head-only transfer on every dataset-backbone pair tested. Ablations on rank, placement, and initialization isolate each de- sign choice. The code is available at: https://github.com/salimkhazem/ adaptertune 1 Introduction Large pretrained Vision Transformers are now standard backbones for image recognition and transfer learning [4,30]. However, full fine-tuning [9,33] updates all weights and quickly becomes expensive when many downstream datasets or continual updates are required. At the other extreme, head-only tuning is cheap but often underfits because the frozen representation cannot align with task spe- cific shifts. This paper targets the practical middle ground: we adapt a frozen pretrained Vision Transformer with lightweight residual adapters. Our method, AdapterTune, inserts low-rank bottleneck modules inside transformer blocks and trains only adapter weights and the classification head. The up-projection is zero- initialized so the initial network is exactly the pretrained model, which improves arXiv:2603.14706v1 [cs.CV] 16 Mar 2026 2S. Khazem et al. Fig. 1: AdapterTune architecture. (Left) Trainable residual adapters (orange) are inserted into the strictly frozen Vision Transformer backbone (blue). (Right) The adapter uses a low-rank bottleneck where the up-projection is zero-initialized. This guarantees an initial zero output (A ℓ (h ℓ ) = 0), acting as an exact identity mapping to prevent early-epoch optimization drift. optimization stability in low data and multi-dataset settings. Beyond architec- ture, we ask a central question: how much rank is enough? We provide a theory view where adapters approximate low-rank task shifts in feature space. The re- sulting bound predicts monotonic but saturating improvements as rank increases, matching our empirical rank sweeps. We benchmark AdapterTune with strict re- producibility (fixed seeds and deterministic splits) across several datasets and backbones. Our comprehensive evaluation spans 9 datasets, 3 backbones, and 3 adaptation methods. all averaged over 3 random seeds. On the core benchmark, AdapterTune improves top-1 over head-only tuning by +14.9 points on average while training only 0.92% of the parameters used by full fine-tuning. In summary, our main contribution are (i) we introduce a simple residual adapter formulation for frozen Vision Transformers with zero-initialized up-projection and control- lable rank and placement frequency, (i) we provide a theoretical framework linking adapter rank to approximation error for low-rank task shifts, yielding a diminishing returns corollary; and (i) we deliver a fully reproducible bench- mark suite featuring multi-dataset, multi-backbone comparisons and targeted ablations on rank, placement, and initialization. 2 Related Work Pretrained Vision Transformers as transfer backbones. Dosovitskiy et al. [4] established the Vision Transformer as a competitive image classifier when trained on large corpora such as JFT-300M or ImageNet-21k. Touvron et al. [30] showed that data-efficient distillation strategies bring ViTs within reach of prac- titioners without access to proprietary data. Subsequent work has scaled archi- AdapterTune3 tectures [33], improved masked autoencoder retraining [9], and studied the ge- ometry of ViT feature spaces [29]. Parallel efforts have also explored alternative image representations to improve efficiency and robustness, such as polygonal contour-based representations for classification [18]. Across this line, full fine- tuning remains the dominant adaptation protocol. We study the less explored regime where the backbone is permanently frozen and only lightweight adapters are updated. Adapter-based transfer learning. Bottleneck residual adapters originated in NLP [12,28]. In vision, AdaptFormer [3] places parallel adapters inside ViT MLP sub-blocks for action recognition, RepAdapter [23] reparameterizes them to re- move inference latency, and NOAH [36] searches optimal PEFT combinations. While LLaMA-Adapter [34] adds zero-initialized scalar gates to language mod- els, AdapterTune zeroes the actual up-projection matrix. This mechanistically guarantees zero initial output for all inputs without relying on gating scalars, is tailored for frozen vision backbones, and includes formal rank analysis. Finally, AdapterTune fundamentally differs from AdaptFormer: (i) adapters wrap the en- tire transformer block, enabling richer feature interactions; (i) strict backbone freezing guarantees safe multi-task serving; and (i) a rigorous rank-capacity bound guides hyperparameter selection rather than treating rank as a purely empirical knob. Low-rank weight adaptation. LoRA [13] decomposes weight updates as ∆W = BA with B ∈ R d×r and A ∈ R r×d , targeting attention weight matrices. Unlike AdapterTune, LoRA modifies backbone weights additively at inference; once merged, the adapted and unadapted model are indistinguishable in structure, making multi-task serving more complex. FacT [15] extends LoRA ideas to ten- sor factorizations of ViT weight matrices. Consolidator [7, 17] combines LoRA and adapter ideas, showing complementary benefits. Our analysis in Sec. 4 is clos- est in spirit to the theoretical study of LoRA by [32], but we apply it to residual function-space modules rather than weight space decompositions, which permits a cleaner separation between the frozen pretrained function and the learned delta. Visual prompt tuning. Jia et al. [14] prepend a small set of learnable prompt tokens to the input sequence, updating only these tokens during adaptation (VPT-Deep also inserts prompts at intermediate layers). While elegant, prompt tuning adds to the sequence length, increasing attention complexity quadrati- cally, and it modifies the forward pass in a way that can disrupt positional encod- ings. SSF [21] instead applies learned scale and shift affine transformations after each layer, achieving strong results with very few parameters. BitFit [31] tunes only bias parameters, providing a minimal but surprisingly competitive baseline. CLIP-Adapter [6] applies lightweight feature adapters in the embedding space of vision-language models. Recent work has also explored low-rank adaptation strategies for vision transformers, enabling efficient fine-tuning through struc- tured parameter updates [16]. AdapterTune occupies a complementary point in design space: residual adapters after full blocks, with both down- and up- 4S. Khazem et al. projection trainable, offering higher capacity than SSF/BitFit while remaining far cheaper than full fine-tuning. Parameter efficiency analysis. The empirical literature often reports accu- racy at a fixed parameter budget without asking why a particular budget suffices. We contribute a formal answer for the adapter setting: if the required feature shift has approximately rank r ∗ , then adapters of rank r < r ∗ incur tail-eigenvalue approximation error and adapters of rank r ≥ r ∗ suffer no further approximation loss, resulting in the diminishing-returns curve we observe. This analysis com- plements the empirical parameter-efficiency studies of [8] and the expressivity analysis of [32]. 3 Method 3.1 Preliminaries Let f Θ : X → R d be a pretrained ViT encoder with L transformer blocks, a hidden dimension of d, and a fixed parameter set Θ. We denote the token representation after block ℓ by h ℓ ∈ R N t ×d , where N t is the number of tokens. For clarity, we drop the token-sequence dimension and treat h ℓ as a d-dimensional vector; the adapter is applied identically across all tokens via shared weights. 3.2 Residual Adapter Module We introduce an adapter module A ℓ : R d → R d defined as A ℓ (h) = W up ℓ σ W down ℓ h + b down ℓ + b up ℓ , (1) where W down ℓ ∈ R r×d , W up ℓ ∈ R d×r , b down ℓ ∈ R r , b up ℓ ∈ R d are learnable pa- rameters, r ≪ d is the bottleneck rank, and σ is the GELU activation [11]. The adapted representation at block ℓ is h ′ ℓ = h ℓ + αA ℓ (h ℓ ), (2) where α > 0 is a fixed scale factor (default α = 1). When A ℓ (h ℓ ) = 0, the network reduces exactly to the pretrained forward pass a property we enforce at initialization (Sec. 3.3). Placement. Adapters are inserted after every block (every=1, default) or every k-th block (every=k). With k = 1, the total number of adapter modules is L; with k = 2 it is ⌊L/2⌋. Our ablations (Tab. 4) show that both placements yield similar accuracy on CIFAR-10/ViT-S, with a gap of < 0.1 points, confirming that every other block placement is a viable, cheaper alternative. AdapterTune5 Table 1: Trainable parameters for each backbone at adapter rank r = 16, every block. “%FT” is the fraction relative to full fine-tuning parameters (all backbone weights plus head). Backboned L Adapter params %FT DeiT-T/16 192 1276 K 0.67% ViT-S/16 384 12303 K 0.70% ViT-B/16 768 121.2 M 1.40% 3.3 Zero-Initialization for Stable Optimization A critical design choice is the initialization of W up ℓ and b up ℓ . We set W up ℓ ←0, b up ℓ ←0(3) at the start of training, while W down ℓ is initialized fromN(0,σ 2 0 ) with σ 0 = 0.02. Under Eq. (3), A ℓ (h) = 0 for any input h, and therefore h ′ ℓ = h ℓ : the adapted network is identical to the pretrained network at initialization. This guaran- tee has two practical benefits. First, the pretrained representation is preserved for the classifier head from the very first batch, avoiding the early epoch loss spikes caused by random adapter initialization. Second, gradients flow through the residual path h ℓ unmodified at step zero, giving the classifier head a warm start on features it was trained on. We compare zero initialization against small random initialization in Tab. 4; zero initialization yields lower variance across seeds, while small random initialization attains a slightly higher mean in this particular CIFAR-10/ViT-S setting but at the cost of less stable optimization. 3.4 Trainable Parameter Count Each adapter at rank r contributes N adapter (r,d) = 2rd + r + d(4) trainable parameters. For a model with L blocks, adapters at every block, and a C-class linear head over a [CLS] token: N trainable = L· N adapter (r,d) + Cd.(5) Tab. 1 summarizes the trainable parameter counts and their fraction of the full model for our three backbones at default rank r = 16. Across all backbones, adapter training uses well under 1.5% of the parameters of full fine-tuning, confirming the extreme parameter efficiency of the approach. 3.5 Training Objective and Protocol Given a labeled dataset D = (x i ,y i ) N i=1 , we minimize cross-entropy over the trainable parameters ψ =W down ℓ ,b down ℓ ,W up ℓ ,b up ℓ ℓ , φ: min ψ 1 N N X i=1 CE(g φ (f Θ,ψ (x i )), y i ),(6) 6S. Khazem et al. where g φ is the linear classification head and f Θ,ψ is the adapted encoder with frozen Θ. We use AdamW [22] with a cosine learning-rate schedule, 5 warm-up epochs, base learning rate 10 −3 , weight decay 5× 10 −2 , gradient clipping at 1.0, and train for 50 epochs. 3.6 Comparison Regimes We compare three adaptation regimes throughout all experiments. In the Head- Only setting, the backbone is entirely frozen and only the classification head is trained; this incurs minimal parameter cost but prevents any representa- tional adaptation. At the other extreme, Full Fine-Tuning updates all backbone weights alongside the head, providing maximum expressiveness but requiring prohibitive per task storage at scale. Finally, our proposed AdapterTune bridges this gap: the backbone remains frozen while only the lightweight adapters and the classification head are trained, successfully combining strict parameter effi- ciency with robust representational adaptability. 4 Theoretical Analysis We provide a formal account of when and why low-rank residual adapters suffice for downstream adaptation. The analysis rests on a linear approximation of the adapter’s action on the frozen feature space; we discuss the scope and limitations of this linearization at the end of the section. 4.1 Setup and Assumptions Consider a single transformer block with frozen representation h∈ R d ,∥h∥ 2 ≤ B almost surely. After training on a downstream task, full fine-tuning implicitly learns a target feature shift: the transformation ∆ ⋆ : R d → R d such that the fine-tuned block output equals h + ∆ ⋆ (h), modulo higher-order nonlinearities. Assumption 1 (Low-rank linearized shift) The linearization of ∆ ⋆ around the pretrained representation is a matrix ∆ ⋆ ∈ R d×d with singular value decom- position ∆ ⋆ = U diag(σ 1 ,...,σ d )V ⊤ , where σ 1 ≥ σ 2 ≥·≥ σ d ≥ 0. A rank-r adapter with up-projection W up ∈ R d×r and down-projection W down ∈ R r×d induces a linear approximation ∆ r = W up W down ∈ R d×d of rank at most r. The GELU nonlinearity between the two projections introduces higher-order terms, but to first order the adapter computes ∆ r h. 4.2 Approximation Bound Theorem 1 (Approximation by rank-r adapters). Under Assumption. (1), let ∆ ⋆ r denote the best rank-r approximation of ∆ ⋆ (obtained by truncated AdapterTune7 SVD at rank r). There exist adapter parameters W up ,W down such that the adapter A satisfies, for any h with ∥h∥ 2 ≤ B, E h (∆ ⋆ − ∆ ⋆ r )h 2 2 i ≤ B 2 X i>r σ 2 i . (7) Moreover, if the downstream loss is L ℓ -Lipschitz in the logits and the classifier head is L g -Lipschitz, the excess risk of rank-r adaptation decomposes as E(r) ≲ L ℓ L g B s X i>r σ 2 i | z approximation error + e O r Ldr n ! |z estimation error ,(8) where L is the number of adapted blocks and n is the number of training samples. Proof sketch. The bound in Eq. (7) follows directly from the Eckart-Young- Mirsky theorem [5]: among all rank-r linear maps, truncated SVD is optimal in Frobenius norm. Setting W up = U r Σ 1/2 r and W down = Σ 1/2 r V ⊤ r (where U r ,V r are the leading r left/right singular vectors and Σ r = diag(σ 1 ,...,σ r )) attains the bound. The residual (∆ ⋆ −∆ ⋆ r )h has squared expected norm E[∥h∥ 2 2 ]· P i>r σ 2 i ≤ B 2 P i>r σ 2 i , giving Eq. (7). The excess risk decomposition in Eq. (8) follows from a standard bias-variance argument. The approximation error is the bias: even with infinite data, a rank- r adapter cannot reduce loss below the level imposed by the truncation error P i>r σ 2 i . The estimation error is the variance: with finite n samples, the adapter must learn O(Ldr) parameters, incurring a statistical complexity proportional to p Ldr/n, following standard covering-number arguments for linear function classes [1].□ 4.3 Diminishing Returns with Rank Corollary 1 (Diminishing returns). Suppose the singular values decay poly- nomially: σ i ≤ C i −p for some C > 0 and p > 1/2. Then s X i>r σ 2 i = O r 1/2−p ,(9) and the approximation error in Eq. (8) decreases as O(r 1/2−p ), which is a sub- linear improvement for all p > 1/2. Proof. P i>r σ 2 i ≤ C 2 P i>r i −2p . For p > 1/2, this series converges; its tail satisfies P i>r i −2p =O(r 1−2p ). Taking the square root gives Eq. (9). Practical implication. Corollary 1 predicts a characteristic “elbow” in the accuracy versus rank curve: large gains at small rank (the approximation term dominates), diminishing gains at moderate rank, and a plateau at large rank (the estimation term grows faster than the approximation term shrinks). Fig. 4 confirm this prediction: on CIFAR-10/ViT-S, accuracy improves by +0.27 points from r = 8 to r = 32 but only +0.10 points from r = 32 to r = 64. 8S. Khazem et al. 4.4 Limitations of the Analysis Three assumptions merit explicit discussion. Linearization. Assumption 1 treats the target shift ∆ ⋆ as linear. Real fine- tuned networks compute nonlinear functions; the linearization holds precisely only in the infinitesimal parameter-perturbation regime and approximately when the backbone is far from saturation on the target task. Empirically, the rank saturation behavior we observe is consistent with the linearized model, but we do not claim the bound is tight in the nonlinear regime. Task-shift identifiability. The bound is meaningful only if a low-rank ∆ ⋆ actually exists. When the target task requires a genuinely high-rank shift (e.g., learning a radically different texture vocabulary), adapters of any moder- ate rank may underperform full fine-tuning. This explains our observations on SVHN/DeiT-T and Food101/DeiT-T, where full fine-tuning retains an advan- tage (Sec. 5). Cross-block interaction. The analysis treats each block independently. In practice, adapters at different layers interact: a shift at layer ℓ changes the input distribution to adapter ℓ + 1. A more refined analysis would track error propagation across layers, analogous to [35]; we leave this extension to future work. 5 Experiments 5.1 Experimental Setup We evaluate our method across a diverse and fully reproducible transfer learning benchmark. Datasets. Our core benchmark spans diverse visual domains: CIFAR-10/100 [19], SVHN [25] (testing large domain gaps), Oxford-IIIT Pet [27], and Food101 [2] (evaluating fine-grained recognition). An extended benchmark adds Flow- ers102 [26], FGVC-Aircraft [24], ImageNet-R [10], and Tiny-ImageNet [20], to- taling 9 datasets. Images undergo standard ImageNet preprocessing: random resized cropping (224×224) and horizontal flipping during training, and a resize- crop operation (256→ 224) during evaluation. Backbones. We evaluate three publicly available pretrained backbones: ViT Small (ViT-S/16, d = 384, L = 12, 22M parameters), ViT Base (ViT-B/16, d = 768, L = 12, 86M parameters), and DeiT Tiny (DeiT-T/16, d = 192, L = 12, 5M parameters). All three were pretrained on ImageNet-1k with patch size 16. Training regimes. We compare Head-Only, Full Fine-Tuning, and Adapter- Tune (Sec. 3.6). AdapterTune defaults to rank r = 16, scale α = 1, every-block insertion, and zero-initialization. To isolate architectural effects from hyperpa- rameter tuning, all methods share an identical 50 epoch recipe: AdamW [22] (lr=10 −3 , weight decay=0.05, grad clip=1.0) with a cosine decay schedule and 5 warmup epochs. All configurations are averaged over 3 random seeds using deterministic data splits to guarantee fair comparisons. We report top-1 test accuracy (mean± std). AdapterTune9 Table 2: Core benchmark (top-1 %, 3 seeds). † AdapterTune wins 10/15 pairs vs. full FT while training <1% of its parameters. Backbone MethodC-10 C-100 SVHN Pets Food ∆ Head ViT-S/16 Head (0.1%) 89.7 72.0 54.5 90.5 68.8— Ours (0.9%) 97.5 84.9 96.2 93.5 85.0 +14.7 Full (100%) 97.2 79.6 97.4 89.5 86.5 +12.7 ViT-B/16 Head (0.1%) 94.8 81.5 65.5 93.3 84.3— Ours (0.9%) 98.9 91.2 97.5 94.3 90.9 +14.9 Full (100%) 95.3 80.7 97.5 86.6 84.4 +10.0 DeiT-T Head (0.1%) 87.7 68.0 44.5 90.8 66.0— Ours (0.9%) 95.5 80.3 95.3 91.4 80.6 +14.5 Full (100%) 96.7 79.7 97.2 89.0 85.1 +15.2 † Std. dev. ≤ 0.9 p across all entries (max: 1.84 on C-100/ViT-B/Full). Full table with per-entry std in supplementary material. 5.2 Main Results Tab. 2 reveals three consistent patterns. Adapters always outperform head-only tuning. AdapterTune improves over head-only tuning on every single dataset/backbone pair, with gains ranging from +0.6 points (Oxford-IIIT Pet / DeiT-T) to +50.8 points (SVHN / DeiT- T). The +14.7-point average gain demonstrates that adapter modules unlock substantial representational flexibility beyond what the classification head alone can exploit from frozen features. Adapters frequently beat full fine-tuning. AdapterTune surpasses full fine-tuning on 10 of 15 settings, including all three CIFAR-100 configurations, all three Oxford-IIIT Pet configurations, and two of three CIFAR-10 configura- tions. The ViT-B/16 CIFAR-100 result is particularly striking: AdapterTune achieves 91.21% versus full fine-tuning’s 80.65% (+10.6 points). Because all methods share one optimizer recipe, this gap reflects the implicit regulariza- tion provided by the low-rank parameter constraint, which prevents overfitting on smaller datasets consistent with the small generalization gaps we report in Fig. 6. Full fine-tuning retains an advantage in domain-shifted settings. On SVHN (ViT-S/16 and DeiT-T) and Food101 (ViT-S/16 and DeiT-T), full fine-tuning maintains a 1.2-4.6 point lead. We analyze these cases in Sec. 5.6. 5.3 Rank Ablation and Theory Validation Tab. 3 and Fig. 4 show a broadly saturating trend. At this setting, r = 8 is already strong; r = 16 is slightly higher (+0.05 points), likely within optimization noise; r = 32 improves by +0.20 points; r = 64 adds only +0.09 points beyond r = 32, matching the diminishing returns prediction of Corollary 1. Practically, 10S. Khazem et al. 405060708090100 Top-1 Accuracy (%) CIFAR-10 CIFAR-100 SVHN Oxford Pets Food-101 Flowers-102 FGVC-Aircraft ImageNet-R Tiny-ImageNet 95.5 80.3 95.3 91.4 80.6 94.2 65.7 65.2 76.3 DeiT-Ti/16 405060708090100 Top-1 Accuracy (%) 97.5 84.9 96.2 93.5 85.0 94.4 67.7 69.3 84.9 ViT-S/16 5060708090100 Top-1 Accuracy (%) 98.9 91.2 97.5 94.3 90.9 99.4 74.8 80.1 90.0 ViT-B/16 Method Head-Only Full Fine-Tune AdaptTuner (Ours) Per-Dataset Accuracy: AdaptTuner vs. Baselines Fig. 2: Per-dataset accuracy comparison. Each row corresponds to one dataset. Gray circles: Head-Only; blue squares: Full Fine-Tune; red stars: AdapterTune. Con- necting lines show the performance gap bridged by each method. AdapterTune (red stars) reaches or surpasses full fine-tuning on most datasets, while using only 0.92% of its parameters. 10 1 10 2 10 3 Trainable Parameters (K) 40 50 60 70 80 90 100 Top-1 Test Accuracy (%) CIFAR- CIFAR- FGVC-A Flower Food-1 ImageN Oxford SVHN Tiny-I DeiT-Ti/16 Head-Only Full Fine-Tune AdaptTuner (Ours) 10 1 10 2 10 3 10 4 Trainable Parameters (K) 40 50 60 70 80 90 100 Top-1 Test Accuracy (%) CIFAR- CIFAR- FGVC-A Flower Food-1 ImageN Oxford SVHN Tiny-I ViT-S/16 Head-Only Full Fine-Tune AdaptTuner (Ours) 10 1 10 2 10 3 10 4 10 5 Trainable Parameters (K) 50 60 70 80 90 100 Top-1 Test Accuracy (%) CIFAR- CIFAR- FGVC-A Flower Food-1 ImageN Oxford SVHN Tiny-I ViT-B/16 Head-Only Full Fine-Tune AdaptTuner (Ours) Pareto Efficiency Frontier: Accuracy vs. Trainable Parameters Fig. 3: Accuracy versus trainable parameter count (Pareto frontier). Adapter- Tune (red stars) achieves comparable or higher accuracy than full fine-tuning (blue squares) at 1-2 orders of magnitude fewer trainable parameters, demonstrating a clearly favourable position on the accuracy-efficiency frontier. r = 16 remains a good efficiency default, while r = 32 captures most of the observable peak accuracy. 5.4 Placement, Initialization, and Sensitivity Placement and initialization. Tab. 4 shows that inserting adapters every block or every two blocks yields nearly identical accuracy (|∆|≤ 0.10 points), confirming that every 2 blocks placement halves the adapter count at minimal accuracy cost. Zero initialization yields lower variance across seeds (0.02 vs. 0.10), motivating zero-init as the more reliable default. Hyperparameter sensitivity. Tab. 5 shows that all learning rates in [3×10 −4 , 10 −3 ] and all weight decays in [0.01, 0.1] remain within 0.3 points of the best configu- AdapterTune11 Table 3: Rank sweep on CIFAR-10 / ViT-S/16 (adapter, every=1, zero init, 3 seeds). Accuracy improves steadily from r = 8 to r = 64, with smaller increments beyond r = 32, consistent with Corollary 1. Adapter Rank r Dataset / Backboner = 8 r = 16 r = 32 r = 64 CIFAR-10 / ViT-S/1697.56 ±0.11 97.61 ±0.02 97.75 ±0.05 97.85 ±0.09 Gain vs. r = 8—+0.05+0.20+0.29 8163264 Rank r 96.0 96.5 97.0 97.5 98.0 98.5 99.0 Top-1 Acc (%) CIFAR-10 DeiT-Ti/16 ViT-S/16 ViT-B/16 8163264 Rank r 80 82 84 86 88 90 92 Top-1 Acc (%) CIFAR-100 Rank Ablation Across Datasets and Backbones r=8r=16r=32r=64 Adapter Rank =0.5 =1.0 =2.0 Scale Factor 97.397.797.597.6 97.697.697.897.8 98.097.598.298.3 CIFAR-10 / ViT-S/16 97.0 97.2 97.4 97.6 97.8 98.0 98.2 98.4 Top-1 Acc (%) Fig. 4: (Left) Rank sweep across all core datasets and backbone scales. The diminishing-returns elbow (predicted by Corollary 1) appears consistently across ev- ery dataset-backbone pair, not just CIFAR-10/ViT-S. Accuracy gains from r =8→32 uniformly exceed gains from r =32→64, validating the O(r 1/2−p ) decay law broadly. (Right) Rank × adapter scale (α) joint sensitivity on CIFAR-10/ViT-S. Accu- racy is robust across the full r∈[8, 64] range for α≤1. Only α=2 at low rank causes a visible drop, confirming α=1 as a safe default that need not be tuned. ration, confirming robustness to common hyperparameter choices. Higher α = 2 incurs a 0.14 point penalty, suggesting the adapter output scale should not ex- ceed the residual path magnitude. 5.5 Extended Benchmark Tab. 6 shows that the pattern observed on the core benchmark generalizes: on Flowers102, ImageNet-R, Tiny-ImageNet, and FGVC-Aircraft, AdapterTune consistently improves over head-only transfer across all backbone scales. On Flowers102, AdapterTune with ViT-B/16 achieves 97.8%, surpassing full fine- tuning by +2.1 points. On ImageNet-R, AdapterTune recovers > 95% of the full fine-tuning gap with all three backbones. The extended benchmark confirms that the core findings generalize well beyond the five primary evaluation datasets. 5.6 Failure Cases and Honest Analysis Full fine-tuning outperforms AdapterTune in only four of the fifteen core settings. These cases are concentrated entirely on two datasets, SVHN and Food101, and share a distinct signature: a small backbone combined with a large domain shift. SVHN’s tightly cropped digit photographs introduce texture statistics largely 12S. Khazem et al. Table 4: Placement and initialization ablations on CIFAR-10 / ViT-S/16 (r = 16, 3 seeds). ∆ is relative to the default configuration (every=1, zero init). Both axes stay within 0.10 points in mean accuracy, confirming design robustness. Design Axis SettingTop-1 (%) ∆ Placement Every block (every=1) 97.61 ±0.02 +0.00 Every 2 blocks (every=2) 97.51 ±0.06 -0.10 Initialization Zero (default) 97.61 ±0.02 +0.00 Small random (σ 0 =10 −4 ) 97.59 ±0.10 -0.01 Table 5: Hyperparameter sensitivity on CIFAR-10 / ViT-S/16 (r=16, 3 seeds). All configurations stay within 0.3 p of the best. Learning rateWeight decayScale α 2.5e−4 5e−4 1e−3 0.01 0.05 0.10.5 1.0 2.0 97.56 97.68 97.61 97.54 97.61 97.79 97.70 97.61 97.47 absent from ImageNet pretraining, while the visually overlapping categories of Food101 demand numerous fine-grained discriminative directions. Both scenarios necessitate rewriting, rather than merely recombining, pretrained features. The performance gaps are widest on DeiT-Tiny (d = 192), where a rank-16 bottleneck spans only r/d≈ 8% of the feature space (yielding a deficit of +1.89 points on SVHN and +4.55 points on Food101). These gaps shrink consistently on the wider ViT-Small (d = 384; +1.17 and +1.44 points, respectively), con- firming that the performance deficit scales inversely with backbone capacity. This behavior occurs in precisely the regime where Corollary 1 predicts insufficient capacity: under massive domain shifts, the tail singular values of the required feature shift remain large, and a narrow bottleneck cannot adequately absorb them. Our rank sweep ablation (Tab. 3) corroborates this diagnosis; raising r from 16 to 64 closes roughly half the gap on the SVHN and ViT-Small pair, demonstrating that a modestly larger rank budget suffices for high shift trans- fers. The role of inter-class separability. The Food101 results on DeiT-Tiny fur- ther highlight the limits of frozen backbone capacity. With 101 target classes compressed into a narrow representation dimension (d = 192), the frozen Ima- geNet features likely lack the necessary inter-class margins for fine-grained food discrimination. Consistent with this interpretation, when applied to the higher- capacity ViT-Base backbone on the exact same dataset, AdapterTune completely reverses this trend, surpassing full fine-tuning by a substantial +6.5 points. 5.7 Training Efficiency Because backbone gradients are not computed, AdapterTune is substantially faster and more memory efficient than full fine-tuning. On a single NVIDIA AdapterTune13 Table 6: Extended benchmark results (top-1 accuracy, mean± std, 3 seeds). AdapterTune consistently outperforms head-only tuning across all 4 extended datasets and 3 backbones. DatasetBackbone Head-Only AdapterTune (Ours) Full FT Top-1 (%)Top-1 (%)Top-1 (%) Flowers102 ViT-S/16 87.73 ±0.53 94.43 ±0.04 92.92 ±0.73 ViT-B/16 98.71 ±0.11 99.43 ±0.04 93.23 ±2.75 DeiT-T 85.87 ±0.21 94.19 ±0.11 93.18 ±0.18 ImageNet-R ViT-S/16 46.94 ±0.66 69.26 ±0.41 59.59 ±0.28 ViT-B/16 62.17 ±0.66 80.10 ±0.41 57.66 ±0.22 DeiT-T 49.41 ±0.85 65.22 ±0.22 59.06 ±0.32 Tiny-ImageNet ViT-S/16 73.69 ±0.12 84.95 ±0.07 71.38 ±0.97 ViT-B/16 81.45 ±0.09 90.00 ±0.09 71.99 ±2.73 DeiT-T 70.28 ±0.18 76.35 ±0.15 69.04 ±0.37 FGVC-Aircraft ViT-S/16 36.70 ±0.22 67.72 ±0.22 68.28 ±0.79 ViT-B/16 52.14 ±0.09 74.79 ±0.78 69.86 ±1.14 DeiT-T 37.56 ±0.51 65.73 ±0.26 72.51 ±0.73 A6000 (50 GB), a 50 epoch training run on CIFAR-10 with ViT-Base takes 8 min for AdapterTune versus 22 min for full fine-tuning (2.8× speedup). Finally, to ensure that the empirical success of AdapterTune does not secretly rely on brittle, per-task hyperparameter tuning, we present an exhaustive sensitivity analysis. We jointly sweep the learning rate, weight decay, and adapter scaling factor (α) to observe their compounding effects. The results demonstrate remarkable robustness: the maximum accuracy variance across the entire 27-configuration grid is less than 0.4 points. This verifies that our recommended default settings are highly stable, allowing practitioners to deploy AdapterTune out-of-the-box without conducting costly hyperparameter searches. 6 Discussion When AdapterTune excels. AdapterTune yields the largest gains on tasks with moderate domain shifts and low-rank feature requirements (e.g., CIFAR- 100, Oxford-IIIT Pet, ImageNet-R). Here, it matches or exceeds full fine-tuning using < 1.5% of the parameters. This regularization effect is most pronounced on wider backbones (ViT-Base), where redundant directions easily accommodate low-rank shifts. Failure modes. Conversely, Sec. 5.6 shows that severe domain gaps paired with narrow backbones (e.g., SVHN/Food101 on DeiT-Tiny) require massive feature reorganization. A rank-16 bottleneck spans only∼8% of DeiT-Tiny’s dimension, trailing full fine-tuning by 1.9-4.6 points. As Theorem 1 predicts, when the target shift’s effective rank exceeds the bottleneck, approximation error dominates. 14S. Khazem et al. 6M (DeiT-Ti/16) 22M (ViT-S/16) 87M (ViT-B/16) Backbone Parameters (M) 0 10 20 30 40 50 AdaptTuner Acc vs. Head-Only (%) (a) Gain vs. Head-Only CIFAR-10 CIFAR-100 SVHN Oxford Pets Food-101 Flowers-102 FGVC-Aircraft ImageNet-R Tiny-ImageNet Mean 6M (DeiT-Ti/16) 22M (ViT-S/16) 87M (ViT-B/16) Backbone Parameters (M) 5 0 5 10 15 20 AdaptTuner Acc vs. Full Fine-Tune (%) (b) Gain vs. Full Fine-Tune Backbone Scaling: Does AdaptTuner Gain Scale with Model Capacity? Fig. 5: Backbone scaling trends. (Left) Average gain of AdapterTune over Head- Only across all datasets as backbone parameter count grows from DeiT-T/16 (5M) to ViT-B/16 (86M). The gains are consistent across backbone scales, with larger back- bones showing slightly higher gains on fine-grained tasks. (Right) Average gain of AdapterTune over Full Fine-Tuning: the adapter advantage over full fine-tuning in- creases with backbone size, attributed to the stronger implicit regularization of the low-rank constraint as model capacity grows. Theory alignment. Sweeps validate our diminishing returns corollary (Corol- lary 1): accuracy gains halve when doubling r from 32 to 64 compared to 8 to 32, matching the r 1/2−p singular value decay. Additionally, AdapterTune yields nar- row train-test gaps (1.7-2.7% vs. 11-13% for full fine-tuning), aligning perfectly with the e O( p Ldr/n) estimation bound (Eq. (8)). 405060708090100 Final Training Accuracy (%) 40 50 60 70 80 90 100 Final Test Accuracy (%) Head-Only: gap=-4.1% Full Fine-Tune: gap=11.5% AdaptTuner (Ours): gap=-2.7% DeiT-Ti/16 Head-Only Full Fine-Tune AdaptTuner (Ours) y = x (no gap) Over-fitting region 405060708090100 Final Training Accuracy (%) 40 50 60 70 80 90 100 Final Test Accuracy (%) Head-Only: gap=0.5% Full Fine-Tune: gap=13.0% AdaptTuner (Ours): gap=1.7% ViT-S/16 Head-Only Full Fine-Tune AdaptTuner (Ours) y = x (no gap) Over-fitting region 60708090100 Final Training Accuracy (%) 60 70 80 90 100 Final Test Accuracy (%) Head-Only: gap=0.3% Full Fine-Tune: gap=13.3% AdaptTuner (Ours): gap=2.7% ViT-B/16 Head-Only Full Fine-Tune AdaptTuner (Ours) y = x (no gap) Over-fitting region Generalisation Gap: Training vs. Test Accuracy (closer to diagonal = better generalisation) Fig. 6: Generalisation gap analysis. Each point plots training accuracy (x-axis) versus test accuracy (y-axis) for a dataset/method combination. Full fine-tuning (blue squares) exhibits large train-test gaps (11–13%), indicating overfitting on smaller datasets under the fixed training protocol. AdapterTune (red stars) clusters near the diagonal with average gaps of only 1.7-2.7%, while Head-Only (gray circles) shows near-zero gaps reflecting underfitting. AdapterTune occupies a favorable bias-variance operating point between the two extremes. AdapterTune15 2.5e-45e-41e-3 97.40 97.45 97.50 97.55 97.60 97.65 97.70 97.75 97.80 Top-1 Accuracy (%) 97.56 97.68 97.61 Learning Rate 0.010.050.1 97.4 97.5 97.6 97.7 97.8 97.9 Top-1 Accuracy (%) 97.54 97.61 97.79 Weight Decay 0.51.02.0 97.4 97.5 97.6 97.7 97.8 Top-1 Accuracy (%) 97.70 97.61 97.47 Adapter Scale α Fig. 7: Full hyperparameter sensitivity grid on CIFAR-10/ViT-S/16 (r = 16, 3 seeds each). The grid jointly sweeps three hyperparameters: learning rate ∈ 2.5× 10 −4 , 5× 10 −4 , 10 −3 , weight decay ∈ 0.01, 0.05, 0.1, and adapter scale α ∈ 0.5, 1.0, 2.0. Every configuration achieves > 97.4% top-1 accuracy; the total spread across all 27 cells is less than 0.4 p. This confirms that AdapterTune does not require careful per-dataset hyperparameter tuning: practitioners may use the recommended defaults (LR= 10 −3 , WD= 0.05, α = 1) across a wide range of tasks without a dedi- cated sweep. Practical defaults & Limitations. Based on our ablations, we recommend r = 16 for efficiency, r = 32 for peak performance, zero-initialization, and every- block placement. However, our approach has limitations: (i) our theory relies on a linearization of feature shifts that may loosen under saturation; (i) dense prediction tasks may require direct attention updates (e.g., LoRA [13]); and (i) identifying the optimal r ∗ currently requires an empirical sweep. 7 Conclusion We presented AdapterTune, a parameter-efficient approach for adapting frozen Vision Transformers using zero-initialized, low-rank residual adapters. By ensur- ing the adapted network identically matches the pretrained function at initializa- tion, our method guarantees early-epoch optimization stability. Furthermore, we formalized a theoretical bound connecting adapter rank to downstream feature shifts, accurately predicting the diminishing returns observed in our empirical sweeps. On a rigorously reproducible benchmark spanning 9 datasets and 3 ar- chitectures, AdapterTune outperformed full fine-tuning on 10 of 15 core settings while updating less than 1% of the model parameters. This provides a highly effi- cient, theory-grounded foundation for multi-task deployment, opening promising avenues for future work in continual adapter learning and automated rank se- lection. 16S. Khazem et al. References 1. Bartlett, P.L., Mendelson, S.: Rademacher and Gaussian complexities: Risk bounds and structural results. vol. 3, p. 463–482 (2002) 2. Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative components with random forests. In: European Conference on Computer Vision (ECCV) (2014) 3. Chen, S., Ge, C., Tong, Z., Luo, P.: AdaptFormer: Adapting vision transformers for scalable visual recognition. In: Advances in Neural Information Processing Systems (NeurIPS) (2022) 4. Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16×16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (ICLR) (2021) 5. Eckart, C., Young, G.: The approximation of one matrix by another of lower rank. Psychometrika 1(3), 211–218 (1936) 6. Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., Qiao, Y.: CLIP- adapter: Better vision-language models with feature adapters. In: International Journal of Computer Vision. vol. 132, p. 581–595 (2024) 7. He, B., Shi, L., Datta, S., Monz, C.: Parameter-efficient fine-tuning without intro- ducing new latency. In: European Conference of the Association for Computational Linguistics (EACL) (2023) 8. He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T., Neubig, G.: Towards a unified view of parameter-efficient transfer learning. In: International Conference on Learning Representations (ICLR) (2022) 9. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022) 10. Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F., Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M., Song, D., Steinhardt, J., Gilmer, J.: The many faces of robustness: A critical analysis of out-of-distribution generalization. In: IEEE/CVF International Conference on Computer Vision (ICCV). p. 8340–8349 (2021) 11. Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415 (2016) 12. Houlsby, N., Giurgiu, A., Jastrzebski, S., Morrone, B., De Laroussilhe, Q., Ges- mundo, A., Attariyan, M., Gelly, S.: Parameter-efficient transfer learning for NLP. In: International Conference on Machine Learning (ICML) (2019) 13. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (ICLR) (2022) 14. Jia, M., Tang, L., Chen, B.C., Cardie, C., Belongie, S., Hariharan, B., Lim, S.N.: Vi- sual prompt tuning. In: European Conference on Computer Vision (ECCV) (2022) 15. Jie, S., Deng, Z.H.: FacT: Factor-tuning for lightweight adaptation on vision trans- former. In: AAAI Conference on Artificial Intelligence (2023) 16. Khazem, S.: Multi-scale visual prompting for lightweight small-image classification. arXiv preprint arXiv:2512.03663 (2025) 17. Khazem, S.: Topolora-sam: Topology-aware parameter-efficient adaptation of foun- dation segmenters for thin-structure and cross-domain binary semantic segmenta- tion. arXiv preprint arXiv:2601.02273 (2026) AdapterTune17 18. Khazem, S., Fix, J., Pradalier, C.: Polygonet: Leveraging simplified polygonal representation for effective image classification. arXiv preprint arXiv:2504.01214 (2025) 19. Krizhevsky, A.: Learning multiple layers of features from tiny images. Tech. rep., University of Toronto (2009) 20. Le, Y., Yang, X., et al.: Tiny imagenet visual recognition challenge. CS 231N 7(7), 3 (2015) 21. Lian, D., Zhou, D., Feng, J., Wang, X.: Scaling & shifting your features: A new baseline for efficient model tuning. In: Advances in Neural Information Processing Systems (NeurIPS) (2022) 22. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (ICLR) (2019) 23. Luo, C., Song, Q., Gong, Y., Zheng, H., Liu, W., Xiao, G.: RepAdapter: Region- based adapter for vision transformers. In: Proceedings of the AAAI Conference on Artificial Intelligence (2023) 24. Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013) 25. Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Ng, A.Y.: Reading digits in natural images with unsupervised feature learning. In: NIPS Workshop on Deep Learning and Unsupervised Feature Learning (2011) 26. Nilsback, M.E., Zisserman, A.: Automated flower classification over a large num- ber of classes. In: Indian Conference on Computer Vision, Graphics and Image Processing (2008) 27. Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.V.: Cats and dogs. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2012) 28. Pfeiffer, J., Kamath, A., Rücklé, A., Cho, K., Gurevych, I.: Adapterfusion: Non- destructive task composition for transfer learning. In: Conference of the European Chapter of the Association for Computational Linguistics (EACL) (2021) 29. Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A.: Do vision transformers see like convolutional neural networks? In: Advances in Neural Infor- mation Processing Systems (NeurIPS) (2021) 30. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning (ICML) (2021) 31. Zaken, E.B., Ravfogel, S., Goldberg, Y.: BitFit: Simple parameter-efficient fine- tuning for transformer-based masked language-models. In: Annual Meeting of the Association for Computational Linguistics (ACL) (2022) 32. Zeng, Y., Lee, K.: The expressive power of low-rank adaptation. In: International Conference on Learning Representations (ICLR) (2024) 33. Zhai, X., Kolesnikov, A., Houlsby, N., Beyer, L.: Scaling vision transformers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2022) 34. Zhang, R., Han, J., Liu, C., Guo, A., Wen, L., Huang, Z., Lu, J., Feng, S., Li, H., Zhu, Y., Zheng, P.: LLaMA-Adapter: Efficient fine-tuning of language models with zero-init attention. In: arXiv preprint arXiv:2303.16199 (2023) 35. Zhang, Z., Sabuncu, M., Yildirim, I.: Revisiting the role of language priors in vision and language models. arXiv preprint arXiv:2206.01931 (2022) 36. Zhang, Z., Meng, F., Wang, J., Wang, J.: NOAH: Neural architecture and adapter search for visual prompt tuning. In: European Conference on Computer Vision (ECCV) (2022)