Paper deep dive
When Exploration Comes for Free with Mixture-Greedy: Do we need UCB in Diversity-Aware Multi-Armed Bandits?
Bahar Dibaei Nia, Farzan Farnia
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%
Last extracted: 3/26/2026, 2:33:15 AM
Summary
The paper investigates the necessity of Upper Confidence Bound (UCB) exploration bonuses in diversity-aware multi-armed bandit problems for generative model selection. It demonstrates that a 'Mixture-Greedy' strategy, which lacks explicit exploration bonuses, often outperforms UCB-based methods by leveraging implicit exploration induced by the geometry of diversity-aware objectives. The authors provide theoretical guarantees for sublinear regret and linear sampling of arms under entropy-based, kernel-based, and FID-type objectives.
Entities (5)
Relation Signals (2)
Diversity-aware objectives → induces → Implicit exploration
confidence 92% · diversity-aware objectives induce implicit exploration by favoring interior mixtures
Mixture-Greedy → outperforms → Mixture-UCB
confidence 90% · Mixture-Greedy performed better than Mixture-UCB (Rezaei et al., 2025) (with the UCB exploration term) in both the experiments with RKE and KID objectives.
Cypher Suggestions (2)
Find algorithms compared in the paper · confidence 85% · unvalidated
MATCH (a:Algorithm)-[:COMPARED_TO]->(b:Algorithm) RETURN a.name, b.name
Map evaluation metrics to their properties · confidence 80% · unvalidated
MATCH (m:Metric)-[:HAS_PROPERTY]->(p:Property) RETURN m.name, p.name
Abstract
Abstract:Efficient selection among multiple generative models is increasingly important in modern generative AI, where sampling from suboptimal models is costly. This problem can be formulated as a multi-armed bandit task. Under diversity-aware evaluation metrics, a non-degenerate mixture of generators can outperform any individual model, distinguishing this setting from classical best-arm identification. Prior approaches therefore incorporate an Upper Confidence Bound (UCB) exploration bonus into the mixture objective. However, across multiple datasets and evaluation metrics, we observe that the UCB term consistently slows convergence and often reduces sample efficiency. In contrast, a simple \emph{Mixture-Greedy} strategy without explicit UCB-type optimism converges faster and achieves even better performance, particularly for widely used metrics such as FID and Vendi where tight confidence bounds are difficult to construct. We provide theoretical insight explaining this behavior: under transparent structural conditions, diversity-aware objectives induce implicit exploration by favoring interior mixtures, leading to linear sampling of all arms and sublinear regret guarantees for entropy-based, kernel-based, and FID-type objectives. These results suggest that in diversity-aware multi-armed bandits for generative model selection, exploration can arise intrinsically from the objective geometry, questioning the necessity of explicit confidence bonuses.
Tags
Links
- Source: https://arxiv.org/abs/2603.21716v1
- Canonical: https://arxiv.org/abs/2603.21716v1
Full Text
119,038 characters extracted from source content.
Expand or collapse full text
When Exploration Comes for Free with Mixture-Greedy: Do we need UCB in Diversity-Aware Multi-Armed Bandits? Bahar Dibaei Nia Department of Computer Science and Engineering, Chinese University of Hong Kong. bahar, farnia@cse.cuhk.edu.hk Farzan Farnia11footnotemark: 1 Abstract Efficient selection among multiple generative models is increasingly important in modern generative AI, where sampling from suboptimal models is costly. This problem can be formulated as a multi-armed bandit task. Under diversity-aware evaluation metrics, a non-degenerate mixture of generators can outperform any individual model, distinguishing this setting from classical best-arm identification. Prior approaches therefore incorporate an Upper Confidence Bound (UCB) exploration bonus into the mixture objective. However, across multiple datasets and evaluation metrics, we observe that the UCB term consistently slows convergence and often reduces sample efficiency. In contrast, a simple Mixture-Greedy strategy without explicit UCB-type optimism converges faster and achieves even better performance, particularly for widely used metrics such as FID and Vendi where tight confidence bounds are difficult to construct. We provide theoretical insight explaining this behavior: under transparent structural conditions, diversity-aware objectives induce implicit exploration by favoring interior mixtures, leading to linear sampling of all arms and sublinear regret guarantees for entropy-based, kernel-based, and FID-type objectives. These results suggest that in diversity-aware multi-armed bandits for generative model selection, exploration can arise intrinsically from the objective geometry, questioning the necessity of explicit confidence bonuses. 1 Introduction Figure 1: Comparison of mixture selection strategies over five ImageNet pretrained generative models from Stein et al. (2023)’s repository in terms of RKE↑ (left) and KID↓ (right) of generated samples from step 11 to t (plots’ x-axis). Mixture-Greedy performed better than Mixture-UCB (Rezaei et al., 2025) (with the UCB exploration term) in both the experiments with RKE and KID objectives. Mixture-Oracle always generates samples from the population-optimal mixture of models. The past decade has witnessed remarkable advances in generative modeling, leading to a wide availability of high-quality pretrained generators across diverse architectures and training paradigms. Early approaches such as variational autoencoders (VAEs) (Kingma and Welling, 2013) and generative adversarial networks (GANs) (Goodfellow et al., 2014) established scalable frameworks for learning complex data distributions. Subsequent architectural and training advances (Karras et al., 2019; Brock et al., 2019; Vahdat and Kautz, 2020), together with the emergence of diffusion-based generative models (Ho et al., 2020; Song et al., 2021; Rombach et al., 2022), have substantially improved both sample fidelity and diversity. As a result, practitioners often have access to multiple well-trained generative models that differ in architecture, training data, and inference cost. In such settings, a fundamental practical question arises: how can one select or combine these generators in a sample-efficient manner? Generating samples from suboptimal models can be computationally expensive, and evaluation budgets are typically limited. Suppose we are given m candidate generators, where the iith generator iG_i induces a distribution PiP_G_i. At each round t=1,2,…,Tt=1,2,…,T, we select a generator index It∈[m]I_t∈[m], observe a sample Xt∼PItX_t P_G_I_t, and update an estimate of a target evaluation score. The objective is to allocate samples so as to approach the best achievable evaluation performance while minimizing the number of samples drawn from suboptimal generators. A natural formalization of this task is as a stochastic multi-armed bandit problem (Russo et al., 2018; Slivkins, 2019; Lattimore and Szepesvári, 2020). Standard bandit algorithms aim to balance exploration and exploitation in the online selection process. To do this, a well-established approach is the Upper Confidence Bound (UCB) principle (Auer et al., 2002), which augments empirical estimates with optimism bonuses to encourage exploration of uncertain arms. In the context of generative model selection, however, the evaluation criteria are typically non-linear functionals of the underlying distribution rather than simple expected rewards. For example, the Fréchet Inception Distance (FID) (Heusel et al., 2017) compares mean and covariance statistics of two distributions, and other widely used metrics such as Kernel Inception Distance (KID) (Binkowski et al., 2018) and precision-recall based measures (Sajjadi et al., 2018; Kynkänniemi et al., 2019) also depend on higher-order distributional properties. To address this, Hu et al. (2025) proposed a bandit framework tailored to generative model evaluation and derived UCB-type confidence bounds for the FID metric, enabling online best-arm identification despite the non-linearity of the objective. A notable finding in generative model selection is the recognition that the optimal solution need not be a single generator, i.e., the best individual arm in the bandit formulation considered by Hu et al. (2025). Rather, as shown by Rezaei et al. (2025), under several diversity-aware evaluation metrics, a non-degenerate mixture of generators can strictly outperform every individual model. Let α∈Δm:=α∈ℝm:α⪰0, 1⊤α=1α∈ _m:=\α ^m:α 0,\ 1 α=1\ denote mixture weights and define the mixture distribution Pα=∑i=1mαiPiP_α= _i=1^m _iP_G_i. The mixture-model selection problem can then be formulated as α⋆∈argmaxα∈ΔmScore(Pα) α \;∈\; α∈ _m \! \;\;Score (P_α ) (1) Unlike classical best-arm identification, the optimizer of equation 1 may lie in the interior of the simplex, implying that combining generators can strictly improve performance. To address this, Rezaei et al. (2025) proposed Mixture-UCB algorithms that operate over the simplex and incorporate Upper Confidence Bound (UCB) bonuses. For quadratic kernel-based objectives such as KID (Binkowski et al., 2018) and RKE (Jalali et al., 2023), they showed that uniform high-probability guarantees can be obtained by estimating a finite number of pairwise expectations, making UCB-style exploration analytically tractable in that regime. Extending such guarantees to non-quadratic metrics, including FID and entropy-based diversity scores such as the Vendi score (Friedman and Dieng, 2023), remains significantly more challenging, as the resulting UCB bound would be loose due to the non-quadratic structure, leading to over-exploration and substantially slow convergence to the optimal solution. In this work, we revisit a foundational design choice in bandit-based generative model selection: Is an explicit UCB exploration bonus actually necessary? In classical best-arm bandits with linear rewards, purely greedy strategies are known to be brittle: early estimation noise can cause the algorithm to over-commit to an apparently good arm, stop collecting data from others, and never recover. UCB-style optimism was introduced precisely to prevent this collapse by forcing continued exploration (Auer et al., 2002). Surprisingly, we empirically find that this classical intuition does not necessarily carry over to diversity-aware mixture selection. Across extensive experiments on multiple datasets and evaluation metrics, adding the UCB bonus consistently slows convergence and often reduces sample efficiency. In contrast, a simple Mixture-Greedy strategy that repeatedly optimizes the empirical objective over the simplex Δm _m (without any explicit exploration bonus) converges faster and achieves comparable or superior performance, especially for widely used non-quadratic metrics such as FID and Vendi, where constructing UCB bounds is challenging. Our explanation is that diversity-aware mixture objectives can induce implicit exploration. The main distinction from best-arm identification is that the decision space here is not the discrete set of arms but the continuous simplex of mixtures. For many diversity-aware metrics, the optimum is attained by a non-degenerate mixture, and the objective itself tends to disfavor degenerate solutions that place nearly all mass on a single generator. As a result, optimizing the empirical objective can naturally produce interior mixture weights, which in turn ensures that every generator continues to receive samples. In other words, the mixture geometry plus the diversity-aware objective can create an exploration effect that is internal to the optimization problem, rather than an external confidence bonus appended to it. We formalize this phenomenon by providing theoretical results showing that, under explicit structural conditions, Mixture-Greedy stays uniformly away from the boundary of Δm _m, which yields linear sampling of all arms and sublinear regret without UCB. Our analyses cover representative families of objectives that matter in practice and were difficult to handle within prior UCB-based mixture frameworks: entropy-based objectives related to log-Vendi, FID-type Fréchet objectives under suitable interiority margin conditions, and kernel-based quadratic objectives including inverse-RKE-type formulations. Taken together with the empirical evidence, these results highlight a structural gap between diversity-aware mixture selection and classical linear-reward bandits: in the former, exploration may arise from the objective and the mixture structure themselves, suggesting that bandit methods for generative model selection should be designed around objective-induced geometry rather than relying solely on generic optimism bonuses. Here is a summary of this work’s main contributions: • Demonstrating empirically that removing the UCB exploration bonus accelerates convergence and improves sample efficiency in mixture-based generative model multi-armed bandits. • Studying the theoretical nature of exploration in diversity-aware generative model selection and characterizing when explicit confidence bonuses are unnecessary. • Developing sublinear regret guarantees for Mixture-Greedy under entropy-based, kernel-based, and FID-type objectives via implicit exploration induced by objective structure. 2 Related Works Evaluation metrics: fidelity, diversity, and memorization/novelty. Generative models are routinely evaluated with embedding-based sample metrics, including Inception Score (IS) (Salimans et al., 2016), Fréchet Inception Distance (FID) (Heusel et al., 2017), and MMD/KID (Gretton et al., 2012; Binkowski et al., 2018), as well as precision–recall style decompositions that separate quality from coverage (Sajjadi et al., 2018; Kynkänniemi et al., 2019; Naeem et al., 2020). Beyond distribution-level scores, recent work emphasizes sample-level diagnostics for memorization and novelty, including authenticity-style auditing (Alaa et al., 2022), feature-likelihood based generalization measures (Jiralerspong et al., 2023), and rarity score (Han et al., 2023). For diversity-aware evaluation, kernel/entropy-based criteria, including the Vendi score (Friedman and Dieng, 2023) and RKE (Jalali et al., 2023), quantify diversity via similarity structure. Also, two extensions of the Vendi-based metrics are discussed in the literature. Quality-weighted variants incorporate per-item utility to trade off quality and diversity within a unified similarity-based objective (Nguyen and Dieng, 2024). Also, Pasarkar and Dieng (2024) generalize Vendi to a broader Renyi-entropy family, offering different sensitivity to prevalence and enabling application-specific diversity control. Additionally, Ospanov and Farnia (2025) study statistical convergence and proposes truncated variants with provable finite-sample convergence guarantees. Another line of work highlights that these metrics inherit biases from the embedding space. Kynkänniemi et al. (2023) show that FID is tightly coupled to ImageNet class geometry; Stein et al. (2023) demonstrate systematic mismatches between embedding-based scores and human evaluation. In our numerical analysis, we use CLIP (Radford et al., 2021b) and DINOv2 (Oquab et al., 2023) backbone embeddings following the results and suggestions of these references. Online selection of generators, mixtures, and hyperparameters. Online allocation of sampling/evaluation budgets across pretrained generators can be cast as a stochastic bandit problem. Hu et al. (2025) derive UCB-style guarantees for online selection under FID-type objectives, and Rezaei et al. (2025) show that diversity-aware objectives may be optimized by interior mixtures, proposing Mixture-UCB methods with guarantees for quadratic kernel objectives. Chen and Ghosh (2024) view generative model hyperparameter search as adaptive resource allocation, proposing successive-Halving-style procedures guided by distributional tests. Our work focuses on a different design question: in diversity-aware mixture selection, is explicit optimism necessary, or can the objective geometry enforce sampling of all arms? Greedy bandits: failures and implicit exploration under structure. In classical stochastic bandits with linear rewards, purely greedy policies can incur linear regret due to early mis-ordering, motivating explicit exploration principles such as UCB (Auer et al., 2002; Lattimore and Szepesvári, 2020). However, positive results show that greedy (or exploration-light) algorithms can achieve sublinear regret when the problem structure induces implicit exploration, including many-armed regimes (Bayati et al., 2020) and contextual/linear bandits under diversity or smoothing assumptions (Bastani et al., 2021; Kannan et al., 2018). Our analysis establishes an analogous phenomenon for diversity-aware mixture objectives in generative model selection. 3 Preliminaries 3.1 Generative model evaluation A generative model G induces a probability distribution P_G over a sample space X. We write x∼Px P_G for a generated sample. In practice, evaluation is performed in a fixed embedding space. Throughout our analysis, we assume x∈ℝdx ^d denotes the embedded representation of a sample. Mathematically, given i.i.d. generated samples x1,…,xn∼Px_1,…,x_n P_G and real samples y1,…,ynreal∼Pdatay_1,…,y_n_real P_data in the same embedding space, an evaluation metric computes a scalar score Score(x1,…,xn)Score(x_1,…,x_n) reflecting fidelity and/or diversity. Fréchet Distance (FD). Let μ^x,Σ^x μ_x, _x denote the empirical mean and covariance matrix of xii=1n\x_i\_i=1^n, and similarly μ^y,Σ^y μ_y, _y for real data. The empirical Fréchet distance is FD(P^G,P^data)=‖μ^x−μ^y‖22+Tr(Σ^x+Σ^y−2(Σ^x1/2Σ^yΣ^x1/2)1/2) ( P_G, P_data)\,=\,\| μ_x- μ_y\|_2^2+\,Tr ( _x+ _y-2 ( _x^1/2 _y _x^1/2 )^1/2 ) (2) 3.2 Kernel-based scores A kernel is a symmetric positive semidefinite function k:×→ℝk:X×X . Equivalently, there exists a Hilbert space ℋH and feature map ϕ:→ℋφ:X such that k(x,y)=⟨ϕ(x),ϕ(y)⟩ℋ.k(x,y)= φ(x),φ(y) _H. Throughout, we assume the kernel is normalized: k(x,x)=1k(x,x)=1 for every x∈x . Note that even if a kernel function k is not normalized, we can replace k with the normalized k~(x,y)=k(x,y)/k(x,x)k(y,y) k(x,y)=k(x,y)/ k(x,x)k(y,y) that remains a valid kernel function. Given samples x1,…,xnx_1,…,x_n, let K∈ℝn×nK ^n× n denote the kernel (Gram) matrix with entries defined as Kij=k(xi,xj)K_ij=k(x_i,x_j) for every 1≤i,j≤n1≤ i,j≤ n. Kernel Distance (KD). KD is the squared maximum mean discrepancy (MMD) between P and Q: KD(P,Q)=[k(X,X′)]+[k(Y,Y′)]−2[k(X,Y)] (P,Q)=E [k(X,X ) ]+E [k(Y,Y ) ]-2E [k(X,Y) ] (3) where X,X′∼PX,X P and Y,Y′∼QY,Y Q are independent. Empirical estimators replace expectations with sample averages. Von Neumann entropy and Vendi. Define the normalized Gram matrix ρ=1nKρ= 1nK, that is PSD and unit-trace given a normalized kerk function. Let λ1,…,λn _1,…, _n be its eigenvalues, which form a probability model as they are non-negative and add to 11. The von Neumann entropy (VNE) of this density matrix is defined as VNE(ρ)=∑i=1nλilog1λiVNE (ρ )= _i=1^n _i 1 _i The Vendi score (Friedman and Dieng, 2023) is defined as the exponential of the VNE of 1nK 1nK Vendi(x1,…,xn)=exp(VNE(1nK)). (x_1,…,x_n)= (VNE ( 1nK ) ). (4) RKE and inverse-RKE. For a distribution P, Jalali et al. (2023) define the RKE mode-count as follows: RKE(P)=1X,X′∼iidP[k2(X,X′)],RKE(P)= 1E_X,X P [k^2(X,X ) ], which gives the Inverse-RKE definition as InvRKE(P)=X,X′∼P[k(X,X′)2]InvRKE(P)=E_X,X P[k(X,X )^2]. For the empirical distribution P^n P_n supported on xii=1n\x_i\_i=1^n, the above definitions reduce to InvRKE(P^n)=1n2‖K‖F2InvRKE( P_n)= 1n^2\|K\|_F^2, and RKE(P^n)=n2/‖K‖F2RKE( P_n)=n^2/\|K\|_F^2. 3.3 Multi-armed bandits for online generative model selection We consider m pretrained generators 1,…,mG_1,…,G_m, with corresponding distributions P1,…,PmP_G_1,…,P_G_m. At each round t=1,…,Tt=1,…,T, the learner selects It∈[m]I_t∈[m] and observes Xt∼PItX_t P_G_I_t. The objective is to maximize an evaluation score computed from the accumulated generated samples. This task can be naturally viewed as a stochastic multi-armed bandit (Slivkins, 2019; Lattimore and Szepesvári, 2020; Hu et al., 2025), where the online learner aims to identify the best model by successive queries to the models. Beyond selecting a single generator, we consider identifying a mixture of the models. Let Δm:=α∈ℝm:αi≥0,∑i=1mαi=1 _m:=\α ^m: _i≥ 0,\ _i=1^m _i=1\ be the probability simplex of weights assigned to the m models, and define the mixture distribution Pα=∑i=1mαiPiP_α= _i=1^m _iP_G_i. The optimization problem is therefore equation 1, where standard choices for ScoreScore can be negative-FD, negative-KD, Vendi, RKE, or a composite score combining multiple scores, especially combining the pure-diversity scores with quality precision or density scores. To address the task, Rezaei et al. (2025) propose the Mixture-UCB algorithm that adapts the established upper-confidence-bound (UCB) approach to the mixture setting. Note that the proposed Mixture-UCB in (Rezaei et al., 2025) is designed exclusively for quadratic evaluation scores, including KD, RKE, and their linear combinations with Precision and Density fidelity scores. Furthermore, another natural algorithm that we will analyze in our work is Mixture-Greedy, which assigns no UCB bonus term and at each round optimizes the empirical objective over Δm _m and samples ItI_t from the resulting mixture αt _t. We highlight that the Mixture-Greedy approach is directly applicable to all the evaluation scores discussed in this section. 4 Methodology: Mixture-Greedy & score-based simplex program We consider m pretrained generators 1,…,mG_1,…,G_m with induced distributions PiP_G_i over embedded samples in ℝdR^d. At round t, Mixture-Greedy forms an empirical loss ℒ^t−1(α) L_t-1(α) over the simplex Δm:=α∈ℝm:α⪰0, 1⊤α=1 _m:=\α ^m:α 0,\ 1 α=1\ and selects αt∈argminα∈Δmℒ^t−1(α). _t∈ _α∈ _m\ L_t-1(α). (5) It then samples an arm It∼αtI_t _t and draws Xt∼PItX_t P_G_I_t. We focus on three objective families used throughout the paper: (i) Fréchet distance (FD) via Gaussian moment matching, (i) negative log-Vendi (equivalently, negative von Neumann entropy), and (i) convex quadratic scores (KD/MMD, inverse-RKE), possibly combined with a linear fidelity term (Precision/Density-type). (i) FD via Gaussian moment matching. Fix the empirical real-data moments (μ^data,Σ^data)( μ_ data, _ data) in the embedding space. Let μ^i(t)∈ℝd μ_i(t) ^d and S^i(t)∈ℝd×d S_i(t) ^d× d denote, respectively, the empirical mean and uncentered second moment of samples from arm i up to time t: S^i(t)=1ni(t)∑r=1ni(t)xi,rxi,r⊤⪰0 S_i(t)= 1n_i(t) _r=1^n_i(t)x_i,rx_i,r 0. For α∈Δmα∈ _m, define the mixture mean and covariance via the exact mixture identities μ^t(α) μ_t(α) :=∑i=1mαiμ^i(t), := _i=1^m _i μ_i(t), (6) Σ^t(α):=∑i=1mαiS^i(t)−μ^t(α)μ^t(α)⊤. _t(α):= _i=1^m _i S_i(t)- μ_t(α) μ_t(α) . (7) The FD-type loss is then the Gaussian Fréchet functional ℒ^tFD(α):=‖μ^t(α)−μ^data‖22+Tr(Σ^t(α))+Tr(Σ^data)−2Tr((Σ^data1/2Σ^t(α)Σ^data1/2)1/2). L FD_t(α):=\ \| μ_t(α)- μ_ data\|_2^2+Tr ( _t(α) )+Tr ( _ data )\,-2\,Tr ( ( _ data^1/2 _t(α) _ data^1/2 )^1/2 ). (8) (i) Negative log-Vendi from pooled samples. Let xjj=1N\x_j\_j=1^N be the pooled generated samples available at time t, where each xjx_j came from arm Ij∈[m]I_j∈[m], and let K∈ℝN×NK ^N× N be the normalized kernel Gram matrix Krs=k(xr,xs)K_rs=k(x_r,x_s). For α∈Δmα∈ _m, define per-sample weights qj(α):=αIjnIj(t),j∈[N],q_j(α):= _I_jn_I_j(t), j∈[N], (9) so q(α)⪰0q(α) 0 and ∑j=1Nqj(α)=1 _j=1^Nq_j(α)=1. Consider the following unit-trace PSD matrix ρt(α):=diag(q(α))1/2Kdiag(q(α))1/2, _t(α):=diag(q(α))^1/2\,K\,diag(q(α))^1/2, (10) where we note that Tr(ρt(α))=1Tr( _t(α))=1 holds and we set the negative log-Vendi loss as ℒ^tVendi(α):=Tr(ρt(α)logρt(α)). L Vendi_t(α):=Tr ( _t(α) _t(α) ). (11) Minimizing equation 11 is equivalent to maximizing VNE(ρt(α))VNE( _t(α)) and hence the Vendi score of the generated samples. (i) Convex quadratics and linear fidelity. Let K^(t)∈ℝm×m K(t) ^m× m be symmetric PSD and b^(t)∈ℝm b(t) ^m estimate a linear functional bi=[ψi(X)]b_i=E[ _i(X)] with ψi∈[0,1] _i∈[0,1]. We consider ℒ^tquad(α):=α⊤K^(t)α+wα⊤b^(t),w≥0. L quad_t(α):=α K(t)α+w\,α b(t), w≥ 0. (12) The following proposition shows that the above loss functions yield convex functions of the mixture weights, which can be optimized by convex optimization algorithms. Proposition 1. For each fixed t, the update equation 5 is a convex optimization problem over Δm _m for: (i) ℒ^tFD(α) L FD_t(α) in equation 8, (i) ℒ^tVendi(α) L Vendi_t(α) in equation 11, and (i) ℒ^tquad(α) L quad_t(α) in equation 12 whenever K^(t)⪰0 K(t) 0. Proof. We provide the proof in the Appendix. ∎ Exponentiated-gradient (EG) solver and practical instantiation. To implement the Mixture-Greedy update equation 5 in an efficient manner on the simplex, we apply the exponentiated gradient (EG) descent, a mirror-descent method with KL geometry that maintains α∈Δmα∈ _m by construction. Starting from a feasible iterate α(0)α^(0) (we use the uniform distribution 1m 1m1 in our experiments), every EG step updates α~i(s+1)=αi(s)exp(−ηgi(s)),α(s+1)=1‖α~(s+1)‖1α~(s+1), α^(s+1)_i=α^(s)_i (-η g^(s)_i ),\;\;α^(s+1)= 1\| α^(s+1)\|_1 α^(s+1), (13) where g(s)=∇αℒ^t−1(α(s))g^(s)= _α L_t-1(α^(s)) and η>0η>0 is a stepsize. In our setting, the objective families in this section lead to smooth convex losses over Δm _m (Proposition 1); therefore, a fixed number of EG steps per round would be sufficient. Algorithm 1 summarizes the main steps. Score-specific gradients. For quadratic objectives equation 12, the gradient is explicit: ∇αℒ^tquad(α)=2K^(t)α+wb^(t) _α L quad_t(α)=2 K(t)α+w\, b(t). For FD equation 8 and log-Vendi equation 11, gradients require matrix-function evaluations (e.g., a matrix square-root/inverse-square-root for FD, and a matrix logarithm for log-Vendi). As we discuss, one can compute these via eigendecomposition. To reduce cost for log-Vendi when the pooled sample size is significantly large, we also use a finite-dimensional feature representation (e.g., random Fourier features for shift-invariant kernels), replacing the N×N× N kernel-matrix logarithm by a D×D× D feature-covariance logarithm when D≪ND N. The full derivations are provided in Appendix A. Algorithm 1 Mixture-Greedy via Exponentiated Gradient 1:Models ii=1m\G_i\_i=1^m, horizon T, warm start M≥1M≥ 1, EG stepsize η>0η>0, EG steps S≥1S≥ 1 2:Initialize counts ni(0)=Mn_i(0)=M for all i∈[m]i∈[m] and draw M warm-start samples from each model to initialize statistics 3:for t=1,2,…,Tt=1,2,…,T do 4: Form the empirical loss ℒ^t−1(α) L_t-1(α) from samples up to round t−1t-1 5: Initialize α(0)←αt−1α^(0)← _t-1 (warm start) or α(0)←1mα^(0)← 1m1 6: for s=0,1,…,S−1s=0,1,…,S-1 do 7: Compute g(s)←∇αℒ^t−1(α(s))g^(s)← _α L_t-1 (α^(s) ) 8: α~i(s+1)←αi(s)exp(−ηgi(s)) α^(s+1)_i←α^(s)_i (-η\,g^(s)_i) for all i∈[m]i∈[m] 9: α(s+1)←α~(s+1)/∑j=1mα~j(s+1)α^(s+1)← α^(s+1)/ _j=1^m α^(s+1)_j 10: end for 11: Set αt←α(S) _t←α^(S), sample It∼αtI_t _t, draw Xt∼PItX_t P_G_I_t 12: Update count nIt(t)←nIt(t−1)+1n_I_t(t)← n_I_t(t-1)+1, and update the empirical statistics needed for ℒ^t L_t 13:end for 5 Implicit exploration and regret for Diversity Scores We analyze Mixture-Greedy for two diversity-aware objectives: (i) negative log-Vendi (negative von Neumann entropy of the kernel matrix), and (i) the Fréchet distance functional under mean and covariance matching. Our goal is to characterize structural conditions under which Mixture-Greedy, without any explicit UCB bonus, achieves sublinear regret. Regarding the bandit protocol and regret definition, we consider a warm start of M≥1M≥ 1 samples per arm (i.e., ni(0)=Mn_i(0)=M). At each round t=1,…,Tt=1,…,T, Mixture-Greedy computes αt∈argminα∈ΔmF^t−1(α),It∼αt, _t∈ _α∈ _m F_t-1(α), I_t _t, (14) where F^t−1 F_t-1 is the plug-in empirical objective formed from samples up to time t−1t-1. We measure regret against the best fixed mixture: RegT:=∑t=1T[F(αt)−minα∈ΔmF(α)].Reg_T:= _t=1^T [F( _t)- _α∈ _mF(α) ]. (15) 5.1 Negative log-Vendi induces implicit exploration Let ϕ:→ℝdφ:X ^d be a finite-dimensional feature map. Define Si:=[ϕ(X)ϕ(X)⊤]S_i:=E[φ(X)φ(X) ] for X∼PiX P_G_i and Sα:=∑i=1mαiSiS_α:= _i=1^m _iS_i. The negative log-Vendi objective is FNLV(α):=Tr(SαlogSα).F_NLV(α):=Tr(S_α S_α). Assumption 1 (Normalized kernel). ‖ϕ(x)‖2=1\|φ(x)\|_2=1 for all x. Assumption 2 (Population innovation for log-Vendi). There exist constants ν0∈(0,1] _0∈(0,1] and ε0∈[0,ν0/8) _0∈[0, _0/8) such that for every i∈[m]i∈[m] there exists a unit vector viv_i satisfying vi⊤Sivi≥ν0v_i S_iv_i≥ _0 and ∑j≠ivi⊤Sjvi≤ε0 _j≠ iv_i S_jv_i≤ _0. Under these structural conditions, the entropy geometry enforces a uniform interiority property for empirical minimizers. This yields linear sampling of all arms and enables time-uniform concentration. Theorem 1 (Mixture-Greedy regret for negative log-Vendi). Fix T≥1T≥ 1 and δ∈(0,1)δ∈(0,1). Under Assumptions 1–2, there exists a warm-start size M=M(d,m,ν0,ε0,T,δ)M=M(d,m, _0, _0,T,δ) such that, with probability at least 1−δ1-δ the following holds, where CNLVC_NLV depends only on (d,m,ν0,ε0)(d,m, _0, _0): RegT≤CNLV(1+logm(T+1)δ)T(1+logT),Reg_T\ ≤\ C_NLV (1+ m(T+1)δ ) T\,(1+ T), Proof. We provide the proof in the Appendix. ∎ 5.2 Fréchet Distance: regret under a population interiority margin The Fréchet Distance functional is smooth at the simplex boundary and may admit boundary minimizers. We therefore state a population interiority condition ensuring that optimal mixtures lie in the interior. Let arm i produce i.i.d. embeddings Z∈ℝdZ ^d with mean μi _i and covariance Σi _i. For α∈Δmα∈ _m define μα=∑i=1mαiμi,Σα=∑i=1mαi(Σi+(μi−μα)(μi−μα)⊤) _α= _i=1^m _i _i,\; _α= _i=1^m _i ( _i+( _i- _α)( _i- _α) ) Fix (μ0,Σ0)( _0, _0) with Σ0⪰λ0I _0 _0I and define FFD(α):=∥μα−μ0∥22+Tr(Σα+Σ0−2(Σ01/2ΣαΣ01/2)1/2)F_FD(α):=\| _α- _0\|_2^2+Tr ( _α+ _0-2( _0^1/2 _α _0^1/2)^1/2 ) Assumption 3 (Bounded embeddings). There exists B<∞B<∞ such that ‖Z‖2≤B\|Z\|_2≤ B almost surely for every arm. Assumption 4 (Uniform positive definiteness). There exists ν>0ν>0 such that Σi⪰νI _i ν I for all i∈[m]i∈[m]. Assumption 5 (Population interiority margin). There exist γ0∈(0,1/m] _0∈(0,1/m] and Δ0>0 _0>0 such that infα∈Δm:miniαi≤γ0FFD(α)≥minα∈ΔmFFD(α)+Δ0. _α∈ _m:\ _i _i≤ _0F_FD(α)\ ≥\ _α∈ _mF_FD(α)+ _0. Theorem 2 (Mixture-Greedy regret for Fréchet Distance). Fix T≥1T≥ 1 and δ∈(0,1)δ∈(0,1). Under Assumptions 3–5, there exists M=M(B,λ0,ν,m,d,γ0,Δ0,T,δ)M=M(B, _0,ν,m,d, _0, _0,T,δ) such that, with probability at least 1−δ1-δ, where CFDC_FD depends only on (B,λ0,ν,m,d,γ0)(B, _0,ν,m,d, _0): RegT≤CFD(1+logm(T+1)δ)TReg_T\ ≤\ C_FD (1+ m(T+1)δ ) T Proof. We provide the proof in the Appendix. ∎ Figure 2: Convergence of Mixture Greedy to the Mixture Oracle in terms of Fréchet Distance (FD) across image selection steps on FFHQ (left) and LSUN-Bedroom (right). Remark 1. Our analysis extends to the RKE diversity objective, RKE(α)=1/Tr(Sα2)RKE(α)=1/Tr(S_α^2), optimized via the equivalent convex functional FRKE(α)=Tr(Sα2)F_RKE(α)=Tr(S_α^2). Under a standard population margin condition analogous to Assumption 5, Mixture-Greedy achieves an ~(T) O( T) regret bound. The same rate holds when the objective is augmented with linear fidelity terms. The statements and proofs are in Appendix D. 5.3 Including a linear fidelity term The fidelity scores for generative models, including Precision (Kynkänniemi et al., 2019) and Density (Naeem et al., 2020) are linear functionals of the generator distribution. We model this using bounded functions ψi:→[0,1] _i:X→[0,1]: G(α):=∑i=1mαiθi,θi:=X∼Pi[ψi(X)]∈[0,1],G(α):= _i=1^m _i _i, _i:=E_X P_G_i[ _i(X)]∈[0,1], and for a base diversity-aware objective F(α)F(α), we consider H(α):=F(α)+wG(α)H(α):=F(α)+w\,G(α) for given coefficient w>0w>0. Given this objective, Mixture-Greedy minimizes the plug-in empirical objective H^t−1(α)=F^t−1(α)+wG^t−1(α) H_t-1(α)= F_t-1(α)+w\, G_t-1(α). Negative log-Vendi. The linear increment along any simplex direction is uniformly bounded, thus the entropy-induced interiority mechanism remains intact, with the quantitative floor adjusted by a controlled w-dependent term. Corollary 1 (Mixture-Greedy regret for negative log-Vendi + linear term). Assume the conditions of Theorem 1 and 0≤ψi≤10≤ _i≤ 1. For M sufficiently large (see Appendix C), with probability at least 1−δ1-δ, ∑t=1T(H(αt)−minα∈ΔmH(α))≤CNLV,wTlogm(T+1)δ, _t=1^T (H( _t)- _α∈ _mH(α) )≤ C_NLV,w T m(T+1)δ, where CNLV,wC_NLV,w depends only on (d,m,ν0,ε0,w)(d,m, _0, _0,w). Fréchet Distance. Since G(α)∈[0,1]G(α)∈[0,1], adding wGwG perturbs any interiority margin by at most w, hence the Fréchet Distance guarantee extends under the corresponding margin condition for H. Corollary 2 (Mixture-Greedy regret for Fréchet Distance + linear term). Assume the conditions of Theorem 2 and 0≤ψi≤10≤ _i≤ 1. If H(α)=FFD(α)+wG(α)H(α)=F_FD(α)+wG(α) satisfies the population interiority margin (Assumption 5 with FFDF_FD replaced by H), then with probability at least 1−δ1-δ, ∑t=1T(H(αt)−minα∈ΔmH(α))≤CFD,wTlog(m(T+1)δ), _t=1^T (H( _t)- _α∈ _mH(α) )≤ C_FD,w T ( m(T+1)δ ), where CFD,wC_FD,w depends only on (B,λ0,ν,m,d,γ0,w)(B, _0,ν,m,d, _0,w). 6 Numerical Results Figure 3: Mixture Greedy, minimizing Frechet Distance using exponentiated gradient descent on FFHQ generated samples. We assess the performance of the proposed approach in both real-data and synthetic environments. For the real-data experiments, we adopt benchmark datasets standard in (Stein et al., 2023). Complementarily, we constructed synthetic scenarios with controlled data-generating mechanisms. Please refer to the Appendix E.1 for details and results. For image feature extraction, we use DINOv2-ViT-L/14 (Oquab et al., 2023) following the study by Stein et al. (2023). For text encoding, we use SBERT (Reimers and Gurevych, 2019). 6.1 Real-World Settings In the real-data setting, we evaluate our method on samples from pretrained generative models provided by the evaluation benchmark in (Stein et al., 2023). Each generative model is treated as an arm in our experimental framework. Here, we consider three datasets: FFHQ, LSUN-Bedroom, and ImageNet. For each dataset, we treat pretrained generative models as arms in our framework. All models are pre-trained on the dataset provided by Stein et al. (2023). FFHQ. For the FFHQ dataset (Karras et al., 2019), we considered five distinct generative models: LDM (Rombach et al., 2022), StyleGAN-XL (Sauer et al., 2022), Efficient-VDVAE (Hazami et al., 2022), InsGen (Yang and others, 2021), and StyleNAT (Walton and others, 2022). LSUN-Bedroom. For the LSUN-Bedroom dataset, we used generated samples from four models: StyleGAN (Karras et al., 2019), Projected GAN (Sauer and Geiger, 2021), iDDPM (Nichol and Dhariwal, 2021), and Unleashing Transformers (Bond-Taylor et al., 2022). ImageNet. For ImageNet, we evaluate four pretrained generative models: DiT-XL-2-guided (Peebles and Xie, 2023), LDM (Rombach et al., 2022), RQ-Transformer (Lee et al., 2022), and StyleGAN-XL (Sauer et al., 2022). Mixture Improves Fréchet Distance. Figure 2 shows the evolution of FD across selection rounds on two datasets. Alongside the online Mixture Greedy algorithm, we report two oracle baselines: the One-Arm Oracle, which samples exclusively from the single generator with the lowest standalone FD, and the Mixture Oracle, which assumes knowledge of the optimal mixture weights α⋆α and provides the corresponding lower bound on FD. Across both datasets, Mixture Greedy rapidly decreases FD and converges to the Mixture Oracle within a few rounds, consistently outperforming the One-Arm Oracle. This highlights the advantage of optimizing over generator mixtures rather than selecting a single best generator. See Figure 4 in Appendix for ImageNet results. Figure 3 illustrates the FD values obtained from individual generative models (arms) trained on FFHQ, as well as the FD achieved by their optimized mixture. While the best single model attains an FD of 164.13 (StyleGAN-XL), the samples from the mixture achieves a substantially lower FD of 155.13, outperforming every individual arm. The optimization is performed using exponentiated gradient descent. 7 Conclusion We studied mixture-based online selection of generative models, and our analysis indicates that, for a broad class of diversity-aware objectives, explicit optimism is not always necessary. Conceptually, our results highlight that exploration can arise intrinsically from the structure of the objective itself, with no need for externally imposed confidence bonuses. In regimes where diversity metrics reward interior mixtures or penalize degeneracy, the optimization landscape naturally prevents collapse onto a single arm and promotes sufficient sampling of all generators. Our numerical experiments support this message: across multiple datasets and metrics, Mixture-Greedy could converge faster than the Mixture-UCB baseline. Indeed, a Mixture-Greedy strategy could achieve sublinear regret under transparent structural conditions. We note that our analysis is conducted in a specific stochastic setting with a fixed pool of generators and objective-specific regularity assumptions, and it focuses on regret with respect to the chosen evaluation metric rather than broader notions of downstream utility. Extending these guarantees to more dynamic and nonstationary settings remains a relevant direction for future work. References A. D. Aaron Grattafiori et al. (2024) The llama 3 herd of models. External Links: 2407.21783 Cited by: §E.1. A. Alaa, B. van Breugel, E. S. Saveliev, and M. van der Schaar (2022) How faithful is your synthetic data? sample-level metrics for evaluating and auditing generative models. In Proceedings of the 39th International Conference on Machine Learning (ICML), Proceedings of Machine Learning Research, Vol. 162, p. 290–306. External Links: Link Cited by: §2. K. M. R. Audenaert (2007) A sharp fannes-type inequality for the von neumann entropy. Journal of Physics A: Mathematical and Theoretical 40 (28), p. 8127–8136. External Links: Document Cited by: §B.2. P. Auer, N. Cesa-Bianchi, and P. Fischer (2002) Finite-time analysis of the multiarmed bandit problem. Machine Learning 47 (2–3), p. 235–256. Cited by: §1, §1, §2. H. Bastani, M. Bayati, and K. Khosravi (2021) Mostly exploration-free algorithms for contextual bandits. Management Science. External Links: Document Cited by: §2. M. Bayati, N. Hamidi, R. Johari, and K. Khosravi (2020) Unreasonable effectiveness of greedy algorithms in multi-armed bandit with many arms. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2. M. Binkowski, D. J. Sutherland, M. Arbel, and A. Gretton (2018) Demystifying MMD GANs. In International Conference on Learning Representations (ICLR), Cited by: §1, §1, §2. S. Bond-Taylor, A. Leach, Y. Long, and C. G. Willcocks (2022) Unleashing transformers for image generation. In International Conference on Learning Representations (ICLR), Cited by: §6.1. A. Brock, J. Donahue, and K. Simonyan (2019) Large scale gan training for high fidelity natural image synthesis. In International Conference on Learning Representations (ICLR), Cited by: §1. J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y. Wu, Z. Wang, J. Kwok, P. Luo, H. Lu, and Z. Li (2023) PixArt-α: fast training of diffusion transformer for photorealistic text-to-image synthesis. arXiv preprint arXiv:2310.00426. External Links: Link Cited by: §E.1. L. Chen and S. K. Ghosh (2024) Fast model selection and hyperparameter tuning for generative models. Entropy 26 (2), p. 150. External Links: Document, Link Cited by: §2. D. Friedman and A. B. Dieng (2023) The vendi score: a diversity evaluation metric for machine learning. Transactions on Machine Learning Research. Cited by: §E.1, §1, §2, §3.2. Gemma-Team (2025) Gemma 2 technical report. External Links: 2503.19786 Cited by: §E.1. I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In Advances in Neural Information Processing Systems, Cited by: §1. A. Gretton, K. M. Borgwardt, M. J. Rasch, B. Schölkopf, and A. Smola (2012) A kernel two-sample test. The journal of machine learning research 13 (1), p. 723–773. Cited by: §2. J. Han, H. Choi, Y. Choi, J. Kim, J. Ha, and J. Choi (2023) Rarity score: a new metric to evaluate the uncommonness of synthesized images. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §2. L. Hazami, P. Comon, et al. (2022) Efficient-vdvae: less is more. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §6.1. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) GANs trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §2. J. Ho, A. Jain, and P. Abbeel (2020) Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, Cited by: §1. X. Hu, H.F. Leung, and F. Farnia (2025) A multi-armed bandit approach to online selection and evaluation of generative models. In Proceedings of the 28th International Conference on Artificial Intelligence and Statistics (AISTATS), Proceedings of Machine Learning Research. External Links: Link Cited by: §1, §1, §2, §3.3. M. Jalali, C. T. Li, and F. Farnia (2023) An information-theoretic evaluation of generative models in learning multi-modal distributions. Advances in Neural Information Processing Systems 36, p. 9931–9943. Cited by: §1, §2, §3.2. M. Jiralerspong, A. J. Bose, I. Gemp, C. Qin, Y. Bachrach, and G. Gidel (2023) Feature likelihood divergence: evaluating the generalization of generative models using samples. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2. S. Kannan, J. Morgenstern, A. Roth, B. Waggoner, and Z. S. Wu (2018) A smoothed analysis of the greedy algorithm for the linear contextual bandit problem. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2. T. Karras, S. Laine, and T. Aila (2019) A style-based generator architecture for generative adversarial networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, p. 4401–4410. Cited by: §1, §6.1, §6.1. D. P. Kingma and M. Welling (2013) Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114. Cited by: §1. T. Kynkänniemi, T. Karras, M. Aittala, T. Aila, and J. Lehtinen (2023) The role of imagenet classes in fréchet inception distance. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: §2. T. Kynkänniemi, T. Karras, S. Laine, J. Lehtinen, and T. Aila (2019) Improved precision and recall metric for assessing generative models. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §1, §2, §5.3. T. Lattimore and C. Szepesvári (2020) Bandit algorithms. Cambridge University Press. Cited by: §1, §2, §3.3. D. Lee, J. Kim, and J. Kim (2022) Autoregressive image generation using residual quantization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §6.1. M. Naeem, S. J. Oh, Y. Uh, Y. Choi, and J. Yoo (2020) Reliable fidelity and diversity metrics for generative models. In International Conference on Machine Learning (ICML), Cited by: §2, §5.3. Q. Nguyen and A. B. Dieng (2024) Quality-weighted vendi scores and their application to diverse experimental design. In Proceedings of the 41st International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 235, p. 37667–37682. Cited by: §2. A. Q. Nichol and P. Dhariwal (2021) Improved denoising diffusion probabilistic models. In Proceedings of the International Conference on Machine Learning (ICML), Cited by: §6.1. M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P. Huang, S. Li, I. Misra, M. Rabbat, V. Sharma, G. Synnaeve, H. Xu, H. Jegou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski (2024) DINOv2: learning robust visual features without supervision. External Links: 2304.07193 Cited by: §E.2.3. M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. (2023) Dinov2: learning robust visual features without supervision. arXiv preprint arXiv:2304.07193. Cited by: §2, §6. A. Ospanov and F. Farnia (2025) Do vendi scores converge with finite samples? truncated vendi score for finite-sample convergence guarantees. In Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence, Proceedings of Machine Learning Research, Vol. 286, p. 3272–3299. Cited by: §2. A. P. Pasarkar and A. B. Dieng (2024) Cousins of the Vendi score: a family of similarity-based diversity metrics for science and machine learning. In Proceedings of The 27th International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, Vol. 238, p. 3808–3816. Cited by: §2. W. Peebles and S. Xie (2023) Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), Cited by: §6.1. D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach (2023a) SDXL: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952. External Links: Link Cited by: §E.1. D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach (2023b) SDXL: improving latent diffusion models for high-resolution image synthesis. External Links: 2307.01952 Cited by: §E.1. Qwen-Team (2024) Qwen2 technical report. External Links: 2407.10671, Link Cited by: §E.1. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever (2021a) Learning transferable visual models from natural language supervision. External Links: 2103.00020 Cited by: §E.2.3. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021b) Learning transferable visual models from natural language supervision. In International conference on machine learning, p. 8748–8763. Cited by: §2. A. Razzhigaev, V. Arkhipkin, et al. (2023) Kandinsky: an improved text-to-image synthesis with image prior and latent diffusion. arXiv preprint arXiv:2310.03502. Cited by: §E.1. N. Reimers and I. Gurevych (2019) Sentence-bert: sentence embeddings using siamese bert-networks. External Links: 1908.10084, Link Cited by: §6. P. Rezaei, F. Farnia, and C. T. Li (2025) Be more diverse than the most diverse: optimal mixtures of generative models via mixture-ucb bandit algorithms. In International Conference on Learning Representations (ICLR), External Links: Link Cited by: Appendix E, Figure 1, Figure 1, §1, §1, §2, §3.3. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer (2022) High-resolution image synthesis with latent diffusion models. In IEEE Conference on Computer Vision and Pattern Recognition, Cited by: §1, §6.1, §6.1. D. J. Russo, B. Van Roy, A. Kazerouni, I. Osband, and Z. Wen (2018) A tutorial on thompson sampling. Foundations and Trends® in Machine Learning 11 (1), p. 1–96. External Links: Document Cited by: §1. M. S. Sajjadi, O. Bachem, M. Lucic, O. Bousquet, and S. Gelly (2018) Assessing generative models via precision and recall. Advances in neural information processing systems 31. Cited by: §1, §2. T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §2. A. Sauer and A. Geiger (2021) Projected gans converge faster. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §6.1. A. Sauer, K. Schwarz, and A. Geiger (2022) StyleGAN-xl: scaling stylegan to large diverse datasets. In Proceedings of the ACM SIGGRAPH Conference, Cited by: §6.1, §6.1. A. Slivkins (2019) Introduction to multi-armed bandits. Foundations and Trends® in Machine Learning 12 (1-2), p. 1–286. Cited by: §1, §3.3. J. Song, C. Meng, and S. Ermon (2021) Denoising diffusion implicit models. In International Conference on Learning Representations, Cited by: §1. G. Stein, J. C. Cresswell, R. Hosseinzadeh, Y. Sui, B. L. Ross, V. Villecroze, Z. Liu, A. L. Caterini, J. E. T. Taylor, and G. Loaiza-Ganem (2023) Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: Figure 1, Figure 1, §2, §6.1, §6. D. J. Sutherland, H. Strathmann, M. Arbel, and A. Gretton (2018) Efficient and principled score estimation with nyström kernel exponential families. In International Conference on Artificial Intelligence and Statistics, p. 652–660. Cited by: Lemma 12, Lemma 3. C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna (2016) Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, p. 2818–2826. Cited by: §E.2.3. A. Vahdat and J. Kautz (2020) NVAE: a deep hierarchical variational autoencoder. Advances in neural information processing systems 33, p. 19667–19679. Cited by: §1. S. Walton et al. (2022) StyleNAT: giving each head a new perspective. In Proceedings of the Asian Conference on Computer Vision (ACCV), Cited by: §6.1. T. Yang et al. (2021) Instance-conditioned gan. In Advances in Neural Information Processing Systems (NeurIPS), Cited by: §6.1. Appendix A Proofs and gradient computations for Section 4 A.1 Proof of Proposition 1 In the following, we prove convexity for each of the three objective families. Part (i): FD convexity in α. Fix (μ^data,Σ^data)( μ_ data, _ data) and the per-arm empirical moments μ^i(t),S^i(t)i=1m\ μ_i(t), S_i(t)\_i=1^m. Define μ^(α)=μ^t(α) μ(α)= μ_t(α) and Σ^(α)=Σ^t(α) (α)= _t(α) as in equation 6–equation 7. Let Σ0:=Σ^data⪰0 _0:= _ data 0 and μ0:=μ^data _0:= μ_ data. We rewrite the FD loss equation 8 as ℒ^tFD(α)= L FD_t(α)=\ ‖μ^(α)−μ0‖22+Tr(Σ^(α))+Tr(Σ0)−2Tr((Σ01/2Σ^(α)Σ01/2)1/2). \| μ(α)- _0\|_2^2+Tr( (α))+Tr( _0)-2\,Tr (( _0^1/2 (α) _0^1/2)^1/2 ). (16) First, we note that the mean plus trace part is affine in α. Using Σ^(α)=∑i=1mαiS^i−μ^(α)μ^(α)⊤ (α)= _i=1^m _i S_i- μ(α) μ(α) and Tr(μ^μ^⊤)=‖μ^‖22Tr( μ μ )=\| μ\|_2^2, ‖μ^(α)−μ0‖22+Tr(Σ^(α)) \| μ(α)- _0\|_2^2+Tr( (α)) =‖μ0‖22−2μ0⊤μ^(α)+‖μ^(α)‖22 =\| _0\|_2^2-2 _0 μ(α)+\| μ(α)\|_2^2 +Tr(∑i=1mαiS^i)−Tr(μ^(α)μ^(α)⊤) +Tr ( _i=1^m _i S_i )-Tr( μ(α) μ(α) ) =‖μ0‖22−2μ0⊤μ^(α)+Tr(∑i=1mαiS^i), =\| _0\|_2^2-2 _0 μ(α)+Tr ( _i=1^m _i S_i ), where the ‖μ^(α)‖22\| μ(α)\|_2^2 terms cancel exactly. Since μ^(α) μ(α) and ∑iαiS^i _i _i S_i are affine in α, this entire expression is affine in α. Next, we show that Σ^(α) (α) is concave in Loewner order. To do this, let α,β∈Δmα,β∈ _m and θ∈[0,1]θ∈[0,1]. Since ∑iαiS^i _i _i S_i is affine, it suffices to show that μ^(α)μ^(α)⊤ μ(α) μ(α) is convex in Loewner order. Because μ^(⋅) μ(·) is affine, we have θμ^(α)μ^(α)⊤+(1−θ)μ^(β)μ^(β)⊤−μ^(θα+(1−θ)β)μ^(θα+(1−θ)β)⊤ θ μ(α) μ(α) +(1-θ) μ(β) μ(β) - μ(θα+(1-θ)β) μ(θα+(1-θ)β) =θ(1−θ)(μ^(α)−μ^(β))(μ^(α)−μ^(β))⊤⪰0. =θ(1-θ) ( μ(α)- μ(β) ) ( μ(α)- μ(β) ) 0. Thus μ^(α)μ^(α)⊤ μ(α) μ(α) is Loewner-convex, and therefore Σ^(α)=∑iαiS^i−μ^(α)μ^(α)⊤ (α)= _i _i S_i- μ(α) μ(α) is Loewner-concave: Σ^(θα+(1−θ)β)⪰θΣ^(α)+(1−θ)Σ^(β). (θα+(1-θ)β)\ \ θ (α)+(1-θ) (β). Then, we show the concavity of the square-root trace term. We define Z(α):=Σ01/2Σ^(α)Σ01/2Z(α):= _0^1/2 (α) _0^1/2. Pre- and post-multiplication by a fixed PSD matrix preserves Loewner inequalities, hence Z(α)Z(α) is Loewner-concave. The matrix square-root map M↦M1/2M M^1/2 is operator concave and operator monotone on the PSD cone. Therefore, for θ∈[0,1]θ∈[0,1], Z(θα+(1−θ)β)1/2⪰(θZ(α)+(1−θ)Z(β))1/2⪰θZ(α)1/2+(1−θ)Z(β)1/2.Z(θα+(1-θ)β)^1/2 (θ Z(α)+(1-θ)Z(β) )^1/2 θ Z(α)^1/2+(1-θ)Z(β)^1/2. Taking traces (trace is a positive linear functional on PSD matrices) yields that H(α):=Tr(Z(α)1/2)H(α):=Tr (Z(α)^1/2 ) is concave in α. Finally, by equation 16, ℒ^tFD(α) L FD_t(α) is, up to affine term in α, − 2H(α)-\,2H(α), and is therefore convex on Δm _m. Part (i): log-Vendi convexity in α (kernel-matrix form). We fix the pooled samples xjj=1N\x_j\_j=1^N and the Gram matrix Krs=k(xr,xs)K_rs=k(x_r,x_s), where k is PSD and normalized, thus we have K⪰0K 0 and Kjj=1K_j=1 for every j. For a weight vector q∈ℝNq ^N with q⪰0q 0 and ∑jqj=1 _jq_j=1, we define ρ(q):=diag(q)1/2Kdiag(q)1/2⪰0,Tr(ρ(q))=∑j=1NqjKjj=1.ρ(q):=diag(q)^1/2Kdiag(q)^1/2 0, (ρ(q))= _j=1^Nq_jK_j=1. Note that the log-Vendi loss is defined to be f(q):=Tr(ρ(q)logρ(q))f(q):=Tr(ρ(q) ρ(q)). Let ℋH be a Hilbert space and ϕ:→ℋφ:X such that k(x,y)=⟨ϕ(x),ϕ(y)⟩ℋk(x,y)= φ(x),φ(y) _H (Section 2). Define the weighted covariance operator C(q):=∑j=1Nqjϕ(xj)⊗ϕ(xj)⪰ 0.C(q):= _j=1^Nq_j\,φ(x_j) φ(x_j)\ \ 0. (17) This operator is affine in q. Moreover, C(q)C(q) and ρ(q)ρ(q) share the same multiset of non-zero eigenvalues. One can see this by noting the linear map Aq:ℝN→ℋA_q:R^N defined by Aqej=qjϕ(xj)A_qe_j= q_j\,φ(x_j). Then, Aq∗Aq=ρ(q)A_q A_q=ρ(q) (in the ℝNR^N basis) and AqAq∗=C(q)A_qA_q =C(q) as an operator on ℋH. It follows that Aq∗AqA_q A_q and AqAq∗A_qA_q have identical nonzero spectra, hence Tr(ρ(q)logρ(q))=Tr(C(q)logC(q)),Tr (ρ(q) ρ(q) )=Tr (C(q) C(q) ), (18) where the trace on the right is over the (finite-rank) operator C(q)C(q). Now, we note that the negative von Neumann entropy X↦Tr(XlogX)X (X X) is convex on the cone of trace-one PSD (finite-rank) operators/matrices. Since C(q)C(q) is affine in q and Tr(C(q))=∑jqj‖ϕ(xj)‖ℋ2=∑jqjk(xj,xj)=1Tr(C(q))= _jq_j\|φ(x_j)\|_H^2= _jq_jk(x_j,x_j)=1 (using kernel normalization), we conclude that q↦Tr(C(q)logC(q))q (C(q) C(q)) is convex on the simplex. By equation 18, q↦Tr(ρ(q)logρ(q))q (ρ(q) ρ(q)) is convex as well. Finally, in our setting q=q(α)q=q(α) defined by equation 9 is affine in α (counts are fixed during the update), thus α↦ℒ^tVendi(α)α L Vendi_t(α) is convex on Δm _m. Part (i): convex quadratics. If K^(t)⪰0 K(t) 0, then α↦α⊤K^(t)α α K(t)α is convex on ℝmR^m and α↦wα⊤b^(t)α w\,α b(t) is linear. Hence ℒ^tquad L quad_t is convex on Δm _m. The above completes the proof of Proposition 1. A.2 Gradients for EG optimization Algorithm 1 requires g=∇αℒ^t−1(α)g= _α L_t-1(α). Below we provide the gradients used in our implementation. In all cases, counts ni(t−1)n_i(t-1) and pooled indices IjI_j are treated as fixed while solving equation 5 at round t. Quadratic + linear fidelity. For ℒ^tquad(α)=α⊤K^(t)α+wα⊤b^(t) L quad_t(α)=α K(t)α+w\,α b(t) with K^(t) K(t) symmetric, ∇αℒ^tquad(α)=2K^(t)α+wb^(t). _α L quad_t(α)=2 K(t)α+w\, b(t). FD gradient (moment-matching form). Let μ(α)=μ^t(α)μ(α)= μ_t(α) and Σ(α)=Σ^t(α) (α)= _t(α) as in equation 6–equation 7. Then ∂μ(α)∂αi=μ^i(t),∂Σ(α)∂αi=S^i(t)−μ^i(t)μ(α)⊤−μ(α)μ^i(t)⊤. ∂μ(α)∂ _i= μ_i(t), ∂ (α)∂ _i= S_i(t)- μ_i(t)μ(α) -μ(α) μ_i(t) . We write ℒ^tFD(α)=ℱ(μ(α),Σ(α)) L FD_t(α)=F(μ(α), (α)), where ℱ(μ,Σ)=‖μ−μ0‖22+Tr(Σ+Σ0)−2Tr((Σ01/2ΣΣ01/2)1/2),F(μ, )=\|μ- _0\|_2^2+Tr( + _0)-2\,Tr (( _0^1/2 _0^1/2)^1/2 ), with (μ0,Σ0)=(μ^data,Σ^data)( _0, _0)=( μ_ data, _ data). Then by the chain rule, ∂ℒ^tFD(α)∂αi ∂ L FD_t(α)∂ _i =2(μ(α)−μ0)⊤μ^i(t)+⟨G(Σ(α)),S^i(t)−μ^i(t)μ(α)⊤−μ(α)μ^i(t)⊤⟩, =2(μ(α)- _0) μ_i(t)+ G( (α)),\ S_i(t)- μ_i(t)μ(α) -μ(α) μ_i(t) , (19) where ⟨A,B⟩:=Tr(A⊤B) A,B :=Tr(A B) and G(Σ):=∇Σℱ(μ,Σ)=I−Σ01/2(Σ01/2ΣΣ01/2)−1/2Σ01/2.G( ):= _ F(μ, )=I- _0^1/2\,( _0^1/2 _0^1/2)^-1/2\, _0^1/2. (20) Note that we can compute the square root (⋅)−1/2(·)^-1/2 via eigendecomposition. Negative log-Vendi gradient (kernel matrix form). At time t, let pooled samples xjj=1N\x_j\_j=1^N with indices IjI_j, Gram matrix K, and qj(α)=αIj/nIj(t)q_j(α)= _I_j/n_I_j(t) as in equation 9. Define ρ(α)=diag(q(α))1/2Kdiag(q(α))1/2ρ(α)=diag(q(α))^1/2Kdiag(q(α))^1/2. For ρ≻0ρ 0, the differential identity dTr(ρlogρ)=⟨logρ+I,dρ⟩d\,Tr(ρ ρ)= ρ+I,\ dρ implies ∇ρℒ^tVendi(α)=logρ(α)+I _ρ L Vendi_t(α)= ρ(α)+I. Moreover, ∂qj(α)∂αi=1ni(t) 1Ij=i. ∂ q_j(α)∂ _i= 1n_i(t)\,1\I_j=i\. Let D:=diag(q)1/2D:=diag(q)^1/2. For qj>0q_j>0, ∂D∂qj=12qjejej⊤,∂ρ∂qj=∂D∂qjKD+DK∂D∂qj. ∂ D∂ q_j= 12 q_j\,e_je_j , ∂ρ∂ q_j= ∂ D∂ q_jKD+DK ∂ D∂ q_j. Therefore, ∂ℒ^tVendi(α)∂αi=∑j:Ij=i1ni(t)⟨logρ(α)+I,∂ρ∂qj⟩. ∂ L Vendi_t(α)∂ _i= _j:I_j=i 1n_i(t) ρ(α)+I,\ ∂ρ∂ q_j . (21) We compute logρ ρ via eigendecomposition ρ=UΛU⊤ρ=U U and logρ=U(logΛ)U⊤ ρ=U( )U . Negative log-Vendi gradient (finite-dimensional feature / random-feature form). If we use a finite-dimensional feature map φ(x)∈ℝD (x) ^D (e.g. random Fourier features) and define C(α):=∑j=1Nqj(α)φ(xj)φ(xj)⊤∈ℝD×D,C(α):= _j=1^Nq_j(α)\, (x_j) (x_j) ^D× D, then the objective can be equivalently computed as Tr(C(α)logC(α))Tr(C(α) C(α)) (finite-rank case), and ∇CTr(ClogC)=logC+I _CTr(C C)= C+I. Since ∂C/∂qj=φ(xj)φ(xj)⊤∂ C/∂ q_j= (x_j) (x_j) , ∂ℒ^tVendi(α)∂αi ∂ L Vendi_t(α)∂ _i =∑j:Ij=i1ni(t)⟨logC(α)+I,φ(xj)φ(xj)⊤⟩ = _j:I_j=i 1n_i(t) C(α)+I,\ (x_j) (x_j) =∑j:Ij=i1ni(t)φ(xj)⊤(logC(α)+I)φ(xj). = _j:I_j=i 1n_i(t)\, (x_j) ( C(α)+I) (x_j). (22) This replaces the O(N3)O(N^3) cost of eigendecomposing ρ by an O(D3)O(D^3) eigendecomposition of C when D<ND<N. Appendix B Proofs for Section 5: log-Vendi and Fréchet Distance regret B.1 Bandit protocol and ERM-to-regret reduction Let Δm=α∈ℝm:αi≥0,∑i=1mαi=1 _m=\α ^m: _i≥ 0,\ _i=1^m _i=1\. For the warm start, we suppose ni(0)=Mn_i(0)=M for all i. At round t=1,…,Tt=1,…,T, Mixture-Greedy selects αt∈argminα∈ΔmF^t−1(α),It∼αt. _t∈ _α∈ _m F_t-1(α), I_t _t. The regret at iteration T is defined as RegT=∑t=1T(F(αt)−minαF(α))Reg_T= _t=1^T(F( _t)- _αF(α)). Lemma 1. For every t≥1t≥ 1, let αt∈argminα∈ΔmF^t−1(α) _t∈ _α∈ _m F_t-1(α). Then, the following holds F(αt)−minα∈ΔmF(α)≤2supα∈Δm|F^t−1(α)−F(α)|.F( _t)- _α∈ _mF(α)≤ 2 _α∈ _m | F_t-1(α)-F(α) |. Proof. Let α⋆∈argminαF(α)α ∈ _αF(α). Since αt _t minimizes F^t−1 F_t-1, F(αt)−F(α⋆) F( _t)-F(α ) =(F(αt)−F^t−1(αt))+(F^t−1(αt)−F^t−1(α⋆))+(F^t−1(α⋆)−F(α⋆)) = (F( _t)- F_t-1( _t) )+ ( F_t-1( _t)- F_t-1(α ) )+ ( F_t-1(α )-F(α ) ) ≤|F(αt)−F^t−1(αt)|+|F^t−1(α⋆)−F(α⋆)|≤2supα|F^t−1(α)−F(α)|. ≤|F( _t)- F_t-1( _t)|+| F_t-1(α )-F(α )|≤ 2 _α| F_t-1(α)-F(α)|. The lemma’s proof is hence complete. ∎ Lemma 2. Assume there exists γ∈(0,1]γ∈(0,1] such that αt,i≥γ _t,i≥γ for all i and all t≤Tt≤ T. Then, for every δ>0δ>0, with probability at least 1−δ1-δ, the following holds simultaneously for all i∈[m]i∈[m] and all t≤Tt≤ T, ni(t)≥M+γt−2tlogmTδ.n_i(t)≥ M+γ t- 2t mTδ. Proof. Consider integer i and let Ni(t)=∑s=1tIs=iN_i(t)= _s=1^t1\I_s=i\, hence ni(t)=M+Ni(t)n_i(t)=M+N_i(t). Define Ds:=Is=i−[Is=i∣ℱs−1]D_s:=1\I_s=i\-E[1\I_s=i\ _s-1], therefore |Ds|≤1|D_s|≤ 1 and ∑s=1tDs=Ni(t)−∑s=1tαs,i _s=1^tD_s=N_i(t)- _s=1^t _s,i. Azuma–Hoeffding and union bound over i,ti,t yields Ni(t)≥∑s=1tαs,i−2tlogmTδ≥γt−2tlogmTδN_i(t)≥ _s=1^t _s,i- 2t mTδ≥γ t- 2t mTδ. ∎ B.2 Negative log-Vendi: concentration and entropy continuity Assumption 6 (Normalized features). ‖ϕ(x)‖2=1\|φ(x)\|_2=1 for all x. For arm i, define Si=[ϕ(X)ϕ(X)⊤]S_i=E[φ(X)φ(X) ] and Sα=∑iαiSiS_α= _i _iS_i. Define FNLV(α)=Tr(SαlogSα)F_NLV(α)=Tr(S_α S_α) and its plug-in estimator F^tNLV(α)=Tr(S^α(t)logS^α(t)) F^NLV_t(α)=Tr( S_α(t) S_α(t)). Assumption 7 (Population innovation). There exist ν0∈(0,1] _0∈(0,1] and ε0∈[0,ν0/8) _0∈[0, _0/8) such that for each i∈[m]i∈[m] there exists a unit vector viv_i with vi⊤Sivi≥ν0v_i S_iv_i≥ _0 and ∑j≠ivi⊤Sjvi≤ε0 _j≠ iv_i S_jv_i≤ _0. Lemma 3 (from [Sutherland et al., 2018]). Let (Yr)r=1n(Y_r)_r=1^n be i.i.d. random elements in a real Hilbert space with [Yr]=0E[Y_r]=0 and ‖Yr‖≤L\|Y_r\|≤ L a.s. Then, for every δ>0δ>0, with probability at least 1−δ1-δ, we have the following ‖1n∑r=1nYr‖≤Ln(1+2log1δ). \| 1n _r=1^nY_r \|≤ L n (1+ 2 1δ ). Lemma 4. Fix T≥1T≥ 1 and δ∈(0,1)δ∈(0,1). Under Assumption 6, with probability at least 1−δ1-δ, simultaneously for all i∈[m]i∈[m] and all n∈M,…,M+Tn∈\M,…,M+T\, ‖S^i(n)−Si‖F≤2n(1+2logm(T+1)δ).\| S_i(n)-S_i\|_F≤ 2 n (1+ 2 m(T+1)δ ). Proof. Fix (i,n)(i,n) and set Yi,r=ϕ(Xi,r)ϕ(Xi,r)⊤−SiY_i,r=φ(X_i,r)φ(X_i,r) -S_i. Then Yi,r=0EY_i,r=0 and ‖Yi,r‖F≤‖ϕϕ⊤‖F+‖Si‖F≤1+1=2\|Y_i,r\|_F≤\|φ \|_F+\|S_i\|_F≤ 1+1=2. The application of Lemma 3 in Frobenius space with failure probability δ/(m(T+1))δ/(m(T+1)) and applying the union bound over (i,n)(i,n)’s proves the result. ∎ Fannes–Audenaert inequality. Let ∥⋅∥1\|·\|_1 be the trace norm. We use Fannes–Audenaert inequality [Audenaert, 2007], which shows for every two density matrices S,S′⪰0S,S 0 with Tr(S)=Tr(S′)=1Tr(S)=Tr(S )=1 and T=12‖S−S′‖1∈[0,1−1/d]T= 12\|S-S \|_1∈[0,1-1/d], the following holds |Tr(SlogS)−Tr(S′logS′)|≤Tlog(d−1)+h(T),h(T)=−TlogT−(1−T)log(1−T).|Tr(S S)-Tr(S S )|≤ T (d-1)+h(T), h(T)=-T T-(1-T) (1-T). Lemma 5. Let S,S′⪰0S,S 0 with Tr(S)=Tr(S′)=1Tr(S)=Tr(S )=1 in dimension d≥2d≥ 2, and assume ‖S−S′‖F≤ε\|S-S \|_F≤ . Let T:=12‖S−S′‖1T:= 12\|S-S \|_1. Then T≤12dεT≤ 12 d\, and whenever T≤1−1/dT≤ 1-1/d, |Tr(SlogS)−Tr(S′logS′)|≤Tlog(d−1)+h(T).|Tr(S S)-Tr(S S )|≤ T (d-1)+h(T). Moreover, if ε≤2ed ≤ 2e d (thus T≤1/eT≤ 1/e), then |Tr(SlogS)−Tr(S′logS′)|≤12dε(log(d−1)+loged2+1+log1ε).|Tr(S S)-Tr(S S )|≤ 12 d\, ( (d-1)+ e d2+1+ 1 ). (23) In particular, in this regime, |Tr(SlogS)−Tr(S′logS′)|≤Cdεlog1εwithCd:=12d(2+log(d−1)+loged2).|Tr(S S)-Tr(S S )|≤ C_d\, 1 C_d:= 12 d (2+ (d-1)+ e d2 ). Proof. Fro Frobenius (∥⋅∥F\|·\|_F) and trace norms (∥⋅∥1\|·\|_1) of a matrix A∈ℝd×dA ^d× d, a well-known inequality is that ‖A‖1≤d‖A‖F\|A\|_1≤ d\|A\|_F. This implies that T≤12dεT≤ 12 d\, . We then apply Fannes–Audenaert to directly obtain the first bound. For the simplified second bound, note that if T≤1/eT≤ 1/e, then −xlogx-x x is increasing on [0,T][0,T], and one can see h(T)≤Tlog(1/T)+Th(T)≤ T (1/T)+T. Since T≤12dεT≤ 12 d\, , we will have log(1/T)≤log2d+log1ε (1/T)≤ 2 d+ 1 , and a substitution yields equation 23. ∎ B.3 Quantitative interiority for log-Vendi Mixture-Greedy (patched) Lemma 6. Fix T≥1T≥ 1 and δ∈(0,1)δ∈(0,1). Assume Assumptions 6 and 7, and assume the event of Lemma 4 holds. Define η:=2M(1+2logm(T+1)δ),νeff:=ν0−η,εeff:=ε0+(m−1)η,η:= 2 M (1+ 2 m(T+1)δ ), _eff:= _0-η, _eff:= _0+(m-1)η, and assume η≤ν0/4η≤ _0/4 (thus νeff≥3ν0/4 _eff≥ 3 _0/4). Define γmin:=max0,exp(−1−1νeff(de+logm))−εeff. _ := \0,\ (-1- 1 _eff ( de+ m ) )- _eff \. Then for every t∈1,…,Tt∈\1,…,T\, every minimizer αt∈argminα∈ΔmF^t−1NLV(α) _t∈ _α∈ _m F^NLV_t-1(α) satisfies αt,i≥γmin∀i∈[m]. _t,i≥ _ ∀ i∈[m]. Proof. Fix t and let α minimize F^t−1NLV F^NLV_t-1. Write S^i:=S^i(t−1) S_i:= S_i(t-1) and S^α:=∑i=1mαiS^i S_α:= _i=1^m _i S_i. First, we analyze the empirical innovation along each viv_i. To do this, fix index i and consider the population innovation direction viv_i from Assumption 7. By Lemma 4 and ‖A‖op≤‖A‖F\|A\|_op≤\|A\|_F, |vi⊤(S^j−Sj)vi|≤η∀j.|v_i ( S_j-S_j)v_i|≤η ∀ j. As a result, we can write vi⊤S^ivi≥ν0−η=νeff,∑j≠ivi⊤S^jvi≤ε0+(m−1)η=εeff.v_i S_iv_i≥ _0-η= _eff, _j≠ iv_i S_jv_i≤ _0+(m-1)η= _eff. Also, ‖S^j‖op≤Tr(S^j)=1\| S_j\|_op ( S_j)=1. Therefore, we have vi⊤S^αvi=αivi⊤S^ivi+∑j≠iαjvi⊤S^jvi≤αi⋅1+εeff.v_i S_αv_i= _i\,v_i S_iv_i+ _j≠ i _j\,v_i S_jv_i≤ _i· 1+ _eff. (24) Next, we derive the directional optimality condition. Note that the map S↦Tr(SlogS)S (S S) is convex on unit-trace PSD matrices (i.e., density matrices). Therefore, α↦F^t−1NLV(α)α F^NLV_t-1(α) is convex on Δm _m. Let j be an index with αj≥1/m _j≥ 1/m. For every i∈[m]i∈[m] and τ∈(0,αj]τ∈(0, _j], we define α(τ)=α+τ(ei−ej)∈Δmα^(τ)=α+τ(e_i-e_j)∈ _m. Optimality of α implies F^t−1NLV(α(τ))−F^t−1NLV(α)τ≥ 0. F^NLV_t-1(α^(τ))- F^NLV_t-1(α)τ\ ≥\ 0. (25) Then, we let (S)=Tr(SlogS)G(S)=Tr(S S). By convexity, F^t−1NLV(α(τ))−F^t−1NLV(α)τ≤⟨logS^α+I,S^i−S^j⟩. F^NLV_t-1(α^(τ))- F^NLV_t-1(α)τ≤ S_α+I,\ S_i- S_j . (26) Since Tr(S^i)=Tr(S^j)=1Tr( S_i)=Tr( S_j)=1, the identity terms cancel: ⟨logS^α+I,S^i−S^j⟩=Tr(S^ilogS^α)−Tr(S^jlogS^α). S_α+I,\ S_i- S_j =Tr( S_i S_α)-Tr( S_j S_α). (27) We bound the S^i S_i term using viv_i. Since S^i⪰0 S_i 0 and Tr(S^i)=1Tr( S_i)=1, Tr(S^ilogS^α)≤(vi⊤S^ivi)⋅vi⊤(logS^α)vi.Tr( S_i S_α)≤(v_i S_iv_i)· v_i ( S_α)v_i. By Jensen’s inequality for the concave log function (on the spectral measure of S^α S_α), we obtain vi⊤(logS^α)vi≤log(vi⊤S^αvi)v_i ( S_α)v_i≤ (v_i S_αv_i). Thus, by equation 24 we attain Tr(S^ilogS^α)≤(vi⊤S^ivi)log(αi+εeff)≤νefflog(αi+εeff).Tr( S_i S_α)≤(v_i S_iv_i) ( _i+ _eff)≤ _eff ( _i+ _eff). Combining this with equation 27 yields the following F^t−1NLV(α(τ))−F^t−1NLV(α)τ≤νefflog(αi+εeff)−Tr(S^jlogS^α). F^NLV_t-1(α^(τ))- F^NLV_t-1(α)τ≤ _eff ( _i+ _eff)-Tr( S_j S_α). (28) Subsequently, we derive a lower bound for the S^j S_j term. Since αj≥1/m _j≥ 1/m, S^α⪰αjS^j⪰1mS^j S_α _j S_j 1m S_j. Using an ϵε-regularization argument to apply operator monotonicity of log on PD matrices and then letting ϵ↓0ε 0, we obtain Tr(S^jlogS^α)≥Tr(S^jlog(1mS^j))=Tr(S^jlogS^j)−(logm)Tr(S^j).Tr( S_j S_α) ( S_j ( 1m S_j ) )=Tr( S_j S_j)-( m)Tr( S_j). Since Tr(S^j)=1Tr( S_j)=1 and xlogx≥−1/ex x≥-1/e for x≥0x≥ 0, Tr(S^jlogS^j)≥−d/eTr( S_j S_j)≥-d/e, and then Tr(S^jlogS^α)≥−de−logm.Tr( S_j S_α)≥- de- m. (29) Combining the above, using equation 25 and equation 28, we find that 0≤F^(α(τ))−F^(α)τ≤νefflog(αi+εeff)−Tr(S^jlogS^α).0≤ F(α^(τ))- F(α)τ≤ _eff ( _i+ _eff)-Tr( S_j S_α). Using equation 29 gives 0≤νefflog(αi+εeff)+de+logm,0≤ _eff ( _i+ _eff)+ de+ m, which can be written as αi+εeff≥exp(−1νeff(de+logm)), _i+ _eff≥ (- 1 _eff ( de+ m ) ), which yields the stated γmin _ after subtracting εeff _eff and truncating below at 0. ∎ B.4 Negative log-Vendi regret (dimension-explicit warm-start condition) Theorem 3 (Negative log-Vendi regret). Fix T≥1T≥ 1 and δ∈(0,1)δ∈(0,1). Assume Assumptions 6 and 7. Assume M is large enough such that: (i) η≤ν0/4η≤ _0/4 and γmin>0 _ >0 in Lemma 6, and (i) along the analysis event, the Frobenius deviation εt:=supα‖S^α(t)−Sα‖F _t:= _α\| S_α(t)-S_α\|_F satisfies 12dεt≤1/e 12 d\, _t≤ 1/e for all t∈0,…,T−1t∈\0,…,T-1\ (equivalently, it suffices that ε0≤2/(ed) _0≤ 2/(e d) since εt _t decreases with t under linear sampling). Then with probability at least 1−δ1-δ, RegT≤CNLV(1+logm(T+1)δ)T(1+logT),Reg_T≤ C_NLV (1+ m(T+1)δ ) T\,(1+ T), where CNLVC_NLV depends only on (d,m,ν0,ε0)(d,m, _0, _0). Proof. We consider the intersection of these events: (A) Lemma 4 (failure probability below δ/3δ/3), (B) Lemma 2 with γ=γminγ= _ (failure probability below ≤δ/3≤δ/3), (C) the deterministic interiority conclusion of Lemma 6 (holds on event A). On this intersection, nmin(t)≥M+γmint−2tlog3mTδ≥ctn_ (t)≥ M+ _ t- 2t 3mTδ≥ ct for an explicit c=c(γmin)>0c=c( _ )>0. Therefore, by Lemma 4, maxi‖S^i(t)−Si‖F≤2nmin(t)(1+2log3m(T+1)δ)≤C1t(1+logm(T+1)δ). _i\| S_i(t)-S_i\|_F≤ 2 n_ (t) (1+ 2 3m(T+1)δ )≤ C_1 t (1+ m(T+1)δ ). For every α∈Δmα∈ _m, the convexity of norm functions shows that ∥S^α(t)−Sα∥F≤maxi∥S^i(t)−Si∥F=:εt\| S_α(t)-S_α\|_F≤ _i\| S_i(t)-S_i\|_F=: _t. By assumption (i), we have εt≤2/(ed) _t≤ 2/(e d), and therefore Lemma 5 implies supα∈Δm|F^tNLV(α)−FNLV(α)|≤Cdεtlog1εt≤C2t(1+logm(T+1)δ)(1+logt), _α∈ _m | F^NLV_t(α)-F_NLV(α) |≤ C_d\, _t 1 _t≤ C_2 t (1+ m(T+1)δ )(1+ t), where the last step uses εt≍t−1/2 _t t^-1/2. Hence, we obtain log(1/εt)≲1+logt (1/ _t) 1+ t. Lemma 1 then yields instantaneous regret bounded by twice the uniform deviation, and summing ∑t=2Tt−1/2(1+logt)≤2T(1+logT) _t=2^Tt^-1/2(1+ t)≤ 2 T(1+ T) gives the stated bound. The application of union bound over events A and B thereof yields a success probability of at least 1−δ1-δ. ∎ Appendix C Including a bounded linear term: Regret for NLV/FD + linear score C.1 Linear term: definition, estimator, and uniform concentration Fix measurable functions ψi:→[0,1] _i:X→[0,1] for each arm i∈[m]i∈[m] and define θi:=[ψi(X)]∈[0,1],X∼Pi,G(α):=∑i=1mαiθi. _i:=E[ _i(X)]∈[0,1], X P_G_i, G(α):= _i=1^m _i _i. For a base objective F(α)F(α) (negative log-Vendi or Fréchet Distance), define H(α):=F(α)+wG(α),w≥0.H(α):=F(α)+w\,G(α), w≥ 0. At time t, define empirical estimates θ^i(t):=1ni(t)∑r=1ni(t)ψi(Xi,r),G^t(α):=∑i=1mαiθ^i(t),H^t(α):=F^t(α)+wG^t(α). θ_i(t):= 1n_i(t) _r=1^n_i(t) _i(X_i,r), G_t(α):= _i=1^m _i θ_i(t), H_t(α):= F_t(α)+w\, G_t(α). Mixture-Greedy with the combined objective selects αt∈argminα∈ΔmH^t−1(α),It∼αt. _t∈ _α∈ _m H_t-1(α), I_t _t. (30) We analyze combined-objective regret RegT(H):=∑t=1T(H(αt)−minα∈ΔmH(α)).Reg_T(H):= _t=1^T (H( _t)- _α∈ _mH(α) ). Lemma 7. Fix T≥1T≥ 1 and δ∈(0,1)δ∈(0,1). With probability at least 1−δ1-δ, simultaneously for all i∈[m]i∈[m] and all n∈M,M+1,…,M+Tn∈\M,M+1,…,M+T\, |θ^i(n)−θi|≤12nlog2m(T+1)δ.| θ_i(n)- _i|≤ 1 2n 2m(T+1)δ. Proof. Consider a fixed (i,n)(i,n). Since ψi(Xi,r)∈[0,1] _i(X_i,r)∈[0,1] and samples are i.i.d., Hoeffding’s inequality gives ℙ(|θ^i(n)−θi|≥ε)≤2e−2nε2P(| θ_i(n)- _i|≥ )≤ 2e^-2n ^2. If we set ε=12nlog2m(T+1)δ = 1 2n 2m(T+1)δ, then the right hand side will be bounded by δ/(m(T+1))δ/(m(T+1)), and applying union bound over (i,n)(i,n)’s proves the result. ∎ Lemma 8 (Uniform deviation of the linear objective). Assuming the event of Lemma 7 occurs, for every t≤Tt≤ T we have the following where nmin(t):=minini(t)n_ (t):= _in_i(t): supα∈Δm|G^t(α)−G(α)|≤maxi∈[m]|θ^i(t)−θi|≤12nmin(t)log2m(T+1)δ _α∈ _m| G_t(α)-G(α)|≤ _i∈[m]| θ_i(t)- _i|≤ 1 2n_ (t) 2m(T+1)δ Proof. For every α∈Δmα∈ _m, we can write the following |G^t(α)−G(α)|=|∑i=1mαi(θ^i(t)−θi)|≤∑i=1mαimaxj|θ^j(t)−θj|=maxj|θ^j(t)−θj|.| G_t(α)-G(α)|= | _i=1^m _i( θ_i(t)- _i) |≤ _i=1^m _i _j| θ_j(t)- _j|= _j| θ_j(t)- _j|. Then, we apply Lemma 7 with n=nmin(t)n=n_ (t) to prove the result. ∎ Lemma 9. For every t≥1t≥ 1, if αt _t minimizes H^t−1 H_t-1 over Δm _m, then the following holds H(αt)−minα∈ΔmH(α)≤2supα∈Δm|H^t−1(α)−H(α)|.H( _t)- _α∈ _mH(α)≤ 2 _α∈ _m| H_t-1(α)-H(α)|. Moreover, we will have supα∈Δm|H^t−1(α)−H(α)|≤supα∈Δm|F^t−1(α)−F(α)|+wsupα∈Δm|G^t−1(α)−G(α)|. _α∈ _m| H_t-1(α)-H(α)|≤ _α∈ _m| F_t-1(α)-F(α)|+w _α∈ _m| G_t-1(α)-G(α)|. Proof. Let α⋆∈argminα∈ΔmH(α)α ∈ _α∈ _mH(α). Since αt _t minimizes H^t−1 H_t-1, we can write H(αt)−H(α⋆) H( _t)-H(α ) =(H(αt)−H^t−1(αt))+(H^t−1(αt)−H^t−1(α⋆))+(H^t−1(α⋆)−H(α⋆)) = (H( _t)- H_t-1( _t) )+ ( H_t-1( _t)- H_t-1(α ) )+ ( H_t-1(α )-H(α ) ) ≤2supα∈Δm|H^t−1(α)−H(α)|. ≤ 2 _α∈ _m| H_t-1(α)-H(α)|. The second inequality follows from the triangle inequality and linearity of the combination as H−H^=(F−F^)+w(G−G^)H- H=(F- F)+w(G- G). ∎ C.2 NLV + linear term: corrected interiority floor and regret We use the NLV setup and notation from the base NLV appendix: normalized features (Assumption 6), population innovation (Assumption 7), uniform concentration of S^i S_i (Lemma 4), and the resulting quantities η,νeff,εeffη, _eff, _eff. Lemma 10 (Updated interiority floor for NLV + linear term (tight; no −1-1)). Fix T≥1T≥ 1 and δ∈(0,1)δ∈(0,1). Assume Assumptions 6 and 7, and assume the event of Lemma 4 holds. Define η:=2M(1+2logm(T+1)δ),νeff:=ν0−η,εeff:=ε0+(m−1)η,η:= 2 M (1+ 2 m(T+1)δ ), _eff:= _0-η, _eff:= _0+(m-1)η, and assume η≤ν0/4η≤ _0/4 to obtain νeff>0 _eff>0. Then, for every t∈1,…,Tt∈\1,…,T\, every minimizer αt∈argminα∈ΔmH^t−1(α) _t∈ _α∈ _m H_t-1(α) satisfies αt,i≥γmin(w)∀i∈[m], _t,i≥ _ ^(w) ∀ i∈[m], where γmin(w):=max0,exp(−1νeff(de+logm+w))−εeff. _ ^(w):= \0,\ (- 1 _eff ( de+ m+w ) )- _eff \. (31) Proof. Fix t and let α minimize H^t−1(⋅)=F^t−1(⋅)+wG^t−1(⋅) H_t-1(·)= F_t-1(·)+w G_t-1(·). Let j be an index with αj≥1/m _j≥ 1/m (exists since ∑iαi=1 _i _i=1). For any i and τ∈(0,αj]τ∈(0, _j], define the feasible perturbation α(τ)=α+τ(ei−ej)α^(τ)=α+τ(e_i-e_j). First, we analyze directional optimality. Because H^t−1 H_t-1 is convex and α is a minimizer over the convex set Δm _m, we have H^t−1(α(τ))−H^t−1(α)τ≥0. H_t-1(α^(τ))- H_t-1(α)τ≥ 0. (32) Then, we show a convex upper bound for the increment term. By convexity of F^t−1 F_t-1 and linearity of G^t−1 G_t-1, we have H^t−1(α(τ))−H^t−1(α)τ H_t-1(α^(τ))- H_t-1(α)τ =F^t−1(α(τ))−F^t−1(α)τ+w(θ^i−θ^j) = F_t-1(α^(τ))- F_t-1(α)τ+w ( θ_i- θ_j ) ≤⟨logS^α+I,S^i−S^j⟩+w(θ^i−θ^j), ≤ S_α+I,\ S_i- S_j +w ( θ_i- θ_j ), (33) where S^α=∑kαkS^k S_α= _k _k S_k. Since Tr(S^i)=Tr(S^j)=1Tr( S_i)=Tr( S_j)=1, the identity components cancel exactly: ⟨logS^α+I,S^i−S^j⟩=Tr(S^ilogS^α)−Tr(S^jlogS^α). S_α+I,\ S_i- S_j =Tr( S_i S_α)-Tr( S_j S_α). Now, we bound the S^i S_i term using the innovation direction viv_i. By Lemma 4 and the fact that ‖A‖op≤‖A‖F\|A\|_op≤\|A\|_F, we attain vi⊤S^ivi≥ν0−η=νeff,∑k≠ivi⊤S^kvi≤ε0+(m−1)η=εeff.v_i S_iv_i≥ _0-η= _eff,\;\; _k≠ iv_i S_kv_i≤ _0+(m-1)η= _eff. Also, ‖S^k‖op≤Tr(S^k)=1\| S_k\|_op ( S_k)=1. Hence vi⊤S^αvi=αivi⊤S^ivi+∑k≠iαkvi⊤S^kvi≤αi⋅1+εeff.v_i S_αv_i= _i\,v_i S_iv_i+ _k≠ i _k\,v_i S_kv_i≤ _i· 1+ _eff. Since S^i⪰0 S_i 0 and Tr(S^i)=1Tr( S_i)=1, we have Tr(S^ilogS^α)≤(vi⊤S^ivi)vi⊤(logS^α)viTr( S_i S_α)≤(v_i S_iv_i)\,v_i ( S_α)v_i. By concavity of log and using Jensen’s inequality on the spectral measure of S^α S_α induced by viv_i, vi⊤(logS^α)vi≤log(vi⊤S^αvi)v_i ( S_α)v_i≤ (v_i S_αv_i), hence Tr(S^ilogS^α)≤(vi⊤S^ivi)log(vi⊤S^αvi)≤νefflog(αi+εeff).Tr( S_i S_α)≤(v_i S_iv_i)\, (v_i S_αv_i)≤ _eff ( _i+ _eff). To obtain a lower bound for the S^j S_j term, note that since αj≥1/m _j≥ 1/m, we have S^α⪰αjS^j⪰1mS^j S_α _j S_j 1m S_j. By operator monotonicity of log on the support, Tr(S^jlogS^α)≥Tr(S^jlog(1mS^j))=Tr(S^jlogS^j)−(logm)Tr(S^j).Tr( S_j S_α) ( S_j ( 1m S_j ) )=Tr( S_j S_j)-( m)Tr( S_j). Let λ1,…,λd _1,…, _d be eigenvalues of S^j S_j; then ∑kλk=1 _k _k=1 and λk≥0 _k≥ 0. Since xlogx≥−1/ex x≥-1/e for all x≥0x≥ 0, we get Tr(S^jlogS^j)=∑kλklogλk≥−d/eTr( S_j S_j)= _k _k _k≥-d/e. Also, Tr(S^j)=1Tr( S_j)=1. Therefore, Tr(S^jlogS^α)≥−de−logm.Tr( S_j S_α)≥- de- m. Because 0≤θ^i,θ^j≤10≤ θ_i, θ_j≤ 1, we have θ^i−θ^j≤1 θ_i- θ_j≤ 1. Substituting the bounds from Steps 3–4 into equation 33 yields H^(α(τ))−H^(α)τ≤νefflog(αi+εeff)−(−de−logm)+w. H(α^(τ))- H(α)τ≤ _eff ( _i+ _eff)- (- de- m )+w. Combining with equation 32 gives 0≤νefflog(αi+εeff)+de+logm+w.0≤ _eff ( _i+ _eff)+ de+ m+w. Rearranging yields αi+εeff≥exp(−1νeff(de+logm+w)), _i+ _eff≥ (- 1 _eff ( de+ m+w ) ), and the claimed floor follows after subtracting εeff _eff and truncating at 0, giving equation 31. ∎ Theorem 4 (NLV + linear term: Mixture-Greedy regret). Fix T≥1T≥ 1 and δ∈(0,1)δ∈(0,1). Assume the NLV conditions (Assumptions 6, 7) and 0≤ψi≤10≤ _i≤ 1. Assume M is sufficiently large to satisfy the following: (i) η≤ν0/4η≤ _0/4 and γmin(w)>0 _ ^(w)>0 in Lemma 10, and (i) the dimension-explicit continuity regime holds along the analysis event, namely 12dεt≤1/e 12 d\, _t≤ 1/e for all t∈0,…,T−1t∈\0,…,T-1\, where εt:=supα∈Δm‖S^α(t)−Sα‖F _t:= _α∈ _m\| S_α(t)-S_α\|_F. Then with probability at least 1−δ1-δ, RegT(H)≤CNLV,w(1+logm(T+1)δ)T(1+logT),Reg_T(H)≤ C_NLV,w (1+ m(T+1)δ ) T\,(1+ T), where CNLV,wC_NLV,w depends only on (d,m,ν0,ε0,w)(d,m, _0, _0,w). Proof. Work on the intersection of the following events: (A) Lemma 4 (uniform concentration of S^i S_i), (B) Lemma 7 (uniform concentration of θ^i θ_i), (C) Lemma 2 with γ=γmin(w)γ= _ ^(w) (linear sampling under the floor), and allocate failure probabilities so that the intersection holds with probability at least 1−δ1-δ. On this intersection, Lemma 10 ensures αt,i≥γmin(w) _t,i≥ _ ^(w) for all i,t≤Ti,t≤ T. Thus nmin(t)=Ω(t)n_ (t)= (t) by Lemma 2. As in the base NLV proof, εt:=supα‖S^α(t)−Sα‖F≤maxi‖S^i(t)−Si‖F≤C1t(1+logm(T+1)δ). _t:= _α\| S_α(t)-S_α\|_F≤ _i\| S_i(t)-S_i\|_F≤ C_1 t (1+ m(T+1)δ ). By assumption (i) we may apply the dimension-explicit continuity modulus (Lemma 5) to obtain supα|F^t(α)−F(α)|≤C2εtlog1εt≤C3t(1+logm(T+1)δ)(1+logt). _α | F_t(α)-F(α) |≤ C_2\, _t 1 _t≤ C_3 t (1+ m(T+1)δ )(1+ t). For the linear part, Lemma 8 gives supα|G^t(α)−G(α)|≤12nmin(t)log2m(T+1)δ≤C4t(1+logm(T+1)δ). _α| G_t(α)-G(α)|≤ 1 2n_ (t) 2m(T+1)δ≤ C_4 t (1+ m(T+1)δ ). Therefore, supα|H^t(α)−H(α)|≤supα|F^t(α)−F(α)|+wsupα|G^t(α)−G(α)|≤C5t(1+logm(T+1)δ)(1+logt). _α| H_t(α)-H(α)|≤ _α| F_t(α)-F(α)|+w _α| G_t(α)-G(α)|≤ C_5 t (1+ m(T+1)δ )(1+ t). Lemma 9 gives H(αt+1)−minαH(α)≤2supα|H^t(α)−H(α)|,H( _t+1)- _αH(α)≤ 2 _α| H_t(α)-H(α)|, and summing ∑t=1T−1t−1/2(1+logt)≤2T(1+logT) _t=1^T-1t^-1/2(1+ t)≤ 2 T(1+ T) completes the proof. ∎ C.3 Fréchet Distance + linear term: margin transfer and regret Let FFDF_FD be the Fréchet Distance objective and F F its plug-in estimator from the base FD appendix. Define H(α)=FFD(α)+wG(α)H(α)=F_FD(α)+wG(α), where G(α)∈[0,1]G(α)∈[0,1]. Assumption 8 (Population interiority margin for H). There exist γ0∈(0,1/m] _0∈(0,1/m] and Δ0(H)>0 _0^(H)>0 such that infα∈Δm:miniαi≤γ0H(α)≥minα∈ΔmH(α)+Δ0(H). _α∈ _m:\ _i _i≤ _0H(α)\ ≥\ _α∈ _mH(α)+ _0^(H). Lemma 11. Assume FFDF_FD satisfies Assumption 5 with gap Δ0 _0 and the same γ0 _0. Then H=FFD+wGH=F_FD+wG satisfies Assumption 8 with gap Δ0(H)=Δ0−w _0^(H)= _0-w provided Δ0>w _0>w. Proof. Since G(α)∈[0,1]G(α)∈[0,1] for all α∈Δmα∈ _m, infminiαi≤γ0H(α)≥infminiαi≤γ0FFD(α)≥minαFFD(α)+Δ0≥minαH(α)+(Δ0−w), _ _i _i≤ _0H(α)≥ _ _i _i≤ _0F_FD(α)≥ _αF_FD(α)+ _0≥ _αH(α)+( _0-w), because minαH(α)≤minαFFD(α)+w _αH(α)≤ _αF_FD(α)+w. ∎ Theorem 5 (Fréchet Distance + linear term: Mixture-Greedy regret). Fix T≥1T≥ 1 and δ∈(0,1)δ∈(0,1). Assume the Fréchet Distance conditions from Theorem 2 (boundedness and PD) and 0≤ψi≤10≤ _i≤ 1. Assume the population margin Assumption 8. Assume M is large enough to provide the following: On an event of probability at least 1−δ/31-δ/3, supα∈Δm|H^t(α)−H(α)|≤Δ0(H)/4for all t∈0,1,…,T−1. _α∈ _m| H_t(α)-H(α)|≤ _0^(H)/4 all t∈\0,1,…,T-1\. Then, with probability at least 1−δ1-δ, the following regret bound holds RegT(H)≤CFD,w(1+logm(T+1)δ)T,Reg_T(H)≤ C_FD,w (1+ m(T+1)δ ) T, where CFD,wC_FD,w depends only on (B,λ0,ν,m,d,γ0,w)(B, _0,ν,m,d, _0,w). Proof. The proof follows the non-linear-term Fréchet Distance regret proof with H by replacing FFDF_FD. Under the supposed uniform deviation, we know supα|H^t−H|≤Δ0(H)/4 _α| H_t-H|≤ _0^(H)/4 for all t≤T−1t≤ T-1. Therefore, the same margin-transfer argument implies that every empirical minimizer satisfies miniαt,i≥γ0 _i _t,i≥ _0 for all t≤Tt≤ T. Then, Lemma 2 shows that nmin(t)=Ω(t)n_ (t)= (t). On this interior region, the Lipschitz analysis for the Fréchet term yields supα|F^t−1(α)−FFD(α)|≲t−1/2(1+log(⋅)) _α| F_t-1(α)-F_FD(α)| t^-1/2(1+ (·)), and Lemma 8 yields wsupα|G^t−1−G|≲wt−1/2(1+log(⋅))w _α| G_t-1-G| w\,t^-1/2(1+ (·)). Therefore, supα|H^t−1(α)−H(α)|≤Ct(1+logm(T+1)δ). _α| H_t-1(α)-H(α)|≤ C t (1+ m(T+1)δ ). Lemma 9 converts this into instantaneous regret, and summing ∑t=1Tt−1/2≤2T _t=1^Tt^-1/2≤ 2 T completes the bound. A union bound yields an overall probability of at least 1−δ1-δ. ∎ Appendix D Mixture-Greedy regret for the RKE score and RKE + linear fidelity term D.1 Setup: feature covariances and the RKE objective Let ϕ:→ℝdφ:X ^d be a feature map and assume normalized features. Assumption 9 (Normalized features). For all x∈x , ‖ϕ(x)‖2=1\|φ(x)\|_2=1. Each arm i∈[m]i∈[m] produces i.i.d. samples Xi,r∼PiX_i,r P_G_i when pulled. Define the (population) feature covariance Si:=[ϕ(Xi)ϕ(Xi)⊤]∈ℝd×d,Xi∼Pi.S_i:=E [φ(X_i)φ(X_i) ] ^d× d, X_i P_G_i. Then Si⪰0S_i 0 and Tr(Si)=‖ϕ(Xi)‖22=1Tr(S_i)=E\|φ(X_i)\|_2^2=1. For α∈Δmα∈ _m, define the mixture covariance Sα:=∑i=1mαiSiS_α:= _i=1^m _iS_i which by default satisfies the requirements for being a density matrix, i.e., Tr(Sα)=1,Sα⪰0Tr(S_α)=1,\,S_α 0. Consider the following function whose minimization is equivalent to optimizing RKE: FRKE(α):=Tr(Sα2).F_RKE(α):=Tr(S_α^2). (34) D.2 Empirical estimators Given ni(t)n_i(t) samples observed from arm i by time t, define the empirical covariance S^i(t):=1ni(t)∑r=1ni(t)ϕ(Xi,r)ϕ(Xi,r)⊤,S^α(t):=∑i=1mαiS^i(t). S_i(t):= 1n_i(t) _r=1^n_i(t)φ(X_i,r)φ(X_i,r) , S_α(t):= _i=1^m _i S_i(t). Define the plug-in estimator of the RKE-minimization objective: F^tRKE(α):=Tr(S^α(t)2). F^RKE_t(α):=Tr ( S_α(t)^2 ). (35) Mixture-Greedy for RKE uses the same protocol as in the main paper: αt∈argminα∈ΔmF^t−1RKE(α),It∼αt. _t∈ _α∈ _m F^RKE_t-1(α), I_t _t. The cumulative regret is RegT(FRKE):=∑t=1T(FRKE(αt)−minα∈ΔmFRKE(α)).Reg_T(F_RKE):= _t=1^T (F_RKE( _t)- _α∈ _mF_RKE(α) ). D.3 Uniform concentration for S^i S_i and a uniform deviation bound for F^RKE F^RKE We use the following concentration statement, proved elsewhere in the appendix (it matches the uniform matrix Hoeffding bound used in the NLV section); we restate it here for readability. Lemma 12 (from [Sutherland et al., 2018]). Fix T≥1T≥ 1 and δ∈(0,1)δ∈(0,1). Under Assumption 9, with probability at least 1−δ1-δ, simultaneously for all i∈[m]i∈[m] and all n∈M,M+1,…,M+Tn∈\M,M+1,…,M+T\, ‖S^i(n)−Si‖F≤2n(1+2logm(T+1)δ).\| S_i(n)-S_i\|_F≤ 2 n (1+ 2 m(T+1)δ ). Lemma 13. Fix T≥1T≥ 1 and δ∈(0,1)δ∈(0,1). On the event of Lemma 12, for every t∈0,1,…,Tt∈\0,1,…,T\, supα∈Δm|F^tRKE(α)−FRKE(α)|≤2εt+εt2,where εt:=maxi∈[m]‖S^i(t)−Si‖F. _α∈ _m | F^RKE_t(α)-F_RKE(α) |≤ 2\, _t+ _t^2, \ _t:= _i∈[m]\| S_i(t)-S_i\|_F. In particular, for t≥1t≥ 1, supα∈Δm|F^tRKE(α)−FRKE(α)|≤Crkenmin(t)(1+logm(T+1)δ), _α∈ _m | F^RKE_t(α)-F_RKE(α) |≤ C_rke n_ (t) (1+ m(T+1)δ ), where nmin(t):=minini(t)n_ (t):= _in_i(t) and CrkeC_rke is a universal numerical constant. Proof. Fix t and α∈Δmα∈ _m. Let Δα(t):=S^α(t)−Sα _α(t):= S_α(t)-S_α. By convexity of the Frobenius norm, ‖Δα(t)‖F=‖∑i=1mαi(S^i(t)−Si)‖F≤∑i=1mαi‖S^i(t)−Si‖F≤εt.\| _α(t)\|_F= \| _i=1^m _i( S_i(t)-S_i) \|_F≤ _i=1^m _i\| S_i(t)-S_i\|_F≤ _t. Now we can expand this as F^tRKE(α)−FRKE(α)=Tr((Sα+Δα)2−Sα2)=2Tr(SαΔα)+Tr(Δα2). F^RKE_t(α)-F_RKE(α)=Tr ((S_α+ _α)^2-S_α^2 )=2\,Tr(S_α _α)+Tr( _α^2). Using |Tr(AB)|≤‖A‖F‖B‖F|Tr(AB)|≤\|A\|_F\|B\|_F and Tr(Δα2)=‖Δα‖F2Tr( _α^2)=\| _α\|_F^2 gives |F^tRKE(α)−FRKE(α)|≤2‖Sα‖F‖Δα‖F+‖Δα‖F2. | F^RKE_t(α)-F_RKE(α) |≤ 2\|S_α\|_F\| _α\|_F+\| _α\|_F^2. Since Sα⪰0S_α 0 and Tr(Sα)=1Tr(S_α)=1, we have ‖Sα‖F≤Tr(Sα)=1\|S_α\|_F (S_α)=1. Therefore, we can write |F^tRKE(α)−FRKE(α)|≤2⋅1⋅εt+εt2=2εt+εt2. | F^RKE_t(α)-F_RKE(α) |≤ 2· 1· _t+ _t^2=2\, _t+ _t^2. Taking supremum over α∈Δmα∈ _m yields the first claim. For the second claim, apply Lemma 12 with n=nmin(t)n=n_ (t) to obtain εt≤2nmin(t)(1+2logm(T+1)δ) _t≤ 2 n_ (t) (1+ 2 m(T+1)δ ), and substitute into 2εt+εt22 _t+ _t^2, absorbing the εt2 _t^2 term into the same rate. ∎ The need for explicit population margin. Because FRKE(α)=Tr(Sα2)F_RKE(α)=Tr(S_α^2) is smooth on Δm _m and can admit boundary minimizers, pure Mixture-Greedy may legitimately assign vanishing weights to some arms unless additional structure is imposed. We therefore assume a population interiority margin, exactly as in the Fréchet Distance analysis. Assumption 10 (Population interiority margin for RKE-minimization). There exist γ0∈(0,1/m] _0∈(0,1/m] and Δ0>0 _0>0 such that infα∈Δm:miniαi≤γ0FRKE(α)≥minα∈ΔmFRKE(α)+Δ0. _α∈ _m:\ _i _i≤ _0F_RKE(α)\ ≥\ _α∈ _mF_RKE(α)+ _0. Lemma 14. Fix T≥1T≥ 1. Suppose that for all t∈0,1,…,T−1t∈\0,1,…,T-1\, supα∈Δm|F^tRKE(α)−FRKE(α)|≤Δ04. _α∈ _m | F^RKE_t(α)-F_RKE(α) |≤ _04. Then for every t∈0,1,…,T−1t∈\0,1,…,T-1\, every minimizer αt+1∈argminα∈ΔmF^tRKE(α) _t+1∈ _α∈ _m F^RKE_t(α) satisfies miniαt+1,i≥γ0 _i _t+1,i≥ _0. Proof. Fix t and let αt+1 _t+1 minimize F^tRKE F^RKE_t. Let α⋆α minimize FRKEF_RKE. Then FRKE(αt+1)≤F^tRKE(αt+1)+Δ04≤F^tRKE(α⋆)+Δ04≤FRKE(α⋆)+Δ02.F_RKE( _t+1)≤ F^RKE_t( _t+1)+ _04≤ F^RKE_t(α )+ _04≤ F_RKE(α )+ _02. Thus FRKE(αt+1)≤minαFRKE(α)+Δ0/2F_RKE( _t+1)≤ _αF_RKE(α)+ _0/2. By Assumption 10, any α with miniαi≤γ0 _i _i≤ _0 has FRKE(α)≥minαFRKE(α)+Δ0F_RKE(α)≥ _αF_RKE(α)+ _0, hence αt+1 _t+1 cannot lie in that boundary layer. Therefore, miniαt+1,i≥γ0 _i _t+1,i≥ _0. ∎ D.4 RKE regret bound (pure RKE-minimization) We use the same ERM-to-regret comparison lemma as elsewhere in the appendix, but stated here for completeness. Lemma 15. If αt∈argminα∈ΔmF^t−1RKE(α) _t∈ _α∈ _m F^RKE_t-1(α), then FRKE(αt)−minα∈ΔmFRKE(α)≤2supα∈Δm|F^t−1RKE(α)−FRKE(α)|.F_RKE( _t)- _α∈ _mF_RKE(α)≤ 2 _α∈ _m | F^RKE_t-1(α)-F_RKE(α) |. Proof. This is the standard ERM comparison: let α⋆∈argminαFRKE(α)α ∈ _αF_RKE(α) and use F^t−1RKE(αt)≤F^t−1RKE(α⋆) F^RKE_t-1( _t)≤ F^RKE_t-1(α ), then add and subtract. ∎ Theorem 6 (Mixture-Greedy regret for RKE-minimization). Fix T≥1T≥ 1 and δ∈(0,1)δ∈(0,1). Assume Assumption 9 and the population margin Assumption 10. Assume the warm start M is adequately large so that the uniform deviation event supα∈Δm|F^tRKE(α)−FRKE(α)|≤Δ04holds for all t∈0,1,…,T−1 _α∈ _m | F^RKE_t(α)-F_RKE(α) |≤ _04 for all t∈\0,1,…,T-1\ on an event of probability at least 1−δ/31-δ/3. Then with probability at least 1−δ1-δ, RegT(FRKE)≤CRKE(1+logm(T+1)δ)T,Reg_T(F_RKE)≤ C_RKE (1+ m(T+1)δ ) T, where CRKEC_RKE depends only on (m,γ0)(m, _0) (and universal numerical constants). Proof. On the event that the uniform deviation bound is ≤Δ0/4≤ _0/4 for all t≤T−1t≤ T-1, Lemma 14 implies miniαt,i≥γ0 _i _t,i≥ _0 for all t≤Tt≤ T. Applying the standard sampling-count lemma (Lemma 2) with γ=γ0γ= _0 yields nmin(t)=Ω(t)n_ (t)= (t) with probability at least 1−δ/31-δ/3 after allocating failure probability and union bounding over arms and times. On the intersection of this count event and the concentration event of Lemma 12, Lemma 13 yields for all t≥2t≥ 2, supα∈Δm|F^t−1RKE(α)−FRKE(α)|≤Ct−1(1+logm(T+1)δ), _α∈ _m | F^RKE_t-1(α)-F_RKE(α) |≤ C t-1 (1+ m(T+1)δ ), where we used nmin(t−1)=Ω(t−1)n_ (t-1)= (t-1). Lemma 15 converts this into instantaneous regret for rounds t≥2t≥ 2, and summing ∑t=2T(t−1)−1/2≤2T _t=2^T(t-1)^-1/2≤ 2 T gives the stated (T)O( T) bound (with the initial t=1t=1 warm-start round absorbed into the constant). A union bound over the deviation and count events yields probability at least 1−δ1-δ. ∎ D.5 RKE + bounded linear term We now combine FRKEF_RKE with the bounded linear fidelity term exactly as in Appendix C. We reuse your definitions G(α)=∑i=1mαiθi,θi=[ψi(X)]∈[0,1],ψi:→[0,1],H(α)=FRKE(α)+wG(α),G(α)= _i=1^m _i _i, _i=E[ _i(X)]∈[0,1], _i:X→[0,1], H(α)=F_RKE(α)+wG(α), and the plug-in estimator H^t(α)=F^tRKE(α)+wG^t(α) H_t(α)= F^RKE_t(α)+w G_t(α), with G^t G_t and θ^i θ_i as in Appendix C. Assumption 11 (Population interiority margin for H=FRKE+wGH=F_RKE+wG). There exist γ0∈(0,1/m] _0∈(0,1/m] and Δ0(H)>0 _0^(H)>0 such that infα∈Δm:miniαi≤γ0H(α)≥minα∈ΔmH(α)+Δ0(H). _α∈ _m:\ _i _i≤ _0H(α)\ ≥\ _α∈ _mH(α)+ _0^(H). Lemma 16. If Assumption 10 holds with gap Δ0 _0 and Δ0>w _0>w, then Assumption 11 holds with Δ0(H)=Δ0−w _0^(H)= _0-w. Proof. Since G(α)∈[0,1]G(α)∈[0,1] for all α∈Δmα∈ _m, infminiαi≤γ0H(α)≥infminiαi≤γ0FRKE(α)≥minαFRKE(α)+Δ0≥minαH(α)+(Δ0−w), _ _i _i≤ _0H(α)≥ _ _i _i≤ _0F_RKE(α)≥ _αF_RKE(α)+ _0≥ _αH(α)+( _0-w), because minαH(α)≤minαFRKE(α)+w _αH(α)≤ _αF_RKE(α)+w. ∎ Theorem 7 (Mixture-Greedy regret for RKE + linear term). Fix T≥1T≥ 1 and δ∈(0,1)δ∈(0,1). Assume Assumption 9, 0≤ψi≤10≤ _i≤ 1, and the population margin Assumption 11. Assume the warm start M is sufficiently large so that on an event of probability at least 1−δ/31-δ/3, supα∈Δm|H^t(α)−H(α)|≤Δ0(H)4for all t∈0,1,…,T−1. _α∈ _m | H_t(α)-H(α) |≤ _0^(H)4 all t∈\0,1,…,T-1\. Then with probability at least 1−δ1-δ, RegT(H)≤CRKE,w(1+logm(T+1)δ)T,Reg_T(H)≤ C_RKE,w (1+ m(T+1)δ ) T, where CRKE,wC_RKE,w depends only on (m,γ0,w)(m, _0,w) (and universal numerical constants). Proof. On the event supα|H^t−H|≤Δ0(H)/4 _α| H_t-H|≤ _0^(H)/4 for all t≤T−1t≤ T-1, the same argument as Lemma 14 (with H in place of FRKEF_RKE) implies miniαt,i≥γ0 _i _t,i≥ _0 for all t≤Tt≤ T. Then Lemma 2 yields nmin(t)=Ω(t)n_ (t)= (t) with high probability. By Lemma 9 (already in Appendix C), H(αt)−minαH(α)≤2supα|H^t−1(α)−H(α)|.H( _t)- _αH(α)≤ 2 _α| H_t-1(α)-H(α)|. Moreover, by triangle inequality and linearity, supα|H^t−1(α)−H(α)|≤supα|F^t−1RKE(α)−FRKE(α)|+wsupα|G^t−1(α)−G(α)|. _α| H_t-1(α)-H(α)|≤ _α | F^RKE_t-1(α)-F_RKE(α) |+w _α | G_t-1(α)-G(α) |. The first term is bounded by Lemma 13 and nmin(t−1)=Ω(t−1)n_ (t-1)= (t-1). The second term is bounded by Lemma 8. Therefore, on the intersection event, for all t≥2t≥ 2, supα|H^t−1(α)−H(α)|≤Ct−1(1+logm(T+1)δ). _α| H_t-1(α)-H(α)|≤ C t-1 (1+ m(T+1)δ ). Summing the instantaneous regret over t=2,…,Tt=2,…,T yields ∑t=2T(t−1)−1/2≤2T, _t=2^T(t-1)^-1/2≤ 2 T, which completes the (T)O( T) bound (with the initial t=1t=1 warm-start round absorbed into the constant). A union bound yields an overall probability of at least 1−δ1-δ. ∎ Appendix E Additional Numerical Results Effect of the UCB Coefficient. In Figure 6 For each dataset, we conducted two independent experiments corresponding to two distinct optimization objectives as [Rezaei et al., 2025] suggests. In the first experiment, the objective is to minimize the KD metric. In the second experiment, the objective is to maximize the RKE metric. As a performance upper bound, the Mixture Oracle is introduces as a baseline. For a given objective, the oracle provides the corresponding optimal mixture weights α⋆α in advance, and arms are sampled according to this fixed distribution. The optimal mixture α⋆α is obtained by solving the quadratic program using a large number of samples from each arm. We analyze the impact of the exploration coefficient δL _L in the UCB term for both objectives. Empirically, decreasing δL _L leads to faster convergence toward the corresponding Mixture Oracle. When δL=0 _L=0, corresponding to the Mixture Greedy strategy, convergence is fastest. E.1 Synthetic Settings text-to-image setting. In Figure 7, we constructed a controlled text-to-image generation scenario using Stable Diffusion XL (SDXL) [Podell et al., 2023a]. We generated samples corresponding to five dog breeds: poodle, bulldog, german shepherd, golden retriever, and havanese. Each breed-specific prompt defines one arm in our framework. The objective in this experiment is to optimize a diversity metric, namely the Vendi Score [Friedman and Dieng, 2023]. Rather than optimizing the Vendi Score directly, we optimize its logarithm, which is a convex function of the mixture weights using sing exponentiated gradient descent (EGD). We additionally report results for the One-Arm Greedy algorithm, which selects at each round the arm with the highest current Vendi score, as well as an ϵε-greedy baseline with ϵ=0.1ε=0.1. In Figure 5, we evaluate three text-to-image generators, Kandinsky Razzhigaev et al. [2023], PixArt-α Chen et al. [2023], and SDXL Podell et al. [2023b], using the prompt: “Generate a red cartoony bird.” We then apply the Mixture Greedy algorithm to adaptively select among these models in order to maximize the Vendi Score of the generated samples. As in our previous experiments, the algorithm rapidly converges toward the optimal mixture, leading to a substantial increase in Vendi Score compared to any individual model. This further supports the benefit of mixture optimization for enhancing generative diversity. text-to-text setting. In Figure 8, we consider three large language models, Qwen 2[Qwen-Team, 2024], Gemma 3 [Gemma-Team, 2025], and Llama 3.2 [Aaron Grattafiori and others, 2024], prompted with: “Write a short sentence about a vibrant city in the U.S.” We then apply the Mixture Greedy algorithm to adaptively select among the models so as to maximize the Vendi Score of the aggregated outputs. The resulting mixture achieves a higher Vendi Score than any individual model evaluated in isolation. This demonstrates that adaptively combining multiple LLMs can yield greater diversity, as measured by the Vendi Score, than relying on a single model alone. Figure 4: Convergence of FD in ImageNet dataset generative models. Figure 5: Application of Mixture Greedy on the generated red birds dataset to optimize Vendi Score. Figure 6: Comparison of KD and RKE convergence among different datasets. Figure 7: Application of Mixture Greedy on generated dog breeds dataset to optimize Vendi Score. Figure 8: Application of Mixture Greedy on LLM generated texts to optimize Vendi Score. E.2 Ablation Studies E.2.1 Number of Random Fourier Features and Cosine Kernel To evaluate the robustness of Vendi Score optimization with respect to kernel approximation, we use the Gaussian kernel with Random Fourier Features (RFF) at R∈256,512,1024R∈\256,512,1024\, and additionally compare against the cosine kernel. As shown in Figure 9, the results are consistent across all settings. On both datasets, Mixture Greedy consistently obtains the highest Vendi Score, outperforming One Arm Greedy and One Arm Epsilon Greedy for all values of R. Increasing R changes the curves only slightly and does not affect the relative ranking of the methods. The same trend also holds for the cosine kernel. These results indicate that our method is robust to both the number of random Fourier features and the choice of kernel. Its advantage is preserved for low- and high-dimensional RFF approximations, as well as when using the cosine kernel instead of the Gaussian kernel. E.2.2 Kernel Bandwidth We evaluate three kernel bandwidths, σ∈20,30,40σ∈\20,30,40\, to examine the sensitivity of RKE and KID to this hyperparameter. Figure 10 shows that the overall conclusions are consistent across all bandwidths. In all cases, RKE increases with training while KID decreases, indicating stable convergence behavior. Across the three bandwidths, the relative ordering of the methods is unchanged: Mixture Oracle performs best, Mixture Greedy is consistently close to the oracle, and Mixture UCB performs worst. For RKE, all methods improve rapidly in the early stage and then plateau, while for KID they decrease sharply at first and then gradually stabilize. Although the absolute values of RKE and KID vary with σ, the same qualitative trend is preserved. This ablation suggests that our method is robust to kernel bandwidth choice. In particular, the strong performance of Mixture Greedy relative to Mixture UCB, and its small gap to Mixture Oracle, holds uniformly for σ=20σ=20, 3030, and 4040. E.2.3 Feature Extractors To test the robustness of our results to the underlying representation, we compute RKE and KID using three different feature extractors: Inception [Szegedy et al., 2016], CLIP [Radford et al., 2021a], and DINO-v2 [Oquab et al., 2024]. Figure 11 shows that the conclusions are consistent across all three feature spaces. In every case, RKE increases and KID decreases over training, indicating stable convergence. The relative ranking of the methods is also unchanged: Mixture Oracle performs best, Mixture Greedy remains very close to it, and Mixture UCB performs worst. Although the absolute metric values vary across feature extractors, the same qualitative behavior is preserved. This demonstrates that our method is robust to the choice of feature representation, and that the strong performance of Mixture Greedy is not tied to a particular feature extractor. Figure 9: Vendi score over rounds for different using RFF (R=256/512/1024) and cosine kernel on two generated datasets. Figure 10: Convergence of KD and RKE for generative models trained on FFHQ dataset using three bandwidths Figure 11: Convergence of RKE and KD on FFHQ generative models, using different feature extractors.