Paper deep dive
Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models
Hayeon Kim, Ji Ha Jang, Junghun James Kim, Se Young Chun
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 94%
Last extracted: 3/26/2026, 1:34:19 AM
Summary
The paper introduces UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA), a framework for Vision-Language Models (VLMs) that improves hierarchical and compositional understanding. By modeling part-to-whole semantic representativeness as hyperbolic uncertainty, UNCHA assigns adaptive weights to contrastive and entailment losses, allowing the model to distinguish between representative and less representative parts of a scene. This approach achieves state-of-the-art performance on zero-shot classification, retrieval, and multi-label benchmarks.
Entities (6)
Relation Signals (3)
UNCHA → enhances → Hyperbolic Vision-Language Models
confidence 95% · We propose UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) for enhancing hyperbolic VLMs.
UNCHA → outperforms → MERU
confidence 90% · We demonstrated that UNCHA outperforms prior hyperbolic VLMs [49, 54, 10] in diverse downstream tasks
UNCHA → uses → Lorentz Model
confidence 90% · Among several equivalent models, we adopt the Lorentz (or hyperboloid) model for embedding.
Cypher Suggestions (2)
Find all methods that enhance Hyperbolic Vision-Language Models · confidence 90% · unvalidated
MATCH (m:Method)-[:ENHANCES]->(v:ModelArchitecture {name: 'Hyperbolic Vision-Language Models'}) RETURN m.nameMap the relationship between models and their underlying geometric frameworks · confidence 85% · unvalidated
MATCH (m:Model)-[:USES_GEOMETRY]->(g:GeometryFramework) RETURN m.name, g.name
Abstract
Abstract:While Vision-Language Models (VLMs) have achieved remarkable performance, their Euclidean embeddings remain limited in capturing hierarchical relationships such as part-to-whole or parent-child structures, and often face challenges in multi-object compositional scenarios. Hyperbolic VLMs mitigate this issue by better preserving hierarchical structures and modeling part-whole relations (i.e., whole scene and its part images) through entailment. However, existing approaches do not model that each part has a different level of semantic representativeness to the whole. We propose UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) for enhancing hyperbolic VLMs. UNCHA models part-to-whole semantic representativeness with hyperbolic uncertainty, by assigning lower uncertainty to more representative parts and higher uncertainty to less representative ones for the whole scene. This representativeness is then incorporated into the contrastive objective with uncertainty-guided weights. Finally, the uncertainty is further calibrated with an entailment loss regularized by entropy-based term. With the proposed losses, UNCHA learns hyperbolic embeddings with more accurate part-whole ordering, capturing the underlying compositional structure in an image and improving its understanding of complex multi-object scenes. UNCHA achieves state-of-the-art performance on zero-shot classification, retrieval, and multi-label classification benchmarks. Our code and models are available at: this https URL.
Tags
Links
- Source: https://arxiv.org/abs/2603.22042v2
- Canonical: https://arxiv.org/abs/2603.22042v2
Full Text
91,203 characters extracted from source content.
Expand or collapse full text
Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models Hayeon Kim1,∗ Ji Ha Jang1,∗ Junghun James Kim2 Se Young Chun1,2,† 1 Dept. of Electrical and Computer Engineering, 2 INMC & IPAI Seoul National University, Republic of Korea khy5630, jeeit17, jonghean12, sychun@snu.ac.kr ∗Authors contributed equally. †Corresponding author. Abstract While Vision-Language Models (VLMs) have achieved remarkable performance, their Euclidean embeddings remain limited in capturing hierarchical relationships such as part-to-whole or parent-child structures, and often face challenges in multi-object compositional scenarios. Hyperbolic VLMs mitigate this issue by better preserving hierarchical structures and modeling part-whole relations (i.e., whole scene and its part images) through entailment. However, existing approaches do not model that each part has a different level of semantic representativeness to the whole. We propose UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) for enhancing hyperbolic VLMs. UNCHA models part-to-whole semantic representativeness with hyperbolic uncertainty, by assigning lower uncertainty to more representative parts and higher uncertainty to less representative ones for the whole scene. This representativeness is then incorporated into the contrastive objective with uncertainty-guided weights. Finally, the uncertainty is further calibrated with an entailment loss regularized by entropy-based term. With the proposed losses, UNCHA learns hyperbolic embeddings with more accurate part-whole ordering, capturing the underlying compositional structure in an image and improving its understanding of complex multi-object scenes. UNCHA achieves state-of-the-art performance on zero-shot classification, retrieval, and multi-label classification benchmarks. Our code and models are available at: https://github.com/jeeit17/UNCHA.git. 1 Introduction Understanding hierarchical structures is essential for capturing complex compositional information efficiently. As well established in cognitive science, human perception relies on part-whole hierarchies [25, 26], enabling generalization by interpreting new inputs through known relational structures [26, 30, 67]. Such hierarchical representations also improve information compression, classification, and inference efficiency [69, 8, 48, 16]. Vision-Language Models (VLMs) such as CLIP [53], ALIGN [31], and ALBEF [39] have demonstrated remarkable performance in image-text matching and shown strong versatility across various downstream tasks. However, owing to their reliance on Euclidean geometry, these models often face distortion of hierarchical structure and dimensionality trade-offs in capturing hierarchical or complex relational structures [21, 65, 48]. Moreover, CLIP has been reported to exhibit bias and difficulty with compositional relations in complex multi-object scenes [1], which is partly due to the lack of modeling part-whole relations. Hyperbolic space, characterized by constant negative curvature and exponential volume growth, provides an efficient geometric foundation for embedding hierarchical and fine-grained relational structures. Motivated by these properties, recent studies [35, 5, 58, 11, 49, 54, 10] have explored hyperbolic geometry in vision-language learning. MERU [10] extended contrastive vision-language learning into hyperbolic space by explicitly modeling entailment relations between text and image pairs. ATMG [54] later demonstrated that proximity-based contrastive losses can hinder hierarchical structure learning and proposed an angle-based alternative. HyCoCLIP [49] extended entailment modeling beyond inter-modal image-text relations by including intra-modal part-whole relationships. Although hyperbolic approaches have demonstrated improved performance in hierarchy-aware representation learning, they do not model that each part has a different level of semantic representativeness to the whole. In other words, they do not account for the varying degree to which each part is semantically representative of the whole. As illustrated in Fig. 1, part images differ substantially in how well they represent the whole scene. When all parts are treated equally, the model may not appropriately distinguish more representative parts from less representative ones for the whole scene, often leading to degraded multi-object alignment and inefficient utilization of the embedding space [54, 49]. Figure 1: Varying representativeness of part images to whole scene. The relationship between each part image and the whole scene varies with its representativeness. We model this varying representativeness as uncertainty, enabling uncertainty-guided part–whole alignment in hyperbolic space. We propose UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) for enhancing hyperbolic VLMs. UNCHA models part-to-whole semantic representativeness by assigning lower uncertainty to more representative parts and higher uncertainty to less representative ones for the whole scene. This design is grounded in prior findings [2, 15, 72, 46] showing that hyperbolic radius correlates with factors such as abstractness or uncertainty. Then, we incorporate uncertainty as part-to-whole semantic representativeness into both contrastive and entailment loss. Specifically, we incorporate uncertainty into the contrastive objective by assigning part-dependent temperature or uncertainty-guided weights, thereby modulating the strength of each part’s alignment with the whole. For the entailment loss, uncertainty is further calibrated based on the degree of part-to-whole entailment, and the entropy-based regularizer is also adapted to stabilize uncertainty estimates and promote richer use of the embedding space. By continually training with the proposed losses, UNCHA progressively strengthens the semantic relationship across parts and wholes, leading to more accurate part-whole ordering in hyperbolic embeddings, capturing the underlying compositional structure in an image and improving its understanding of complex multi-object scenes. We demonstrated that UNCHA outperforms prior hyperbolic VLMs [49, 54, 10] in diverse downstream tasks such as zero-shot image classification, retrieval, and a range of compositional and multi-object benchmarks, validating UNCHA’s modeling of part-to-whole semantic representativeness and capability of more faithful compositional understanding. Our embedding space analysis further confirms UNCHA’s more discriminative and efficient use of part-to-whole modeling. The contributions of this work are summarized as: • We propose UNCHA, a uncertainty-guided compositional alignment with part-to-whole semantic representativeness, enabling hierarchy-aware and compositional representation learning for hyperbolic VLMs. • We model part-to-whole semantic representativeness with hyperbolic uncertainty, designing uncertainty-guided contrastive and entailment loss for uncertainty calibration, regularized by entropy to adaptively reflect part–whole relations. • We performed diverse benchmarks, demonstrating that UNCHA achieves superior performance over prior arts in downstream tasks such as retrieval, zero-shot and multi-object classification, validating the effectiveness of our uncertainty-guided compositional alignment. 2 Related Works 2.1 Vision-language models Vision-Language Models (VLMs) have demonstrated strong capability in aligning image and text representations within a shared semantic space, achieving remarkable performance across tasks such as image-text retrieval and zero-shot image classification. The foundations of these models trace back to early studies on vision-language representation learning such as image retrieval, image captioning, and visual grounding, where joint embedding spaces are learned under task-specific supervision to associate visual content with linguistic semantics [44, 27, 24, 36, 55, 68]. More recently, CLIP [53] introduced a contrastive objective for aligning the two modalities using paired image-text data, achieving strong zero-shot and cross-modal performance [17, 52, 57, 32, 59]. ALIGN [31] and ALBEF [39] further extend CLIP by scaling up weak supervision and incorporating enhanced alignment-fusion strategies to better exploit large-scale, noisy datasets. However, the inherent limitations of Euclidean space make it difficult to represent hierarchical relationships effectively [48, 28, 50]. Moreover, CLIP has been shown to exhibit biases in complex multi-object scenes [1]. Its text encoder tends to emphasize the object mentioned first in the caption, while its image encoder focuses on larger objects, which hinders performance in multi-object settings. In contrast, hyperbolic space naturally provides continuous tree-like structures that support hierarchical embedding. However, when hierarchical relationships are handled without distinguishing their varying different part-to-whole representativeness, the embeddings tend to lose meaningful structural separation and collapse toward a narrow region [54, 49]. To address this, we introduce a part-to-whole uncertainty-guided alignment framework and explicitly model diverse part-whole entailment relationships within and across modalities, thereby enhancing compositional understanding. 2.2 Hyperbolic representation learning Hyperbolic space has emerged as an intriguing alternative in representation learning for embedding hierarchies. Hyperbolic space has exponential volume growth and a tree-like geometry, enabling near distortion-free hierarchical embeddings [16, 58]. Therefore, it provides an efficient representation for hierarchical structures. Consequently, numerous studies have leveraged hyperbolic geometry for representing text [61, 11, 38], images [35, 72, 2], and graphs [41, 7, 60]. Recently, hyperbolic space has been integrated into foundation models to better capture hierarchical, compositional, and multi-modal structures at scale, enabling more expressive representations [22, 10, 54, 49, 23, 46]. MERU [10] first introduced hyperbolic vision–language models by employing an additional entailment loss [16, 38] inspired by order embeddings [65] to reflect the informativeness of different modalities. ATMG [54] addressed hierarchical distortion and modality gap caused by spatial proximity–based contrastive learning by introducing an angle-based metric for image-text alignment in hyperbolic. HyCoCLIP [49] further incorporated intra-modal relationships by considering box images and their corresponding texts. However, it does not differentiate the varying strengths of these relationships, resulting in limited distinction among parts. Several studies have explored the use of hyperbolic radius, the distance between an embedding and the origin, as a proxy for concept abstractness or uncertainty [2, 15, 72, 46]. The hyperbolic radius naturally provides uncertainty estimation and boundary awareness in pixel-level classification [2, 15], image retrieval [72], and multi-modal language understanding [46], where it serves as an implicit indicator of confidence. Building on this property, we leverage the hyperbolic radius to better encode hierarchical structures in VLM and utilize entailment relationships for effective uncertainty calibration. An entropy-based regularizer further stabilizes the calibrated uncertainty, enabling more efficient use of the embedding space. 3 Method Figure 2: Comparison of UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA, Ours) with prior works. MERU [10] models inter-modal entailment between whole scene image and text representations. HyCoCLIP [49] extends this to include intra-modal entailment between part and whole scene representations. UNCHA (Ours) further incorporates uncertainty to quantify the semantic representativeness of each part, enabling uncertainty-guided part–whole alignment via adaptive weighting in the contrastive objectives and uncertainty calibration through the entailment loss. In addition, entropy regularization is applied in uncertainty calibration to ensure consistent and balanced utilization of the hyperbolic embedding space across varying uncertainty levels and modalities. 3.1 Preliminaries Hyperbolic space is a non-Euclidean geometry with a constant negative curvature −κ-κ where κ∈ℝ+κ ^+. Among several equivalent models, we adopt the Lorentz (or hyperboloid) model for embedding. A vector p∈ℝn+1p ^n+1 can be expressed in the form [ptime,space][p_time,p_space], where space∈ℝnp_space ^n and ptime∈ℝp_time . The Lorentzian inner product between two vectors ,∈ℝn+1p,q ^n+1 is defined as: ⟨,⟩=−ptimeqtime+⟨space,space⟩, ,q _L=-p_timeq_time+ _space,q_space , (1) where ⟨⋅,⋅⟩ ·,· denotes the Euclidean inner product. The n-dimensional Lorentz manifold nL^n is defined as the upper sheet of a two-sheeted hyperboloid in (n+1)(n+1)-dimensional Minkowski space: n=∈ℝn+1|⟨,⟩=−1κ,κ>0.L^n= \p ^n+1\;|\; ,p _L=- 1κ,\;κ>0 \. (2) The geodesic distance between two points ,p,q on the n-dimensional Lorentz manifold nL^n is: d(,)=1/κcosh−1(−κ⟨,⟩).d_L(p,q)= 1/κ\, ^-1 (-κ ,q _L ). (3) The hyperbolic radius of the embedding p is defined as the geodesic distance from the origin of the hyperboloid o, i.e., d(,)d_L(p,o). The tangent space at the point ∈nz ^n is defined as: Tn=∈ℝn+1:⟨,⟩=0,T_zL^n= \v ^n+1: ,v _L=0 \, (4) which consists of Euclidean vectors v orthogonal to z under the Lorentzian inner product. The exponential map projects a tangent vector ∈Tnv∈ T_zL^n onto the manifold as below: expκ()=cosh(κ‖)+sinh(κ‖)κ‖. ^κ_z(v)= ( κ\,\|v\|_L)z+ ( κ\,\|v\|_L) κ\,\|v\|_Lv. (5) Conversely, the logarithmic map sends a point ∈np ^n back to the tangent space at z as below: logκ()=cosh−1(−κ⟨,⟩)(κ⟨,⟩)2−1proj() ^κ_z(p)= ^-1(-κ ,p _L) (κ ,p _L)^2-1\;proj_z(p) (6) where proj()=+κ⟨,⟩proj_z(p)=p+κ\, ,p _Lz. Here, we consider the case where z corresponds to the origin of the hyperboloid, =[1/κ,]o=[ 1/κ,0]. In this setting, the time component of vectors in the tangent Euclidean space can be treated as zero, allowing us to parameterize the space component only, which is consistent with the design of prior works [10, 49, 54]. 3.2 Uncertainty-guided hyperbolic alignment Revisiting prior arts in hyperbolic alignment. Prior hyperbolic VLMs [54, 10, 49] extend contrastive vision-language learning by defining entailment relationships. In this hyperbolic geometry, abstract concepts tend to lie closer to the origin and specific ones farther out, with each specific concept constrained to its parent’s entailment cone (see Sec 3.2.3 for details). As illustrated in Fig. 2, MERU [10] incorporates an image-text entailment objective following partial-order embeddings [65], where text is considered more abstract than image. HyCoCLIP [49] extends this idea by modeling intra-modal alignment, assuming that part image is more abstract than its corresponding whole scene. Method overview. 3.2.1 Uncertainty model of semantic representativeness We leverage the geodesic distance from the origin (radius) in hyperbolic space [2, 15, 72, 46] to quantify the part-to-whole semantic representativeness using hyperbolic uncertainty. Since more abstract concepts are typically located near the origin and more specific ones farther away, this measure naturally reflects representativeness. Thus, we design the hyperbolic uncertainty to assign lower uncertainty to parts that are more representative of the whole scene, and high uncertainty otherwise (e.g.,e.g., part images). As shown in Fig. 4, our estimated uncertainty well aligns with semantic representativeness, indicating that the model effectively captures the varying part-to-whole relationships. Specifically, for a point ∈nx ^n, the Euclidean norm of x is monotonically related to its hyperbolic radius (see the supplementary material Sec. S.2.3.1). Accordingly, we define the uncertainty u as follows: u()=log(1+exp(−‖2)).u(x)= \! (1+ \! (-\|x\|_2 ) ). (7) Since points near the origin correspond to higher semantic uncertainty, the hyperbolic radius is inversely monotonically related to uncertainty. Eq. 7 is a smooth monotonic transformation of the hyperbolic radius, which is a differentiable, well-behaved uncertainty measure for numerical stability. 3.2.2 Uncertainty-guided contrastive loss In image–text pretraining, contrastive objectives are commonly employed to align multi-modal representations. Following prior works [10, 49], we adopt the negative Lorentzian distance as the similarity measure as below: Lc∗(i,t;τ)=−∑ilogexp(−d(i,ti)/τ)∑k≠iexp(−d(i,tk)/τ)L_c^*(i,t;τ)=- _i (-d_L(i_i,t_i)/τ ) _ subarrayck≠ i subarray (-d_L(i_i,t_k)/τ ) (8) where the i-th image embedding ii_i and its corresponding text embedding tit_i form a positive pair while all other text embeddings tit_i with k≠ik≠ i are treated as negatives in the batch of size B and the temperature parameter τ controls the scaling of similarities. Prior work [49] introduces a global–local contrastive loss ℒconorigL_con^orig that aligns part-level text features partt^part with whole image embeddings, and part-level image features parti^part with whole text embeddings as below: Lc∗(part,;τ)+Lc∗(part,;τ)⏟global-local contrastive loss+Lc∗(,;τ)+Lc∗(,;τ)⏟global contrastive loss. L_c^*\! (i^part,t;τ )+L_c^*\! (t^part,i;τ )_global-local contrastive loss+ L_c^*(i,t;τ)+L_c^*(t,i;τ)_global contrastive loss. (9) Our contrastive loss additionally includes a local contrastive loss that explicitly aligns each part image with its corresponding text on top of Eq. 9. Since whole and part images differ in information levels and occupy distinct regions in hyperbolic space, we design to assign separate temperature parameters, τg _g, τl _l, and τgl _gl to global, local and global-local contrastive losses, respectively, to better model these relationships. We propose uncertainty-guided contrastive loss unlike the aforementioned prior contrastive losses with fixed temperature. Our approach incorporates uncertainty into the global-local contrastive loss by considering the varying semantic representativeness of multiple parts. We modulate the temperature in an element-wise manner through an uncertainty-guided global-local contrastive loss, where the temperature is adaptively scaled according to the estimated uncertainty of each part image and text. The adaptive temperatures un,iI τ_un,i^I and un,iT τ_un,i^T are designed as below: un,iI=exp(u(ipart)/2)τgl,un,iT=exp(u(ipart)/2)τgl τ^I_un,i= \! (u(i^part_i)/2 )\, _gl, τ^T_un,i= \! (u(t^part_i)/2 )\, _gl (10) where higher uncertainty leads to a larger temperature and a smaller contribution to the contrastive loss. The formulation of our proposed contrastive loss is shown as below: ℒconun _con^un = = Lc∗(part,;unI)+Lc∗(part,;unT)⏟uncertainty-guided global-local contrastive loss L_c^*\! (i^part,t; τ_un^I )+L_c^*\! (t^part,i; τ_un^T )_ subarraycuncertainty-guided global-local contrastive loss subarray +Lc∗(,;τg)+Lc∗(,;τg)⏟global contrastive loss + L_c^*(i,t; _g)+L_c^*(t,i; _g)_ subarraycglobal contrastive loss subarray +Lc∗(part,part;τl)+Lc∗(part,part;τl)⏟local contrastive loss. + L_c^*(i^part,t^part; _l)+L_c^*(t^part,i^part; _l)_ subarrayclocal contrastive loss subarray. Unlike the one-to-one correspondence between matched image-text pairs, the relationship between a part image and its whole scene or text may not be a perfect correspondence. For instance, a single scene text may correspond to multiple part images. If all embeddings within a whole scene are pushed apart with the same temperature, both highly representative and less representative regions are equally repelled, breaking semantic structure. Our proposed contrastive loss in Eq. 3.2.2 is designed to mitigate these undesirable cases. Figure 3: Entailment geometry in hyperbolic space. The term ω(part)ω(i^part) denotes the aperture of the entailment cone centered at parti^part. The angle ϕ(part,)φ(i^part,i) measures the geodesic angle between the embeddings parti^part and i, which is used to determine whether i lies within the entailment region of parti^part. 3.2.3 Entailment loss for uncertainty calibration Piecewise-continuous entailment loss. Building upon the hyperbolic entailment formulation in [10, 38], prior work [49] defines the entailment loss as: ℒorig=max(0,ϕ(,)−ηω())L_orig= (0,φ(p,q)-ηω(p)) (12) where ϕ(,)φ(p,q) denotes the angular distance between the embeddings p and q, η and K are hyperparameters, and ω()ω(p) defines the aperture of the entailment cone centered at p as below: ω()=sin−1(2K/(−κ‖)),ω(p)= ^-1\! (2K/( -κ\|p\|) ), (13) which is also illustrated in Fig. 3. The ℒorigL_orig in Eq. 12 enforces entailment by constraining q to lie within the cone of p. However, once q is fully contained in the cone, the loss becomes zero, preventing further fine-grained alignment. Here, we propose adding an angular term ϕ(,)φ(p,q) in Eq. 12 to encourage fine-grained alignment while maintaining smooth optimization continuity as below: Lent∗(,)=max(0,ϕ(,)−ηω())+αϕ(,)L^*_ent(p,q)= \! (0,\,φ(p,q)-η\,ω(p) )+α\,φ(p,q) (14) where α is a hyperparameter. This formulation can be viewed as a Leaky-ReLU-like [45] relaxation of the original hinge-based entailment loss, with the additional term preserving a small gradient even when q is inside the cone. Uncertainty calibration loss. Prior studies have reported that hyperbolic embeddings often accumulate around narrow regions, leading to collapse [54]. Moreover, local and global image representations exhibit similar radii, making their separation less distinct [49]. To clearly distinct global and local representations, we propose the uncertainty calibration loss as follows: Lentcal(,)=⌊Lent∗(,)⌋e−u()+u()+ℋ(u~())L_ent^cal(p,q)= L_ent^*(p,q) \,\!e^-u(p)+u(p)+H( u(p)) (15) where ⌊.⌋ \,. denotes the stop-gradient operator and ℋH represents the entropy term as follows: ℋ(u~())=−∑iu~(i)log(u~(i))H( u(p))=-Σ _i u(p_i) ( u(p_i)) (16) where u~(i)=exp(u(i))/∑jexp(u(j)) u(p_i)= (u(p_i))/ _j (u(p_j)). When the entailment relation between p and q is weak, the term e−u()e^-u(p) encourages the model to increase uncertainty. The term u()u(p) prevents the model from assigning excessively high uncertainty just to reduce the loss. Thus, ℋ(u~())H( u(p)) regularizes the uncertainty distribution to remain diverse and informative, avoiding a collapse toward uniform or constant uncertainty, analogous to [18]. Figure 4: Analysis of uncertainty modeling. (a) Randomly cropped parts are sorted by uncertainty (low→high). Semantically representative parts show low uncertainty, while blurred or less representative crops show high uncertainty. (b) On an ImageNet [56] subset, part-to-whole similarity vs. uncertainty shows a strong negative correlation (r=−0.739r=-0.739), indicating that less representative parts have higher uncertainty. With the entropy regularizer, the proposed formulation of our entailment loss is as follows: ℒentun _ent^un = = Lent∗(part,part)+Lent∗(,)⏟inter-modal entailment L_ent^*(t^part,i^part)+L_ent^*(t,i)_inter-modal entailment +λ1(Lent∗(part,)+Lent∗(part,)⏟intra-modal entailment) + _1( L_ent^*(t^part,t)+L_ent^*(i^part,i)_intra-modal entailment) +λ2(Lentcal(part,)+Lentcal(part,))⏟uncertainty calibration + _2( L_ent^cal(t^part,t)+L_ent^cal(i^part,i))_uncertainty calibration where λ1 _1 and λ2 _2 are hyperparameters. This uncertainty calibration enables semantic alignment with the representativeness of each part relative to the whole. This is a process that naturally fits the geometric properties of hyperbolic space, which is particularly beneficial for jointly aligning multiple objects simultaneously. Moreover, such calibration enhances multi-object alignment, as shown in Fig. 4. Parts with higher semantic similarity to the whole exhibit lower uncertainty, while less representative parts show higher uncertainty, resulting in a strong negative correlation between similarity and uncertainty. Further details on Fig. 4 are provided in the supplementary material Sec. S.2.3.2. Finally, our overall loss with the proposed uncertainty-guided contrastive loss in Eq. 3.2.2 and the entailment loss with uncertainty calibration in Eq. 3.2.3 is defined as follows: L=ℒconun+λentℒentun L=L_con^un+ _entL_ent^un (18) where λent _ent is a hyperparameter. We detail all hyperparameters in the supplementary material Sec. S.1.2. 4 Experiments Table 1: Zero-shot image classification evaluation. UNCHA (Ours) consistently demonstrates strong zero-shot classification performance across both architectures. Bold numbers denote the best performance within each architecture. † denotes ATMG trained on the GRIT [51]. General datasets Fine-grained datasets Misc. datasets Model ImageNet CIFAR-10 CIFAR-100 SUN397 Caltech-101 STL-10 Food-101 CUB Cars Aircraft Pets Flowers DTD EuroSAT RESISC45 Country211 ViT-S/16 CLIP [53] 36.7 70.2 42.6 35.8 57.6 89.7 44.7 9.8 6.9 2.0 44.6 14.8 22.3 40.7 40.1 5.1 MERU [10] 35.4 71.2 40.4 33.8 57.3 89.7 41.2 11.3 5.2 4.2 42.7 17.3 18.6 39.1 38.9 5.3 ATMG† [54] 34.1 66.9 42.1 47.9 68.5 90.7 43.6 14.1 5.8 2.5 41.8 14.9 19.7 35.8 40.3 4.6 HyCoCLIP [49] 41.7 85.0 53.4 52.5 75.7 92.5 50.2 14.7 8.1 4.2 52.0 20.5 23.3 38.3 45.7 5.2 UNCHA (Ours) 43.9 85.9 56.6 52.6 80.5 94.4 52.1 12.5 9.2 2.7 52.1 24.6 25.4 36.2 43.4 5.2 ViT-B/16 CLIP [53] 40.6 78.9 48.3 43.0 70.7 92.4 48.3 10.4 9.3 3.4 45.9 21.3 23.4 37.1 42.7 5.7 MERU [10] 40.1 78.6 49.3 43.0 73.0 92.8 48.5 11.0 5.3 3.7 48.5 21.6 22.1 31.7 42.6 5.4 ATMG† [54] 34.3 68.8 42.1 48.2 68.5 91.2 43.2 14.3 6.0 2.4 42.2 15.0 19.4 35.0 40.4 4.6 HyCoCLIP [49] 45.8 88.8 60.1 57.2 81.3 95.0 59.2 16.4 11.6 3.7 56.8 23.9 29.4 35.8 45.6 6.5 UNCHA (Ours) 48.8 90.4 63.2 57.7 83.9 95.7 60.3 14.8 14.0 3.8 57.1 27.0 30.3 41.3 52.7 6.1 Table 2: Zero-shot retrieval and hierarchical classification metrics on ImageNet [9]. UNCHA (Ours) consistently achieves superior performance across both retrieval and hierarchical metrics, showing the effectiveness of our uncertainty-based hyperbolic alignment. Text retrieval Image retrieval Hierarchical metrics Model COCO Flickr COCO Flickr TIE(↓) LCA(↓) J(↑) P_H(↑) R_H(↑) R@1 R@5 R@1 R@5 R@1 R@5 R@1 R@5 ViT-S/16 CLIP [53] 69.3 79.1 90.2 95.2 53.7 65.2 81.1 87.9 4.02 2.39 0.76 0.83 0.84 MERU [10] 68.8 78.8 89.4 94.8 53.6 65.3 80.4 87.5 4.08 2.39 0.76 0.83 0.83 ATMG† [54] 62.6 74.2 85.5 91.6 50.3 62.1 76.9 84.6 4.26 2.50 0.75 0.82 0.83 HyCoCLIP [49] 69.5 79.5 89.1 93.9 55.2 66.6 81.5 88.1 3.55 2.17 0.79 0.86 0.85 UNCHA (Ours) 69.9 79.7 90.8 94.8 56.2 67.6 82.5 89.3 3.39 2.14 0.80 0.86 0.86 ViT-B/16 CLIP [53] 71.4 81.5 93.6 96.9 57.4 68.5 83.5 89.9 3.60 2.21 0.79 0.85 0.85 MERU [10] 72.3 82.0 93.5 96.2 57.4 68.6 84.0 90.0 3.63 2.22 0.78 0.85 0.85 ATMG† [54] 62.9 74.0 85.1 92.2 51.2 62.6 78.0 85.3 4.19 2.48 0.75 0.83 0.83 HyCoCLIP [49] 72.0 82.0 92.6 95.4 58.4 69.3 84.9 90.3 3.17 2.05 0.81 0.87 0.87 UNCHA (Ours) 72.7 82.7 91.4 95.9 60.0 71.0 84.9 91.2 2.94 1.96 0.83 0.88 0.88 Table 3: Comparison on part-level alignment evaluation with hard negatives. Ours achieves substantial performance gains under the most challenging scenario of [63], demonstrating its strong ability for fine-grained compositional understanding. All Pick5 All Model SCM Neg Hard Negs CLIP [53] 13.10 22.94 52.89 ATMG† [54] 12.23 23.08 53.91 MERU [10] 12.59 20.69 54.56 HyCoCLIP [49] 11.65 23.52 53.33 UNCHA (Ours) 13.53 23.81 56.51 Table 4: Ablation study on classification and retrieval benchmarks. Removing any component leads to consistent performance drops, showing that all modules contribute meaningfully. Bold numbers indicate the best performance within each task group. Classification Retrieval Model General Fine MISC. Text Image Ours (full) 68.98 25.53 27.55 83.80 73.90 w/o uncertainty 64.57 22.98 26.67 79.60 69.68 w/o contrastive 65.14 23.92 25.58 80.78 70.55 w/o entropy 65.61 23.09 24.78 80.60 69.95 Table 5: Comparison across Multi-object Representation and Classification tasks. Left: zero-shot mAP comparison across multi-object configurations on ComCo and SimCo datasets. Right: zero-shot multi-label classification (Cls.) on VOC and COCO datasets (mAP only). Our method consistently achieves higher mAP across both tasks. Multi-object Representation Multi-label Cls. Model ComCo SimCo VOC COCO 2 obj. 3 obj. 4 obj. 5 obj. 2 obj. 3 obj. 4 obj. 5 obj. ViT-B/16 CLIP [53] 77.55 80.31 81.41 80.22 77.15 84.58 87.40 88.48 78.56 53.94 MERU [10] 72.90 77.25 78.15 77.34 77.82 83.91 85.79 86.90 79.50 54.39 ATMG† [54] 45.91 45.97 45.80 45.82 65.52 65.32 65.28 65.12 72.22 46.81 HyCoCLIP [49] 72.90 73.22 73.51 72.90 75.71 81.13 82.41 82.85 80.43 58.12 UNCHA (Ours) 77.92 80.96 81.83 81.18 79.72 86.93 89.75 90.65 82.14 59.43 4.1 Training details To ensure a fair comparison, baseline models [49, 53, 10, 54] are reproduced under identical dataset and training configurations, while preserving the optimization settings specified in their original implementations. The batch size and total number of training iterations are fixed at 768 and 500,000, respectively. All models are trained on the Grounded Image-Text Pairs (GRIT) [51] dataset, which contains 20.5 million grounded vision–language pairs and 35.9 million part-level annotations. Detailed descriptions of the settings and hyperparameters are provided in Sec. S.1 of the supplementary material. 4.2 Downstream tasks 4.2.1 Zero-shot image classification We conduct zero-shot classification experiments on 16 benchmark datasets as listed in Tab. 1. We report Top-1 accuracy as the evaluation metric for all results following prior works [10, 53]. To evaluate scalability, we experiment with different sizes of vision encoders, ViT-S and ViT-B. For ATMG [54], we follow the original setup, computing similarity via averaged exterior angles instead of Lorentz or Euclidean inner products. This configuration is used for all downstream tasks. As shown in Tab. 1, our method consistently outperforms prior approaches across all benchmark datasets, demonstrating generalization and robust performance on downstream tasks. 4.2.2 Zero-shot retrieval For the retrieval task, we evaluate the model’s ability to retrieve the most relevant samples across modalities. Specifically, given an input image (or text), the model retrieves the Top-K text (or image) candidates from the collection, and the retrieval accuracy is computed accordingly. All experiments are conducted under the zero-shot setting using the COCO [40] validation set and the Flickr30K [73, 34] test set. As shown in Tab. 2, our method shows steady performance, indicating its reliable cross-modal alignment capability across both benchmarks. 4.2.3 Hierarchical Classification To evaluate how well the model embeds hierarchical relationships in hyperbolic space, we adopt the hierarchy-aware metrics introduced in HyCoCLIP [49]. As shown in Tab. 2, our model achieves consistently strong performance in hierarchical metrics, demonstrating its improved ability to preserve the structural hierarchy of the class labels within the embedding space, partly due to the uncertainty-guided alignment. More detailed explanations are in supplementary material Sec. S.2.2.3. 4.2.4 Zero-shot multi-label classification We conduct multi-label classification experiments on the MS-COCO [40] and VOC [14] datasets, as shown in Tab. 5. The evaluation metric is mean Average Precision (mAP). To further assess performance in more complex multi-object settings, we employed the ComCo and SimCo datasets [1]. These datasets evaluate compositional understanding with images containing N objects. ComCo features realistic object compositions, whereas SimCo provides synthetic scenes with diverse geometric shapes. For evaluation, we train a lightweight classifier on the embeddings and reported test-set classification mAP. As shown in Tab. 5, UNCHA outperforms all baselines across both multi-label classification and multi-object representation benchmarks which indicate that our uncertainty-aware modeling provides a substantially stronger compositional understanding. These results highlight UNCHA’s ability to better disentangle object-level semantics and maintain robust alignment in complex multi-object scenes. Figure 5: Analysis of hyperbolic embedding. Compared to HyCoCLIP [49], whose hyperbolic embeddings exhibit a narrower range, UNCHA yields a more dispersed and structured distribution, reflecting richer use of the hyperbolic space. 4.2.5 Part-level alignment with hard negatives We evaluate part-level text–image matching using the benchmark derived from the densely annotated Densely Captioned Images [63]. The benchmark pairs cropped parts with their corresponding texts and introduces region-specific hard negatives to test fine-grained alignment. We report results on the ‘All Pick5’ and ‘All Hard negs’ in Tab. 3, which require the model not only to identify the correct pair among hard negative captions but also to produce a correct ordering between matching and non-matching pairs. UNCHA (Ours) achieves the highest performance among baselines, exhibiting substantial improvements in the ‘All Pick5’ setting. This shows that our model effectively captures fine-grained part-whole distinctions, yielding better region-level visual-semantic alignment. 4.3 Analysis about hyperbolic space We visualize the radii of hyperbolic embedding for 10,00010,000 ImageNet [56] images and their randomly cropped parts, shown in Fig. 5. As noted in HyCoCLIP [49], the embeddings of image and their parts often collapse into a narrowly concentrated region, yielding minimal separation between part and whole. In contrast, UNCHA produces a more distinctive and semantically structured geometry: part embeddings consistently lie closer to the origin than whole-scene embeddings, and the two distributions become clearly separated. This behavior results from the application of our uncertainty calibration and entropy regularizer. A more detailed analysis of hyperbolic space is provided in Sec. S.2.5 of the supplementary material. 4.4 Ablation study To assess the contribution of each component in our framework, we performed ablation experiments, each removing a distinct component. In Tab. 4, ‘w/o contrastive’ removes the uncertainty-aware scaling from the global-local contrastive loss, while ‘w/o uncertainty’ disables the uncertainty calibration in uncertainty-guided entailment loss. Finally, ‘w/o entropy’ removes the entropy regularization from the uncertainty calibration module. The results demonstrate that all components of our method are essential. All experiments were conducted with ViT-S/16 architecture. 5 Conclusion We propose UNCHA, a hyperbolic VLM that integrates part-to-whole representativeness, quantified as hyperbolic uncertainty, into both contrastive and entailment learning for hierarchy-aware compositional modeling. By further calibrating uncertainty using part-to-whole entailment relationships and an entropy based regularization term, our method enables efficient use of hyperbolic space and yields well-calibrated part-whole orderings. Extensive experiments on zero-shot classification, retrieval, and multi-label benchmarks, including complex multi-object scenes, demonstrate state-of-the-art performance, highlighting the importance of uncertainty-guided alignment for compositional understanding in vision-language learning. Acknowledgements This work was supported in part by Institute of Information & communications Technology Planning & Evaluation (IITP) grants funded by the Korea government(MSIT) [No.RS-2021-I211343, Artificial Intelligence Graduate School Program (Seoul National University) / No.RS-2025-02314125, Effective Human-Machine Teaming With Multimodal Hazy Oracle Models], the National Research Foundation of Korea(NRF) grants funded by the Korea government(MSIT) (Nos. RS-2022-NR067592, RS-2025-02263628), the AI Computing Infrastructure Enhancement (GPU Rental Support) User Support Program funded by the Ministry of Science and ICT (MSIT), Republic of Korea (No. RQT-25-120066), the BK21 FOUR program of the Education and Research Program for Future ICT Pioneers, Seoul National University and AI-Bio Research Grant through Seoul National University. Supplementary Material for Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models Appendix S.1 Implementation details S.1.1 Model architecture Our text encoder follows the CLIP [53] design and uses a 12-layer, 512 dimensional Transformer [64]. The maximum input length is set to 7777 tokens with a vocabulary size of 49,40849,408. For images, we adopt a Vision Transformer [12] and experiment with two capacity configurations, ViT-S and ViT-B [62, 8], both using a patch size of 16. These architectural choices are consistent with prior works [10, 49]. During training, we apply the same image augmentations as OpenCLIP [29], including random cropping, random grayscale conversion, and random color jittering, and resize all images to 224×224224× 224. S.1.2 Model initialization The curvature of Lorentz space is initialized to κ=1.0κ=1.0 and treated as a learnable parameter, while being clamped in [0.1,10.0][0.1,10.0] for numerical stability. The final learned value converges to κ=0.1κ=0.1 , consistent with those used in prior hyperbolic methods [10, 49, 54]. Before projecting representations onto the Lorentz model, we apply learnable scaling factors to image and text vectors. These scalars are initialized as cimg=ctxt=1512c_img=c_txt= 1 512, following prior work [10, 49]. The temperature parameters are also learnable. The global-local logit scale τgl _gl is initialized to 0.060.06, while the local and global logit scales, τl _l and τg _g, are initialized to 0.070.07. All temperature values are clipped at a minimum of 0.010.01. Values of η parameter are set to ηintra=1.2 _intra=1.2 for intra-modality entailments and ηinter=0.7 _inter=0.7 for inter-modal entailments (Eq.14) and K=0.1K=0.1 (Eq.13), following [49]. In Eq.14, we set α=0.1α=0.1. For Eq.17, the weighting coefficients are λ1=0.5 _1=0.5 and λ2=10.0 _2=10.0. In Eq.18, we use λent=0.2 _ent=0.2. S.1.3 Optimizer and hardware Our model is trained for 500K steps using four A100 GPUs with a batch size of 768. We employ the AdamW optimizer [43], setting β1=0.9,β2=0.98 _1=0.9, _2=0.98, and a weight decay of 0.20.2. The decay is excluded for the learnable parameters, including the temperature parameters, curvature, and the scaling factors cimgc_img and ctxtc_txt. We adopt a cosine learning-rate scheduler [42] with a maximum learning rate of 5×10−45× 10^-4, with a 4k-step linear warm-up period. Appendix S.2 Additional details on experiments S.2.1 Training details on other models We employ CLIP [53], MERU [10], and HyCoCLIP [49] models trained on the Grounded Image-Text Pairs (GRIT) dataset [51], using the reproduced version released by [49]. For CLIP [53] and MERU [10], we adopt the variants trained without part images, as their original training pipelines do not incorporate part-level data and prior work [49] reports that including part images does not lead to performance improvements. The GRIT dataset contains 20.520.5 million grounded vision-language pairs and 35.935.9 million box-level annotations describing objects within each scene, derived from the larger COYO-700M corpus [4]. In addition, we train ATMG [54] on the same GRIT dataset using a batch size of 768768 for 500K iterations, preserving the optimization settings specified in their original implementation. S.2.2 Downstream tasks S.2.2.1 Zero-shot image classification For MERU [10], HyCoCLIP [49] and UNCHA (Ours), similarity between text and image embedding is computed with Lorentzian inner product. For CLIP [53], similarity is measured using the Euclidean inner product, while for ATMG [54], we adopt its exterior angle-based similarity. The same similarity formulation for each model is consistently applied across all remaining downstream tasks. In zero-shot image classification, we treat the label set as a collection of text queries [13] and apply prompt ensembling for each label by encoding multiple prompt variants and averaging their embeddings before generating the final textual representations, following previous works [10, 49, 54]. Using these embedded text queries, we compute image-text similarities and report top-1 accuracy averaged over classes. S.2.2.2 Zero-shot retrieval In zero-shot text-to-image retrieval, we compare every text caption embedding against all image embeddings and sort the images in descending order of similarity. The same procedure is applied symmetrically for image-to-text retrieval. We compute recall@K for both directions using the ground-truth associations provided by COCO [40] and Flickr30K [73, 34], where a retrieval is counted as correct if at least one paired item appears within the top-K results. All recall metrics are averaged over the full set of queries to produce final results. S.2.2.3 Hierarchical classification For hierarchical classification task, we follow the prior work [49] and use the WordNet hierarchy [47] of the ImageNet class labels [9, 56]. The Tree-Induced Error (TIE) quantifies how far the predicted label is from the ground-truth label within the given tree. The Lowest Common Ancestor (LCA) error captures how far each label is from their deepest shared ancestor, defined as the sum of the edge-weighted distances from the predicted and true labels to the LCA. Set-based metrics compare the ancestor sets of the predicted and true labels: using all ancestor nodes for each label, we compute Jaccard similarity, hierarchical precision, and hierarchical recall based on their set intersection. S.2.2.4 Zero-shot multi-label classification Multi-label classification. We perform multi-label classification experiments on the MS-COCO [40] and VOC [14] datasets and report performance using mean Average Precision (mAP). This task evaluates whether the VLM can correctly predict the set of classes present in each image by comparing its predictions against the binary ground-truth labels. Because the baseline models include both hyperbolic and Euclidean variants, their similarity score ranges differ substantially: Euclidean VLMs typically output similarities within [0,1][0,1], whereas hyperbolic similarity scores generally fall at or below –10. To ensure a fair comparison across models, we apply an additional normalization step to the similarity scores before computing the evaluation metrics. Multi-object representation. This benchmark is designed to evaluate more complex multi-object scenarios using the ComCo and SimCO datasets [1]. As described in [1], this setting allows us to assess how well a VLM’s image encoder represents individual objects within multi-object scenes and to analyze whether its representations exhibit biases with respect to object size. ComCo consists of images containing realistic 3D asset objects, such as cars or airplanes, arranged in sets of N, while SimCo contains synthetic 3D assets such as blue spheres, cones, and other primitive shapes. In both datasets, each image contains between two and five objects, so the labels ‘2 obj.’, ‘3 obj.’ in Tab.5 of the main text refers to sets of images that contain exactly two or three objects, respectively. These images include various combinations of object sizes and spatial arrangement. For instance, ‘3 obj.’ set contain one large object and two smaller objects in different location. A separate classifier is trained for each set on top of the features produced by the VLM’s image encoder, grouped by the number of objects. The model is evaluated on its ability to distinguish all components across different sizes and positions, and at test time we assess whether the trained classifier can correctly identify each component in response to new text queries. Extended results evaluated with the ViT-S backbone are provided in Tab. LABEL:tab:multi-obj-full. S.2.2.5 Part-level alignment with hard negatives This benchmark, introduced in [63], evaluates whether a VLM can correctly associate captions with the appropriate image subregions when multiple submasks and captions exist for the same image, using the 7,805 images from the summarized Densely Captioned Images (sDCI) dataset. The original DCI dataset provides dense textual annotations, including multiple captions, subcaptions, and visual descriptions per image. To align these annotations with CLIP-style input constraints, all LLM-generated captions are truncated to 77 tokens to form sDCI. Each image contains several subcrops, each paired with one or more summarized captions as well as LLM-generated negatives. Retrieval-style evaluations are constructed by placing multiple subcrops and captions from the same image within a single batch, requiring the model to identify which caption corresponds to which region. We report the result of ‘All Pick5-SCM’, ‘All Pick5-Neg’, and ‘All-Hard Negs’ in the main paper, and include all metrics below, tested with both ViT-B, ViT-S at Tab. S.7. In ‘All-SCM’, one summarized caption is paired with each subcrop, and the model must identify the caption that describes that specific region, distinguishing it from captions corresponding to other subcrops of the same image as well as from other in-batch captions. In ‘All-Neg’, each subcrop’s caption is evaluated against an LLM-generated negative to test positive-negative discrimination. The ‘All Pick5-SCM’ setting follows the structure of ‘All-SCM’ but uses five captions per subcrop, with success only if the correct caption scores higher than all positives from other images. In ‘All Pick5-Neg’, five summarized captions are paired with each subcrop, and the model succeeds only if all positives score above the negative. In ‘Base-Neg’, only full images (no subcrops) are used, and each image is paired with its LLM-generated negative caption to test the models’ ability to distinguish between an LLM generated caption and its corresponding LLM-generated negative. Finally, ‘All-Hard Negs’ follows the same setup as ‘All-Neg’ but replaces the negative caption with the hardest (highest-scoring) LLM-generated negative across the entire negative pool. S.2.3 Additional ablation study S.2.3.1 Ablation study on hyperbolic radius As discussed in the main paper, for a point ∈nx ^n, we define the uncertainty u using the Euclidean ℓ2 _2 norm of x, since this norm is monotonically proportional to its hyperbolic radius. We represent a point ∈ℝn+1x ^n+1 in the Lorentz model using its time–space decomposition: =[xtime,space],xtime∈ℝ,space∈ℝnx=[x_time,x_space], x_time ,\;x_space ^n (S.19) The origin of the hyperboloid corresponds to the point =[1/κ,]o=[ 1/κ,0]. Therefore, the hyperbolic radius, defined as the geodesic distance between x and the origin, can be calculated as: d(,) d_L(x,o) =1κcosh−1(−κ⟨,⟩) = 1κ\, ^-1\! (-κ ,o _L ) (S.20) =1κcosh−1(xtimeκ) = 1κ\, ^-1\! (x_time κ ) where we used the Lorentzian inner product ⟨,⟩=−xtime1κ ,o _L=-x_time 1κ (S.21) To obtain an explicit expression, we use the hyperboloid constraint: ⟨,⟩=−xtime2+‖space‖22=−1κ, ,x _L=-x_time^2+\|x_space\|^2_2=- 1κ, (S.22) which implies xtime=‖space‖22+1κx_time= \|x_space\|^2_2+ 1κ (S.23) As we mentioned in preliminaries of main text, we only parameterize the space component of x. Hence, the Euclidean norm satisfies ∥space∥2≡∥2 _space _2≡ _2 for our parameterization. Therefore, the geodesic distance (hyperbolic radius) from the origin to a point ∈ℝDx ^D is given by: d(,)=1κcosh−1(1+κ∥22) d_L(x,o)= 1 κ ^-1\! ( 1+κ _2^2 ) (S.24) This expression reveals that the hyperbolic radius is closely related to the Euclidean norm of x, ∥2 _2. For small x, ∥2 _2, we have the approximation 1+κ∥22≈1+κ2∥22 1+κ _2^2≈ 1+ κ2 _2^2 (S.25) and using cosh−1(1+u)≈2u ^-1(1+u)≈ 2u, it follows that d(,)≈∥2d_L(x,o)≈ _2 (S.26) showing that the hyperbolic radius grows approximately proportionally to the Euclidean norm for small ∥2 _2. For large norms, using cosh−1(t)≈log(2t) ^-1(t)≈ (2t), the radius behaves as: d(,)≈1κlog(2κ∥2)d_L(x,o)≈ 1 κ \! (2 κ _2 ) (S.27) indicating a transition to logarithmic growth. Overall, the hyperbolic radius is approximately proportional to the Euclidean norm for small ∥2 _2, but grows logarithmically for large ∥2 _2. This monotonic relationship validates the use of the Euclidean norm of x as a proxy for its hyperbolic radius. This enables us to avoid the unnecessary hyperbolic computations while preserving the same ordering. The ablation result obtained when training directly with the hyperbolic radius in Eq. S.24 is reported in Tab. S.6, showing slightly reduced performance compared to our full model. This confirms that our Euclidean norm proxy provides an effective surrogate for the hyperbolic radius, enabling more reliable uncertainty estimation during training. Table S.6: Ablation study on hyperbolic radius. Replacing our Euclidean-norm surrogate with the explicit hyperbolic radius slightly degrades both classification and retrieval performance. Bold numbers indicate the best within each task group. Classification Retrieval Model Gen. Fine. MISC. Text Image Ours (full) 68.98 25.53 27.55 83.80 73.90 with d(,)d_L(x,o) 67.41 24.81 25.55 79.43 72.00 S.2.3.2 Analysis experiments Analysis of uncertainty modeling. In Fig. 4(a), we investigate how uncertainty reflects the semantic representativeness of local regions within an image. To this end, we randomly crop multiple patches from a single image and compute the uncertainty for each patch. The patches are then arranged according to their uncertainty values, from low to high, progressing from the top-left to the bottom-right. We observe that patches with low uncertainty tend to correspond to semantically meaningful and well-aligned regions, such as prominent objects or structurally informative parts of the scene. In contrast, patches with high uncertainty are often blurred, textureless, or less informative, making them less representative of the overall scene. This qualitative observation suggests that our uncertainty measure effectively captures how well a local region aligns with the global semantics of the image. Additional results on uncertainty-based ordering are provided in Fig. S.13. In Fig. 4(b), we further provide a quantitative analysis of this behavior using a subset of ImageNet [56]. For each image, we compute the semantic similarity between each cropped part and the corresponding whole image, and examine its relationship with the estimated uncertainty. The resulting scatter plot reveals a strong negative correlation (Corr = -0.739), indicating that parts with higher semantic similarity to the whole tend to have lower uncertainty, while less representative parts exhibit higher uncertainty. This consistent trend supports the interpretation that our uncertainty measure serves as a reliable proxy for semantic representativeness, which is crucial for accurate and robust part-level alignment. S.2.4 Additional experimental results S.2.4.1 Part-level alignment with hard negatives Experimental setting. The experimental setting is described in Sec.4.2.5 of the main text and further detailed in Sec. S.2.2.5. Experimental results. Tab. S.7 presents the results for the part-level alignment benchmark with hard negatives across evaluation settings described in Sec. S.2.2.5. Across both ViT-S/16 and ViT-B/16 backbones, UNCHA (Ours) consistently achieves the best or second-best performance in nearly every setting. The gains are especially noticeable in the more challenging ‘All Pick5-SCM’ and ‘All Pick5-Neg’ settings, where multiple positives per sub-crop make the matching task substantially harder. Even in the ‘All-Hard Negs’ setting, where each sub-crop must be distinguished from the hardest negative caption selected from the entire LLM-generated negative pool, UNCHA achieves the best performance, demonstrating its robustness against challenging negative distractors. This result indicates that UNCHA (Ours) effectively identifies and differentiates distinct subregions within an image, demonstrating its ability to understand images in a more fine-grained manner. Table S.7: Full results of part-level alignment with hard negatives. Comparison across all settings of part-level alignment with hard negatives for ViT-S and ViT-B. UNCHA (Ours) consistently outperforms prior models, including the challenging ‘All Pick5’ and ‘All-Hard Negs’ settings, demonstrating its strong capability in accurately identifying and distinguishing fine-grained visual regions within images. All All Pick5 Base All Model SCM Neg SCM Neg Neg Hard Negs ViT-S/16 CLIP [53] 39.87 63.60 12.52 23.88 82.41 57.31 ATMG [54] 40.45 61.51 12.30 22.29 73.15 55.79 MERU [10] 40.81 64.18 12.23 23.81 79.63 56.30 HyCoCLIP [49] 36.61 60.13 10.85 22.29 80.56 52.03 UNCHA (Ours) 41.10 63.89 12.88 25.04 83.33 57.45 ViT-B/16 CLIP [53] 39.22 59.33 13.10 22.94 74.07 52.89 ATMG [54] 40.38 62.08 12.23 23.08 82.41 53.91 MERU [10] 40.09 62.37 12.59 20.69 81.48 54.56 HyCoCLIP [49] 35.96 60.78 11.65 23.52 75.93 53.33 UNCHA (Ours) 39.58 62.23 13.53 23.81 80.56 56.51 S.2.4.2 Multi-object representation Experimental setting. The experimental setting is described in Sec.4.2.4 of the main text and further detailed in Sec. S.2.2.4. Experimental results. We extend the multi-object representation experiments from the main paper by additionally evaluating ViT-S models. As presented in Tab. LABEL:tab:multi-obj-full, UNCHA (Ours) consistently achieves superior performance across diverse object counts and datasets. This reflects its ability to reliably represent and distinguish individual objects within complex multi-object scenes, demonstrating strong fine-grained and compositional understanding. Table S.8: Multi-object representation performance on ComCo and SimCo (mAP). UNCHA (Ours) generally outperforms all baselines across object counts and datasets in the extended ViT-S and ViT-B evaluation (Tab. LABEL:tab:multi-obj-full), demonstrating strong fine-grained and compositional understanding in complex multi-object scenes. ComCo SimCo 2 obj 3 obj 4 obj 5 obj 2 obj 3 obj 4 obj 5 obj ViT-S/16 CLIP [53] 69.59 71.97 72.44 72.06 72.49 80.05 82.45 82.65 MERU [10] 67.42 69.31 70.04 69.60 71.69 78.56 80.65 81.20 ATMG [54] 44.01 43.94 44.12 43.97 62.17 63.02 61.83 62.00 HyCoCLIP [49] 64.47 65.67 66.37 65.74 72.91 78.25 79.55 79.43 UNCHA (Ours) 68.91 71.54 72.90 72.58 74.41 81.79 83.55 83.13 ViT-B/16 CLIP [53] 77.55 80.31 81.41 80.22 77.15 84.58 87.40 88.48 MERU [10] 72.90 77.25 78.15 77.34 77.82 83.91 85.79 86.90 ATMG [54] 45.91 45.97 45.80 45.82 65.52 65.32 65.28 65.12 HyCoCLIP [49] 72.90 73.22 73.51 72.90 75.71 81.13 82.41 82.85 UNCHA (Ours) 77.92 80.96 81.83 81.18 79.72 86.93 89.75 90.65 S.2.4.3 Zero-shot semantic segmentation Experimental setting. Zero-shot semantic segmentation refers to benchmark settings where additional attention-modulation methods (such as SCLIP [66] and NACLIP [20]) are integrated into the model to extract not only class-level features but also the dense features produced by the backbone. Using these dense features, the model performs classification by comparing them against the class texts from existing datasets. In our experiments, we first use NACLIP to extract dense features and then compute their similarity to class texts, evaluating how accurately the model localizes fine-grained regions based on the mIoU metric. However, semantic segmentation is substantially more challenging than standard image classification, so instead of relying solely on text–image matching as in typical classification, we further reduce the modality mismatch by extrapolating the text embeddings from the root of the hyperbolic space for all hyperbolic-based models. Experimental results. As shown in Tab. S.9 and Fig. S.11–S.12, our method consistently achieves strong performance across both the ViT-S and ViT-B backbones, indicating that it captures fine-grained details in images more effectively than existing approaches. Furthermore, the results demonstrate that our method produces more coherent region assignments and reliably handles scenes containing multiple objects, correctly separating and allocating each instance. Taken together, these observations highlight the robustness and strong fine-grained awareness capability of our model in zero-shot segmentation settings. Table S.9: Zero-shot segmentation performance on VOC21. UNCHA (Ours) model achieves the highest mIoU on both the ViT-S/16 and ViT-B/16 backbones, showing clear improvements over prior methods. This result demonstrates that our hyperbolic alignment enables the model to effectively capture fine-grained region-level features. VOC 21 dataset Model ViT-S/16 ViT-B/16 CLIP 36.02 28.47 MERU 36.18 26.05 AtMG 7.63 6.51 HyCoCLIP 36.79 26.03 UNCHA (Ours) 39.03 32.28 S.2.4.4 Bounding box classification Experimental setting. Bounding box classification evaluates a model’s ability to recognize objects within localized regions using only textual descriptions. Following prior work [70, 33], we crop bounding boxes from COCO-val2017 [40], LVIS [19], and Open Images [37] and classify them in a zero-shot manner. Experimental results. We report Top-1 and Top-5 accuracy in Tab. LABEL:tab:box-level_zeroshot. These results demonstrate that UNCHA (Ours) achieves consistently superior performance across all datasets, COCO, LVIS, and OpenImages, showing large gains over existing approaches. The improvements are particularly prominent in the Top-1 accuracy, reaching margins as high as 32.89%, which highlights the model’s ability to precisely associate localized visual regions with their corresponding textual concepts under zero-shot settings. This suggests that UNCHA (Ours) produces representations that remain stable and discriminative even when object regions are tightly cropped, where contextual cues are minimized. Table S.10: Box-level zero-shot classification accuracy on COCO [40], LVIS [19], and OpenImages [37]. We report Top-1 and Top-5 accuracy. UNCHA (Ours) achieves consistently superior performance across all datasets, showing substantial improvements over CLIP [53], MERU [10], ATMG [54], and HyCoCLIP [49] with Top-1 gains reaching up to 32.89%32.89\%. These results indicate that our hyperbolic alignment mechanism enables more reliable region-level grounding and captures part-whole semantic structure more faithfully than prior baselines. COCO LVIS OpenImages Model Top-1 Top-5 Top-1 Top-5 Top-1 Top-5 ViT-S/16 CLIP 34.98 60.74 5.81 13.97 13.81 35.76 MERU 43.51 66.77 6.43 15.06 16.51 41.26 ATMG 19.24 34.85 5.45 13.49 9.72 26.28 HyCoCLIP 45.36 73.17 11.12 25.28 20.79 47.57 Ours 51.57 77.11 13.65 29.03 24.36 53.26 ViT-B/16 CLIP 35.22 62.84 6.84 16.16 14.90 38.18 MERU 44.55 68.10 7.41 16.37 18.14 42.23 ATMG 21.19 37.61 6.19 14.84 10.52 29.09 HyCoCLIP 47.88 74.79 12.92 27.31 22.16 48.78 Ours 54.14 79.03 17.17 33.21 23.81 52.53 S.2.5 Analysis S.2.5.1 Hyperbolic embedding analysis We conduct several visualization studies on the COCO val2017 dataset [40]. First, Fig. S.6 shows the relative distribution of the embeddings produced by HyCoCLIP and our method, visualized using HoroPCA [6] according to their distance from the origin. HyCoCLIP embeddings lie closer to the origin, whereas ours are positioned farther from the origin in the hyperbolic space. In addition, our embeddings are more widely dispersed, with reduced overlap between part and whole image/text representations. This indicates that our hyperbolic alignment utilizes the available hyperbolic volume more effectively. In addition, Fig. S.7 presents qualitative examples in which we visualize a subset of COCO part texts and part images using HoroPCA. As shown, the global image concept “bedroom” and its corresponding text representation reside farther from the origin in the hyperbolic space, while multiple part-level objects distribute across different regions according to their uncertainty. Note that the part text “chair” appears multiple times in the part-text dataset, so we depict its labels as stacked in the visualization. A similar pattern also emerges in the PCA visualization shown in the green box region of Fig. S.7, where several part-text embeddings overlap due to the dataset. Figure S.6: Hyperbolic embedding visualization using HoroPCA. On the COCO dataset [40], we compare the hyperbolic embeddings of our model with those of HyCoCLIP [49]. While HyCoCLIP embeddings are largely concentrated near the origin, ours are distributed farther away, enabling a broader and more effective utilization of the hyperbolic space. Figure S.7: Hyperbolic embedding of whole vs. part representations. Whole-scene images and texts lie deeper in the hyperbolic space, while part-level representations cluster closer to the origin. The zoom-in view and examples illustrate how parts such as chair, bed, and dining table are organized relative to the whole-scene embedding. S.2.5.2 Hyperparameter sensitivity analysis We conduct an analysis on λ1 _1 and λ2 _2. Following prior studies on Leaky-ReLU activations [71], we use a small α to preserve sufficient non-linearity while preventing unstable optimization. Results for λ1 _1 and λ2 _2 are summarized in Tab. LABEL:tab:ablation_compact, where all models are trained for 100k iterations. For consistency, we follow the same training protocol and architectural setup as in our main experiments, using the ViT-S configuration. In Tab. LABEL:tab:ablation_compact, we report both classification (Cls.) and retrieval (Ret.) performance, where each value corresponds to the average over all classification and retrieval tasks, respectively. As shown in the table, our method consistently maintains stable performance across different choices of λ1 _1 and λ2 _2, with only minor variations. Notably, our approach tends to achieve either stronger classification or retrieval performance depending on the hyperparameter setting, while avoiding significant degradation in either metric. This demonstrates that our method is robust to the choice of hyperparameters and does not require sensitive tuning to achieve competitive performance. Table S.11: Hyperparameter sensitivity analysis at 100k iterations.Hyperparameter sensitivity analysis of λ1 _1 and λ2 _2 at 100k iterations. Each entry reports classification (Cls.) and retrieval (Ret.) performance averaged across all tasks. Our method demonstrates stable performance across a wide range of values, with λ1=0.5 _1=0.5 and λ2=10.0 _2=10.0 selected as the default setting. _1 0.3 0.4 0.5 0.6 0.7 31.9 / 63.6 31.5 / 64.2 31.6 / 63.8 31.5 / 64.2 31.1 / 63.4 _2 9.0 9.5 10.0 10.5 11.0 31.3 / 64.2 31.5 / 64.9 31.6 / 63.8 31.5 / 62.9 31.4 / 63.2 S.2.5.3 Role and influence of individual loss terms We analyze the role of each loss component at 100k iterations. Fig. S.8(a) shows the cosine similarity between gradients of different loss terms. The uncertainty calibration loss exhibits an opposing gradient direction to the entailment loss, acting as a regularizer that prevents representation collapse and stabilizes training. In contrast, the uncertainty-guided contrastive loss remains well aligned with the standard contrastive objective, reinforcing the primary learning signal. Fig. S.8(b) visualizes the embedding distributions on COCO [40] using HoroPCA [6]. In the full model ((b)-1), embeddings are well-structured with clear relationships between scene text (★ [rgb]1,1,0 ) and part images (★ [rgb]0,1,0 ). Removing the uncertainty-guided contrastive loss ((b)-2) weakens this relational alignment, while removing the uncertainty calibration loss ((b)-3) causes the embeddings to concentrate in a narrower region (approximately 0.57R0.57R), reducing representational capacity. Overall, the uncertainty-guided contrastive loss improves relational alignment, whereas the uncertainty calibration loss maintains a well-distributed embedding space and prevents such contraction. Figure S.8: Analysis of our newly introduced loss terms. (a) Cosine similarity between gradients of different loss components, showing that the uncertainty calibration loss acts as a regularizer by opposing the entailment loss, while the uncertainty-guided contrastive loss remains aligned with the main contrastive objective. (b) Visualization of embedding distributions using HoroPCA on COCO, where the full model exhibits well-structured representations, while removing each loss term leads to degraded alignment or concentrated embeddings. S.2.5.4 Embedding analysis on hyperbolic radius. In Fig.4, following prior work [49], we first visualize embedding distances using the Euclidean norm. However, this does not fully reflect the geometry of hyperbolic space. To address this, we re-plot the results using the hyperbolic distance from the origin, d(,)d_L(o,p), in Fig. S.9. Due to the exponential expansion of hyperbolic space with radius [3], points farther from the origin lie in regions with significantly larger effective volume. Therefore, analyzing embeddings with d(,)d_L(o,p) provides a more faithful view of their distribution and better captures hierarchical and semantic structure. Figure S.9: Hyperbolic embedding analysis using hyperbolic radius. Distances are measured by d(,)d_L(o,p) instead of the Euclidean norm to better reflect the intrinsic geometry of hyperbolic space. The results show that embeddings are distributed across different radial regions, corresponding to varying levels of semantic granularity and representational capacity. S.2.5.5 Hyperbolic distribution during training To investigate how our hyperbolic alignment organizes part-whole relationships within the hyperbolic space, we visualize the distribution of embedding distances from origin for whole images and their corresponding part-level crops, using both cropped and full images from the ImageNet [9, 56] dataset. As shown in Fig. S.10, as training progresses, part-image distance from the origin decreases (i.e., the uncertainty associated with part images steadily increases), and the separation between the two distributions becomes more pronounced. This pattern indicates that the model gradually enhances its ability to distinguish part-level content from full-scene contexts. The bottom row of Fig. S.10 reports three statistical distances, Maximum Mean Discrepancy (MMD), Wasserstein-1 distance (W1), and Wasserstein-2 distance (W2), computed at every iteration, quantitatively confirming the growing divergence between the part and whole image distributions. Consistent with the visual trends, all three metrics rise sharply during the early stages of training and gradually stabilize as the model converges. W1 measures the minimum amount of mass that must be transported to align one distribution with the other, reflecting differences in their overall locations. W2 extends this by incorporating squared deviations, making it more sensitive to changes in distributional spread. MMD evaluates the discrepancy between two distributions by comparing their kernel-based mean embeddings, capturing differences in both central tendency and higher-order statistical structure. Figure S.10: Hyperbolic embedding distributions of whole images vs. part images across training iterations. As training progresses, the uncertainty distributions of whole images and small crops gradually diverge, indicating increasing part–whole separation in the learned hyperbolic space. The bottom row reports iteration-wise distributional distances (MMD, W1, W2), which quantitatively confirm the growing discrepancy between the two distributions. S.2.5.6 Dense feature localization visualization We follow a setting analogous to S.2.4.3 and perform dense localization on the VOC dataset [14] by computing the similarity between text queries and dense features. The resulting visualizations are presented in Fig. S.11 and Fig. S.12. As shown, our method consistently provides the most fine-grained and accurate localization across a diverse set of object classes and input images. Notably, our model is able to correctly highlight objects that competing methods either fail to capture (e.g., person, sofa) or detect with substantially less precision (e.g., dining table, potted plant). These findings demonstrate that our approach achieves a more detailed and robust understanding of complex, multi-object scenes compared to existing baselines. Quantitative results supporting these observations are reported in S.2.4.3. Figure S.11: Dense feature localization visualizations for zero-shot semantic segmentation. Following the procedure described in Sec. S.2.4.3, similarity maps on the VOC dataset are generated by extracting dense features and computing their correspondence to text queries. Our method produces sharper and more localized activations that align more accurately with the queried object categories. Figure S.12: Dense feature visualizations for zero-shot semantic segmentation. Similarity maps on the VOC dataset are generated by extracting dense features and computing their correspondence to text queries, following the procedure described in Sec. S.2.4.3. Our method produces sharper and more localized activations that align more accurately with the queried object categories. S.2.5.7 Uncertainty-based ordering of part images We investigate how well part images are organized within the hyperbolic space by sorting them based on uncertainty and comparing them with HyCoCLIP. Because the Euclidean norm, hyperbolic radius, and uncertainty are monotonic measures (differing only in direction), we sort HyCoCLIP embeddings by their Euclidean norms for a fair comparison with our uncertainty-based ordering. The results are presented in Fig. S.13. As shown, HyCoCLIP produces several misordered cases where abstract or highly-representative samples appear in inconsistent positions, whereas our method yields a more coherent ordering in which part images align naturally according to their scene-level representativeness. Figure S.13: Comparison of uncertainty-based ordering of part images. Comparison of uncertainty-based ordering of part images between HyCoCLIP [49] and UNCHA (Ours) shows that UNCHA produces a coherent ordering in which part images are arranged according to their scene-level representativeness. S.2.5.8 Hyperbolic embedding visualization with various dataset We analyze how part images, part texts, whole scene images, and whole scene texts are distributed within the hyperbolic embedding space by conducting the visualization shown in Fig. S.14. All experiments are performed using our ViT-B model on both the COCO [40] and OpenImages [37] datasets. As illustrated, part-level data consistently occupy regions closer to the origin compared to whole-scene representations, and this trend remains stable across different datasets. Figure S.14: Distribution of hyperbolic embeddings across datasets. Using UNCHA (ViT-B), we visualize part and whole representations from OpenImages [37] and COCO [40] Across both datasets, part-level embeddings appear closer to the origin, while whole-scene embeddings lie farther away, consistently reflecting their hierarchical structure. References [1] R. Abbasi, A. Nazari, A. Sefid, M. Banayeeanzade, M. H. Rohban, and M. S. Baghshah (2025) CLIP under the microscope: a fine-grained analysis of multi-object representation. In CVPR, Cited by: §S.2.2.4, §1, §2.1, §4.2.4. [2] M. G. Atigh, J. Schoep, E. Acar, N. van Noord, and P. Mettes (2022) Hyperbolic image segmentation. In CVPR, Cited by: §1, §2.2, §2.2, §3.2.1. [3] M. R. Bridson and A. Haefliger (2013) Metric spaces of non-positive curvature. Vol. 319, Springer Science & Business Media. Cited by: §S.2.5.4. [4] M. Byeon, B. Park, H. Kim, S. Lee, W. Baek, and S. Kim (2022) COYO-700m: image-text pair dataset. Note: https://github.com/kakaobrain/coyo-datasetDataset Cited by: §S.2.1. [5] I. Chami, A. Gu, V. Chatziafratis, and C. Ré (2020) From trees to continuous embeddings and back: hyperbolic hierarchical clustering. NeurIPS. Cited by: §1. [6] I. Chami, A. Gu, D. P. Nguyen, and C. Ré (2021) Horopca: hyperbolic dimensionality reduction via horospherical projections. In International Conference on Machine Learning, p. 1419–1429. Cited by: §S.2.5.1, §S.2.5.3. [7] I. Chami, Z. Ying, C. Ré, and J. Leskovec (2019) Hyperbolic graph convolutional neural networks. NeurIPS 32. Cited by: §2.2. [8] M. Chen, Y. Bai, J. D. Lee, T. Zhao, H. Wang, C. Xiong, and R. Socher (2020) Towards understanding hierarchical learning: benefits of neural representations. NeurIPS. Cited by: §S.1.1, §1. [9] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei (2009) Imagenet: a large-scale hierarchical image database. In CVPR, Cited by: §S.2.2.3, §S.2.5.5, Table 2, Table 2. [10] K. Desai, M. Nickel, T. Rajpurohit, J. Johnson, and S. R. Vedantam (2023) Hyperbolic image-text representations. In ICML, Cited by: §S.1.1, §S.1.2, §S.2.1, §S.2.2.1, Table S.10, Table S.7, Table S.7, Table S.8, Table S.8, §1, §1, §2.2, Figure 2, Figure 2, §3.1, §3.2, §3.2.2, §3.2.3, §4.1, §4.2.1, Table 1, Table 1, Table 2, Table 2, Table 3, Table 5. [11] B. Dhingra, C. Shallue, M. Norouzi, A. Dai, and G. Dahl (2018) Embedding text in hyperbolic spaces. In Proceedings of the Twelfth Workshop on Graph-Based Methods for Natural Language Processing (TextGraphs-12), Cited by: §1, §2.2. [12] A. Dosovitskiy (2021) An image is worth 16x16 words: transformers for image recognition at scale. ICLR. Cited by: §S.1.1. [13] M. Elhoseiny, B. Saleh, and A. Elgammal (2013) Write a classifier: zero-shot learning using purely textual descriptions. In CVPR, Cited by: §S.2.2.1. [14] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman (2010) The pascal visual object classes (voc) challenge. IJCV. Cited by: §S.2.2.4, §S.2.5.6, §4.2.4. [15] L. Franco, P. Mandica, K. Kallidromitis, D. Guillory, Y. Li, T. Darrell, and F. Galasso (2023) Hyperbolic active learning for semantic segmentation under domain shift. ICML. Cited by: §1, §2.2, §3.2.1. [16] O. Ganea, G. Bécigneul, and T. Hofmann (2018) Hyperbolic entailment cones for learning hierarchical embeddings. In ICML, Cited by: §1, §2.2. [17] Y. Ge, J. Ren, A. Gallagher, Y. Wang, M. Yang, H. Adam, L. Itti, B. Lakshminarayanan, and J. Zhao (2023) Improving zero-shot generalization and robustness of multi-modal models. In CVPR, Cited by: §2.1. [18] Y. Grandvalet and Y. Bengio (2004) Semi-supervised learning by entropy minimization. NeurIPS. Cited by: §3.2.3. [19] A. Gupta, P. Dollar, and R. Girshick (2019) Lvis: a dataset for large vocabulary instance segmentation. In CVPR, Cited by: §S.2.4.4, Table S.10, Table S.10. [20] S. Hajimiri, I. B. Ayed, and J. Dolz (2025) Pay attention to your neighbours: training-free open-vocabulary semantic segmentation. In WACV, Cited by: §S.2.4.3. [21] N. He, J. Liu, B. Zhang, N. Bui, A. Maatouk, M. Yang, I. King, M. Weber, and R. Ying (2025) Position: beyond euclidean–foundation models should embrace non-euclidean geometries. arXiv preprint arXiv:2504.08896. Cited by: §1. [22] N. He, H. Madhu, N. Bui, M. Yang, and R. Ying (2025) Hyperbolic deep learning for foundation models: a survey. In Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V. 2, Cited by: §2.2. [23] N. He, M. Yang, and R. Ying (2025) Hypercore: the core framework for building hyperbolic foundation models with comprehensive modules. arXiv preprint arXiv:2504.08912. Cited by: §2.2. [24] X. He and Y. Peng (2017) Fine-grained image classification via combining vision and language. In CVPR, Cited by: §2.1. [25] G. Hinton (1979) Some demonstrations of the effects of structural descriptions in mental imagery. Cognitive Science. Cited by: §1. [26] G. Hinton (2023) How to represent part-whole hierarchies in a neural network. Neural Computation. Cited by: §1. [27] Z. Huang, Z. Zeng, B. Liu, D. Fu, and J. Fu (2020) Pixel-bert: aligning image pixels with text by deep multi-modal transformers. arXiv preprint arXiv:2004.00849. Cited by: §2.1. [28] S. Ibrahimi, M. G. Atigh, N. Van Noord, P. Mettes, and M. Worring (2024) Intriguing properties of hyperbolic embeddings in vision-language models. TMLR. Cited by: §2.1. [29] OpenCLIP External Links: Document, Link Cited by: §S.1.1. [30] T. Ito, T. Klinger, D. Schultz, J. Murray, M. Cole, and M. Rigotti (2022) Compositional generalization through abstract representations in human and artificial neural networks. NeurIPS. Cited by: §1. [31] C. Jia, Y. Yang, Y. Xia, Y. Chen, Z. Parekh, H. Pham, Q. Le, Y. Sung, Z. Li, and T. Duerig (2021) Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, p. 4904–4916. Cited by: §1, §2.1. [32] L. Jin, G. Luo, Y. Zhou, X. Sun, G. Jiang, A. Shu, and R. Ji (2023) Refclip: a universal teacher for weakly supervised referring expression comprehension. In CVPR, Cited by: §2.1. [33] D. Jing, X. He, Y. Luo, N. Fei, W. Wei, H. Zhao, Z. Lu, et al. (2024) Fineclip: self-distilled region-based clip for better fine-grained understanding. NeurIPS. Cited by: §S.2.4.4. [34] A. Karpathy and L. Fei-Fei (2015) Deep visual-semantic alignments for generating image descriptions. In CVPR, Cited by: §S.2.2.2, §4.2.2. [35] V. Khrulkov, L. Mirvakhabova, E. Ustinova, I. Oseledets, and V. Lempitsky (2020) Hyperbolic image embeddings. In CVPR, Cited by: §1, §2.2. [36] R. Kiros, R. Salakhutdinov, and R. S. Zemel (2014) Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539. Cited by: §2.1. [37] A. Kuznetsova, H. Rom, N. Alldrin, J. Uijlings, I. Krasin, J. Pont-Tuset, S. Kamali, S. Popov, M. Malloci, A. Kolesnikov, et al. (2020) The open images dataset v4: unified image classification, object detection, and visual relationship detection at scale. IJCV. Cited by: Figure S.14, Figure S.14, §S.2.4.4, §S.2.5.8, Table S.10, Table S.10. [38] M. Le, S. Roller, L. Papaxanthos, D. Kiela, and M. Nickel (2019) Inferring concept hierarchies from text corpora via hyperbolic embeddings. In Proceedings of the 57th annual meeting of the association for computational linguistics, Cited by: §2.2, §3.2.3. [39] J. Li, R. Selvaraju, A. Gotmare, S. Joty, C. Xiong, and S. C. H. Hoi (2021) Align before fuse: vision and language representation learning with momentum distillation. Advances in neural information processing systems 34, p. 9694–9705. Cited by: §1, §2.1. [40] T. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft coco: common objects in context. In ECCV, Cited by: Figure S.14, Figure S.14, Figure S.6, Figure S.6, §S.2.2.2, §S.2.2.4, §S.2.4.4, §S.2.5.1, §S.2.5.3, §S.2.5.8, Table S.10, Table S.10, §4.2.2, §4.2.4. [41] Q. Liu, M. Nickel, and D. Kiela (2019) Hyperbolic graph neural networks. NeurIPS 32. Cited by: §2.2. [42] I. Loshchilov and F. Hutter (2017) Sgdr: stochastic gradient descent with warm restarts. ICLR. Cited by: §S.1.3. [43] I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. ICLR. Cited by: §S.1.3. [44] J. Lu, D. Batra, D. Parikh, and S. Lee (2019) Vilbert: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. NeurIPS 32. Cited by: §2.1. [45] A. L. Maas, A. Y. Hannun, A. Y. Ng, et al. (2013) Rectifier nonlinearities improve neural network acoustic models. In ICML, Cited by: §3.2.3. [46] P. Mandica, L. Franco, K. Kallidromitis, S. Petryk, and F. Galasso (2024) Hyperbolic learning with multimodal large language models. In ECCV, Cited by: §1, §2.2, §2.2, §3.2.1. [47] G. A. Miller (1995) WordNet: a lexical database for english. Communications of the ACM. Cited by: §S.2.2.3. [48] M. Nickel and D. Kiela (2017) Poincaré embeddings for learning hierarchical representations. NeurIPS. Cited by: §1, §2.1. [49] A. Pal, M. van Spengler, G. M. D. di Melendugno, A. Flaborea, F. Galasso, and P. Mettes (2024) Compositional entailment learning for hyperbolic vision-language models. ICLR. Cited by: §S.1.1, §S.1.2, Figure S.13, Figure S.13, Figure S.6, Figure S.6, §S.2.1, §S.2.2.1, §S.2.2.3, §S.2.5.4, Table S.10, Table S.7, Table S.7, Table S.8, Table S.8, §1, §1, §1, §2.1, §2.2, Figure 2, Figure 2, §3.1, §3.2, §3.2.2, §3.2.2, §3.2.3, §3.2.3, Figure 5, Figure 5, §4.1, §4.2.3, §4.3, Table 1, Table 1, Table 2, Table 2, Table 3, Table 5. [50] W. Peng, T. Varanka, A. Mostafa, H. Shi, and G. Zhao (2021) Hyperbolic deep neural networks: a survey. IEEE Transactions on pattern analysis and machine intelligence. Cited by: §2.1. [51] Z. Peng, W. Wang, L. Dong, Y. Hao, S. Huang, S. Ma, and F. Wei (2023) Kosmos-2: grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824. Cited by: §S.2.1, §4.1, Table 1. [52] S. Pratt, I. Covert, R. Liu, and A. Farhadi (2023) What does a platypus look like? generating customized prompts for zero-shot image classification. In CVPR, Cited by: §2.1. [53] A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al. (2021) Learning transferable visual models from natural language supervision. In ICML, Cited by: §S.1.1, §S.2.1, §S.2.2.1, Table S.10, Table S.7, Table S.7, Table S.8, Table S.8, §1, §2.1, §4.1, §4.2.1, Table 1, Table 1, Table 2, Table 2, Table 3, Table 5. [54] S. Ramasinghe, V. Shevchenko, G. Avraham, and A. Thalaiyasingam (2024) Accept the modality gap: an exploration in the hyperbolic space. In CVPR, Cited by: §S.1.2, §S.2.1, §S.2.2.1, Table S.10, Table S.7, Table S.7, Table S.8, Table S.8, §1, §1, §1, §2.1, §2.2, §3.1, §3.2, §3.2.3, §4.1, §4.2.1, Table 1, Table 1, Table 2, Table 2, Table 3, Table 5. [55] Z. Ren, H. Jin, Z. Lin, C. Fang, and A. Yuille (2016) Joint image-text representation by gaussian visual-semantic embedding. In Proceedings of the 24th ACM international conference on Multimedia, Cited by: §2.1. [56] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. (2015) Imagenet large scale visual recognition challenge. IJCV. Cited by: §S.2.2.3, §S.2.3.2, §S.2.5.5, Figure 4, Figure 4, §4.3. [57] A. Sain, A. K. Bhunia, P. N. Chowdhury, S. Koley, T. Xiang, and Y. Song (2023) Clip for all things zero-shot sketch-based image retrieval, fine-grained or not. In CVPR, Cited by: §2.1. [58] F. Sala, C. De Sa, A. Gu, and C. Ré (2018) Representation tradeoffs for hyperbolic embeddings. In ICML, Cited by: §1, §2.2. [59] S. D. Sarkar, O. Miksik, M. Pollefeys, D. Barath, and I. Armeni (2025) CrossOver: 3d scene cross-modal alignment. In CVPR, Cited by: §2.1. [60] A. Sinha, S. Zeng, M. Yamada, and H. Zhao (2024) Learning structured representations with hyperbolic embeddings. NeurIPS. Cited by: §2.2. [61] A. Tifrea, G. Bécigneul, and O. Ganea (2018) Poincar\ ’e glove: hyperbolic word embeddings. arXiv preprint arXiv:1810.06546. Cited by: §2.2. [62] H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablayrolles, and H. Jégou (2021) Training data-efficient image transformers & distillation through attention. In ICML, Cited by: §S.1.1. [63] J. Urbanek, F. Bordes, P. Astolfi, M. Williamson, V. Sharma, and A. Romero-Soriano (2024) A picture is worth more than 77 text tokens: evaluating clip-style models on dense captions. In CVPR, Cited by: §S.2.2.5, §4.2.5, Table 3. [64] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. NeurIPS. Cited by: §S.1.1. [65] I. Vendrov, R. Kiros, S. Fidler, and R. Urtasun (2015) Order-embeddings of images and language. arXiv preprint arXiv:1511.06361. Cited by: §1, §2.2, §3.2. [66] F. Wang, J. Mei, and A. Yuille (2024) Sclip: rethinking self-attention for dense vision-language inference. In ECCV, Cited by: §S.2.4.3. [67] J. Whittington, T. Muller, S. Mark, C. Barry, and T. Behrens (2018) Generalisation of structural knowledge in the hippocampal-entorhinal system. NeurIPS. Cited by: §1. [68] H. Wu, J. Mao, Y. Zhang, Y. Jiang, L. Li, W. Sun, and W. Ma (2019) Unified visual-semantic embeddings: bridging vision and language with structured meaning representations. In CVPR, Cited by: §2.1. [69] S. Wu, N. Élteto, I. Dasgupta, and E. Schulz (2022) Learning structure from the ground up—hierarchical representation learning by chunking. NeurIPS. Cited by: §1. [70] C. Xie, B. Wang, F. Kong, J. Li, D. Liang, G. Zhang, D. Leng, and Y. Yin (2025) FG-clip: fine-grained visual and textual alignment. ICML. Cited by: §S.2.4.4. [71] B. Xu, N. Wang, T. Chen, and M. Li (2015) Empirical evaluation of rectified activations in convolutional network. arXiv preprint arXiv:1505.00853. Cited by: §S.2.5.2. [72] S. Yan, Z. Liu, and L. Xu (2023) Hyp-uml: hyperbolic image retrieval with uncertainty-aware metric learning. arXiv preprint arXiv:2310.08390. Cited by: §1, §2.2, §2.2, §3.2.1. [73] P. Young, A. Lai, M. Hodosh, and J. Hockenmaier (2014) From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Transactions of the association for computational linguistics. Cited by: §S.2.2.2, §4.2.2.