← Back to papers

Paper deep dive

Upper Entropy for 2-Monotone Lower Probabilities

Tuan-Anh Vu, Sébastien Destercke, Frédéric Pichon

Year: 2026Venue: arXiv preprintArea: cs.LGType: PreprintEmbeddings: 59

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 93%

Last extracted: 3/26/2026, 1:37:03 AM

Summary

This paper presents a strongly polynomial algorithmic approach for computing the upper entropy of 2-monotone lower probabilities (credal sets). The authors demonstrate that the problem can be solved using supermodular function maximization (SFM) and provide a decomposition algorithm that improves upon previous exponential or quadratic methods. They also detail specialized, efficient implementations for belief functions, possibility distributions, and probability intervals.

Entities (6)

Upper Entropy · concept · 98%2-monotone lower probabilities · mathematical-object · 95%Supermodular Function Maximization · algorithm · 95%Belief Function · mathematical-object · 92%Possibility Distribution · mathematical-object · 92%Frank–Wolfe algorithm · algorithm · 90%

Relation Signals (3)

Supermodular Function Maximization computes Upper Entropy

confidence 95% · We demonstrate that by using supermodular optimization... the problem has a strongly polynomial solution

2-monotone lower probabilities isacaseof Credal sets

confidence 95% · credal sets induced by 2-monotone (a.k.a. supermodular) capacities

Upper Entropy ismeasureof Uncertainty

confidence 90% · upper entropy plays a central role as an uncertainty measure

Cypher Suggestions (2)

Find all algorithms used to compute upper entropy · confidence 90% · unvalidated

MATCH (a:Algorithm)-[:COMPUTES]->(e:Entity {name: 'Upper Entropy'}) RETURN a.name

Map relationships between mathematical objects and their properties · confidence 85% · unvalidated

MATCH (m:MathematicalObject)-[:HAS_PROPERTY]->(p:Property) RETURN m.name, p.name

Abstract

Abstract:Uncertainty quantification is a key aspect in many tasks such as model selection/regularization, or quantifying prediction uncertainties to perform active learning or OOD detection. Within credal approaches that consider modeling uncertainty as probability sets, upper entropy plays a central role as an uncertainty measure. This paper is devoted to the computational aspect of upper entropies, providing an exhaustive algorithmic and complexity analysis of the problem. In particular, we show that the problem has a strongly polynomial solution, and propose many significant improvements over past algorithms proposed for 2-monotone lower probabilities and their specific cases.

Tags

ai-safety (imported, 100%)cslg (suggested, 92%)preprint (suggested, 88%)

Links

Your browser cannot display the PDF inline. Open PDF directly →

Full Text

58,204 characters extracted from source content.

Expand or collapse full text

Upper entropy for 2-monotone lower probabilities Tuan-Anh Vu UMR CNRS 7253 Heudiasyc Université de Technologie de Compiègne France tuan-anh.vu@hds.utc.fr &Sébastien Destercke UMR CNRS 7253 Heudiasyc Université de Technologie de Compiègne France sebastien.destercke@hds.utc.fr &Frédéric Pichon Laboratoire de Génie Informatique et d’Automatique de l’Artois (LGI2A) Université d’Artois, UR 3926, France frederic.pichon@univ-artois.fr Abstract Uncertainty quantification is a key aspect in many tasks such as model selection/regularization, or quantifying prediction uncertainties to perform active learning or OOD detection. Within credal approaches that consider modeling uncertainty as probability sets, upper entropy plays a central role as an uncertainty measure. This paper is devoted to the computational aspect of upper entropies, providing an exhaustive algorithmic and complexity analysis of the problem. In particular, we show that the problem has a strongly polynomial solution, and propose many significant improvements over past algorithms proposed for 2-monotone lower probabilities and their specific cases. 1 Introduction Imprecise probabilistic or credal [Levi, 1980] approaches offer rich representations of uncertainty, usually coming in the form of convex sets of probabilities or of equivalent representations such as lower envelope of expectations [Troffaes and De Cooman, 2014], that we will call here credal sets. They have been used in many facets of AI, such as symbolic reasoning [Hansen et al., 2000], graphical models [Cozman, 2000] or machine learning [Caprio et al., 2024], with this latter trend gaining traction, as show many recent papers on the topic [Singh et al., 2024, Wang et al., 2024, Nguyen et al., 2025, Löhr et al., 2025]. Uncertainty quantification, by which we mean the fact of quantifying some aspect of an uncertainty representation by a value, is another important aspect of reasoning under uncertainty, and is a cornerstone of, e.g., Klir’s theory [Klir, 2006]. Such uncertainty quantifications can serve, e.g., to select a peculiar model among many possible ones [Dubey et al., 2018], as a regularization tool when used as a penalty term [Grandvalet and Bengio, 2006], or as a way to quantify a predictive uncertainty. Within credal approaches, upper entropy plays a particular role, as it can be shown to satisfy quite a number of axioms [Klir, 2006, Sec. 6.7.] when considered as measure of total uncertainty. It can also be associated to a robust, minimax version of using the logarithmic scoring rule: if your uncertainty is represented by a credal set P, and if ¯​[−ln⁡(Q)] E_P[- (Q)] is the maximum loss of choosing Q under knowledge P with a logarithmic scoring rule, then the distribution Q minimizing ¯​[−ln⁡(Q)] E_P[- (Q)] is the one of maximal entropy [Walley, 1991, Sec. 5.12.]. It is also commonly used in various learning settings, such as the learning of decision trees [Abellán and Moral, 2005] or as a means to quantify total uncertainties of predictions [Hüllermeier and Waegeman, 2021, Javanmardi et al., 2024]. Given this importance of upper entropy, it is surprising that the computational aspects of calculating it for large families of credal sets has not been much explored. Abellán and Moral [2006] provide an algorithm for credal sets induced by 2-monotone capacities, yet only conclude that it is a “difficult problem”. This algorithm is derived from the MRW algorithm [Meyerowitz et al., 1994, Harmanec et al., 1996] that is dedicated to belief functions and is claimed to be of exponential complexity in the size of the considered space by Huynh and Nakamori [2009]. Recent improvements of the MRW do not discuss complexity either [Abellán et al., 2023]. A more efficient algorithm, that is quadratic in the size of the space, exists for the specific case of probability intervals [Abellan and Moral, 2003]. The goal of this paper is to revisit the problem of computing upper entropies for credal sets P induced by 2-monotone (a.k.a. supermodular) capacities, that include many special cases such as belief functions, probability intervals, ϵε-contamination models or possibility distributions. Our main contributions to this problem are the following: • We demonstrate that by using supermodular optimization, the exact algorithm proposed by Abellán and Moral [2006] can be made polynomial by making a number of calls to supermodular function maximization (SFM) that is quadratic in the space size. This insight also allows us to propose a new algorithm whose number of calls to SFM is linear in the space size. These results both show that the problem is not as hard as previously claimed, and improve significantly upon existing methods. This is elaborated in Section 3. • We provide improved algorithms for the specific cases of belief functions, possibility distributions and probability intervals, that are commonly considered in applications (see, e.g., [Caprio et al., 2025, Chau et al., 2025]). These are detailed in Section 4. • We propose in Section 5 an approximation algorithm using the Frank–Wolfe algorithm [Bach, 2013] that frames the true results between a lower and a upper bounds, thereby allowing us to control the approximation error. Finally, we propose in Section 6 some first experiments that compare our algorithms but also prove that our improvements make the associated computations scalable. Before that, let us start with some necessary preliminaries. 2 Preliminaries 2.1 2-monotone lower probability Throughout this paper, we consider a finite space Ω=1,2,…,n =\1,2,…,n\. A set function ν:2Ω→ℝν:2 with ν​(∅)=0ν( )=0 is called submodular if ν​(A)+ν​(B)≥ν​(A∪B)+ν​(A∩B)​∀A,B⊆Ω.ν(A)+ν(B)≥ν(A∪ B)+ν(A∩ B) 10000\ ∀ A,B . (1) Similarly, ν is called supermodular if −ν-ν is submodular, i.e., the inequality in (1) is reversed. As listing 2n2^n values is intractable, we assume that ν is accessed through an oracle that returns ν​(A)ν(A) for any query A⊆ΩA [Grötschel et al., 1993, Chapter 10] and let EOEO be the time required for one query. The base polyhedron B​(ν)⊆ℝnB(ν) ^n of ν is defined as B​(ν)=x:∑i∈Axi≤ν​(A)​∀A⊆Ω,∑i=1nxi=ν​(Ω).B(ν)=\x: _i∈ Ax_i≤ν(A) 10000\ ∀ A , 10000\ _i=1^nx_i=ν( )\. (2) Submodular (supermodular) functions appear in many domains, including the theory of imprecise probabilities [Walley, 1991]. This uncertainty theory generalizes probability theory, that uses additive measures, notably by considering probability sets that can be induced by lower probabilities, that are non-additive measures. A set function μ:2Ω→[0,1]μ:2 →[0,1] is called a lower probability if μ​(Ω)=1,μ​(∅)=0​ and ​μ​(A)≤μ​(B)μ( )=1, 10000\ μ( )=0 10000\ and μ(A)≤μ(B) if A⊆BA B. The term “lower probability” comes from the interpretation that the true probability on Ω is unknown and μ encodes the available (limited) information in the form of lower bounds on event probabilities. Hence, the so-called credal set consisting of probabilities on Ω that are compatible with μ is ​(μ):=p∈Δn:p​(A)≥μ​(A)​∀A⊆Ω.P(μ):=\p∈ _n:p(A)≥μ(A) 10000\ ∀ A \. Many important lower probabilities are supermodular, and have a specific name in imprecise probabilities. Definition 1. A lower probability μ is called 22-monotone if it is supermodular. Remark 1. For a lower probability μ, its dual, called an upper probability μ¯ μ, is given by μ¯​(A)=1−μ​(Ω∖A)​∀A μ(A)=1-μ( A) 10000\ ∀ A. It can be seen that ​(μ)=p:p​(A)≤μ¯​(A)​∀A⊆Ω.P(μ)=\p:p(A)≤ μ(A) 10000\ ∀ A \. Moreover, 22-monotonicity of μ implies that μ¯ μ is submodular, so ​(μ)P(μ) is exactly the base polyhedron B​(μ¯)B( μ) of μ¯ μ (see (2)). Below, we highlight some notable special cases of 22-monotone lower probabilities. Belief Function A mass function on Ω is a mapping m:2Ω→[0,1]m:2 →[0,1] satisfying ∑A⊆Ωm​(A)=1 _A m(A)=1 and m​(∅)=0m( )=0 [Shafer, 1976]. A subset A is called a focal set of m if m​(A)>0m(A)>0. m induces a special lower probability called belief function defined as Bel​(B):=∑A⊆Bm​(A)Bel(B):= _A Bm(A). The dual of BelBel, the plausibility function, is defined as Pl​(B):=∑A∩B≠∅m​(A)Pl(B):= _A∩ B≠ m(A). In this case, EOEO is polynomial in |A:m​(A)>0||\A:m(A)>0\|, so is tractable if we do consider a mass function positive on a limited amount of subsets: this is for instance the case with k-additive belief functions [Grabisch, 1997] or imprecise cumulative distributions [Montes and Destercke, 2017]. Possibility Distribution A possibility distribution π on Ω is a vector in [0,1]n[0,1]^n where each component πi _i represents the degree of possibility of element i. π induces a lower probability on Ω , called the necessity measure, defined as N​(A):=1−maxi∈Ω∖A⁡πi​∀AN(A):=1- _i∈ A _i 10000\ ∀ A. The dual of NN, the possibility measure, is Π​(A):=maxi∈A⁡πi​∀A (A):= _i∈ A _i 10000\ ∀ A. Again, this means EOEO is polynomial. Probability Intervals In this case, the credal set is given directly as =p∈Δn:li≤pi≤ui​∀i,P=\p∈ _n:l_i≤ p_i≤ u_i 10000\ ∀ i\, where each intervals [li,ui]⊆[0,1]​∀i∈1,…,n[l_i,u_i] [0,1] 10000\ ∀ i∈\1,…,n\. Moreover, we require that ∑i=1nli≤1≤∑i=1nui _i=1^nl_i≤ 1≤ _i=1^nu_i so that ≠∅P≠ . De Campos et al. [1994] showed that the set function μ where μ​(A):=minp∈⁡p​(A)​∀Aμ(A):= _p p(A) 10000\ ∀ A is a 2-monotone lower probability. Note that De Campos et al. [1994] provide formulas such that EOEO is polynomial in |Ω|| |. 2.2 Upper Entropy The upper entropy UE​(μ)UE(μ) of a lower probability μ is defined as UE​(μ):=maxp∈​(μ)−∑i=1npi​log⁡pi.UE(μ):= _p (μ)- _i=1^np_i p_i. (3) Algorithm 1 presents the Abellán and Moral [2006] procedure for computing UE​(μ)UE(μ) when μ is 2-monotone. Setting μ=Belμ=Bel yields the MRW algorithm [Meyerowitz et al., 1994, Harmanec et al., 1996] for belief functions. However and as said in the introduction, none of these papers proceed through a detailed complexity analysis of these algorithms, merely concluding that the problem is “difficult”, mostly due to Line 1 in Algorithm 1. Input: 2-monotone μ on Ω=1,…,n =\1,…,n\. 1 2Find A such that μ​(A)|A| μ(A)|A| is maximum. If A is not unique, the largest such set is selected 3For each i∈Ai∈ A, pi=μ​(A)|A|p_i= μ(A)|A| 4For each B⊆Ω∖AB A, μ​(B)=μ​(B∪A)−μ​(A)μ(B)=μ(B∪ A)-μ(A) 5Ω=Ω∖A = A 6if Ω≠∅ ≠ and μ​(Ω)>0μ( )>0 then go to 1 7 8if μ​(Ω)=0μ( )=0 and Ω≠∅ ≠ then pi=0p_i=0 for all i∈Ωi∈ 9 10return −∑i=1npi​log⁡pi- _i=1^np_i p_i 11 Algorithm 1 Abellán-Moral’s algorithm 3 Strong Polynomiality of computing Upper Entropy We call an algorithm whose input is a supermodular (submodular) function strongly polynomial if it performs a number of oracle queries and arithmetic operations 111Arithmetic operations include addition, subtraction, multiplication, division, comparison operations, etc. bounded by poly⁡(n)poly(n). A notable example is supermodular function maximization (SFM) (equivalently, submodular function minimization), which seeks a subset maximizing (resp. minimizing) a supermodular (resp. submodular) set function; several algorithms are known, e.g., Orlin’s algorithm Orlin [2009] runs in time O​(n5​EO+n6)O(n^5EO+n^6). Upon closer inspection, Algorithm 1 can be implemented so that it is strongly polynomial. The key observation is that although the optimization problem in Line 1 is not an SFM, it can be transformed into a series of SFM. Lemma 1. Finding A∈arg​max∅≠B⊆Ω⁡μ​(B)|B|A∈ *arg\,max_ ≠ B μ(B)|B| and |A||A| is maximum amounts to solving O​(n)O(n) SFM. The key of the proof, given in Appendix A, is that the function μ​(A)−λ​|A|μ(A)-λ|A| is supermodular if μ is supermodular, and that finding the argmax within Lemma 1 is equivalent to successively finding the set A maximizing μ​(A)−λ​|A|μ(A)-λ|A| (a SFM procedure, known to be polynomial), for O​(n)O(n) suitable values of λ. Proposition 1. Algorithm 1 boils down to solving O​(n2)O(n^2) SFM, and thus is strongly polynomial. Proof. Line 1 of Algorithm 1 requires solving O​(n)O(n) SFM (Lemma 1). Moreover, the update in Lines 1–1 still results in a supermodular function on Ω∖A A, e.g., [Abellán and Moral, 2006, Property 1], whose evaluation oracle is given by Line 1. As Ω shrinks by at least one element per iteration (Line 1), Algorithm 1 terminates after at most n passes through the loop in Lines 1–1. Hence, it requires solving O​(n2)O(n^2) SFM. ∎ As it turns out, we can compute UE​(μ)UE(μ) with improved complexity by using a decomposition algorithm [Nagano and Kawahara, 2013, Fujishige, 2005]. Let ν be a submodular function and b be a positive vector in ℝnR^n, Nagano and Kawahara [2013, Sec. 5] shows that solving minx∈B​(ν)​∑i=1n(xi​log⁡xibi+bi−xi) _x∈ B(ν) _i=1^n(x_i x_ib_i+b_i-x_i) (4) reduces to finding a chain ∅=S0⊂⋯⊂Sl=Ω =S_0⊂·s⊂ S_l= and breakpoints 0=α0<α1<⋯<αl<αl+1=+∞0= _0< _1<·s< _l< _l+1=+∞ such that, for any α∈[αj,αj+1)α∈[ _j, _j+1), SjS_j is the unique minimizer of maximum cardinality of minA⊆Ω⁡ν​(A)−α​∑i∈Abi _A ν(A)-α _i∈ Ab_i. The minimizer of (4) is, for each i∈Sj∖Sj−1​(j∈1,…,l)i∈ S_j S_j-1 10000\ (j∈1,…,l) xi=ν​(Sj)−ν​(Sj−1)∑i∈Sj∖Sj−1bi​bi.x_i= ν(S_j)-ν(S_j-1) _i∈ S_j S_j-1b_ib_i. (5) Algorithm 2 aims to compute the chain Sj\S_j\. Given two sets Sj⊂SkS_j⊂ S_k in the chain (initially Sj=∅S_j= , Sk=ΩS_k= ), it recursively checks if SjS_j and SkS_k are consecutive, and if not, it finds a new chain set strictly between them and repeats the process until the whole chain is revealed. In total, Nagano and Kawahara [2013] shows that Algorithm 2 requires solving O​(n)O(n) SFM (Line 2). Input: submodular ν on Ω=1,…,n =\1,…,n\; b∈ℝnb ^n. 1 Function DA(S,S′S,S ): 2 α=ν​(S′)−ν​(S)∑i∈S′∖Sbiα= ν(S )-ν(S) _i∈ S Sb_i 3 Take S′∈arg​minA⊆Ω⁡ν​(A)−α​∑i∈AbiS ∈ *arg\,min_A ν(A)-α _i∈ Ab_i s.t |S′||S | is largest 4 5 if S′=S′S =S then return S,S′\S,S \ 6 7 else return DA(S,S′S,S ) ∪ DA(S′,S′S ,S ) 8 9 10return ∗=DA​(∅,Ω)S^*= DA( , ) Algorithm 2 Decomposition algorithm Proposition 2. Computing UE​(μ)UE(μ) amounts to running Algorithm 2 with μ¯ μ (the dual of μ) and b=b=1, which requires O​(n)O(n) SFM. Proof. With the provided input, Problem (4) becomes minx∈B​(ν)⁡(∑i=1nxi​log⁡xi)+n−1 _x∈ B(ν) ( _i=1^nx_i x_i )+n-1. It has the same optimal solution as (3) because of Remark 1. Algorithm 2 outputs the chain Sj\S_j\ and the optimal p of (3) is obtained via (5). ∎ Remark 2. There is a tight connection between Algorithm 1 and Algorithm 2: the sets A1,A2,…A_1,A_2,… successively computed in Line 1 of the former are exactly the sets Sl∖Sl−1,Sl−1∖Sl−2,…S_l S_l-1,S_l-1 S_l-2,… from the chain ∅=S0⊂⋯⊂Sl=Ω =S_0⊂·s⊂ S_l= returned by the latter with the input μ¯ μ and 1. Example 1. Consider a 2-monotone μ and its dual μ¯ μ on Ω=1,2,3 =\1,2,3\. Their values are given below (excluding the obvious values of ∅ and Ω ). Algorithm 1, with input μ, outputs the sets 2,3\2,3\ and 1\1\ in this order at Line 1. Hence, the optimal probability is p=(0.2,0.4,0.4)p=(0.2,0.4,0.4). A1231,21,32,3μ​(A)0.10.40.40.50.50.8μ¯​(A)0.20.50.50.60.60.9 array[]c|cA&\1\&\2\&\3\&\1,2\&\1,3\&\2,3\\\ μ(A)&0.1&0.4&0.4&0.5&0.5&0.8\\ μ(A)&0.2&0.5&0.5&0.6&0.6&0.9 array Let us illustrate Algorithm 2 with μ¯ μ and b=b=1. First, we run DA​(∅,Ω) DA( , ), for which the associated α=μ¯​(Ω)−μ¯​(∅)|Ω|=13α= μ( )- μ( )| |= 13. Second, we solve minA⊆Ω⁡μ¯​(A)−13​|A| _A μ(A)- 13|A| and obtain A=1A=\1\ the maximal minimizer. Third, we would proceed recursively by executing DA​(∅,1) DA( ,\1\) and DA​(1,Ω) DA(\1\, ); but since there is no set strictly between ∅ and 1\1\, we only need to run DA​(1,Ω) DA(\1\, ), for which α=μ¯​(Ω)−μ¯​(1)|Ω∖1|=0.4α= μ( )- μ(\1\)| \1\|=0.4. Finally, solving minA⊆Ω⁡μ¯​(A)−0.4​|A| _A μ(A)-0.4|A|, we find that Ω is the maximal minimizer, implying there is no set in the chain strictly between 1\1\ and Ω , so we stop. Hence, Algorithm 2 outputs the chain ∅⊂1⊂Ω ⊂\1\⊂ . Applying (5), we also get p=(0.2,0.4,0.4)p=(0.2,0.4,0.4). 4 Computing upper entropy in special cases 4.1 Belief function Let m be a mass function on Ω with PlPl its plausibility function and ℱF its set of focal sets. We consider a directed bipartite graph with partitions L and R in which each left vertex in L (resp. right vertex in R) corresponds to an element of Ω (resp. focal set of m), and an edge from j to FiF_i iff j∈Fij∈ F_i. From this graph, we construct a flow network G by adding a source s and a sink t, together with edges (s,j)(s,j) for all elements j∈Ωj∈ and (Fi,t)(F_i,t) for all focal sets FiF_i of m. Define a capacity function c on edges as c​(s,j)=αc(s,j)=α, c​(j,Fi)=+∞c(j,F_i)=+∞, and c​(Fi,t)=m​(Fi)c(F_i,t)=m(F_i). A cut (S,T)(S,T) of G is simply a partition of its vertices into a source set S (where s∈Ss∈ S) and a sink set T=V∖ST=V S (where t∈Tt∈ T) [Cormen et al., 2009, Chapter 26]. The capacity of (S,T)(S,T) is the sum of capacities of edges leaving S, i.e., c​(S,T)=∑u∈S∑v∈Tc​(u,v)c(S,T)= _u∈ S _v∈ Tc(u,v). As the cut (S,T)(S,T) where T=tT=\t\ has capacity 1, the minimum cut capacity is at most 11. In this case, Line 2 of Algorithm 2 boils down to minimum-cut/maximum-flow computations and is thus much faster than general SFM. Proposition 3. Solving minA⊆Ω⁡Pl​(A)−α​|A| _A Pl(A)-α|A| with α>0α>0 is equivalent to finding a minimum cut (S,T)(S,T) in G. Proof. Consider a minimum cut X=(S,T)X=(S,T) where S=s∪L′∪R′S=s∪ L ∪ R , with L′L a set of left vertices (elements of Ω ) and R′R a set of right vertices (focal sets of m). If an element i∈L′i∈ L , then R′R contain all focal sets F such that i∈Fi∈ F, otherwise there is an edge with capacity +∞+∞ leaving S. Furthermore, if there is a focal set F∈R′F∈ R , then L′∩F≠∅L ∩ F≠ , otherwise the cut formed by removing F from R′R has capacity of c​(X)−m​(F)<c​(X)c(X)-m(F)<c(X). Therefore, X is of the form with S=s∪A∪F∈ℱ:F∩A≠∅, for a subset A⊆Ω.S=\s\∪ A∪\F :F∩ A≠ \, for a subset $A $. By the construction of G, c​(X)=α​|Ω∖A|+∑F∩A≠∅m​(F) c(X)=α| A|+ _F∩ A≠ m(F) =∑F∩A≠∅m​(F)−α​|A|+n​α=Pl​(A)−α​|A|+n​α, = _F∩ A≠ m(F)-α|A|+nα=Pl(A)-α|A|+nα, so minA⊆Ω⁡Pl​(A)−α​|A|=c∗−n​α _A Pl(A)-α|A|=c^*-nα with c∗c^* is the minimum cut capacity. The proposition holds as n​αnα is constant. ∎ Example 2. Consider a mass m on Ω:=1,2,3,4 :=\1,2,3,4\ whose focal sets are F1=1,2,F2=2,3F_1=\1,2\,F_2=\2,3\ and F3=3,4F_3=\3,4\. The flow network from Proposition 3 is shown in Figure 1, which we use to illustrate its proof. Namely, the cut with S=s∪1∪F1S=\s\∪\1\∪\F_1\ is not minimum because edge (1,F2)(1,F_2) has capacity +∞+∞. The cut with S=s∪1,2∪F1,F2,F3S=\s\∪\1,2\∪\F_1,F_2,F_3\ is not minimum because removing F3F_3 from S results in a cut with strictly smaller capacity. We still need to select A∗∈arg​minA⊆Ω⁡Pl​(A)−α​|A|A^*∈ *arg\,min_A Pl(A)-α|A| with maximum cardinality. It is well-known that for a minimum cut (S∗,T∗)(S^*,T^*) such that |S∗||S^*| is maximum, S∗S^* is the union of all minimum cut source sets [Picard and Queyranne, 1980]. It follows that A∗A^* consists of exactly the left vertices in S∗S^*, found as follows. By the max-flow min-cut theorem [Cormen et al., 2009, Theorem 26.6], we first compute a maximum flow of G and construct the associated residual graph. Then A∗A^* is the set of left vertices that cannot reach t in this residual graph [Picard and Queyranne, 1980]. Hence, we obtain: Proposition 4. For belief functions, computing upper entropy via Algorithm 2 requires O​(n)O(n) maximum-flow computations and O​(n)O(n) depth-first searches. As the number of edges of G and its residual graphs are bounded by O​(n​|ℱ|)O(n|F|), the running time of Algorithm 2 is poly​(n,|ℱ|)poly(n,|F|) with the exact complexity depending on the chosen maximum-flow algorithm. s11223344F1F_1F2F_2F3F_3tm​(F1)m(F_1)m​(F2)m(F_2)m​(F3)m(F_3)+∞+∞+∞+∞+∞+∞+∞+∞+∞+∞α Figure 1: Flow network in Example 2. 4.2 Possibility distribution Klir [2006, Algorithm 6.2] has particularized Algorithm 1 for the case of a possibility distribution π under the standard convention that 1=π1≥π2≥…≥πn>01= _1≥ _2≥…≥ _n>0. Unsurprisingly, this particularization inherits the quadratic complexity (see Proposition 1), yielding an O​(n2)O(n^2) running time. We will, however, design an algorithm achieving O​(n)O(n) complexity. Recall in Section 3 that our goal is to compute the chain ∅=S0⊂⋯⊂Sl=Ω =S_0⊂·s⊂ S_l= where SjS_j is the unique maximal minimizer of minA⊆Ω⁡Π​(A)−α​|A| _A (A)-α|A| ∀α∈[αj,αj+1) 10000\ ∀α∈[ _j, _j+1). To this end, we employ a procedure different from Algorithm 2 that specifically exploits the structure of π. Due to the ordering convention, Π​(A)=πj (A)= _j where j is the minimum element of A, and thus j,…,n\j,…,n\ is the largest set having the same possibility as A. Therefore,∀α≥0 10000\ ∀α≥ 0, minA⊆Ω⁡Π​(A)−α​|A|=minj∈1,…,n,n+1⁡πj−α​(n−j+1), _A (A)-α|A|= _j∈\1,…,n,n+1\ _j-α(n-j+1), (6) with πn+1:=0 _n+1:=0 to handle the case when ∅ minimizes the left-hand side at α=0α=0. Let gj​(α):=πj−α​(n−j+1)g_j(α):= _j-α(n-j+1) and g:=minj∈1,…,n+1⁡gjg:= _j∈\1,…,n+1\g_j. To find the desired chain, we compute all breakpoints in [0,+∞)[0,+∞) of g and select the smallest j such that gj=g_j=g at each breakpoint. This ensures that the associated minimizer of the left-hand side of (6) has maximum size, which is j,…,n\j,…,n\. Our task can be done via a classic problem in computational geometry. More precisely, in this field, g is called the lower envelope of lines gjg_j. Under the duality that maps a line ℓ:y=m​α+b :y=mα+b to its dual point ℓ∗:=(m,−b) ^*:=(m,-b), g corresponds one-to-one with the upper convex hull 222The upper convex hull is the chain of edges on the convex hull’s upper boundary. of the set of dual points gj∗:1≤j≤n+1\g^*_j:1≤ j≤ n+1\ where gj∗=(j−n−1,−πj)g^*_j=(j-n-1,- _j) [De Berg et al., 2008, Chapter 11.4]. We employ this correspondence to our setting to solve minA⊆Ω⁡Π​(A)−α​|A|​∀α _A (A)-α|A| 10000\ ∀α, described in Algorithm 3. Because of this correspondence, we simply traverse the upper hull’s vertices from right to left, recording the breakpoints as the slopes of the visited edges (Line 3). Furthermore, the smallest minimizing indices associated with the breakpoints are computed in Line 3, from which the maximal minimizers are obtained immediately (Line 3). Finally, the complexity is dominated by the upper convex hull computation (Line 3), which is O​(n​log⁡n)O(n n) in general but is only O​(n)O(n) here since the dual points are already sorted by x-coordinate (the slopes of gjg_j are strictly monotone)[De Berg et al., 2008, Theorem 1.1]. Hence, we obtain the following. Proposition 5. Algorithm 3 runs in O​(n)O(n) time. Input: possibility distribution π s.t 1=π1≥π2≥…≥πn>πn+1=01= _1≥ _2≥…≥ _n> _n+1=0. 1 2Form points pj=(j−n−1,−πj)p^j=(j-n-1,- _j), each has attribute pj.idx=jp^j.idx=j. 3 Compute upper hull vertices (from right to left) of the set pj:1≤j≤n+1\p^j:1≤ j≤ n+1\ and store in array V 4 N= length(V); B=[]B=[ 10000\ ]; I=[]I=[ 10000\ ] 5 for i=1i=1 to N−1N-1 do 6 Add slope of edge joining V​[i]V[i] and V​[i+1]V[i+1] to BB 7 Add V​[i+1].idxV[i+1].idx to II 8 9for j∈Ij do 10 Sj=j,…,nS_j=\j,…,n\ 11 12return BB and Sj\S_j\ 13 Algorithm 3 Solve minA⊆Ω⁡Π​(A)−α​|A|​∀α _A (A)-α|A| 10000\ ∀α Example 3. Consider the distribution π=(1,0.6,0.3)π=(1,0.6,0.3). We have g1​(α)=1−3​α,g2​(α)=0.6−2​α,g3​(α)=0.3−αg_1(α)=1-3α,g_2(α)=0.6-2α,g_3(α)=0.3-α, and g4​(α)=0g_4(α)=0. The graph of g=mini⁡gig= _ig_i is illustrated in Figure 2(a) in which the intersections of active lines are marked in black. At the first breakpoint α=0.3α=0.3, the smallest j satisfying gj=g_j=g is j=2j=2. At the second (and last) breakpoint α=0.4α=0.4, the smallest j satisfying gj=g_j=g is j=1j=1. From these breakpoints, we read the chain Sj\S_j\ associated to minA⊆Ω⁡Π​(A)−α​|A| _A (A)-α|A| as: ∀α∈[0.3,0.4)∀α∈[0.3,0.4), the maximal minimizer is 2,3\2,3\; ∀α∈[0.4,+∞)∀α∈[0.4,+∞), the maximal minimizer is 1,2,3\1,2,3\. Of course, ∀α∈[0,0.3)∀α∈[0,0.3), the maximal minimizer is ∅ . Let gi∗g^*_i be the dual points of lines gig_i. Figure 2(b) illustrates the convex hull of these points in which the upper convex hull is marked in red. As g3∗g^*_3 is not a vertex, the edges in this upper hull from right to left are (g4∗,g2∗)(g^*_4,g^*_2) and (g2∗,g1∗)(g^*_2,g^*_1), which correspond one to one with the intersections of active lines of g (see Figure 2(a)). We also see that the slopes of edges (g4∗,g2∗)(g^*_4,g^*_2) and (g2∗,g1∗)(g^*_2,g^*_1) are 0.3 and 0.4 matching exactly the breakpoints of g calculated before. (a) Lower envelope of lines gjg_j (b) Upper convex hull of dual points gj∗g^*_j Figure 2: Illustration of Example 3 4.3 Probability intervals In this case, computing the upper entropy becomes solving maxp∈−∑i=1npi​log⁡pi, _p - _i=1^np_i p_i, (7) where =p∈Δn:li≤pi≤ui​∀iP=\p∈ _n:l_i≤ p_i≤ u_i 10000\ ∀ i\. Abellan and Moral [2003] designed an algorithm for solving (7) with O​(n2)O(n^2) complexity. We now present an algorithm with better complexity. By using the KKT conditions [Boyd and Vandenberghe, 2004], the optimal solution is characterized as follows. Proposition 6. p is optimal of Problem 7 if and only if there exists x such that ∑i=1nmin⁡max⁡x,li,ui=1 _i=1^n \ \x,l_i\,u_i\=1 and pi=min⁡max⁡x,li,ui​∀ip_i= \ \x,l_i\,u_i\ 10000\ ∀ i. Let fi​(x):=min⁡max⁡x,li,uif_i(x):= \ \x,l_i\,u_i\ or explicitly fi​(x)=li,x≤li,x,li<x<uiui,x≥ui.f_i(x)= casesl_i,&x≤ l_i,\\ x,&l_i<x<u_i\\ u_i,&x≥ u_i. cases From Proposition 6, whose technical proof is in the Appendix B, it suffices to solve f​(x)=1f(x)=1 where f​(x):=∑i=1nfi​(x)f(x):= _i=1^nf_i(x). Because each fif_i is non-decreasing and piecewise affine, the same holds for f. Let a:=mini⁡lia:= _il_i and b:=maxi⁡uib:= _iu_i, then f​(a)=∑i=1nlif(a)= _i=1^nl_i and f​(b)=∑i=1nuif(b)= _i=1^nu_i. As f is continuous and ∑i=1nli≤1≤∑i=1nui _i=1^nl_i≤ 1≤ _i=1^nu_i, from the intermediate value theorem there is a solution to f​(x)=1f(x)=1 in the interval [a,b][a,b]. Figure 3 illustrates a plot of f. Figure 3: Plot of f​(x)f(x) with [l1,u1]=[0.1,0.4],[l2,u2]=[0.4,0.5][l_1,u_1]=[0.1,0.4],[l_2,u_2]=[0.4,0.5] and [l3,u3]=[0.2,0.6][l_3,u_3]=[0.2,0.6]. Because f is non-decreasing, we can solve f​(x)=1f(x)=1 by binary search: at each iteration, set x=a+b2x= a+b2; if f​(x)<1f(x)<1, continue on [x,b][x,b], otherwise continue on [a,x][a,x], and repeat. This procedure is already quite efficient. In fact, suppose we aim to solve f​(x)=1f(x)=1 with accuracy ϵε, i.e., |f​(x)−f​(x∗)|≤ϵ|f(x)-f(x^*)|≤ε where x∗x^* is a solution. Because |f​(x)−f​(x∗)|≤n​|x−x∗||f(x)-f(x^*)|≤ n|x-x^*|, we stop the binary search once the interval width is less than ϵn εn, which requires at most ⌈log2⁡nϵ⌉ _2 nε iterations (as [a,b]⊆[0,1][a,b] [0,1]). For example, if n=107n=10^7 and ϵ=10−10ε=10^-10, this amounts to about 5656 iterations, each needing only an evaluation of f. We can speed up the computation even further by combining binary search with the celebrated Newton method. Start with an initial guess x0x_0, until a solution is found, Newton’s method iteratively updates xkx_k for k=1,2,…k=1,2,… as xk=xk−1−f​(xk−1)−1f′​(xk−1).x_k=x_k-1- f(x_k-1)-1f (x_k-1). (8) Note that f is not differentiable at its breakpoints. However, this is not problematic since at a given x, if f​(x)<1f(x)<1 (resp. f​(x)>1f(x)>1), the solution lies to the right (resp. the left) of x, and we replace the derivative in (8) with the right derivative f+′​(x)f _+(x) (resp. left derivative f−′​(x)f _-(x)). It is easy to see that ∀i∀ i, (fi′)+​(x)=1(f _i)_+(x)=1 if li≤x<uil_i≤ x<u_i and (fi′)+​(x)=0(f _i)_+(x)=0 otherwise. Summing these right derivatives, we obtain f+′​(x)=|i:li≤x<ui|f _+(x)=|\i:l_i≤ x<u_i\|. Similarly, f−′​(x)=|i:li<x≤ui|f _-(x)=|\i:l_i<x≤ u_i\|. Finally, the Newton step in (8) may step out of the current interval which contains the solution or be undefined due to a zero derivative. In these cases, we fall back to a binary-search step [Press et al., 1992, Chapter 9]. Wrapping up, the approach is described in Algorithm 4. At worst, it may perform binary search only, and thus we obtain: Proposition 7. Algorithm 4 runs in O​(n​log⁡nϵ)O(n nε) time. We note that in practice, Algorithm 4 is often considerably faster due to the Newton steps. Input: n-intervals [li,ui]⊆[0,1][l_i,u_i] [0,1]; tolerance ϵε 1 a=mini⁡li,b=maxi⁡uia= _il_i, 10000\ b= _iu_i, x=a+b2x= a+b2 2 3while |f​(x)−1|>ϵ|f(x)-1|>ε do 4 if f​(x)−1<0f(x)-1<0 then 5 a=xa=x and d=|i:li≤x<ui|d=|\i:l_i≤ x<u_i\| 6 else 7 b=xb=x and d=|i:li<x≤ui|d=|\i:l_i<x≤ u_i\| 8 if d=0d=0 then 9 x=a+b2x= a+b2 10 11 else 12 x′=x−f​(x)−1dx =x- f(x)-1d 13 if x′∈(a,b)x ∈(a,b) then x=x′x=x else x=a+b2x= a+b2 14 15 16return pi=min⁡max⁡x,li,ui​∀ip_i= \ \x,l_i\,u_i\ 10000\ ∀ i Algorithm 4 Hybrid Newton–Binary Search 5 Computing upper entropy via convex optimization The bottleneck of Algorithm 2 lies in solving O​(n)O(n) SFM since algorithms for solving a general SFM are slow due to their high-order polynomial complexity. Hence, for general 2-monotonicity, it is worthwhile to explore efficient approximate ways to compute upper entropy. We now adopt the well-known Frank–Wolfe (FW) algorithm [Bach, 2013] to solve Problem (3). Denote the entropy function by H​(p):=−∑i=1npi​log⁡piH(p):=- _i=1^np_i p_i. In this method, we start with a p0∈​(μ)p^0 (μ). At each iteration k=0,1,2,…k=0,1,2,…, we solve the so-called linear oracle maxp∈​(μ)​∇H​(pk)T​(p−pk), _p (μ)∇ H(p^k)^T(p-p^k), (9) for which the maximum is attained at vkv^k. Since ​(μ)P(μ) is convex, the segment between pkp^k and vkv^k lies in ​(μ)P(μ), and the next iterate is taken on this segment as the point with the best objective, i.e., pk+1=(1−γk)​pk+γk​vkp^k+1=(1- _k)p^k+ _kv^k where γk∈arg​maxγ∈[0,1]⁡H​((1−γ)​pk+γ​vk). _k∈ *arg\,max_γ∈[0,1]\,H((1-γ)p^k+γ v^k). (10) If p∗p^* is the optimal probability then the concavity of H ensures that H​(pk)≤H​(p∗)≤H​(pk)+∇H​(pk)T​(vk−pk)H(p^k)≤ H(p^*)≤ H(p^k)+∇ H(p^k)^T(v^k-p^k). Hence, the maximum value of (9) (also called the duality gap) serves as a stopping certificate: we terminate if it is smaller than a threshold. As H is not differentiable at p when some pi=0p_i=0, we apply the algorithm by starting with an initial point p0>0p^0>0 and restricting the line search (10) to γ∈[0,1)γ∈[0,1). Thus, the gradient is always defined. Remark 3. A necessary and sufficient condition for the existence of such p0>0p^0>0 is μ¯​(i)>0​∀i μ(\i\)>0 10000\ ∀ i. Since μ¯​(i)=0 μ(\i\)=0 means i is impossible (as μ¯ μ provides upper bounds for event probabilities), this condition makes all outcomes i possible. Moreover, p0p^0 can be found in O​(n2​EO)O(n^2EO) (see Appendix C). The key step in the algorithm is solving (9). Fortunately, ​(μ)P(μ) coincides with the base polyhedron of μ¯ μ (see Remark 1) and Problem (9) is linear in p. Hence, it is efficiently solvable due to the following result. Proposition 8 ([Fujishige, 2005]). Let w∈ℝnw ^n and ϕφ be a permutation such that wϕ​(1)≥…≥wϕ​(n)w_φ(1)≥…≥ w_φ(n). Define Si:=ϕ​(1),…,ϕ​(i)​∀i∈1,…,nS_i:=\φ(1),…,φ(i)\ 10000\ ∀ i∈\1,…,n\ and S0=∅S_0= . If ν is a submodular function on Ω , then a maximizer of maxx∈B​(ν)⁡wT​x _x∈ B(ν)w^Tx is given as xi=ν​(Si)−ν​(Si−1)​∀ix_i=ν(S_i)-ν(S_i-1) 10000\ ∀ i. 6 Experiments Our experiments are conducted on a laptop with 16GB RAM and Intel Core i9 where we code in Python with NumPy package for efficient array operations [Harris et al., 2020]. Our code is available in the supplementary materials. 6.1 2-monotonicity To construct a 2-monotone μ, we do as follows [Bednarski, 1981]: let f:[0,1]→[0,1]f:[0,1]→[0,1] be any convex function where f​(0)=0f(0)=0 and p∗p^* be any probability on Ω ; define μ​(Ω)=1,μ​(A)=f​(p∗​(A))​∀A≠Ω.μ( )=1, 10000\ μ(A)=f(p^*(A)) 10000\ ∀ A≠ . We have chosen p∗p^* as a vector sampled from a Dirichlet distribution where all concentration parameters are 1. Our choices of f and sizes of Ω are listed in Table 1. We implement the FW algorithm for this type of 2-monotonicity. We first generate the initial point p0>0p^0>0 in the credal set which takes O​(n2​EO)O(n^2EO) (see Remark 3). The resolution is then carried out starting from p0p^0. At any iteration, we obtain an approximate entropy HaH_a and a gapgap for which the true entropy H∗∈[Ha,Ha+gap]H^*∈[H_a,H_a+gap]. Hence, the relative error H∗−HaH∗ H^*-H_aH^* is bounded by gapHa+gap gapH_a+gap and we stop once this bound is less than 0.1%. Table 1 reports the average running times (in seconds) for initial-point generation and the subsequent resolution, obtained by repeating each (f,|Ω|)(f,| |) experiment with five random seeds used to construct μ. From Table 1, we observe that the approach is quite efficient, yet most of the computation time is spent generating the initial point p0>0p^0>0 rather than the resolution. This is because the former has quadratic complexity, making it the bottleneck as |Ω|| | grows. Therefore, when the space is large, we may adopt the workaround of optimizing the smooth surrogate H~​(p):=−∑i=1npi​log⁡(pi+ϵ) H(p):=- _i=1^np_i (p_i+ε) for a fixed, sufficiently small ϵ>0ε>0. The gap between the optimal values of H and H~ H is bounded by n​ϵnε [Hazan et al., 2019, Lemma A.1]. With this surrogate, we can start from any feasible point in the credal set, which is generated in O​(n)O(n) time using a single permutation on Ω (see Appendix C). The results in Table 2 show that this approach is suitable for large spaces, e.g., |Ω|=5×105| |=5× 10^5. In general, this approximation approach is likely to outperform in efficiency the exact Algorithm 2, mostly because solving the SFM problems will require polynomial methods whose exponent will be too high. We would therefore recommend it for very high values of |Ω|| |, for which Algorithm 2 will struggle. Table 1: Initial point construction and resolution times (s). f |Ω|| | Create p0p^0 Resolution x2x^2 5000 0.284±0.0060.284± 0.006 0.013±0.0010.013± 0.001 10000 1.035±0.0181.035± 0.018 0.024±0.0010.024± 0.001 20000 3.961±0.0723.961± 0.072 0.206±0.0270.206± 0.027 1−1−x1- 1-x 5000 0.307±0.0030.307± 0.003 0.009±0.0010.009± 0.001 10000 1.149±0.0161.149± 0.016 0.017±0.0010.017± 0.001 20000 4.333±0.0144.333± 0.014 0.147±0.0140.147± 0.014 e2​x−1e2−1 e^2x-1e^2-1 5000 0.397±0.0030.397± 0.003 0.008±0.0010.008± 0.001 10000 1.481±0.0101.481± 0.010 0.015±0.0010.015± 0.001 20000 5.713±0.0195.713± 0.019 0.139±0.0100.139± 0.010 Table 2: Initial point construction and resolution times (s) for |Ω|=5×105| |=5× 10^5 with the surrogate. f Create p0p^0 Resolution x2x^2 0.010±0.0010.010± 0.001 6.327±0.2916.327± 0.291 1−1−x1- 1-x 0.012±0.0010.012± 0.001 4.326±0.0884.326± 0.088 e2​x−1e2−1 e^2x-1e^2-1 0.019±0.0010.019± 0.001 3.617±0.0743.617± 0.074 6.2 Probability intervals To ensure a nonempty credal set, we construct n random intervals [li,ui][l_i,u_i] as follows. First, we draw all uiu_i independently and uniformly from (0,1](0,1] and compute ∑i=1nui _i=1^nu_i. If this sum is less than 1, we discard the draw and resample. Second, each lil_i is drawn uniformly from [0,ui)[0,u_i). Finally, we rescale lil_i so that ∑i=1nli=1−margin _i=1^nl_i=1-margin for a fixed 0<margin<10<margin<1. We compare Algorithm 4 against the one in [Abellan and Moral, 2003] and report in Table 3 the average running times (in seconds) over five random seeds for each (margin,|Ω|)(margin,| |) pair. As mentioned earlier, Algorithm 4 has better complexity than Abellán & Moral’s algorithm (O​(n​log⁡nϵ)O(n nε) vs. O​(n2)O(n^2)). From Table 3, we also observe a significant improvement in the experiment. Moreover, the results suggest that their algorithm is influenced not only by |Ω|| | but also by the size of the feasible set: for a fixed |Ω|| |, increasing marginmargin enlarges the credal set and their algorithm struggles more. Our algorithm, in contrast, appears to depend only on |Ω|| |. Table 3: Runtimes (s) of Algorithm 4 vs. Abellán & Moral marginmargin |Ω|| | Algorithm 4 Abellán & Moral 0.10.1 10710^7 0.305±0.0020.305± 0.002 12.906±0.03612.906± 0.036 10810^8 3.133±0.0423.133± 0.042 15.699±0.14415.699± 0.144 0.20.2 10710^7 0.265±0.0030.265± 0.003 19.375±0.04719.375± 0.047 10810^8 2.729±0.0372.729± 0.037 19.589±0.24519.589± 0.245 0.30.3 10710^7 0.281±0.0150.281± 0.015 25.054±0.06025.054± 0.060 10810^8 2.843±0.1472.843± 0.147 27.069±0.17927.069± 0.179 7 Conclusion In this paper, we revisited the problem of computing upper entropy for 2-monotone lower probabilities, from theoretical and practical viewpoints. Upper entropy indeed plays an important role when it comes to credal uncertainty quantification, hence the interest to fully explore its computational aspects. We analyzed Abellán & Moral’s algorithm, well-known in imprecise probability but whose exact complexity was previously unknown. In contrast with previous claims saying that it was difficult, we showed that it was strongly polynomial, and proposed an improved algorithm based on submodular optimization. For important special cases, namely belief functions, probability intervals and possibility distributions, we also developed algorithms with notable improvements over existing approaches in the literature. Along with those improved methods, our results give a rather comprehensive picture of how to compute upper entropies for practical credal sets routinely used in applications. Also, as the algorithm for computing upper entropy in general cases has high-degree polynomial complexity, we proposed an alternative approach using the classic Frank-Wolfe method. Initial experiments indicates that this alternative is practical and achieves good precision quickly, thus offering a highly efficient alternative to exact methods. In the future, we aim to perform much more comprehensive experiments. In particular, for belief functions, it is interesting to compare the exact resolution approach using maximum-flow algorithms with the approximate one via Frank-Wolfe. A more general question to which this paper contributes is to know to which extent classical results issued from submodular optimization can help to solve commonly encountered computational issues for credal sets, and more specifically those induced by 2-monotone lower probabilities? One could think of other robust versions of classical uncertainty quantification tools, such as KL divergences, Wasserstein metrics. Another interesting topic to explore under this algorithmic and computational aspect is optimal transport [Lorenzini et al., 2025, Caprio, 2025]. References J. Abellan and S. Moral (2003) Maximum of entropy for credal sets. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 11 (05), p. 587–597. Cited by: §1, §4.3, §6.2. J. Abellán and S. Moral (2005) Upper entropy of credal sets. applications to credal classification. International Journal of Approximate Reasoning 39 (2-3), p. 235–255. Cited by: §1. J. Abellán and S. Moral (2006) An algorithm to compute the upper entropy for order-2 capacities. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 14 (02), p. 141–154. Cited by: 1st item, §1, §2.2, §3. J. Abellán, A. Pérez-Lara, and S. Moral-García (2023) A variation of the algorithm to achieve the maximum entropy for belief functions. Entropy 25 (6). Cited by: §1. F. R. Bach (2013) Learning with submodular functions: A convex optimization perspective. Found. Trends Mach. Learn. 6 (2-3), p. 145–373. Cited by: 3rd item, §5. T. Bednarski (1981) On solutions of minimax test problems for special capacities. Zeitschrift für Wahrscheinlichkeitstheorie und Verwandte Gebiete 58 (3), p. 397–405. Cited by: §6.1. S. Boyd and L. Vandenberghe (2004) Convex optimization. Cambridge university press. Cited by: §4.3. M. Caprio, S. K. Manchingal, and F. Cuzzolin (2025) Credal and interval deep evidential classifications. arXiv preprint arXiv:2512.05526. Cited by: 2nd item. M. Caprio, M. Sultana, E. G. Elia, and F. Cuzzolin (2024) Credal learning theory. Advances in Neural Information Processing Systems 37, p. 38665–38694. Cited by: §1. M. Caprio (2025) Optimal transport for ε -contaminated credal sets. In International Symposium on Imprecise Probabilities: Theories and Applications, p. 33–46. Cited by: §7. S. L. Chau, M. Caprio, and K. Muandet (2025) Integral imprecise probability metrics. arXiv preprint arXiv:2505.16156. Cited by: 2nd item. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein (2009) Introduction to algorithms. 3rd edition, MIT press. Cited by: §4.1, §4.1. F. G. Cozman (2000) Credal networks. Artificial Intelligence 120 (2), p. 199–233. Cited by: §1. M. De Berg, O. Cheong, M. Van Kreveld, and M. Overmars (2008) Computational geometry: algorithms and applications. Springer. Cited by: §4.2, §4.2. L. M. De Campos, J. F. Huete, and S. Moral (1994) Probability intervals: a tool for uncertain reasoning. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 2 (02), p. 167–196. Cited by: §2.1. W. Dinkelbach (1967) On nonlinear fractional programming. Management Science 13 (7), p. 492–498. Cited by: Appendix A. A. Dubey, O. Gupta, R. Raskar, and N. Naik (2018) Maximum-entropy fine grained classification. Advances in neural information processing systems 31. Cited by: §1. L. Fleischer and S. Iwata (2003) A push-relabel framework for submodular function minimization and applications to parametric optimization. Discrete Applied Mathematics 131 (2), p. 311–322. Cited by: Appendix A. S. Fujishige (2005) Submodular functions and optimization. Vol. 58, Elsevier. Cited by: Appendix C, §3, Proposition 8. M. Grabisch (1997) K-order additive discrete fuzzy measures and their representation. Fuzzy Sets and Systems 92 (2), p. 167–189. Cited by: §2.1. Y. Grandvalet and Y. Bengio (2006) Entropy regularization. Semi-Supervised Learning, p. 151–168. Cited by: §1. M. Grötschel, L. Lovász, and A. Schrijver (1993) Geometric algorithms and combinatorial optimization. Springer. Cited by: §2.1. P. Hansen, B. Jaumard, M. P. De Aragao, F. Chauny, and S. Perron (2000) Probabilistic satisfiability with imprecise probabilities. International Journal of Approximate Reasoning 24 (2-3), p. 171–189. Cited by: §1. D. Harmanec, G. Resconi, G. J. Klir, and Y. Pan (1996) On the computation of uncertainty measure in Dempster–Shafer theory. International Journal of General Systems 25 (2), p. 153–163. Cited by: §1, §2.2. C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers, P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J. Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett, A. Haldane, J. F. del Río, M. Wiebe, P. Peterson, P. Gérard-Marchant, K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke, and T. E. Oliphant (2020) Array programming with NumPy. Nature 585 (7825), p. 357–362. Cited by: §6. E. Hazan, S. Kakade, K. Singh, and A. Van Soest (2019) Provably efficient maximum entropy exploration. In International Conference on Machine Learning, p. 2681–2691. Cited by: §6.1. E. Hüllermeier and W. Waegeman (2021) Aleatoric and epistemic uncertainty in machine learning: an introduction to concepts and methods. Machine Learning 110 (3), p. 457–506. Cited by: §1. V. Huynh and Y. Nakamori (2009) Notes on “reducing algorithm complexity for computing an aggregate uncertainty measure”. IEEE Transactions on Systems, Man, and Cybernetics-Part A: Systems and Humans 40 (1), p. 205–209. Cited by: §1. A. Javanmardi, D. Stutz, and E. Hüllermeier (2024) Conformalized credal set predictors. Advances in Neural Information Processing Systems 37, p. 116987–117014. Cited by: §1. G. J. Klir (2006) Uncertainty and information. Foundations of Generalized Information Theory. Cited by: §1, §1, §4.2. I. Levi (1980) The enterprise of knowledge: an essay on knowledge, credal probability, and chance. MIT press. Cited by: §1. T. Löhr, P. Hofman, F. Mohr, and E. Hüllermeier (2025) Credal prediction based on relative likelihood. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, External Links: Link Cited by: §1. S. Lorenzini, D. Petturiti, and B. Vantaggi (2025) Choquet-wasserstein pseudo-distances via optimal transport under partially specified marginal probabilities. Fuzzy Sets and Systems 515, p. 109429. Cited by: §7. A. Meyerowitz, F. Richman, and E. Walker (1994) Calculating maximum-entropy probability densities for belief functions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 2 (04), p. 377–389. Cited by: §1, §2.2. I. Montes and S. Destercke (2017) On extreme points of p-boxes and belief functions. Annals of Mathematics and Artificial Intelligence 81 (3), p. 405–428. Cited by: §2.1. K. Nagano and Y. Kawahara (2013) Structured convex optimization under submodular constraints. In Proceedings of the Twenty-Ninth Conference on Uncertainty in Artificial Intelligence, UAI, Cited by: §3, §3. V. Nguyen, H. Zhang, and S. Destercke (2025) Credal ensembling in multi-class classification. Machine Learning 114 (1), p. 19. Cited by: §1. J. B. Orlin (2009) A faster strongly polynomial time algorithm for submodular function minimization. Mathematical Programming 118 (2), p. 237–251. Cited by: §3. J. C. Picard and M. Queyranne (1980) On the structure of all minimum cuts in a network and applications. Mathematical Programming Study 13, p. 8–16. Cited by: §4.1. W. H. Press, S. A. Teukolsky, W. T. Vetterling, and B. P. Flannery (1992) Numerical recipes in C: the art of scientific computing. 2nd edition, Cambridge University Press. Cited by: §4.3. G. Shafer (1976) A mathematical theory of evidence. Princeton university press. Cited by: §2.1. A. Singh, S. L. Chau, S. Bouabid, and K. Muandet (2024) Domain generalisation via imprecise learning. In International Conference on Machine Learning, p. 45544–45570. Cited by: §1. M. C. Troffaes and G. De Cooman (2014) Lower previsions. John Wiley & Sons. Cited by: §1. P. Walley (1991) Statistical reasoning with imprecise probabilities. Vol. 42, Springer. Cited by: §1, §2.1. K. Wang, F. Cuzzolin, K. Shariatmadar, D. Moens, H. Hallez, et al. (2024) Credal deep ensembles for uncertainty quantification. Advances in Neural Information Processing Systems 37, p. 79540–79572. Cited by: §1. Upper entropy for 2-monotone lower probabilities (Appendix) Appendix A Proof of Lemma 1 Before proving this, we remark that if μ is supermodular, the set function μλ​(A):=μ​(A)−λ​|A| _λ(A):=μ(A)-λ|A| is also supermodular as |A|+|B|=|A∩B|+|A∪B|​∀A,B|A|+|B|=|A∩ B|+|A∪ B| 10000\ ∀ A,B. Part 1: Finding A∈arg​max∅≠B⊆Ω⁡μ​(B)|B|A∈ *arg\,max_ ≠ B μ(B)|B| by solving O​(n)O(n) SFM. For this part, we adapt the idea described in [Fleischer and Iwata, 2003, Sec. 4.1]. Proof. Define a function g​(λ):=max∅≠B⊆Ω⁡μ​(B)−λ​|B|g(λ):= _ ≠ B μ(B)-λ|B|. We claim that λ∗:=max∅≠B⊆Ω⁡μ​(B)|B|⇔g​(λ∗)=0.λ^*:= _ ≠ B μ(B)|B| g(λ^*)=0. (11) Indeed, if maxB≠∅⁡μ​(B)|B|>λ _B≠ μ(B)|B|>λ then maxB≠∅⁡μ​(B)−λ​|B||B|>0 _B≠ μ(B)-λ|B||B|>0, and thus it follows that g​(λ)>0g(λ)>0 because |B|>0|B|>0. Hence, we need to solve g​(λ)=0g(λ)=0, for which the Dinkelbach’s method Dinkelbach [1967] will be employed. Note that λ∗∈λ^*∈ (0,1](0,1]. Dinkelbach’s method then constructs the sequence λt\ _t\ starting from λ0=0 _0=0, evaluates g​(λt)g( _t), and updates λt _t until it hits a solution of g as 1. Find Bt∈argmaxB≠∅⁡μ​(B)−λt​|B|B_t _B≠ μ(B)- _t|B|. Evaluate g​(λt)=μ​(Bt)−λt​|Bt|g( _t)=μ(B_t)- _t|B_t|. 2. If g​(λt)=0g( _t)=0, then λt _t is a solution and BtB_t is optimal. 3. Otherwise, update λt+1:=λt+g​(λt)|Bt|=μ​(Bt)|Bt| _t+1:= _t+ g( _t)|B_t|= μ(B_t)|B_t|. Define Mk:=max|B|=k⁡μ​(B)M_k:= _|B|=kμ(B) for k=1,…,nk=1,…,n. At each iteration, because |Bt||B_t| is a maximizer, if |Bt|=k|B_t|=k, it follows that μ​(Bt)=Mkμ(B_t)=M_k, and thus λt+1=Mkk _t+1= M_kk. Consequentiality, the sequence λt\ _t\ can only take values in the finite set M11,…,Mnn\ M_11,…, M_nn\. Since λt\ _t\ strictly increases until optimality with g​(λt)=0g( _t)=0, the loop terminates in at most n iterations. Finally, at each iteration we solve maxB⁡μ​(B)−λt​|B| _Bμ(B)- _t|B| which is an SFM as μ is supermodular. In total, we need to solve O​(n)O(n) SFM. ∎ Part 2: Finding the maximizer A of maximum size by solving O​(n)O(n) SFM. Proof. Because the set of maximizers of a supermodular function is closed under union, after finding λ∗λ^* (see (11)) via Dinkelbach’s method, we simply take the union of all maximizers of μλ∗ _λ^* where μλ∗​(B):=μ​(B)−λ∗​|B|​∀B _λ^*(B):=μ(B)-λ^*|B| 10000\ ∀ B. Observe that i∈Ωi∈ belongs to a maximizer of μλ∗ _λ^* if and only if maxB⊆Ω∖i⁡μ​(B∪i)−λ∗​|B∪i|=0. _B \i\μ(B∪\i\)-λ^*|B∪\i\|=0. (12) After solving (12) n times for each i∈Ωi∈ , the desired set A consists of all i such that (12) holds. ∎ Appendix B Proof of Proposition 6 WLOG, assume that ui>0​∀iu_i>0 10000\ ∀ i (otherwise pi=0p_i=0 and we discard pip_i), ∑i=1nli<1 _i=1^nl_i<1 (otherwise pi=li​∀ip_i=l_i 10000\ ∀ i), and ∑i=1nui>1 _i=1^nu_i>1 (otherwise pi=ui​∀ip_i=u_i 10000\ ∀ i). Under these assumptions, at the optimal p, we have pi>0​∀ip_i>0 10000\ ∀ i. Proof. By the KKT conditions, p is optimal of Problem (7) if and only if the following system (with variables pi,αi,βip_i, _i, _i and λ) is feasible: li≤pi≤ui​ and ​∑i=1npi=1 l_i≤ p_i≤ u_i and _i=1^np_i=1 (13) αi,βi≥0​∀i _i, _i≥ 0 10000\ ∀ i (14) αi​(−pi+li)=0​ and ​βi​(pi−ui)=0​∀i _i(-p_i+l_i)=0 and _i(p_i-u_i)=0 10000\ ∀ i (15) log⁡pi+1−αi+βi+λ=0​∀i p_i+1- _i+ _i+λ=0 10000\ ∀ i (16) Let f​(x):=∑i=1nmin⁡max⁡x,li,uif(x):= _i=1^n \ \x,l_i\,u_i\ and b=maxi⁡uib= _iu_i. Hence, f​(0)=∑i=1nli<1f(0)= _i=1^nl_i<1 and f​(b)=∑i=1nui>1f(b)= _i=1^nu_i>1 (by our assumptions). As f is continuous, by the intermediate value theorem, there exists an x∈(0,b)x∈(0,b) such that f​(x)=1f(x)=1. We define pi=min⁡max⁡x,li,ui​∀ip_i= \ \x,l_i\,u_i\ 10000\ ∀ i. We claim that such pip_i is feasible to (13-16) and thus is optimal. By the definition of pip_i, (13) automatically holds. We set λ=−1−log⁡x.λ=-1- x. For each i, if li<uil_i<u_i, we consider three cases: 1. If li<x<uil_i<x<u_i then pi=xp_i=x, and we set αi=βi=0 _i= _i=0. 2. If x≤lix≤ l_i then pi=lip_i=l_i, and we set αi=log⁡lix _i= l_ix and βi=0 _i=0. 3. If x≥uix≥ u_i then pi=uip_i=u_i, and we set αi=0 _i=0 and βi=log⁡xui _i= xu_i. Finally, if li=uil_i=u_i, then pi=lip_i=l_i, and we can set any αi,βi _i, _i such that −αi+βi=log⁡xli- _i+ _i= xl_i. ∎ Appendix C Proof of Remark 3 We make use of the following well-known result about the structure of vertices of ​(μ)P(μ) [Fujishige, 2005]. Let ϕφ be a permutation on Ω , the probability p where pϕ​(i):=μ¯​(Siϕ)−μ¯​(Si−1ϕ)p_φ(i):= μ(S^φ_i)- μ(S^φ_i-1) with S0ϕ:=∅S^φ_0:= , Siϕ:=ϕ​(1),…,ϕ​(i)​∀iS^φ_i:=\φ(1),…,φ(i)\ 10000\ ∀ i, is a vertex of ​(μ)P(μ). Proof. If ∃p∈​(μ)∃ p (μ) such that pi>0​∀ip_i>0 10000\ ∀ i then μ¯​(i)>pi>0 μ(\i\)>p_i>0. Conversely, assume that μ¯​(i)>0​∀i μ(\i\)>0 10000\ ∀ i. Consider n permutations ϕ1,…,ϕnφ^1,…,φ^n on Ω where ϕi​(1)=i​∀iφ^i(1)=i 10000\ ∀ i. Because of the above-mentioned result, each ϕiφ^i induces a vertex of ​(μ)P(μ) for which its iith component is positive. Taking the average of these n vertices, we obtain a probability p∈​(μ)p (μ) such that pi>0​∀ip_i>0 10000\ ∀ i. Moreover, this p is found in O​(n2​EO)O(n^2EO). ∎