Paper deep dive

A Semi-Decentralized Approach to Multiagent Control

Mahdi Al-Husseini, Mykel J. Kochenderfer, Kyle H. Wray

Year: 2026Venue: arXiv preprintArea: cs.AIType: PreprintEmbeddings: 74

Abstract

Abstract:We introduce an expressive framework and algorithms for the semi-decentralized control of cooperative agents in environments with communication uncertainty. Whereas semi-Markov control admits a distribution over time for agent actions, semi-Markov communication, or what we refer to as semi-decentralization, gives a distribution over time for what actions and observations agents can store in their histories. We extend semi-decentralization to the partially observable Markov decision process (POMDP). The resulting SDec-POMDP unifies decentralized and multiagent POMDPs and several existing explicit communication mechanisms. We present recursive small-step semi-decentralized A* (RS-SDA*), an exact algorithm for generating optimal SDec-POMDP policies. RS-SDA* is evaluated on semi-decentralized versions of several standard benchmarks and a maritime medical evacuation scenario. This paper provides a well-defined theoretical foundation for exploring many classes of multiagent communication problems through the lens of semi-decentralization.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: failed | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 0%

Last extracted: 3/13/2026, 1:16:07 AM

OpenRouter request failed (402): {"error":{"message":"This request requires more credits, or fewer max_tokens. You requested up to 65536 tokens, but can only afford 52954. To increase, visit https://openrouter.ai/settings/keys and create a key with a higher monthly limit","code":402,"metadata":{"provider_name":null}},"user_id":"user_2shvuzpVFCCndDdGXIdfi40gIMy"}

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

73,865 characters extracted from source content.

Expand or collapse full text

printacmref=false 929 University States University States University States A Semi-Decentralized Approach to Multiagent Control Mahdi Al-Husseini mah9@stanford.edu , Mykel J. Kochenderfer mykel@stanford.edu and Kyle H. Wray k.wray@northeastern.edu Abstract. We introduce an expressive framework and algorithms for the semi-decentralized control of cooperative agents in environments with communication uncertainty. Whereas semi-Markov control admits a distribution over time for agent actions, semi-Markov communication, or what we refer to as semi-decentralization, gives a distribution over time for what actions and observations agents can store in their histories. We extend semi-decentralization to the partially observable Markov decision process (POMDP). The resulting SDec-POMDP unifies decentralized and multiagent POMDPs and several existing explicit communication mechanisms. We present recursive small-step semi-decentralized A* (RS-SDA*), an exact algorithm for generating optimal SDec-POMDP policies. RS-SDA* is evaluated on semi-decentralized versions of several standard benchmarks and a maritime medical evacuation scenario. This paper provides a well-defined theoretical foundation for exploring many classes of multiagent communication problems through the lens of semi-decentralization. Key words and phrases: Semi-decentralization, Semi-Markov process, Multiagent communication, Admissible heuristic search, Planning 1. Introduction Many complex real-world problems require the coordination of multiple cooperative agents to solve, but feature limited opportunities for information exchange. The decentralized partially observable Markov decision process (Dec-POMDP) formalizes multi-agent planning and control in settings where explicit communication is impossible bernstein2002complexity; becker2004solving; amato2013decentralized. Several model variants take advantage of existing, if limited, information structures and extend the Dec-POMDP to explicitly incorporate costly goldman2008communication, delayed oliehoek2012tree; oliehoek2013sufficient, lossy tung2021effective, or intermittent zhang2024multiagent communication. In this paper we pursue a general framework that unifies several multiagent communication mechanisms. We are interested in problems whose communication dynamics are conditioned with some probability on the state, joint actions, or joint observations. An example domain is area-wide Global Positioning System (GPS) denial via jamming grant2009gps. As seen in the evacuation scenario depicted in Figure 1, agents need to coordinate joint tasks in environments with degraded, denied, or disrupted communication. Agents must reason about which actions to take in light of available communication, the influence actions taken have on future communication, and future communication’s influence on future actions. In decentralized systems, information may be action-gated in costly communication, constrained by channel capacity in lossy communication, temporally offset in intermittent or delayed communication, and either deterministic or stochastic in nature, while semi-decentralized systems generalize all of these through probabilistic information flow. Figure 1. A semi-decentralized multiagent evacuation scenario with probabilistic restrictions on communication. Aircraft and watercraft must coordinate under communication constraints to move patients from aid stations to hospitals. We begin by formally defining the semi-decentralization property. The key insight is that semi-Markov systems for resolving agent control may be co-opted in semi-decentralized systems for agent communication. Semi-decentralized systems may simultaneously contain a subset of agents that are necessarily decentralized and a subset of agents that are permissibly centralized. We introduce the semi-decentralized POMDP (SDec-POMDP), which unifies several existing multiagent models and explicit communication mechanisms. The SDec-POMDP reveals underlying mechanisms for memory propagation, selector functions, which we toggle here but are otherwise inherently present and constant in existing models. We then detail an exact optimal planning algorithm for solving SDec-POMDPs and apply to a set of new semi-decentralized benchmarks and a maritime medical evacuation application. Contributions • We formulate the semi-decentralization property by extending semi-Markov control concepts to communication. Semi-decentralization is then applied to multiagent POMDPs, forming the SDec-POMDP. • We prove that the SDec-POMDP unifies the Dec-POMDP, the MPOMDP, k-steps delayed communication, and the Dec-POMDP-COM. • We introduce Recursive Small-Step Semi-Decentralized A* (RS-SDA*), an exact algorithm for solving semi-decentralized problems. We evaluate performance in semi-decentralized variants of four Dec-POMDP benchmarks, then apply to a complex medical evacuation scenario with joint tasks. Ultimately, we describe and validate the mechanisms of a novel yet foundational property and model for multiagent decision making in probabilistic communication environments. 2. Related Work Multiagent Models Centralized planning can occur when agents freely communicate in reliable networks with no latency or when individual agents possess full system observability. This allows a single planner to select joint actions over the global state oliehoek2016concise. A partially observable centralized multiagent system is represented by a multiagent partially observable Markov decision process (MPOMDP). MPOMDP planners benefit from conditioning joint actions on joint observations but scale poorly and are susceptible to communication failures. MPOMDPs therefore have limited application to many challenging real-world problems. By contrast, cooperative agents in decentralized systems can neither communicate explicitly nor observe the entire system state, and therefore must select their own actions in accordance with local observations kochenderfer2015decision. The decentralized POMDP (Dec-POMDP) and MPOMDP share an underlying model but possess distinct policies and histories. Decentralized models have been used to support unmanned aerial vehicle formation flight azam2021uav, maritime traffic management singh2019multiagent, and wildfire surveillance julian2019distributed. Decentralized systems permit implicit communication, in which agents transmit information by means of taken actions and received observations grover2010implicit. Explicit communication, by contrast, endows agents with formalized communications actions. The state-of-the-art exact optimal algorithm for solving Dec-POMDPs, recursive small-step multi-agent A* (RS-MAA*) koops2023recursive, relies on a combination of incremental expansion, clustering, variable-depth recursive heuristics, and heuristic memoization. We extend RS-MAA* to semi-decentralized systems. Communication Schemes The literature features numerous explicitly modeled communication mechanisms and frameworks, owing to the sheer diversity of information structures in real-world problems. goldman2003optimizing formalize the decentralized POMDP with communication (Dec-POMDP-Com), which incorporates an alphabet of possible messages and a communication cost function goldman2003optimizing. The Dec-POMDP-Com models costly communication and generally assumes noise-free instantaneous broadcasting. Delayed communication occurs when agents learn the local observation received by others after one or more time-steps oliehoek2016concise. In one-step delayed communication, agents determine the latest joint action from the joint policy, but are unaware of the latest joint observation oliehoek2012tree. In k-steps delayed communication, a sufficient statistic may be used in place of the past joint policy oliehoek2013sufficient. Other multiagent models consider noisy tung2021effective or intermittent zhang2024multiagent communication channels. In the costly, delayed, noisy, and intermittent communication cases, the domain environment is orthogonal to the communication channel, and the actions taken by agents and resulting state do not affect their future ability to coordinate. Therefore, while communication directly influences control, control does not in turn directly influence communication. This distinguishes semi-decentralized infrastructure from existing communication schemes. Communication in Reinforcement Learning Our primary contribution is a foundational model that supports both planning and reinforcement learning in restricted communication environments. We further present an exact algorithm for semi-decentralized planning, in which the communication and model dynamics are known prior to execution moerland2022unifying. Still, the growing body of multiagent deep reinforcement learning with communication (Comm-MADRL) research is replete with promising techniques for codifying communication and defining communication policy. zhu2024survey categorize Comm-MADRL approaches by communication policy, to include full communication, partial communication, individual control, and global control subcategories zhu2024survey. We are chiefly concerned with the individual control literature jiang2018learning; singh2018learning; sheng2022learning, and seek to learn optimal control policies in light of potential communication links between agents. In the sub-field of learning tasks with communication, policies generated using learning algorithms simultaneously maximize environmental rewards and determine effective communication protocols for agents zhu2024survey. This parallels recent efforts in planning to design joint communication and control strategies sudhakara2024optimal, which we also do. 3. Preliminaries 3.1. Dec-POMDPs The decentralized partially observable Markov decision process (Dec-POMDP) is a stochastic, decentralized multiagent model for sequential decision-making under partial observability characterized by tuple ⟨I,S,A¯,¯,T,O,R⟩ I,S, A, O,T,O,R , where: • I is a finite set of k agents, • S is a finite set of states, • A¯=×iAi A=×_iA_i is a finite set of joint actions, • ¯=×i O=×_iO_i is a finite set of joint observations, and • T:S×A¯×S→[0,1]T:S× A× S→[0,1] is a state transition function where T(s′∣s,a¯)T(s s, a) is the probability of being in state s′s given joint action a¯ a being performed in state s, • O:¯×S×A¯→[0,1]O: O× S× A→[0,1] is a joint observation function where O(o¯′∣s′,a¯)O( o s , a) specifies the probability of attaining joint observation o¯′ o when joint action a¯ a results in state s′s , and • R:S×A¯→ℝR:S× A is a reward function such that R(s,a¯)R(s, a) is the immediate reward for performing joint action a¯ a in state s. Agents in the decentralized partially observable Markov decision process (Dec-POMDP) cannot explicitly share information. Each agent therefore selects actions in accordance with a local policy πi _i informed by action observation history hih_i, where h¯∈(∏i∈IAiOi)⋆ h∈( _i∈ IA_iO_i) . We are primarily concerned with deterministic policies, from which agent actions can be inferred using only observation histories o¯h o_h. The collection of individual policies is the decentralized policy set π¯:⟨π1,π2,…πn⟩ π: _1, _2,... _n . Assume an infinite horizon h=∞h=∞ and time discount rate γ. The objective is to find a policy set π¯ π maximizing expected reward over states and observation histories: Vπ¯(s,o¯h)=⁡[∑t=0∞γtR(st,π¯(o¯ht))∣b0,π¯]V π(s, o_h)= E\ [Σ _t=0^∞γ^tR(s^t, π( o_h^t)) b^0, π ] π¯∗←argmaxa¯∈A¯(R(s,a¯)+γ∑s′∈S∑o¯′∈Pr(s′,o¯′∣s,a¯)V∗(s′,o¯h′)). π^*← *argmax _ a∈ A (R(s, a)+γΣ _s ∈ SΣ _ o Pr(s , o s, a)V^*(s , o_h ) ). 3.2. MPOMDPs The multiagent partially observable Markov decision process (MPOMDP) is a stochastic, centralized multiagent model for sequential decision-making under partial observability also characterized by tuple ⟨I,S,A¯,¯,T,O,R⟩ I,S, A, O,T,O,R . MPOMDP agents also lack a complete picture of the underlying system state, but can share individual actions aia_i taken and observations received oio_i. This permits a sufficient statistic in the form of a probability distribution over states called a belief b∈Δnb∈ ^n, where n is the number of states and Δn ^n is the n−1n-1 simplex. We notate the multiagent belief using bαb_α. The belief over successor states s′s is updated using the set of agent histories h¯ h of actions taken and observation received: bα′(s′)=ηO(o¯′∣s′,a¯)∑s∈ST(s′∣s,a¯)bα(s)b_α (s )=η O( o s , a)Σ _s∈ ST(s s, a)b_α(s) where η is a normalizing constant equal to Pr(o¯′∣bα,a¯)−1Pr( o b_α, a)^-1, and Pr(o¯′∣bα,a¯)=∑s′∈SO(o¯′∣s′,a¯)∑s∈ST(s′∣s,a¯)bα(s)Pr( o b_α, a)=Σ _s ∈ SO( o s , a)Σ _s∈ ST(s s, a)b_α(s). Belief bo¯αa¯′b __α o a is then generated by applying the belief update equation to all s′∈Ss ∈ S. An initial belief b0b^0 is assumed to avoid considering the Bellman equation for infinite starting beliefs. From the agent perspective, reward must be calculated as a function of belief, such that R(bα,a¯)=∑s∈SR(s,a¯)bα(s)R(b_α, a)= _s∈ SR(s, a)b_α(s). In an MPOMDP, we seek a single joint policy πα:Δn→A¯ _α: ^n→ A that maps beliefs to actions. The expected value of bαb_α over an infinite horizon may be written as: Vπα(bα)=⁡[∑t=0∞γtR(bαt,πα(bαt))∣bα0=bα,πα].V _α(b_α)= E\ [Σ _t=0^∞γ^tR(b_α^t, _α(b_α^t)) b_α^0=b_α, _α ]. We seek a πα∗π^*_α that maximizes expected reward over time, determined by: πα∗←argmaxa¯∈A¯(R(bα,a¯)+γ∑o¯′∈¯Pr(o¯′∣bα,a¯)V∗(bo¯αa¯′)). aligned _α^*← *argmax _ a∈ A (R(b_α, a)+γΣ _ o ∈ OPr( o b_α, a)V^*(b __α o a) ) aligned. 3.3. Semi-Markov Processes Definition 1. The semi-Markov property for control admits a distribution over time for state transitions aligned with agent actions. Definition 2. Sojourn control time τ is the assignment of general continuous random variable T. Q(≤τ,s′∣s,a) Q(T≤τ,s s,a) (1) The semi-Markov decision process (SMDP) puterman1994markov is a stochastic single agent model for sequential decision-making with sojourn system control. The SMDP is characterized by state transition distribution Q, which is the probability that the next state transition occurs at or before τ and results in successor state s′s . It is mathematically convenient to define Q as the product of F(τ∣s,a)T(s′∣s,a,τ)F(τ s,a)T(s s,a,τ), or equivalently F(τ∣s′,a,s)T(s′∣s,a)F(τ s ,a,s)T(s s,a), where F is the cumulative distribution function of τ. The distribution of τ is conditioned on the current state s and action taken a and is therefore Markov. The “semi” in semi-Markov reflects the arbitrary probability distribution followed by model transitions. When τ = 0, a new action is taken to coincide with a decision epoch. The system natural process time is denoted as η∈ℝ+η ^+, such that each decision epoch t∈ℕt with corresponding sojourn time τt∈ℝ+τ^t ^+ and state st∈Ss^t∈ S occurs at ηtη^t, where ηtη^t is decision epoch start time within the natural process. 4. Semi-Decentralization Definition 3. The semi-Markov property for communication, or semi-decentralization, admits a distribution over time for what information agents can store in memory. Definition 4. Sojourn communication time τ is general continuous random variable representing the time for an agent to return to an information-sharing state. We here overload τ for sojourn communication time, distinct from sojourn control time. Q(¯≤τ¯′,s′∣τ¯,s,a¯,a¯′) Q( T≤ τ ,s τ,s, a, a ) (2) As with SMDPs, we can define Q as the product of F(τ¯′∣s′,a¯′,τ¯)F( τ s , a , τ) and T(s′∣s,a¯,τ¯)T(s s, a, τ), where τ¯′ τ may be conditioned on the subsequent joint action set a¯′ a . SMDPs have one agent with an implicit conditioned τ¯=0 τ=0. However, SDec-POMDPs have multiple agents with varied τ¯ τ. Thus it is more general with τ¯′ τ conditioned on τ¯ τ. Semi-decentralized models assume an initial τ¯0 τ^0, which can be interpreted as the communicating state of each agent when η=0η=0. When τ = 0, information sharing can occur coinciding with a communication epoch. We assume noise-free instantaneous broadcast communication resulting in a single communicating agent set as in a blackboard erman1980hearsay; craig1988blackboard. Semi-decentralization may however incorporate multiple distinct communicating sets. Whereas as semi-Markov control systems toggle model transition dynamics using τ, semi-decentralized systems toggle updating histories using τ¯ τ. 5. The SDec-POMDP The semi-decentralized partially observable Markov decision process (SDec-POMDP) is a stochastic, semi-decentralized multiagent model for sequential decision-making under partial observability characterized by tuple ⟨I,S,A¯,¯,F,T,O,R⟩ I,S, A, O,F,T,O,R . Model The SDec-POMDP model introduces selector functions f, g, and h to propagate agent memories, actions, and observations, respectively, to MiM_i and McM_c through selector sub-nodes conditioned on τ¯ τ. Figure 2 depicts an SDec-POMDP dynamic decision network where selector sub-nodes are contained within Zisel=⟨Mcisel,A¯isel,O¯isel⟩Z_i^sel= M_ci^sel, A_i^sel, O_i^sel and Zcsel=⟨M¯sel,A¯sel,O¯sel⟩Z_c^sel= M^sel, A^sel, O^sel . The selector infrastructure defines information-sharing configurations and enables the model to simultaneously maintain sets of centralized and decentralized agents. The set compositions change with changes to τ¯ τ following each state transition. Define Mc=(∏i∈I(∅∪AiOi))⋆M_c=( _i∈ I(\ \∪ A_iO_i)) , and Mi=(AiOi)⋆×McM_i=(A_iO_i) × M_c. The selector functions follow: ∀i,f(mcisel)=mcif τi=0∅if τi>0∀ i,f(m_ci^sel)= casesm_c&if _i=0\\ &if _i>0\\ cases f(mcsel)=∀i,miif τi=0∅if τi>0f(m_c^sel)= cases∀ i,m_i&if _i=0\\ &if _i>0\\ cases ∀i,g(aisel)=aj∈a¯,∀j∣τj=0if τi=0aiif τi>0∀ i,g(a_i^sel)= casesa_j∈ a,∀ j _j=0&if _i=0\\ a_i&if _i>0\\ cases g(acsel)=ai∈a¯,∀i∣τi=0g(a_c^sel)=\a_i∈ a,∀ i _i=0\ ∀i,h(oisel)oj∈o¯,∀j∣τj=0if τi=0oiif τi>0∀ i,h(o_i^sel) caseso_j∈ o,∀ j _j=0&if _i=0\\ o_i&if _i>0\\ cases h(ocsel)=oi∈o¯,∀i∣τi=0.h(o_c^sel)=\o_i∈ o,∀ i _i=0\. Figure 2. The SDec-POMDP dynamic decision network, with the policy infrastructure on the left and model on the right. The green backdrop contains the blackboard with memory McM_c generated from the histories of communicating agents. The gray backdrop with plate notation includes the individual agent memories MiM_i. Z selector nodes are selectively toggled by τ¯ τ to facilitate memory propagation η, represented by dashed lines. Policy ψ edges are represented by dotted lines. The SDec-POMDP framework is flexible and can be easily modified to capture the structural and informational characteristics of different problem domains. Policy For generality, assume an SDec-POMDP policy with deterministically stored aioi,∀ia_io_i,∀ i and deterministically selected a¯ a, parameterized by both πi:Mi→A _i:M_i→ A and πc:Mc→A¯ _c:M_c→ A. Memory propagation η and policy ψ probability functions follow: ψ(a¯)=1,∀iai=(πc(mc))iif τi=0ai=πi(mi)if τi>00,otherwiseψ( a)= cases1,∀ i casesa_i=( _c(m_c))_i&if _i=0\\ a_i= _i(m_i)&if _i>0\\ cases\\ 0,otherwise cases ηc(mc′)=1,if mc′=mcm¯sela¯selo¯sel0,otherwise _c(m_c )= cases1,if m_c =m_c m^sel a^sel o^sel\\ 0,otherwise cases ∀i,η(mi′)=1,if mi′=mimciselaiseloisel0,otherwise.∀ i,η(m_i )= cases1,if m_i =m_im_ci^sela_i^selo_i^sel\\ 0,otherwise cases. Objective Function The SDec-POMDP objective is to identify the combination of memory propagation and policy functions that will maximize expected reward. Consider the infinite horizon case: J(ψ,ηc,η¯)=⁡[∑t=0∞γtR(st,a¯t)∣b0]J(ψ, _c, η)= E\ [Σ _t=0^∞γ^tR(s^t, a^t) b^0 ] J⋆=argmaxψ,ηc,η¯J(ψ,ηc,η¯).J = *argmax _ψ, _c, ηJ(ψ, _c, η). Rewrite in terms of policies π¯ π and πc _c and blackboard and agent memories mcm_c and m¯ m: Vπc,π¯(mc,m¯) V _c, π(m_c, m) =⁡[∑t=0∞γtR(st,a¯t)∣b0,mc,m¯,πc,π¯] = E\ [Σ _t=0^∞γ^tR(s^t, a^t) b^0,m_c, m, _c, π ] =∑s∈SVπc,π¯(mc,m¯,s)b0(s) =Σ _s∈ SV _c, π(m_c, m,s)b^0(s) Vπc,π¯(mc,m¯,s)=∑a¯∈A¯ψ(a¯∣mc,m¯)[R(s,a¯)+γ∑τ¯∈T¯F(dτ¯∣s,a¯) V _c, π(m_c, m,s)=Σ _ a∈ Aψ( a m_c, m) [R(s, a)+γΣ _ τ∈ TF(d τ s, a) . ∑s′∈ST(s′∣s,a¯)∑o¯′∈¯O(o¯′∣s′,a¯) Σ _s ∈ ST(s s, a)Σ _ o ∈ OO( o s , a) ∑mc′∈Mcηc(mc′∣mc,f(m¯),g(a¯),h(o¯′)) Σ _m_c ∈ M_c _c(m_c m_c,f( m),g( a),h( o )) ∑m¯′∈M¯∏i∈Iη(mi′∣mi,f(mc),g(a¯),h(o¯′))Vπc,π¯(mc′,m¯′,s′)]. .Σ _ m ∈ MΠ _i∈ Iη(m_i m_i,f(m_c),g( a),h( o ))V _c, π(m_c , m ,s ) ]. 6. Theoretical Analysis Definition 5. Models XθX_θ and XϕX_φ are equivalent if they reduce to one another via mapping function f, such that Xθ≤XϕX_θ≤ X_φ and XϕX_φ ≤ XθX_θ. Definition 6. Model-policy structures XYθXY_θ and XYϕXY_φ are equivalent if they reduce to one another via mapping function g, such that XYθ≤XYϕXY_θ≤ XY_φ and XYϕXY_φ ≤ XYθXY_θ. Definition 7. Model-policy-objective structures ZθZ_θ and ZϕZ_φ are equivalent if ∀h¯∀ h VXYθ(h¯,b0)=VXYϕ(h¯,b0)V_XY_θ( h,b^0)=V_XY_φ( h,b^0). 6.1. MPOMDP Lemma 0. SDec-POMDP and MPOMDP models are equivalent. Proof. Demonstrate that 1. XMPOMDPX_MPOMDP ≤ XSDec-POMDPX_SDec-POMDP and 2. XSDec-POMDP≤XMPOMDPX_SDec-POMDP≤ X_MPOMDP 1. XMPOMDPX_MPOMDP ≤ XSDec-POMDPX_SDec-POMDP Let I′=I =I , S′=S =S, A¯′=A¯ A = A, R′=R =R, and ¯′=¯ O = O, where prime notation indicates the SDec-POMDP for purposes of relating models. The state transition and observation functions are defined to reproduce the MPOMDP dynamics independently of τ¯ τ, such that T′(s′∣s,a¯,τ¯)=T(s′∣s,a¯)T (s s, a, τ)=T(s s, a) and O′(o¯′∣s′,a¯,τ¯)O ( o s , a, τ) = O(o¯′∣s′,a¯)O( o s , a). Assume a deterministic communication sojourn time function F, where τ′τ for each agent is fixed to one, resulting in complete centralization at subsequent decision epochs; ∀i∀ i, F′(τi′<1∣s,ai,τiF ( _i <1 s,a_i, _i) = 0 and F′(τi′=1∣s,ai,τiF ( _i =1 s,a_i, _i) = 1. Selector functions therefore return joint actions and observations at each time-step; ∀i∀ i, g′(aisel)=g′(acsel)=a¯g (a_i^sel)=g (a_c^sel)= a and ∀i∀ i, h′(oisel)=h′(ocsel)=o¯h (o_i^sel)=h (o_c^sel)= o. 2. XSDec-POMDPX_SDec-POMDP ≤ XMPOMDPX_MPOMDP Let R′=R =R and A¯′=A¯ A = A. The agent set is extended to include an additional blackboard agent with an independent memory, so that I′=I∪IcI =I∪ I_c. We assume without loss of generality that |Ic|=1|I_c|=1, and that the communication sojourn time for this agent satisfies τi=0 _i=0 ∀i∈Ic∀ i∈ I_c. The state space is augmented to include τ¯ τ, such that S′=S×(ℝ+)nS =S×(R^+)^n. Similarly, the joint observation space is expanded to include the sequence of all action-observation pairs, ¯′=¯×∏i∈I(AiOi)⋆. O = O× _i∈ I(A_iO_i) . The transition and observation functions adopt the factored state space: T′(⟨s′,τ¯′⟩∣⟨s,τ¯⟩,a¯,a¯′)=F(τ¯′∣s′,a¯′,τ¯)T(s′∣s,a¯,τ¯)T ( s , τ s, τ , a, a )=F( τ s , a , τ)T(s s, a, τ) and O′(o¯′∣⟨s′,τ¯′⟩,⟨s,τ¯⟩,a¯)=O(o¯′∣s′,s,a¯,τ¯′,τ¯)O ( o s , τ , s, τ , a)=O( o s ,s, a, τ , τ) at its most general. ∎ Lemma 0. SDec-POMDP and MPOMDP model-policy structures are equivalent. Proof. Demonstrate that 1. XYMPOMDPXY_MPOMDP ≤ XYSDec-POMDPXY_SDec-POMDP and 2. XYSDec-POMDP≤XYMPOMDPXY_SDec-POMDP≤ XY_MPOMDP 1. XYMPOMDPY_MPOMDP ≤ XYSDec-POMDPXY_SDec-POMDP Let each memory selector function return the joint memory, such that ∀i∀ i, f′(mcisel)=f′(mcsel)=m¯f (m_ci^sel)=f (m_c^sel)= m. The update rule for the joint memory is deterministic and concatenates the prior shared memory with the complete set of agent actions and observations from the current time-step, as represented by: η′(m¯′)=ηc′(mc′)=1,if m¯′=m¯a¯selo¯sel0,otherwise.η ( m )= _c (m_c )= cases1,if m = m a^sel o^sel\\ 0,otherwise. cases Agent actions are selected using a policy over the joint memory: ψ′(a¯)=1,a¯=π(m¯)0,otherwise.ψ ( a)= cases1, a=π( m)\\ 0,otherwise. cases 2. XYSDec-POMDP≤XYMPOMDPXY_SDec-POMDP≤ XY_MPOMDP Construct policy πM(mi′)=⟨a1,…,an⟩π^M(m_i )= a_1,...,a_n , simulating where the action for agent i is determined to be: ai=(πc(m¯c))iif τi=0πi(mi)if τi>0a_i= cases( _c( m_c))_i&if _i=0\\ _i(m_i)&if _i>0 cases as if each agent draws from the blackboard’s policy when its communication sojourn time is zero and otherwise following a local policy. The memory update rule is defined to extend the current memory with the observed joint outcome: η′(m¯′)=1,if m¯′=m¯o¯0,otherwise.η ( m )= cases1,if m = m o\\ 0,otherwise. cases Joint action selection is consistent with the constructed joint policy: ψ′(a¯)=1,∀i,ai=πM(mi′)0,otherwise.ψ ( a)= cases1,∀ i,a_i=π^M(m_i )\\ 0,otherwise. cases ∎ Lemma 0. SDec-POMDP and MPOMDP model-policy-objective structures are equivalent. Proof. Show ∀h¯∀ h VXYMPOMDP(h¯,b0)=VXYSDec-POMDP(h¯,b0)V_XY_MPOMDP( h,b_0)=V_XY_SDec-POMDP( h,b_0) Show that the semi-decentralized value function reduces to the standard value function under the original joint policy. Begin with, Vπc,π¯(mc,m¯,s)=∑a¯∈A¯ψ(a¯∣mc,m¯)⏟1,mc=h¯[R(s,a¯)+γ∑τ¯∈T¯F(dτ¯∣s,a¯)⏟1, all RV ⁣⟂⁣⟂ of dτ∑s′∈ST(s′∣s,a¯)∑o¯′∈¯O(o¯′∣s′,a¯)∑mc′∈Mcηc(mc′∣mc,f(m¯),g(a¯),h(o¯′))⏟1,mc′=mca¯selo¯sel∑m¯′∈M¯∏i∈Iη(mi′∣mi,f(mc),g(a¯),h(o¯′))⏟1,∀i,mi′=miaiseloiselVπc,π¯(mc′,m¯′,s′)]. aligned &V _c, π(m_c, m,s)= Σ _ a∈ Aψ( a m_c, m)_ [rgb]1,0,0 [named]pgfstrokecolorrgb1,0,01,m_c= h [R(s, a)+γ Σ _ τ∈ TF(d τ s, a)_ [rgb]1,0,0 [named]pgfstrokecolorrgb1,0,01, all RV \!\!\!\! of dτ\\ &Σ _s ∈ ST(s s, a)Σ _ o ∈ OO( o s , a)\\ & Σ _m_c ∈ M_c _c(m_c m_c,f( m),g( a),h( o ))_ [rgb]1,0,0 [named]pgfstrokecolorrgb1,0,01,m_c =m_c a^sel o^sel\\ & Σ _ m ∈ MΠ _i∈ Iη(m_i m_i,f(m_c),g( a),h( o ))_ [rgb]1,0,0 [named]pgfstrokecolorrgb1,0,01,∀ i,m_i =m_ia_i^selo_i^selV _c, π(m_c , m ,s ) ]. aligned By construction, each under-braced term evaluates deterministically: the blackboard’s memory update enforces mc′=mca¯selo¯selm_c =m_c a^sel o^sel, each agent’s local memory update yields mi′=miaiseloiselm_i =m_ia_i^selo_i^sel, and the distribution over sojourn times collapse to one. Moreover, since ψ(a¯∣mc,m¯,s)=1ψ( a m_c, m,s)=1 whenever a¯=π(h¯) a=π( h), we obtain Vπc,π¯(mc,m¯,s)=Vπ(s,h¯)V _c, π(m_c, m,s)=V^π(s, h), which results in standard value recursion: R(s,π(h¯))+γ∑s′∈S∑o¯′∈Pr(s′,o¯′∣s,π(h¯))Vπ(s′,h¯′).R(s,π( h))+γΣ _s ∈ SΣ _ o Pr(s , o s,π( h))V^π(s , h ). Further observe that m¯c=m¯ and ∑s′∈ST(s′∣s,a¯)∑o¯′∈¯O(o¯′∣s′,a¯)=∑s′∈S∑o¯′∈Pr(s′,o¯′∣s,π(h¯)). aligned &Further observe that m_c= m and \\ &Σ _s ∈ ST(s s, a)Σ _ o ∈ OO( o s , a)=Σ _s ∈ SΣ _ o Pr(s , o s,π( h)). aligned ∎ Proposition 0. The SDec-POMDP and MPOMDP are equivalent. 6.2. Dec-POMDP Lemma 0. SDec-POMDP and Dec-POMDP models are equivalent. Proof. 1. Demonstrate that XDec-POMDPX_Dec-POMDP ≤ XSDec-POMDPX_SDec-POMDP and 2. XSDec-POMDP≤XDec-POMDPX_SDec-POMDP≤ X_Dec-POMDP 1. XDec-POMDPX_Dec-POMDP ≤ XSDec-POMDPX_SDec-POMDP Again set I′=I =I , S′=S =S, A¯′=A¯ A = A, R′=R =R, and ¯′=¯ O = O. Let T′(s′∣s,a¯,τ¯)=T(s′∣s,a¯)T (s s, a, τ)=T(s s, a) and O′(o¯′∣s′,a¯,τ¯′)O ( o s , a, τ ) = O(o¯′∣s′,a¯)O( o s , a). Action and observation selection are specified so that for every agent i, g′(aisel)=aig (a_i^sel)=a_i and h′(oisel)=oih (o_i^sel)=o_i. Assign to the blackboard memory the null set, such that g′(acsel)g (a_c^sel) = h′(ocsel)h (o_c^sel) = ∅ . By the construction of g′g and h′h , τ has no impact and F can be disregarded. 2. XSDec-POMDPX_SDec-POMDP ≤ XDec-POMDPX_Dec-POMDP Reference Lemma 1 proof 2, as XDec-POMDP=XMPOMDPX_Dec-POMDP=X_MPOMDP. ∎ Lemma 0. SDec-POMDP and Dec-POMDP model-policy structures are equivalent. Proof. Demonstrate that 1. XYDec-POMDPXY_Dec-POMDP ≤ XYSDec-POMDPXY_SDec-POMDP and 2. XYSDec-POMDP≤XYDec-POMDPXY_SDec-POMDP≤ XY_Dec-POMDP 1. XYDec-POMDPXY_Dec-POMDP ≤ XYSDec-POMDPXY_SDec-POMDP Let each memory selector function return the null set, such that ∀i∀ i, f′(mcisel)=f′(mcsel)=∅f (m_ci^sel)=f (m_c^sel)= . The update rule for each agent’s memory is deterministic and concatenates the prior memory with the set of individual agent actions and observations from the current time-step, as represented by: η′(mi′)=1,if mi′=miaiseloisel0,otherwiseη (m_i )= cases1,if m_i =m_ia_i^selo_i^sel\\ 0,otherwise cases Agent actions are selected using a policy over the agent’s memory: ψ′(a¯)=1,∀i,ai=πi(mi)0,otherwiseψ ( a)= cases1,∀ i,a_i= _i(m_i)\\ 0,otherwise cases By the construction of ψ′ψ , ηc′ _c can be disregarded. 2. XYSDec-POMDP≤XYDec-POMDPXY_SDec-POMDP≤ XY_Dec-POMDP Construct simulated policy (πD)i(mi′)=ai(π^D)_i(m_i )=a_i, where: ai=(πc(mc))iif τi=0πi(mi)if τi>0a_i= cases( _c(m_c))_i&if _i=0\\ _i(m_i)&if _i>0 cases as if each agent draws from the blackboard’s policy when its communication sojourn time is zero and otherwise following a local policy. Each agent’s memory update rule appends the taken action and observation to the current memory: η′(mi′)=1,if mi′=miaioi0,otherwiseη (m_i )= cases1,if m_i =m_ia_io_i\\ 0,otherwise cases Agent action selection is consistent with the constructed policy: ψ′(a¯)=1,∀i,ai=(πD)i(mi′)0,otherwiseψ ( a)= cases1,∀ i,a_i=(π^D)_i(m_i )\\ 0,otherwise cases ∎ Lemma 0. SDec-POMDP and Dec-POMDP model-policy-objective structures are equivalent. Proof. Show ∀h¯∀ h VXYDec-POMDP(h¯,b0)=VXYSDec-POMDP(h¯,b0)V_XY_Dec-POMDP( h,b_0)=V_XY_SDec-POMDP( h,b_0) The semi-decentralized value function reduces to the decentralized value function under the original policy set, beginning with: Vπc,π¯(mc⏟Mc=∅,m¯,s)=∑a¯∈A¯ψ(a¯∣mc⏟Mc=∅,m¯)⏟1[R(s,a¯)+γ∑τ¯∈T¯F(dτ¯∣s,a¯)⏟1, all RV ⁣⟂⁣⟂ of dτ∑s′∈ST(s′∣s,a¯)∑o¯′∈¯O(o¯′∣s′,a¯)∑m¯′∈M¯∏i∈Iη(mi′∣mi,f(mc)⏟Mc=∅,g(a¯),g(a¯),h(o¯′))⏟1,∀i,mi′=miaiseloisel∑mc′∈Mcηc(mc′∣mc,f(m¯),g(a¯),h(o¯′))⏟Mc=∅Vπc,π¯(mc′⏟Mc=∅,m¯′,s′)]. aligned &V _c, π( m_c_ [rgb]1,0,0 [named]pgfstrokecolorrgb1,0,0M_c= , m,s)= Σ _ a∈ Aψ( a m_c_ [rgb]1,0,0 [named]pgfstrokecolorrgb1,0,0M_c= , m)_ [rgb]1,0,0 [named]pgfstrokecolorrgb1,0,01 [R(s, a)+\\ &γ Σ _ τ∈ TF(d τ s, a)_ [rgb]1,0,0 [named]pgfstrokecolorrgb1,0,01, all RV \!\!\!\! of dτΣ _s ∈ ST(s s, a)Σ _ o ∈ OO( o s , a)\\ & Σ _ m ∈ MΠ _i∈ Iη(m_i m_i, f(m_c)_ [rgb]1,0,0 [named]pgfstrokecolorrgb1,0,0M_c= ,g( a),g( a),h( o ))_ [rgb]1,0,0 [named]pgfstrokecolorrgb1,0,01,∀ i,m_i =m_ia_i^selo_i^sel\\ & Σ _m_c ∈ M_c _c(m_c m_c,f( m),g( a),h( o ))_ [rgb]1,0,0 [named]pgfstrokecolorrgb1,0,0M_c= V _c, π( m_c _ [rgb]1,0,0 [named]pgfstrokecolorrgb1,0,0M_c= , m ,s ) ]. aligned By construction, each under-braced term evaluates deterministically: the blackboard’s memory remains mc′=∅m_c = , each agent’s local memory update yields mi′=miaiseloiselm_i =m_ia_i^selo_i^sel, and the distribution over sojourn times collapses to one. As before, since ψ(a¯∣mc,m¯,s)=1ψ( a m_c, m,s)=1 whenever a¯=π¯(h¯) a= π( h), we obtain Vπc,π¯(mc,m¯,s)=Vπ¯(s,h¯)V _c, π(m_c, m,s)=V π(s, h), which results in standard value recursion: R(s,π¯(h¯))+γ∑s′∈S∑o¯′∈Pr(s′,o¯′∣s,π¯(h¯))Vπ¯(s′,h¯′)R(s, π( h))+γΣ _s ∈ SΣ _ o Pr(s , o s, π( h))V π(s , h )\\ Again observe that, ∑s′∈ST(s′∣s,a¯)∑o¯′∈¯O(o¯′∣s′,a¯)=∑s′∈S∑o¯′∈Pr(s′,o¯′∣s,π¯(h¯)).Σ _s ∈ ST(s s, a)Σ _ o ∈ OO( o s , a)=Σ _s ∈ SΣ _ o Pr(s , o s, π( h)). ∎ Proposition 0. The SDec-POMDP and Dec-POMDP are equivalent Corollary 1. The SDec-POMDP is the same complexity class as a Dec-POMDP (NEXP-complete) Proposition 0. The SDec-POMDP and k-steps-delayed communication are equivalent Proposition 0. The SDec-POMDP and Dec-POMDP-Com are equivalent We provide proofs for Propositions 3 and 4 in the technical appendix. 7. Recursive Small-Step Semi-Decentralized A* Recursive small-step semi-decentralized A* (RS-SDA*) is an exact planning algorithm for optimizing expected reward in SDec-POMDP problems. RS-SDA* modifies RS-MAA* koops2023recursive by maintaining a stage-specific partition of decentralized and centralized joint observation histories per probabilistic communication dynamics. Like RS-MAA*, RS-SDA* relies on incremental expansion of a small-step search tree, clustering, recursive heuristics, memoization, and last stage modifications. RS-SDA* generates a fully-specified policy set π¯∈Π π∈ using offline planning. A fully-specified policy contains both fully-specified local policies and, if appropriate for the problem, a blackboard policy π¯=⟨π1,…πn,πc⟩ π= _1,... _n, _c where πi:h→Ai _i:O^h→ A_i and πc:¯h→A¯ _c: O^h→ A. Similarly, φ¯=⟨φ1,…φn,φc⟩ = _1,... _n, _c , φi:≤h−1→Ai _i:O^≤ h-1→ A_i, and φ¯:¯≤h−1→A¯ : O^≤ h-1→ A. RS-SDA* is outlined in Algorithm 1. Small-Step Search Tree We adopt the small-step approach first introduced by cazenave2010partial for A* cazenave2010partial and used by koops2023recursive in RS-MAA* koops2023recursive to limit search tree outdegree. Small-step search can be used with centralized and decentralized components of each policy node, depicted in Figure 3. As shown in Table 1, small-step search provides RS-SDA* with mixed component policies, pre-clustering, a lower bound (complete decentralization) and an upper bound (complete centralization) on the number of considered nodes per stage t. When F(τ∣s,a¯)=F(τ∣a¯)F(τ s, a)=F(τ a) and the policy is deterministic, we can consolidate the centralized and decentralized components of policies and significantly reduce the search tree size. This results in an RS-SDA* lower bound that is equivalent to RS-MAA* and an RS-SDA* upper bound that has |⋆|tn|O_ |^tn levels per stage and considers |A⋆|n|A_ |^n joint actions per level, where |⋆||O_ | is the size of the largest observation set and |A⋆||A_ | is the size of the largest action set. RS-MAA* RS-SDA* Classical MAA* (RS-SDA* lower bound) upper bound levels nodes/stage levels nodes/stage levels nodes/stage t n|O⋆|tn|O_ |^t n|⋆|t|A⋆|n|O_ |^t|A_ | |⋆|tn|O_ |^tn |⋆|tn|A⋆|n|O_ |^tn|A_ |^n 1 |A⋆|n|⋆|t|A_ |^n|O_ |^t 0 2 6 1 9 1 9 1 4 12 4 36 1 81 2 8 24 16 144 1 6,561 3 16 48 64 576 1 ¿1E6 4 32 96 256 2,304 1 ¿1E6 5 64 192 1,024 9,216 1 ¿1E6 6 128 384 4,096 36,864 1 ¿1E6 7 256 768 16,384 147,456 1 ¿1E6 8 512 1536 65,536 589,824 1 ¿1E6 Table 1. SDec-Tiger levels and nodes by stage for RS-MAA*, RS-SDA*, and MAA*. |A⋆|=3|A_ |=3, |O⋆|=2|O_ |=2, and n=2n=2. Figure 3. Illustrating RS-SDA* applied to SDec-Tiger using mixed component policies through stage σ=2σ=2. Dynamic Programming We apply backward induction over beliefs to rapidly determine the value of centralized policy components. This bypasses expensive recursive heuristic calculations for large portions of the search. For each remaining horizon r and belief b, we compute Vr(b)V_r(b) and Qr(b,a¯)Q_r(b, a) and memoize both V and Q under keys (r,b)(r,b) and (r,b,a¯)(r,b, a) to enable extensive reuse during A* expansion. Similarly, observation likelihoods P(o¯∣b,a¯)P( o b, a) and posteriors (bo¯,a¯′)(b _ o, a) are obtained via a belief update over the model’s transition and observation tensors, then cached for subsequent calls. Ordering Observation Histories Any φ¯ may contain both decentralized and centralized mappings conditioned on the underlying state and actions taken. We therefore explore joint observation histories (JOH) and local observation histories (LOH) in a predetermined sequence: stage, JOH then LOH, by agent (for LOHs), then lexicographically. Observe that, for length t, all o¯0:t⪯o0:t o^0:t o^0:t and all o0:t⪯o¯0:t′o^0:t o^0:t . Additionally, o¯0:t⪯lexo¯0:t o^0:t _lex o^0:t, and oi0:t⪯oj0:to_i^0:t o_j^0:t if (i<j)(i<j) or (i=j∧oi0:t⪯lexoj0:t)(i=j o_i^0:t _lexo_j^0:t). Clustering We implement lossless incremental clustering in decentralized policy components based on a probabilistic equivalence criterion oliehoek2009lossless. We similarly cluster centralized policy components based on the resulting joint belief, or ∀s _s, P(s∣o¯10:t)=P(s∣o¯20:t)P(s o^0:t_1)=P(s o^0:t_2). Admissible Heuristic As with multiagent A* (MAA*) szer2005maa, an admissible heuristic Q guides a path through a search tree with partial policies as nodes φ¯ . A heuristic is admissible if it equals or over-approximates the true value of the policy node. An open list is maintained with nodes under consideration. The node in the open list with the highest heuristic value is expanded and replaced in the open list by its children. The tree search terminates once a fully-specified policy with the highest heuristic value is identified. For each parent node and candidate action, we split each posterior belief using syncS_sync and syncA_sync (more generally expressed using F(s,a)F(s,a)) and take a probability‑weighted sum of the exact centralized value on the communication-dependent posterior and the exact decentralized value on its complement. Because every constituent (centralized, decentralized, and their mixture conditioned on communication) is an exact optimum of a relaxation of the remaining subproblem, the resulting heuristic never underestimates the achievable return and is therefore admissible. Input : Φ≜(sync,sync) (S_sync,A_sync); h horizon; b initial belief; φ initial policy; d heuristic depth; M iterations; u upper bound Output : optimal policy φ∗ ^*, optimal value v∗v^* 1ex function RS-SDA*(h,b,φ,d,M,u,Φ)(h,b, ,d,M,u, ): Q←PriorityQueue(↑)Q← PriorityQueue( ) Q.push(min(φ.heuristics),φ)Q.push( ( .heuristics), ) i←0i← 0 while Q is not empty do (v,φ)←Q.pop()(v, )← Q.pop() if v<∞v<∞ then i←i+1i← i+1 if i≥Mi≥ M or v≤uv≤ u then return (min⁡(v,u),None)( (v,u),None) σ←σ← current stage of φ ; j←j← current agent index of φ // Phase 1: Transition to Subsequent Stage if stage σ is complete (j=|A|j=|A|) then if σ=hσ=h then return (v,φ)(v, ) (Ddec,Dcen,Pdec,Pcen,ρ)←TerminalProbs(φ,Φ)(D_dec,D_cen,P_dec,P_cen,ρ)← TerminalProbs( , ) if ρ>0ρ>0 then φ←ClusterPolicyDec(φ,Ddec,Pdec) ← ClusterPolicyDec( ,D_dec,P_dec) if ρ<1ρ<1 then φ←ClusterPolicyCen(φ,Dcen,Pcen) ← ClusterPolicyCen( ,D_cen,P_cen) w←EvaluatePolicy(φ,ρ)w← EvaluatePolicy( ,ρ) φ.heuristics.append(w) .heuristics.append(w) φ.depth←min⁡(σ,d) .depth← (σ,d) φ.extend_empty_stage() .extend\_empty\_stage(); σ←σ+1σ←σ+1; j←0j← 0 end if // Phase 2: Policy Expansion if σ=hσ=h (Final Stage Optimization) then if ρ<1ρ<1 and centralized unfilled then determine a¯∗ a^* for centralized component φ′←φ ← with a¯∗ a^* set; w←EvaluatePolicy(φ′,ρ)w←\ EvaluatePolicy( ,ρ) Q.push(w,φ′)Q.push(w, ) else determine aj∗a_j^* for decentralized component φ′←φ ← with aj∗a_j^* set; w←EvaluatePolicy(φ′,ρ)w←\ EvaluatePolicy( ,ρ) if w=vw=v then return (w,φ′)(w, ) Q.push(w,φ′)Q.push(w, ) else if ρ<1ρ<1 and centralized unfilled then foreach joint action a¯∈A¯ a∈ A do φ′←φ ← with a¯ a set w←ExactCentralQ(φ′,a¯)w←\ ExactCentralQ( , a) Q.push(w,φ′)Q.push(w, ) end foreach else if agent j unfilled then foreach aj∈Aja_j∈ A_j do φ′←φ ← with aja_j set w←EvaluatePolicy(φ′)w\!←\! EvaluatePolicy( ) Q.push(w,φ′)Q.push(w, ) end foreach end while Algorithm 1 Recursive Small-Step Semi-Decentralized A* 8. Experiments Semi-Decentralized Benchmarks We evaluate RS-SDA* in semi-decentralized versions of four common Dec-POMDP benchmarks: Dec-Tiger nair2003taming, FireFighting oliehoek2008optimal, BoxPushing seuken2007improved, and Mars amato2009achieving, and in a new MaritimeMEDEVAC benchmark. Problem descriptions with illustrations are disclosed in the technical appendix. All experiments were conducted using an AMD Ryzen 9 9900X3D 12-Core Processor (4400 MHz), with timeout occurring at 20 minutes and memory out at 16 GB. We adopt hyper-parameters M = 200, d = 3, and α = 0.2 for all experiments. A link to our code repository is provided in the technical appendix to support reproducibility. Figure 4. MaritimeMEDEVAC environment representation and centralized/decentralized/semi-decentralized optimal policy values for horizons one through eight. Results As shown in Table 2 and Figure 4, RS-SDA* is competitive with the centralized upper bound across most semi-decentralized benchmarks and MaritimeMEDEVAC. The modified benchmarks demonstrate how model dynamics influence the value of information in multi-agent systems. SDec-FireFighting exemplifies problems where centralization benefits are negligible, and the optimal RS-SDA* solution equals the optimal RS-MAA* solution for all considered h. By contrast, SDec-Box exemplifies problems where partial centralization results in complete information sharing, and the optimal RS-SDA* solution equals the fully centralized optimum for all considered h. In MaritimeMEDEVAC, the three regimes are nearly indistinguishable at moderate horizons (H=4,5,6H=4,5,6), but at H=7H=7 centralized reaches 6.626.62 while semi-decentralized attains 6.366.36, clearly outperforming full decentralization (3.273.27). At H=7H=7, the semi-decentralized policy recovers about 96%96\% of the centralized value. These results indicate that semi-decentralization and RS-SDA* preserve much of the benefit of centralized coordination while staying tractable, with occasional slowdowns or timeouts on problem instances where lossless clustering is largely ineffective. lower bound our approach upper bound decentralized semi-decentralized centralized RS-MAA* RS-SDA* RS-SDA* h value time value time value time !15SDec-Tiger 8 12.21726 1 27.21518 ¡1 47.71696 ¡1 9 15.57244 6 30.90457 ¡1 53.47353 ¡1 10 15.18438 271 34.72370 ¡1 60.50988 ¡1 !15SDec-FireFighting (nh=3,nf=3n_h=3,n_f=3) 3 -5.73697 ¡1 -5.73697 ¡1 -5.72285 ¡1 4 -6.57883 2 -6.57883 3 -6.51859 3 5 -7.06987 29 -7.06987 49 -6.98069 37 !15SDec-BoxPushing 3 66.08100 ¡1 66.81000 ¡1 66.81000 ¡1 4 98.59361 18 99.55630 ¡1 99.55630 ¡1 5 107.72985 MO 109.09251 1 109.09251 1 !15SDec-Mars (Right Band Rendezvous) 4 10.18080 1 10.18080 ¡1 10.87020 ¡1 5 13.26654 2 14.29038 ¡1 14.38556 1 6 18.62317 4 20.06430 2 20.06706 3 !15SDec-Mars (Survey Site Beacons) 4 10.18080 1 10.54620 ¡1 10.87020 ¡1 5 13.26654 2 13.26654 ¡1 14.38556 1 6 18.62317 4 18.62317 143 20.06706 3 !15SDec-Mars (Drill Site Beacons) 4 10.18080 1 10.87020 ¡1 10.87020 ¡1 5 13.26654 2 14.38556 ¡1 14.38556 1 6 18.62317 4 20.06168 2 20.06706 3 !15MaritimeMEDEVAC 6 3.18348 ¡1 3.19807 28 3.19945 ¡1 7 3.26710 2 6.36301 33 6.61819 1 8 8.03228 156 10.61275 MO 10.88244 1 Table 2. RS-SDA* offline planning performance on semi-decentralized benchmarks and MaritimeMEDEVAC. TO denotes timeout (>>1200s) and MO denotes memout (>>16GB). 9. Conclusion We present a foundational framework for multiagent decision making under probabilistic communication. We introduce the SDec‑ POMDP, which unifies the Dec‑POMDP, MPOMDP, and several communication mechanisms with delay, loss, or cost. A secondary contribution is RS‑SDA*, an admissible heuristic search algorithm for semi-decentralized systems with excellent performance, and semi-decentralized versions of four standard benchmarks and a new medical evacuation scenario. Taken together, SDec‑POMDP and RS‑SDA* provide a principled basis for studying and exploiting probabilistic communication in cooperative teams. Future work includes exploiting interleaving offline planning and online search to improve approximate RS-SDA* performance and investigating systems with non-stationary sojourn time distributions. References Appendix A Appendix A.1. k-Steps Delayed Communication k-steps delayed communication nayyar2010optimal, oliehoek2013sufficient, oliehoek2008optimal is a model for delayed broadcast communication where each agent receives the complete joint history from t−kt-k at t. This enables k-steps delayed communication to generate a joint belief bt−kb^t-k on which to conduct subsequent decentralized planning. We formally define the agent histories under k-steps delayed communication, below: Hit=(AiOi)tH^t_i=(A_iO_i)^t h¯i∈H¯i,k=⋃t=1∞∏j∈IHitif j=iHjmax0, t−kotherwise h_i∈ H_i,k= _t=1^∞ _j∈ I casesH_i^t&if j=i\\ H_j^max\0, $t-k$\&otherwise cases h¯=∏i∈Ih¯i h= _i∈ I h_i Lemma 0. SDec-POMDP and k-steps delayed communication models are equivalent. Proof. Demonstrate that 1. Xk-steps DelayedX_$k$-steps Delayed ≤ XSDec-POMDPX_SDec-POMDP and 2. XSDec-POMDP≤Xk-steps DelayedX_SDec-POMDP≤ X_$k$-steps Delayed 1. Xk-steps DelayedX_$k$-steps Delayed ≤ XSDec-POMDPX_SDec-POMDP Let I′=I =I , S′=S =S, A¯′=A¯ A = A, R′=R =R, and ¯′=¯ O = O. The state transition and communication sojourn time functions are independent of τ¯ τ such that T′(s′∣s,a¯,τ¯)=T(s′∣s,a¯)T (s s, a, τ)=T(s s, a) and O′(o¯′∣s′,a¯,τ¯′)O ( o s , a, τ ) = O(o¯′∣s′,a¯)O( o s , a). Again assume a deterministic communication sojourn time function F, where τ′τ for each agent is fixed to one, resulting in complete centralization at subsequent decision epochs; ∀i∀ i, F′(τi′<1∣s,ai,τiF ( _i <1 s,a_i, _i) = 0 and F′(τi′=1∣s,ai,τiF ( _i =1 s,a_i, _i) = 1. Blackboard selector functions return joint actions and observations at each time-step; g′(acsel)=a¯g (a_c^sel)= a and h′(ocsel)=o¯h (o_c^sel)= o. Agent selector functions return agent actions and observations at each time-step; ∀i∀ i, g′(aisel)=aig (a_i^sel)=a_i and ∀i∀ i, h′(oisel)=oih (o_i^sel)=o_i. 2. XSDec-POMDPX_SDec-POMDP ≤ Xk-steps DelayedX_$k$-steps Delayed Let k=0k=0. See proof of Lemma 4 for construction of an SDec-POMDP model within a Dec-POMDP. ∎ Lemma 0. SDec-POMDP and k-steps delayed model-policy structures are equivalent. Proof. Demonstrate that XYk-steps DelayedXY_$k$-steps Delayed ≤ XYSDec-POMDPXY_SDec-POMDP and 2. XYSDec-POMDP≤XYk-steps DelayedXY_SDec-POMDP≤ XY_$k$-steps Delayed 1. XYk-steps DelayedXY_$k$-steps Delayed ≤ XYSDec-POMDPXY_SDec-POMDP For any object X carrying a time index, X|0:rX |_0:r denotes the same object with indices >r>r disabled/ignored. Let each agent memory selector function return the blackboard memory at t−kt-k, such that ∀i∀ i, f′(mcisel)=mct|0:t−kf (m_ci^sel)=m_c^t |_0:\,t-k. The blackboard memory selector function returns the latest joint agent memory set. The update rule for each agent’s memory is deterministic and concatenates the prior memory with the set of individual agent actions and observations from the current time-step: η′(mi′)=1,if mi′=mimciselaiseloisel0,otherwise.η (m_i )= cases1,if m_i =m_im_ci^sela_i^selo_i^sel\\ 0,otherwise. cases Agent actions are selected using a policy over the agent’s memory: ψ′(a¯)=1,∀i,ai=πi(mi)0,otherwise.ψ ( a)= cases1,∀ i,a_i= _i(m_i)\\ 0,otherwise. cases Finally, the update rule for the blackboard memory is deterministic and concatenates the prior shared memory with the complete set of agent actions and observations from the current time-step: ηc(mc′)=1,if mc′=mca¯selo¯sel0,otherwise. _c(m_c )= cases1,if m_c =m_c a^sel o^sel\\ 0,otherwise. cases 2. XYSDec-POMDP≤XYk-steps DelayedXY_SDec-POMDP≤ XY_$k$-steps Delayed Let k=0k=0. See proof of Lemma 5 for construction of an SDec-POMDP model-policy structure within a Dec-POMDP. ∎ Lemma 0. SDec-POMDP and k-steps delayed model-policy-objective structures are equivalent. The semi-decentralized value function reduces to the k-steps delayed value function under the original policy set, beginning with: Proof. Show ∀h¯∀ h VXYk-steps Delayed(h¯,b0)=VXYSDec-POMDP(h¯,b0)V_XY_$k$-steps Delayed( h,b_0)=V_XY_SDec-POMDP( h,b_0) Vπc,π¯(mc,m¯,s)=∑a¯∈A¯ψ(a¯∣mc,m¯)⏟1,mc=h¯[R(s,a¯)+γ∑τ¯∈T¯F(dτ¯∣s,a¯)⏟1, all RV ⁣⟂⁣⟂ of dτ∑s′∈ST(s′∣s,a¯)∑o¯′∈¯O(o¯′∣s′,a¯)∑mc′∈Mcηc(mc′∣mc,f(m¯),g(a¯),h(o¯′))⏟1,mc′=mca¯selo¯sel∑m¯′∈M¯∏i∈Iη(mi′∣mi,f(mc),g(a¯),h(o¯′))⏟1,∀i,mi′=miaiseloiselVπc,π¯(mc′,m¯′,s′)] aligned &V _c, π(m_c, m,s)= Σ _ a∈ Aψ( a m_c, m)_ [rgb]1,0,0 [named]pgfstrokecolorrgb1,0,01,m_c= h [R(s, a)+γ Σ _ τ∈ TF(d τ s, a)_ [rgb]1,0,0 [named]pgfstrokecolorrgb1,0,01, all RV \!\!\!\! of dτ\\ &Σ _s ∈ ST(s s, a)Σ _ o ∈ OO( o s , a)\\ & Σ _m_c ∈ M_c _c(m_c m_c,f( m),g( a),h( o ))_ [rgb]1,0,0 [named]pgfstrokecolorrgb1,0,01,m_c =m_c a^sel o^sel\\ & Σ _ m ∈ MΠ _i∈ Iη(m_i m_i,f(m_c),g( a),h( o ))_ [rgb]1,0,0 [named]pgfstrokecolorrgb1,0,01,∀ i,m_i =m_ia_i^selo_i^selV _c, π(m_c , m ,s ) ] aligned By construction, each under-braced term evaluates deterministically: the blackboard’s memory update enforces mc′=mca¯selo¯selm_c =m_c a^sel o^sel, each agent’s local memory update yields mi′=miaiseloiselm_i =m_ia_i^selo_i^sel, and the distribution over communication sojourn times collapse to one. Moreover, since ψ(a¯∣mc,m¯,s)=1ψ( a m_c, m,s)=1 whenever a¯=π¯(h¯) a= π( h), we obtain Vπc,π¯(mc,m¯,s)=Vπ¯,k(s,h¯)V _c, π(m_c, m,s)=V π,k(s, h), which results in standard value recursion: R(s,π¯(h¯))+γ∑s′∈S∑o¯′∈Pr(s′,o¯′∣s,π¯(h¯))Vπ¯(s′,h¯)R(s, π( h))+γΣ _s ∈ SΣ _ o Pr(s , o s, π( h))V π(s , h) Observe that: ∑s′∈ST(s′∣s,a¯)∑o¯′∈¯O(o¯′∣s′,a¯)=∑s′∈S∑o¯′∈Pr(s′,o¯′∣s,π¯(h¯))Σ _s ∈ ST(s s, a)Σ _ o ∈ OO( o s , a)=Σ _s ∈ SΣ _ o Pr(s , o s, π( h)) ∎ Proposition 0. The SDec-POMDP and k-steps delayed are equivalent. A.2. Dec-POMDP-Com The Dec-POMDP-Com goldman2003optimizing extends explicit communication to the Dec-POMDP by including an alphabet of possible messages Σ and communication cost function CΣC_ . For a specified cost, each agent takes a communication action after their control action, which under the instantaneous broadcast communication assumption results in all other agents receiving an additional observation. Unlike the SDec-POMDP, agents in a Dec-POMDP-Com are never entirely restricted from communication. The algorithm designer may centralize the agents in a Dec-POMDP-Com (for cost) at will. Lemma 0. SDec-POMDP and Dec-POMDP-Com models are equivalent. Proof. Demonstrate that XDec-POMDP-ComX_Dec-POMDP-Com ≤ XSDec-POMDPX_SDec-POMDP and 2. XSDec-POMDP≤XDec-POMDP-ComX_SDec-POMDP≤ X_Dec-POMDP-Com 1. XDec-POMDP-ComX_Dec-POMDP-Com ≤ XSDec-POMDPX_SDec-POMDP Reference XDec-POMDP-ComX_Dec-POMDP-Com ≤p _p XDec-POMDPX_Dec-POMDP seuken2008formal and XDec-POMDPX_Dec-POMDP ≤ XSDec-POMDPX_SDec-POMDP in proof of Lemma 4. 2. XSDec-POMDPX_SDec-POMDP ≤ XDec-POMDP-ComX_Dec-POMDP-Com Reference XSDec-POMDPX_SDec-POMDP ≤ XDec-POMDPX_Dec-POMDP in proof of Lemma 4 and XDec-POMDPX_Dec-POMDP ≤p _p XDec-POMDP-ComX_Dec-POMDP-Com seuken2008formal. ∎ Lemma 0. SDec-POMDP and Dec-POMDP-Com model-policy structures are equivalent. Proof. Demonstrate that XYDec-POMDP-ComXY_Dec-POMDP-Com ≤ XYSDec-POMDPXY_SDec-POMDP and 2. XYSDec-POMDP≤XYDec-POMDP-ComXY_SDec-POMDP≤ XY_Dec-POMDP-Com 1. XYDec-POMDP-ComXY_Dec-POMDP-Com ≤ XYSDec-POMDPXY_SDec-POMDP Reference XYDec-POMDP-ComXY_Dec-POMDP-Com ≤p _p XYDec-POMDPXY_Dec-POMDP seuken2008formal and XYDec-POMDPXY_Dec-POMDP ≤ XYSDec-POMDPXY_SDec-POMDP in proof of Lemma 5. 2. XYSDec-POMDP≤XYDec-POMDP-ComXY_SDec-POMDP≤ XY_Dec-POMDP-Com Reference XYSDec-POMDPXY_SDec-POMDP ≤ XYDec-POMDPXY_Dec-POMDP in proof of Lemma 5 and XYDec-POMDPXY_Dec-POMDP ≤p _p XYDec-POMDP-ComXY_Dec-POMDP-Com seuken2008formal. ∎ Lemma 0. SDec-POMDP and Dec-POMDP-Com model-policy-objective structures are equivalent. Proof. Demonstrate that ∀h¯∀ h VXYDec-POMDP-Com(h¯,b0)=V_XY_Dec-POMDP-Com( h,b_0)= VXYSDec-POMDP(h¯,b0)V_XY_SDec-POMDP( h,b_0) Reference ∀h¯∀ h VXYDec-POMDP(h¯,b0)=VXYSDec-POMDP(h¯,b0)V_XY_Dec-POMDP( h,b_0)=V_XY_SDec-POMDP( h,b_0) in proof of Lemma 6. ∎ Proposition 0. The SDec-POMDP and Dec-POMDP-Com are equivalent. A.3. Semi-Decentralized Benchmarks A.3.1. SDec-Tiger Consider the following semi-decentralized variation on the Dec-Tiger benchmark nair2003taming, depicted in Figure 5. SDec-Tiger has 2 states, 3 actions, and 2 observations. Two cooperative agents stand behind two doors. One door leads to a room containing a tiger while the other leads to a room containing treasure. Each agent has three actions: opening the left door OLOL, opening the right door OROR, and listening L. The problem reward function is fully described in table 3. The problem resets when any door is opened; the probability that the tiger is behind the left door TLTL and that the tiger is behind the right door TRTR both become 0.5. Joint actions that do not open doors do not affect the underlying state. Agents have a 75% chance of accurately communicating their action observation histories if they both listen. After taking an action, each agent receives one of two observations, hear tiger on left HLHL or hear tiger on right HRHR. Listening to either door gives an 0.75 probability of returning the correct observation. Opening a door and resetting the problem results in both agents receiving either observation with a 0.5 probability. If one agent opens a door while the other listens, the listening agent will not know the problem has been reset. Table 3. SDec-Tiger Rewards (TLTL, TRTR) a¯ a OLOL OROR L OLOL (-50, 20) (-100, -100) (-101, 9) OROR (-100, -100) (20, -50) (9, -101) L (-101, 9) (9, -101) (-2, -2) Figure 5. Illustration of four of nine possible joint actions for SDec-Tiger. Agents communicate their observation histories with some probability when they listen to the same door (in green). A.3.2. SDec-FireFighting Consider the following semi-decentralized variation on the FireFighting benchmark oliehoek2008optimal, depicted in Figure 6. SDecFireFighting (nf=3,nh=3n_f=3,n_h=3) has 432 states, 3 actions, and 2 observations. 2 agents are tasked with addressing a line of nhn_h houses, each with fire severity status f in range [0,nf0,n_f] initially sampled from a uniform distribution. Each agent selects a house to suppress at each time-step. Single agent suppression decrements f by 1 with probability 1.0 if all adjacent houses have f=0f=0 or with probability 0.6 otherwise. Dual agent suppression resets f to 0. Agents only communicate their observation histories if they suppress the same house. A house without a firefighter present increments its f by 1 with probability 0.8 if an adjacent house has f>0f>0 or with probability 0.4 if all adjacent houses have f=0f=0. A house with f=0f=0 will catch fire (increment f by 1) with probability 0.8 if an adjacent house has f>0f>0. Each agent observes their selected house to be on fire or not with probability 0.2 if f=0f=0, probability 0.5 if f=1f=1, and probability 0.8 if f≥2f≥ 2. The cooperative agent team is rewarded the summation of −f-f across all nhn_h following action selection. Figure 6. Illustration of two of nine joint actions in SDec-FireFighting (nh=3,nf=4n_h=3,n_f=4). Agents communicate when they suppress the same house, shown in green. A.3.3. SDec-BoxPushing Consider the following semi-decentralized variation on the BoxPushing benchmark seuken2007improved, depicted in Figure 7. SDecBoxPushing has 100 states, 4 actions, and 5 observations. n agents cooperate to push small and large boxes into an established goal area. Each agent can choose to rotate left, rotate right, move forward, or remain in place. Rotation and movement actions are successful with a 0.9 probability, otherwise the agent remains in place. Forward movement while facing a box will cause the box to translate one unit in the direction of movement, if permissible. A single agent can push a small box, but two agents must act in tandem to push a large box. Movement into a wall, or into a large box with one agent, will result in remaining in place. Each agent correctly observes what is in front of them: a wall, a small box, a large box, an empty space, or another agent. Agents share their observation histories when simultaneously occupying one or more established communication grid squares. Agents receive a −0.1n-0.1n reward after each time-step, a −5-5 reward for each agent that moves into a wall, a +10+10 reward for each small box pushed into the goal area, and a +100+100 reward for each large box pushed into the goal area. The problem resets as soon as any box reaches the goal state. We adopt the environment configuration depicted in Figure 7. Figure 7. Illustration of the SDec-BoxPushing environment. Agents communicate when they are both in the green square. A.3.4. SDec-Mars Consider the following semi-decentralized variation on the Mars benchmark amato2009achieving, depicted in Figure 8. SDec-Mars has 256 states, 6 actions, and 8 observations. Each of two agents can choose to move north, south, east, and west in a 2x2 grid, or conduct an experiment of choice (drilling or sampling) in their current location. Two grid squares are intended to be sampled by one agent, and the other two grid squares require that both agents drill simultaneously. Each agent accurately observes their location in the 2x2 grid and whether an experiment has already been performed there. Agents share their observation histories while simultaneously occupying a designated communication grid square. The problem resets once an experiment is performed in all four grid squares. The cooperative agent team receives a large positive reward for drilling a drill site, a small positive reward for sampling a sample site, a large negative reward for drilling a sample site, and a small positive reward for sampling a drill site. Attempting a second experiment on the same site incurs a small negative reward. (a) (b) (c) Figure 8. Illustration of the SDec-Mars environment. Agents in (a), the survey site beacon scenario, can communicate when co-located or adjacent to the same not yet surveyed survey site. Agents in (b), the right band rendezvous scenario, communicate alongside the right side of the grid when at least one site remains incomplete. Agents in (c), the drill site beacon scenario, can communicate when co-located or adjacent to the same not yet drilled drill site. Figure 9. Illustration of the MaritimeMEDEVAC environment. Agents communicate when they are positioned adjacent to patient pickup and drop-off sites. A.3.5. MaritimeMEDEVAC We introduce a new semi-decentralized MEDEVAC benchmark involving a 4×44× 4 gridworld archipelago, depicted in Figure 9. MaritimeMEDEVAC has 512 states, 3 actions, and 2 observations. Two agents, a medical aircraft and a transport ship, must retrieve a patient at (1,1)(1,1) and deliver them to a hospital at (3,3)(3,3). At each time-step, each agent selects one of Wait,Advance,Exchange Wait, Advance, Exchange. Advance moves one cell toward the current target (patient if carry=0carry=0, else hospital), succeeding independently with probability 0.95 for the aircraft and 0.85 for the boat. Wait leaves the agent position unchanged. Exchange attempts a joint pickup/drop that succeeds with probability 0.95 when both agents are at the corresponding site (toggling carrycarry). Each agent receives a binary observation indicating whether it is at-target (patient if carry=0carry=0, hospital if carry=1carry=1) or not. The team incurs -0.3 per step, issuing Exchange away from (1,1),(3,3)\(1,1),(3,3)\ incurs -1.0, a solo pickup or solo drop-off incurs -6.0, and joint pickup or drop-off grants +5.0 and +12.0 respectively. Agents share observation histories in a subset of “one-arrived, one-not” states: at the patient when the aircraft is at (1,1)(1,1) and the boat remains at (1,0)(1,0) with (carry=0)(carry=0), and at the hospital when one agent is at (3,3)(3,3) and the other at (3,2)(3,2) with (carry=1)(carry=1) (both permutations). Agents cannot communicate in any other states. A.4. Code Results for semi-decentralized benchmarks may be reproduced at: https://github.com/mahdial-husseini/RSSDA A.5. Notation τ sojourn time (assignment of random variable T) τtτ^t sojourn time corresponding to decision epoch t η natural process time ηtη^t decision epoch t start time within natural process Q complete state transition distribution F sojourn time distribution f selector for propagating agent memory information g selector for propagating agent action information h selector for propagating agent observation information mciselm_ci^sel memory propagated to the blackboard miselm_i^sel memory propagated to agent i aisela_i^sel action information propagated to agent i acsela_c^sel action information propagated to blackboard oiselo_i^sel observation information propagated to agent i ocselo_c^sel observation information propagated to blackboard mcm_c blackboard memory mim_i agent i’s memory ψ distribution over action selection η distribution over memory updates (overloaded) ZiselZ_i^sel ⟨Mcisel,A¯isel,O¯isel⟩ M_ci^sel, A_i^sel, O_i^sel ZcselZ_c^sel ⟨M¯sel,A¯sel,O¯sel⟩ M^sel, A^sel, O^sel