Paper deep dive
A Recipe for Stable Offline Multi-agent Reinforcement Learning
Dongsu Lee, Daehee Lee, Amy Zhang
Abstract
Abstract:Despite remarkable achievements in single-agent offline reinforcement learning (RL), multi-agent RL (MARL) has struggled to adopt this paradigm, largely persisting with on-policy training and self-play from scratch. One reason for this gap comes from the instability of non-linear value decomposition, leading prior works to avoid complex mixing networks in favor of linear value decomposition (e.g., VDN) with value regularization used in single-agent setups. In this work, we analyze the source of instability in non-linear value decomposition within the offline MARL setting. Our observations confirm that they induce value-scale amplification and unstable optimization. To alleviate this, we propose a simple technique, scale-invariant value normalization (SVN), that stabilizes actor-critic training without altering the Bellman fixed point. Empirically, we examine the interaction among key components of offline MARL (e.g., value decomposition, value learning, and policy extraction) and derive a practical recipe that unlocks its full potential.
Tags
Links
- Source: https://arxiv.org/abs/2603.08399v1
- Canonical: https://arxiv.org/abs/2603.08399v1
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/13/2026, 12:49:50 AM
Summary
The paper identifies and addresses the instability of non-linear value decomposition in offline multi-agent reinforcement learning (MARL). It demonstrates that non-linear mixing networks cause value-scale amplification and unstable optimization due to coupled per-agent value learning and policy extraction. The authors propose Scale-Invariant Value Normalization (SVN) to stabilize actor-critic training without altering the Bellman fixed point, enabling reliable use of non-linear value decomposition in offline settings.
Entities (5)
Relation Signals (3)
Offline Multi-agent Reinforcement Learning → formulatedas → Dec-POMDP
confidence 98% · We formulate the MARL problem as a decentralized partially observable Markov decision process (Dec-POMDP)
Scale-Invariant Value Normalization → stabilizes → Offline Multi-agent Reinforcement Learning
confidence 95% · we propose a simple technique, scale-invariant value normalization (SVN), that stabilizes actor-critic training
Non-linear Value Decomposition → causes → Value-scale amplification
confidence 94% · Our observations confirm that they induce value-scale amplification and unstable optimization.
Cypher Suggestions (2)
Find all techniques used to stabilize offline MARL · confidence 90% · unvalidated
MATCH (t:Technique)-[:STABILIZES]->(f:ResearchField {name: 'Offline Multi-agent Reinforcement Learning'}) RETURN t.nameIdentify phenomena caused by non-linear value decomposition · confidence 90% · unvalidated
MATCH (m:Methodology {name: 'Non-linear Value Decomposition'})-[:CAUSES]->(p:Phenomenon) RETURN p.nameFull Text
74,160 characters extracted from source content.
Expand or collapse full text
A Recipe for Stable Offline Multi-agent Reinforcement Learning Dongsu Lee 1 , Daehee Lee 2 , Amy Zhang 1† 1 University of Texas at Austin, 2 Sungkyunkwan University † Supervision Despite remarkable achievements in single-agent offline reinforcement learning (RL), multi-agent RL (MARL) has struggled to adopt this paradigm, largely persisting with on-policy training and self-play from scratch. One reason for this gap comes from the instability of non-linear value decomposition, leading prior works to avoid complex mixing networks in favor of linear value decomposition (e.g., VDN) with value regularization used in single-agent setups. In this work, we analyze the source of instability in non-linear value decomposition within the offline MARL setting. Our observations confirm that they induce value-scale amplification and unstable optimization. To alleviate this, we propose a simple technique, scale-invariant value normalization (SVN), that stabilizes actor-critic training without altering the Bellman fixed point. Empirically, we examine the interaction among key components of offline MARL (e.g., value decomposition, value learning, and policy extraction) and derive a practical recipe that unlocks its full potential. Date: March 10, 2026 Correspondence: Dongsu Lee at dongsu.lee@utexas.edu Blog: https://dongsuleetech.github.io/blog/lazy-borrowed-recipe/ 1 Introduction While offline RL has achieved notable success, its extension to multi-agent settings remains relatively underexplored. More importantly, insights that hold in single-agent settings often fail to transfer to MARL. For example, while DDPG+BC (BRAC) (Fujimoto & Gu, 2021) can be even preferable to advantage-weighted regression (AWR) (Peng et al., 2019) as a single-agent policy extraction (Park et al., 2024), we observe that in MARL, even minor deviations can precipitate severe performance degradation (Figure 1). This instability highlights a key challenge to multi-agent systems: even a minor deviation in individual agent actions can cascade into a complete breakdown of the coordination. Despite these structural challenges, most existing studies have merely extended single-agent value regularization techniques to multi-agent settings (Wang et al., 2023; Shao et al., 2023) with linear value decomposition (Li et al., 2025a; Lee et al., 2025b) or centralization (Wang et al., 2023). Although effective in some cases, centralized critics raise scalability concerns, and linear value decomposition can struggle to capture complex coordination structures. That is, achieving limited short-term success but offering little insight into the deeper challenges of offline MARL. This naturally raises a central question motivating this research: Where does the bottleneck in offline MARL come from, and how should we design algorithms to explicitly address it? In this study, we move beyond simple value regularization with linear value decomposition. Specifically, we first focus on non-linear value decomposition (Rashid et al., 2020b) in offline MARL. Then, we empirically investigate the interplay between value decomposition, policy extraction, and value learning to unlock the 1 arXiv:2603.08399v1 [cs.LG] 9 Mar 2026 Figure 1 Revisiting offline RL insights in MARL. (Left) The convex hull denotes the dataset action support, and dots represent actions sampled from the learned policy.BRACexhibits mode-seeking behavior that extends beyond the dataset support, whileAWRremains mode-covering and strictly in-distribution. (Right) Although such mode- seeking would be helpful in single-agent RL, even small out-of-distribution actions induced byBRAClead to severe performance degradation in MARL, highlighting the sensitivity of joint behavior to individual policy deviations. These results are based on TD learning and hold it regardless of the value decomposition methods (centralized, vdn, and decentralized). potential of offline MARL. Our key discoveries and contributions are as follows: AnalysesIn our pathological observations, the bottleneck of the non-linear method manifests itself as a coupled instability between value learning and policy extraction. Unlike linear decomposition, mixing networks structurally couple per-agent approximation errors through their Jacobian. This coupling breaks the contractivity of the global TD operator and turns value updates expansive rather than contractive. As a result, joint Q-values can grow exponentially even on expert datasets. This value-scale amplification further propagates to policy extraction, where actor gradients become dominated by the absolute magnitude of the value function rather than relative advantages. This dominance leads to poorly calibrated loss and unstable updates. Comprehensively, these effects form a feedback loop that destabilizes learning in non-linear value decomposition. Solutions To alleviate this issue, we propose a simple yet effective normalization technique that stabilizes nonlinear value decomposition without altering the Bellman fixed point. Our method renders both critic and actor updates scale-invariant, thereby directly addressing the amplification mechanism induced by the non-linear value decomposition while preserving theoretical correctness. This normalization restores stable and well-scaled optimization dynamics for actor-critic training with nonlinear value decomposition, enabling the non-linear value decomposition to be used reliably in the offline setting for the first time. Experiments Our empirical results further offer clear guidance for designing offline MARL algorithms. We find that performance is far more sensitive to value decomposition and policy extraction than value learning methods. In particular, non-linear value decomposition and mode-covering policy extraction yield stable and strong performance. Moreover, we demonstrate that nonlinear value decomposition equipped with our solution is practical across both continuous and discrete control, and remains stable when transitioning from offline to online. These findings highlight value decomposition as both a fundamental bottleneck and a promising lever to advance offline MARL. 2 2 Related work Over the past few years, data-driven approaches have fundamentally reshaped control systems, enabling learning-based methods to tackle real-world problems across several domains, including autonomous driving (Liu et al., 2023; Lee et al., 2024; Lee & Kwon, 2024, 2025a) and robotics (Black et al., 2024; Collaboration et al., 2024). A major driver of this progress has been the development of off-policy RL algorithms, which train value functions via temporal difference (TD) learning while leveraging static operational logs (Levine et al., 2020). Compared to single-agent RL, offline MARL faces a more severe out-of-distribution challenge. A small deviation in an individual policy can induce joint behaviors that are absent from the dataset. This can lead to unseen coordination patterns even when each agent’s action is individually plausible. To prevent such a failure mode, prior works have largely followed the success of single-agent RL in terms of value regularization and policy extraction, such as conservatism (Pan et al., 2022; Shao et al., 2023; Eldeeb et al., 2024), in-sample maximization (Wang et al., 2023), action support constraint (Yang et al., 2021; Jiang & Lu, 2021), distribution matching (Zhu et al., 2024; Li et al., 2025a,c), convex duality (Matsunaga et al., 2023), and density weighting (Lee et al., 2025b). In practice, these approaches rely on behavior-regularized policy gradients(e.g., BRAC (Fujimoto & Gu, 2021; Tarasov et al., 2023)) or AWR (Peng et al., 2019) for policy extraction. However, the role of value decomposition remains underexplored. Value decomposition in prior work is limited to simple linear forms (e.g., VDN (Sunehag et al., 2017)) or a fully centralized critic. Although these choices improve stability, they also limit expressivity or introduce scalability challenges as the number of agents grows. More expressive non-linear mixing architectures are often avoided due to long-observed instability (Shen et al., 2022; Liu et al., 2025). Rather than proposing another algorithmic variant, this work aims to diagnose why non-linear value decomposition becomes unstable. We trace this instability to the structural coupling between per-agent value learning and policy extraction that amplifies joint out-of-distribution errors. Based on this analysis, we introduce a simple normalization method that preserves the Bellman fixed point and enables practical non-linear value decomposition. Finally, we study how standard policy extraction methods interact with value learning objectives and decomposition strategies, and distill effective design principles for offline MARL. 3 Background Problem formulation. We formulate the MARL problem as a decentralized partially observable Markov decision process (Dec-POMDP) (Oliehoek et al., 2016)M=⟨A,S,U a ,O a ,P,Ω,R,γ⟩. Each agent identified bya ∈ A ≡ 1,· ,Atakes an actionu a ∈ U a at a given local observationo a ∈ O a . The joint action is denoted byu = (u a ,...,u a )∈ U=× a U a , and the environment evolves according to the transition probabilityP(s ′ | s,u), wheres,s ′ ∈ Srepresent the global states. After the transition, each agent receives an individual observationo i sampled from the joint observation function Ω(o| s ′ ,u), whereo = (o 1 ,...,o A )∈ O=× a O a . Each agent receives an individual rewardr a (s,u a ,u −a )∈ R, whereu −i denotes the actions of all other agents except agenti; that is, the team rewardR(s,u) = 1 A P A a=1 r a (s,u a ,u −a ) is defined as the aggregation of individual rewards. Each agent acts according to a local policy π a (u a | τ a ), conditioned on its individual trajectory τ a = (o a 0 ,u a 0 ,...,o a t ). Our objective is to identify and address the challenges of offline MARL. We aim to learn a set of policiesπ=π 1 ,...,π A that maximizes the expected discounted returnE τ∼p π (τ) h P H t=0 γ t R(s t ,u t ) i for allτfrom the datasetDcollected from behavioral policyμ, whereγ ∈[0,1) is the discount factor. The objective of offline MARL is therefore to infer a performant set of decentralized policiesπwhile maintaining coordinated behavior across agents under partial observability. This formulation naturally supports centralized training with decentralized execution (CTDE) (Zhang et al., 2021; Gronauer & Diepold, 2022). Value decomposition via the mixing network. To enable CTDE, we adopt value decomposition through 3 a mixing network (Rashid et al., 2020b; Son et al., 2019). Each agent maintains an individual utility function Q a (τ a ,u a ). The global action–value function is represented as follows: Q tot (s,u) = f mix Q 1 ,...,Q A ;s ,(1) wheref mix is a differentiable function parameterized by a hypernetwork with the global states. To ensure consistency between local and global optima, we impose the monotonicity constraint∂Q tot /∂Q a ≥0, which guarantees that the joint greedy action can be obtained by independent maximization over each Q a (Rashid et al., 2020b). The network parameters are optimized by minimizing the TD loss, E (s,u,r,s ′ )∼D u ′ ∼π Q tot (s,u)− r− γ ̄ Q tot (s ′ ,u ′ ) 2 ,(2) allowing the mixing network to learn non-linear joint interactions among agents while preserving decen- tralized policies. Multi-agent actor-critic framework. Building on the decomposed Q-function, we formulate an off-policy actor-critic scheme that factorizes the joint policy through the same mixing structure used for value decomposition (Wu et al., 2019). The global criticQ tot (s,u) aggregates per-agent utilities by Equation(1) while maintaining differentiability with respect to each agent’s action. Each actorπ a is trained via behavioral regularized actor-critic as follows: E s∼D −Q a (o a ,u a ) + αh π a (u a |o a ),μ a (u a |o a ) |z behavioral regularization ,(3) whereαis a weight coefficient,h(·,·) represents the function that captures the divergence betweenπ a and μ a . Its gradients are back-propagated from the global critic to each local actor viaf mix . This structure implicitly decomposes the policy optimization objective, aligning decentralized policies with the joint value landscape estimated by the centralized critic (Sunehag et al., 2017; Wang et al., 2021; Peng et al., 2021). The resulting formulation enables consistent off-policy updates and coordinated learning without requiring access to global information during execution. Notationalwarning: All policy and critic are parameterized:π θ =π θ 1 ,...,π θ A andQ tot φ =Q φ 1 ,...,Q φ A ,f φ mix . For simplicity, we omit explicit dependence onθandφin the main text, i.e., writingπ,Q tot , andf mix , whenever no ambiguity arises. 4 Divergent dynamics of mixer optimization Off-policy value learning in the actor-critic algorithm is tricky because each Bellman update regresses toward a target that may itself be biased by approximation error or extrapolation. This difficulty is further compounded by nonlinear coupling among per-agent critics through the mixerf mix , which aggregates local utilities into a single joint value. In this section, we first illustrate the necessity of non-linear value decompositionf mix through a didactic example (Sec. 4.1). Then, we investigate the origin of instability within the mixer (Sec. 4.2 and 4.3) and motivate a scale-invariant regularization to restore stability (Sec. 4.4). All analyses of the following sub-sections are conducted on a 2ant task with the Expert dataset, except Sec. 4.1. 4.1 Didactic example: The necessity of non-linear value decomposition for MARL To demonstrate why non-linear value decomposition is necessary, we show a simple didactic example where the linear method (VDN), i.e.,Q tot (s,u) = P A a=1 Q a (τ a ,u a ), fails. We adopt the two-step game where agents must choose a safe sub-optimal state and a risky optimal state (Xu et al., 2023; Rashid et al., 2020a). For training an offline policy, we augment the dataset to ensure an equal distribution of all possible patterns. Figure 2 shows that theVDNfails to represent the non-monotonic payoff structure of 4 Didactic exampleValue map Agent A picks next state <latexit sha1_base64="fN/yI0nQlP7+7JG29fDOQjgGCuA=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4sSRFqseiF48V7Ae0oWy2m3bpZhN2J0IJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/nbX1jc2t7cJOcXdv/+CwdHTcMnGqGW+yWMa6E1DDpVC8iQIl7ySa0yiQvB2M72Z++4lrI2L1iJOE+xEdKhEKRtFKbdPPqpfVab9UdivuHGSVeDkpQ45Gv/TVG8QsjbhCJqkxXc9N0M+oRsEknxZ7qeEJZWM65F1LFY248bP5uVNybpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDG/8TKgkRa7YYlGYSoIxmf1OBkJzhnJiCWVa2FsJG1FNGdqEijYEb/nlVdKqVrxapfZwVa7f5nEU4BTO4AI8uIY63EMDmsBgDM/wCm9O4rw4787HonXNyWdO4A+czx+rJI8m</latexit> s 2→2 <latexit sha1_base64="y02YUJ3gvkLzRign/IG63QGWEb4=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqseiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0oPtev1xxq+4cZJV4OalAjka//NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQllYzrErqWSRqj9bH7qlJxZZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULLFojAVxMRk9jcZcIXMiIkllClubyVsRBVlxqZTsiF4y+vktZF1atVa/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnnzmGP3A+fwAHpI2m</latexit> s 1 <latexit sha1_base64="Y51VP6XXNmogRyoIqQbOEUeMuD8=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4sSRFqseiF48V7Ae0oWy2k3bpZhN2N0IJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekAiujet+O2vrG5tb24Wd4u7e/sFh6ei4peNUMWyyWMSqE1CNgktsGm4EdhKFNAoEtoPx3cxvP6HSPJaPZpKgH9Gh5CFn1FiprftZ9dKb9ktlt+LOQVaJl5My5Gj0S1+9QczSCKVhgmrd9dzE+BlVhjOB02Iv1ZhQNqZD7FoqaYTaz+bnTsm5VQYkjJUtachc/T2R0UjrSRTYzoiakV72ZuJ/Xjc14Y2fcZmkBiVbLApTQUxMZr+TAVfIjJhYQpni9lbCRlRRZmxCRRuCt/zyKmlVK16tUnu4Ktdv8zgKcApncAEeXEMd7qEBTWAwhmd4hTcncV6cd+dj0brm5DMn8AfO5w+pn48l</latexit> s 2→1 <latexit sha1_base64="jq7hWYPkx8M05onic2lel+WAxvM=">AAACHXicbZBNSwMxEIaz9avWr1WPXoLF4qnsSqk9Fr14rGBboV1KNpttQ7PJkmTFsvSPePGvePGgiAcv4r8x3S6orQMDT96ZYTKvHzOqtON8WYWV1bX1jeJmaWt7Z3fP3j/oKJFITNpYMCFvfaQIo5y0NdWM3MaSoMhnpOuPL2f17h2Rigp+oycx8SI05DSkGGkjDexa3xcyIDJCWtL7FFagY9KFfSwz+nm5GVVgA04HdtmpOlnAZXBzKIM8WgP7ox8InESEa8yQUj3XibWXIqkpZmRa6ieKxAiP0ZD0DHIUEeWl2XVTeGKUAIZCmuQaZurviRRFSk0i33SaK0ZqsTYT/6v1Eh02vJTyONGE4/miMGFQCzizCgZUEqzZxADCkpq/QjxCEmFtDC0ZE9zFk5ehc1Z169X6da3cvMjtKIIjcAxOgQvOQRNcgRZoAwwewBN4Aa/Wo/VsvVnv89aClc8cgj9hfX4D2lScFw==</latexit> ! 01 001 118 " <latexit sha1_base64="UThcshZa5594pm6pwaKiqcVxqNc=">AAACHXicbVDLSsNAFJ3UV62vqEs3g0VxVRIprcuiG5cV7AOaUCaTSTt0MgkzE7GE/ogbf8WNC0VcuBH/xkkaRFsvHDj33HuZOceLGZXKsr6M0srq2vpGebOytb2zu2fuH3RllAhMOjhikeh7SBJGOekoqhjpx4Kg0GOk502usnnvjghJI36rpjFxQzTiNKAYKS0NzbrjRcInIkRK0PsUnkJLw4YOFjlr5sg6+6ebDc2qVbPygsvELkgVFNUemh+OH+EkJFxhhqQc2Fas3BQJRTEjs4qTSBIjPEEjMtCUo5BIN83dzeCJVnwYREKDK5irvy9SFEo5DT29qV2M5eIsE/+bDRIVXLgp5XGiCMfzh4KEQRXBLCroU0GwYlNNEBZU/xXiMRIIKx1oRYdgL1peJt3zmt2oNW7q1dZlEUcZHIFjcAZs0AQtcA3aoAMweABP4AW8Go/Gs/FmvM9XS0Zxcwj+lPH5DfaqnCk=</latexit> ! 01 077 177 " <latexit sha1_base64="fN/yI0nQlP7+7JG29fDOQjgGCuA=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4sSRFqseiF48V7Ae0oWy2m3bpZhN2J0IJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekEhh0HW/nbX1jc2t7cJOcXdv/+CwdHTcMnGqGW+yWMa6E1DDpVC8iQIl7ySa0yiQvB2M72Z++4lrI2L1iJOE+xEdKhEKRtFKbdPPqpfVab9UdivuHGSVeDkpQ45Gv/TVG8QsjbhCJqkxXc9N0M+oRsEknxZ7qeEJZWM65F1LFY248bP5uVNybpUBCWNtSyGZq78nMhoZM4kC2xlRHJllbyb+53VTDG/8TKgkRa7YYlGYSoIxmf1OBkJzhnJiCWVa2FsJG1FNGdqEijYEb/nlVdKqVrxapfZwVa7f5nEU4BTO4AI8uIY63EMDmsBgDM/wCm9O4rw4787HonXNyWdO4A+czx+rJI8m</latexit> s 2→2 <latexit sha1_base64="Y51VP6XXNmogRyoIqQbOEUeMuD8=">AAAB7nicbVBNS8NAEJ34WetX1aOXxSJ4sSRFqseiF48V7Ae0oWy2k3bpZhN2N0IJ/RFePCji1d/jzX/jts1BWx8MPN6bYWZekAiujet+O2vrG5tb24Wd4u7e/sFh6ei4peNUMWyyWMSqE1CNgktsGm4EdhKFNAoEtoPx3cxvP6HSPJaPZpKgH9Gh5CFn1FiprftZ9dKb9ktlt+LOQVaJl5My5Gj0S1+9QczSCKVhgmrd9dzE+BlVhjOB02Iv1ZhQNqZD7FoqaYTaz+bnTsm5VQYkjJUtachc/T2R0UjrSRTYzoiakV72ZuJ/Xjc14Y2fcZmkBiVbLApTQUxMZr+TAVfIjJhYQpni9lbCRlRRZmxCRRuCt/zyKmlVK16tUnu4Ktdv8zgKcApncAEeXEMd7qEBTWAwhmd4hTcncV6cd+dj0brm5DMn8AfO5w+pn48l</latexit> s 2→1 <latexit sha1_base64="y02YUJ3gvkLzRign/IG63QGWEb4=">AAAB6nicbVBNS8NAEJ3Ur1q/qh69LBbBU0lEqseiF48V7Qe0oWy2k3bpZhN2N0IJ/QlePCji1V/kzX/jts1BWx8MPN6bYWZekAiujet+O4W19Y3NreJ2aWd3b/+gfHjU0nGqGDZZLGLVCahGwSU2DTcCO4lCGgUC28H4dua3n1BpHstHM0nQj+hQ8pAzaqz0oPtev1xxq+4cZJV4OalAjka//NUbxCyNUBomqNZdz02Mn1FlOBM4LfVSjQllYzrErqWSRqj9bH7qlJxZZUDCWNmShszV3xMZjbSeRIHtjKgZ6WVvJv7ndVMTXvsZl0lqULLFojAVxMRk9jcZcIXMiIkllClubyVsRBVlxqZTsiF4y+vktZF1atVa/eXlfpNHkcRTuAUzsGDK6jDHTSgCQyG8Ayv8OYI58V5dz4WrQUnnzmGP3A+fwAHpI2m</latexit> s 1 VDN Mixer Figure 2 Two step matrix game with offline dataset. (Left) The schematic of the didactic example. Ins 1 , Agent A selects between a safe states 2−1 with a fixed suboptimal reward of 7 and a risky states 2−2 with an optimal reward of 8. (Right) The learned joint Q value matrices for each state. The top and bottom rows display the linear method (VDN) and Mixer. Each cell reports the mean Q value and two standard deviations across five random seeds. the risky state, resulting in an underestimated value and a convergence to the suboptimal safe policy. On the other hand, Mixer can identify the global optimum, and such an observation underscores that non-linear expressivity is a structural necessity for solving complex coordination tasks. 4.2 Problem I: Coupled value updates break the contractivity of TD learning Figure 3 Divergent dynamics of mixer-based critics. Comparison among the monotonic Mixer, VDN, and individual critics under expert offline data. The mixer induces co-amplification of the Q value (Left) and critical loss (Right), indicating a structural instability of the TD operator. Pathological observation. Primarily, we begin with a controlled comparison between three variants of value decomposition: (i) individual critics, (i) the linear value decomposition (VDN), and (i) the nonlinear value decomposition (Mixer). Using an offline dataset of expert demonstrations, we eliminate other components (e.g., exploration and replay stochasticity) so that learning dynamics reflect only the intrinsic behavior of the value updates. As shown in Figure 3, both the individual critics andVDNremain numerically stable throughout training: their total value estimates and TD losses converge smoothly to bounded levels. In contrast, the mixer-based critic exhibits structural divergence: the magnitude of the Q value grows exponentially, and the critic loss co-amplifies by several orders of magnitude. This divergence persists across random seeds, indicating that it arises from the structural coupling of value updates. Analytical formulations. DefineQ tot (s,u) as in Equation(1)with JacobianJ s =∂f s /∂Q, wheref s ≜ f mix (s,·) denotes the state-conditioned mixing function. Linearizingf mix around the current estimate yields the following: Q tot ← Q tot − 2α Q (I − γJ s ) Q tot − ̄ Q tot . Ifγ|J s | op >1, thenρ(I−2α Q (I−γJ s ))>1, and the map becomes expansive, where|·| op is operator norm max ||x||=1 ||J s x||andρ(M) is spectral radiusmax i |λ i (M)|. 1 That is, both|Q tot |andL TD (Equation(2)) 1 x and M is a placeholder variable and matrix. Then, λ i (·) means i-th eigenvalue of a specific matrix. 5 grow geometrically. With an actor, the closed-loop gaing loop =γ|J s | op ∂π ∂Q tot introduces positive feedback when g loop > 1, joint divergence occurs. Remark 1: The mixer’s Jacobian couples per-agent errors into a non-contractive TD operator. Actor updates amplify this coupling, converting TD updates from damping to amplifying. 4.3 Problem I: Loss miscalibration under value-scale amplification In addition to the instability discussed in Sec 4.2, value-scale amplification inQ tot leads to further learning issues. A drift in the critic’s scale could miscalibrate the policy gradient, causing its magnitude to depend on the absolute value scale instead of action quality. This section analyzes how this value-scale amplification propagates to the actor and leads to ill-conditioned updates. The first issue arises from poorly scaled critic updates. In∇ φ L TD = 2E [(Q tot − y)∇ φ Q tot ], ifQ tot is globally rescaled byc >1, both the residual and Jacobian scale withc, and the effective Hessian scales asc 2 . Moreover, it leads to misaligned actor gradients. The policy gradient magnitude depends on the value amplitude, i.e.,|∇ θ L actor | ≈ E s [|Q tot (s,u)||∇ θ logπ θ (u|s)|]. Scale drift inQ tot increases action and target variances, leading to positive feedback loop through the critic loss. Figure 4 Actor loss miscalibration under value-scale amplification. (Left) Actor loss increases sharply as value-scale drift begins, indicating that the policy objective is dominated by value amplitude rather than advantage structure. (Right) The total gradient norm. This reveals ill-conditioned updates and confirms that the coupled actor and mixer-critic system loses numerical stability. Empirical results. Figure 4 visualizes the empirical evidence in terms of problem I. The Q loss of actor updates decreases smoothly, but its scale rises by several orders of magnitude. Simultaneously, the total gradient norm shows exponential growth, revealing that the joint optimization becomes ill-conditioned even before explicit divergence. 2 The synchronous rise of actor loss and gradient norm demonstrates that the TD target has become a mis-calibrated supervision signal whose magnitude dominates its semantics. Remark 2: Value-scale drift miscalibrates the learning signal of the actor network, destabilizing the actor-critic update cycle through amplified gradients. This remark motivates the scale-invariant normalization method to stabilize the actor update when we consider value learning with non-linear value decomposition. Simple remedy. We aim to prevent the actor loss from co-amplifying with the Q value, without modifying the TD objective. The idea is simple: Normalize the actor-side Q maximization term by its own batch magnitude, making the policy gradient invariant to global rescaling ofQ tot . This effectively reconditions the actor’s gradient magnitude while preserving its preference ordering. We simply modify the Q maximization of Equation (3) as follows, − E Q tot (s,u π )− E[Q tot (s,u π )] E |Q tot (s,u π )| .(4) 2 The total gradient norm is computed over all parameters of actor, critic, and mixing networks. 6 Figure 5 Effect of the simple actor-side remedy. These two signals together demonstrate suppression of value-scale amplification without modifying the TD objective. Here,Q tot (s,u π ) denotes the critic’s estimate under the current policy. By removing the mean and dividing by the absolute mean, this operation eliminates scale drift while preserving the action preference ordering, yielding advantage-normalized and scale-invariant actor updates. Empirically, Figure 5 validates our diagnosis. Specifically, using Eq(4)suppresses the amplification of the Q value in the actor loss. The gradient norm is bounded, and the mean ofQ tot also decreases as expected from scale normalization. These results confirm that value-scale drift is the main source of the coupled instability. Although such a remedy stabilizes the actor side, the TD objective remains scale sensitive, motivating a fully scale-invariant critic formulation. 4.4 Scale-invariant value normalization (SVN) We now extend the invariance principle to the critic itself, ensuring that actor-critic updates become scale-invariant while preserving the theoretical foundations of TD learning. Desiderata. The Bellman equation defines a fixed point under the TD operatorT Q(s,u) =r+γ ̄ Q(s ′ ,π(s ′ )), that is, it minimizesE[(Q tot −T Q tot )]. Any normalization that rescales the entire loss or prediction uniformly must therefore not change thearg min; otherwise, the Bellman fixed point would shift, violating TD. Our goal is thus to recondition the critic updates without modifying this fixed point. SVN. We compute detached statistics of the total value for each training batch: μ Q = sg[E(Q tot )], σ Q = sg[MAD(Q tot )] + ε,(5) wheresg[·] denotes the stop-gradient operator andMAD(x) =E[|x− E(x)|] is mean-absolute deviation. We then define normalized projections b Q = Q tot −μ Q σ Q and by= y−μ Q σ Q ,and minimize the normalized TD loss as follows, ̃ L TD = E h ( b Q− by) 2 i = 1 σ 2 Q E (Q tot − y) 2 .(6) Since (μ Q ,σ Q ) are treated as constants with respect to gradients, the optimization objective satisfies arg min φ ̃ L TD =arg min φ L TD ,and thus preserves the same Bellman fixed point. In effect, our remedy ̃ L TD only rescales the gradient magnitude by the batch-dependent constant 1/σ 2 Q , improving the numerical conditioning of updates without altering the underlying TD solution. To analyze its impact, we linearize the mixing functionf mix (Q 1 ,...,Q A ) with JacobianJ s . The effective critic Jacobian becomesJ eff s = 1 σ Q J s , which directly reduces the closed-loop gain between actor and critic: ̃g loop = γ σ Q ∥J s ∥ op ∂π ∂Q tot .(7) Hence, normalization can attenuate the amplification identified in Section 4.2. By dividing out the global value scale, our remedy restores the contractive behavior of the TD operator while keeping the Bellman fixed point intact. 7 Remark 3:SVNretains theoretical correctness and stabilizes actor-critic-mixer coupling without modify- ing the Bellman fixed point. -2000 -1000 Figure 6 SVN’s effectiveness. (Left) Un-normalized Q value to fairly compare between each solution. (Right) Average performance curve across 10 evaluations×8 random seeds. Horizontal shaded area represents the reward distribution of dataset D. Empirical results. 3 Figure 6 evaluates monotonic mixer withSVNand baselines to demonstrate its effectiveness. The unnormalized Q values show that the actor-only normalization partially mitigates cales drift but still allows slow amplification, whereasSVNcompletely stabilizes the value scale throughout training. Correspondingly, Figure 6 (Right) confirms thatSVNmaintains stable learning dynamics and achieves performance comparable to the reward distribution of the expert dataset. These results confirm that addressing the critic side of the value-scale coupling is essential for achieving fully stable actor-critic optimization. 5 A practical recipe for offline MARL We now broaden our scope from the theoretical analysis to an empirical analysis of the structural components of offline MARL as a whole. While Section 4 established a theoretical foundation for stability, the practical imperative is to ensure these gains translate into robust optimization dynamics. We ask: how do different design choices interact to shape the final performance across value decomposition, value learning, and policy extraction? To this end, we empirically study offline MARL through three modules to identify which design choices most importantly affect final performance. 5.1 Analysis setup This subsection introduces the objectives for each module and the environments and datasets we study in our analysis. Value decomposition. We consider four value decomposition strategies: Fully centralization (Cen) (Lyu et al., 2023), non-linear value decomposition (Mix) (Rashid et al., 2020b), linear value decomposition (VDN) (Sunehag et al., 2017), and fully decentralization (Dec) (Wang et al., 2022). (1)Cen: Fully centralized critic represents the joint Q-function as a single critic network conditioned on the full global state and all agents’ actionsQ tot (s,u) =Q(s,u). This captures all inter-agent dependencies directly. (2) Mix: It simply follows Equation (1) with Equation (6). (3)VDN: It assumes an additive structure, i.e.,Q tot (s,u) = P A a=1 Q a (τ a ,u a ). This is a simple linear decomposition, removing inter-agent coupling in gradients. 3 We report unnormalized Q to ensure a fair comparison between with and without SVN. 8 MA-Mujoco - 2antMA-Mujoco - 3hopper MA-Mujoco - 6halfcheetahMPE - simple spread Figure 7 Best performance according to design choices in offline MARL over four continuous control tasks. Each bar plot reports the best normalized return over 8 seeds for different combinations of value decomposition, value learning, and policy extraction methods, aggregated across datasets andαhyperparameter for policy extraction. The bars colored by dark orange and apricot indicate the best and runner-up performance. The top and right panels marginalize over policy extraction and value learning. (4)Dec: Fully decentralized variant does not take into account global value functionQ tot (s,u), building a set of individual actors and critics. Value learning. We consider three objectives that are widely used in offline RL settings:TD,SARSA(Sutton et al., 1998), and implicit Q learning (IQL) (Kostrikov et al., 2022). 4 (1) TD: This follows from Equation (2). (2) SARSA: It is similar to the TD method, but we remove the policy sampling for the target calculation. min Q tot E (s,u,r,s ′ ,u ′ )∼D Q tot (s,u)− r− γ ̄ Q tot (s ′ ,u ′ ) 2 (8) (3)IQL: UnlikeTDestimation, this implicitly emphasizes high-return actions within the behavior dataset via asymmetric regression. Therefore, it enables a behavior-constrained approximation to the Bellman optimal argmax. min Q tot E (s,u,r,s ′ )∼D Q tot (s,u)− r− γV tot (s ′ ) 2 (9) min V tot E (s,u,r)∼D [ℓ 2 τ (Q tot (s,u)− V tot (s)](10) ℓ 2 τ denotes the expectile regression loss, defined as an asymmetric squared loss, with expectile hyperpa- rameter τ. Policy extraction. We consider two mainstream policy extraction methods in offline RL:BRAC(Fujimoto & Gu, 2021) and AWR (Peng et al., 2019). 5 4 We do not consider explicit value regularization for pessimism, as such mechanisms do not naturally extend to online learning. 5 When extracting a policy independently of the value decomposition and learning method, we adopt a simple remedy 9 (1)BRAC: This couples Q-value maximization with behavioral regularization, thereby preventing the learned policy from deviating far from the action distribution supported by the behavior policy or offline dataset. It follows the Equation(3), and we seth(·,·) as the simplest regularization, which is BC loss minimization α logπ(u a |o a ). (2)AWR: optimizes a weighted maximum-likelihood objective, increasing the likelihood of actions proportional to their estimated advantages Q(o,u)− V (o) as follows. max π a E (o a ,u a )∼D [e α(Q a (o a ,u a )−V a (o a )) logπ a (u a |o a )](11) Environments and datasets. This work focuses on the actor-critic algorithm for multi-agent systems. There- fore, we evaluate on continuous action domains, e.g., MA-MuJoCo (Peng et al., 2021) and MPE (Lowe et al., 2017), as the main set of experiments. We use datasets collected from offline MARL benchmarks (Formanek et al., 2023, 2024). 5.2 Best practices for offline MARL Figure 7 provides a comparison across value decomposition, value learning, and policy extraction methods. It reports the best normalized return achieved for each configuration over four datasets, aggregated from 16,384 independent runs (8 seeds; Appendix C.3). Overall, the results reveal that performance differences are dominated by value decomposition and policy extraction, while the impact of value learning is comparatively minor. For value decomposition,Mixconsistently dominates the design space, achieving the best or runner-up performance in 17 out of 24 configurations. While fully centralized critics (Cen) can be competitive in certain configurations, their performance could be less consistent across design choices. Next,VDNexhibits a clear performance ceiling due to its restrictive additive structure. In contrast,Mixenables expressive modeling of inter-agent interactions. It preserves decentralized action selection and yields more consistent performance across design choices. Agent 1 Agent 2 Figure 8 Learned Q under different value learning methods. Each point represents a dataset action sample, colored by its Q value: dark blue and apricot indicate lower and higher Q values. Thexandyaxes are the coordinates of dimension 0 and 1 of action. The top and bottom are about agents 1 and 2. For value learning objectives, we observe that the objectives that avoid policy-sampled target estimation (i.e.,SARSAandIQL) tend to be slightly more favorable thanTDin the offline setting. This is becauseSARSA andIQLcan provide more conservative target estimates. However, the performance differences among described in Equation(4). Other methods appear relatively stable in terms of value learning, but they sometime exhibit significant instability during the policy extraction process. 10 these objectives are modest and do not constitute a dominant factor compared to value decomposition and policy extraction. This observation is supported by Figure 8.TD,SARSA, andIQLexhibit highly similar value estimation behavior on in-distribution samples. The learned Q-value distributions largely overlap in action space, with comparable separation between low- and high-value regions, indicating that all three methods capture similar relative preferences over dataset actions. Notably,TDexhibits slightly reduced contrast in Q-value, with fewer sharply distinguished high-value points suggesting weaker relative value separation. However, this difference does not translate into substantial performance gaps once value decomposition and policy extraction are fixed, reinforcing the conclusion that value learning itself is not the primary bottleneck in offline MARL. For policy extraction,AWRyields more stable and reliable results thanBRACacross value decomposition and value learning choices. In our observation, while BRAC occasionally attains strong performance on expert datasets, it frequently suffers from sharp degradation, likely due to its mode-seeking behavior inducing out-of-distribution joint actions. In contrast,AWR’s mode-covering nature better preserves coordinated behavior, particularly when paired with the non-linear value decomposition (Mix). 6 Discussion and further analysis CenMixVDNDec 10 20 Average performance SMACv1 CenMixVDNDec SMACv2 Figure 9 Performance comparison on discrete control. We evaluate four tasks from SMACv1 (Good and Medium dataset3mand2s3z) and two tasks from SMACv2 (Replay dataset forterran_5_vs_5andzerg_5_vs_5). The error bar shows the minimum and maximum performance range. Do These Ideas Work on Discrete Control? Beyond continuous control, we examine whether the efficiency of Mixextends to discrete control, i.e., SMACv1 (Samvelyan et al., 2019) and SMACv2 (Ellis et al., 2023). As shown in Figure 9,Mixconsistently outperforms the other value decomposition methods in discrete settings. Specifically, while all methods achieve comparable performance on SMACv1,Mixdemonstrates superior performance on SMACv2. This suggests thatMixis particularly effective in environments characterized by high stochasticity. Figure 10 Compatibility ofMix. We integrateMixinto two offline MARL algorithms,MAC-FlowandOMIGA, and compare their performance against the original baselines on SMACv1 tasks. 11 Can Mixer Be Incorporated into Prior Algorithms? To verify the practicality of non-linear value decomposition (Mix), we integrate it with recent algorithms,MAC-Flow(Lee et al., 2025b) andOMIGA(Wang et al., 2023), and check whether this integration can further boost performance with these methods. On a good-quality dataset, replacing the value decomposition withMixmaintains the existing superior performance without degradation. More importantly, on a suboptimal (meidum) dataset, integratingMixcan enhance the baselines’ performance. 2ant3hopper6halfcheetah Figure 11 Performance in offline-to-online MARL. Online fine-tuning starts at 0.5 gradient steps normalized by [0,1] scale. How doesMiximprove with additional interaction data? Figure 11 shows thatBRACfrequently benefits from online interaction but also occasionally suffers from performance degradation. On the other hand,AWR largely preserves its offline performance after online fine-tuning. This observation suggests that the effect of online fine-tuning in MARL is sensitive to the underlying policy mode rather than uniformly beneficial. Overall, we highlight the need for online fine-tuning strategies specifically designed for MARL, beyond direct adaptations of offline MARL methods. Final Remark: Our empirical study demonstrates that stabilized non-linear value decomposition is effective across continuous and discrete control. Additionally, we find the importance of mode-covering policy extraction to preserve coordination patterns in offline MARL. 7 Call to actions: Towards practical and scalable offline MARL In this work, we empirically demonstrated that the key to offline MARL is how the policy is extracted to preserve coordination patterns and how the global value is made from individual value functions. Additionally, this work analyzed the problem of non-linear value decomposition and provided some simple but powerful remedies. This is far from the existing trend of offline MARL, which heavily focuses on the extension of value regularization of single-agent offline RL. Overall, this work repositions non-linear methods from a fragile component to a foundational building block for scalable and practically deployable offline MARL by providing both a diagnostic understanding and a recipe for offline MARL. Although this is a great start, further research is needed to build upon these findings. First, our approach relies on simple normalization methods to stabilize the scale of value from adoption to non-linear value decomposition. This implicitly assumes that such scale control is sufficient. This leaves open the need to more directly develop non-linear value decomposition itself, including alternative architectural designs and stabilization principles beyond normalization. Second, while our study adopts the same action discretization method across all combinations, the optimal one may depend on the specific combination of value decomposition, value learning, and policy extraction. Since actor-critic structures are primarily designed for continuous control, the impact of different discretization choices in offline MARL warrants more systematic investigation. Finally, our empirical setup largely depends on the classical MARL testbed, i.e., MA-MuJoCo, SMAC, and MPE, which considers team rewards as individual ones in dense reward setups. This posits relatively weak non-linear coupling among agents, as well as making it hard to study 12 scaling behavior with substantially larger datasets. Such shortcomings serve as a springboard for fruitful research trajectories in offline MARL: •How should we completely stabilize the hypernetwork for non-linear value decomposition (e.g., potentially with a dueling mechanism (Sutton et al., 1998; Wang et al., 2016, 2021), attention structure (Vaswani et al., 2017; Yang et al., 2020), and factored graph (Guestrin et al., 2001, 2002; Böhmer et al., 2020; Kang et al., 2022))? • Can offline MARL benchmarks and datasets be expanded to better capture diverse coordination structures, such as goal-conditioned tasks (Feng et al., 2025; Skrynnik et al., 2024; Lee et al., 2025a), skill-based coordination (Liu et al., 2022; Omari et al., 2025; Chen et al., 2022), and social or mixed cooperative-competitive settings (Guo et al., 2025; Ruhdorfer et al., 2024; Gessler et al., 2025), beyond current dense team-reward testbeds? •Are the scalability limits of offline MARL fundamental, or can principled designs enable reliable offline-to-online MARL (or scaling up the datasets in offline) (Park et al., 2024; Lee et al., 2022; Sun et al., 2020)? Impact statement This work advances offline multi-agent reinforcement learning by improving the stability of value decom- position and multi-agent policy extraction, and contributes to the broader field of machine learning. References Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016. 22 Philip J Ball, Laura Smith, Ilya Kostrikov, and Sergey Levine. Efficient online reinforcement learning with offline data. In ICML, p. 1577–1594, 2023. 23 Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π 0 : A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024. 3 Wendelin Böhmer, Vitaly Kurin, and Shimon Whiteson. Deep coordination graphs. In International Conference on Machine Learning, p. 980–991. PMLR, 2020. 13, 21 Jiayu Chen, Jingdi Chen, Tian Lan, and Vaneet Aggarwal. Scalable multi-agent covering option discovery based on kronecker graphs. Advances in Neural Information Processing Systems, 35:30406–30418, 2022. 13 Embodiment Collaboration, Abby O’Neill, Abdul Rehman, Abhinav Gupta, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, Albert Tung, Alex Bewley, Alex Herzog, Alex Irpan, Alexander Khazatsky, Anant Rai, Anchit Gupta, Andrew Wang, Andrey Kolobov, Anikait Singh, Animesh Garg, Aniruddha Kembhavi, Annie Xie, Anthony Brohan, Antonin Raffin, Archit Sharma, Arefeh Yavary, Arhan Jain, Ashwin Balakrishna, Ayzaan Wahid, Ben Burgess-Limerick, Beomjoon Kim, Bernhard Schölkopf, Blake Wulfe, Brian Ichter, Cewu Lu, Charles Xu, Charlotte Le, Chelsea Finn, Chen Wang, Chenfeng Xu, Cheng Chi, Chenguang Huang, Christine Chan, Christopher Agia, Chuer Pan, Chuyuan Fu, Coline Devin, Danfei Xu, Daniel Morton, Danny Driess, Daphne Chen, Deepak Pathak, Dhruv Shah, Dieter Büchler, Dinesh Jayaraman, Dmitry Kalashnikov, Dorsa Sadigh, Edward Johns, Ethan Foster, Fangchen Liu, Federico Ceola, Fei Xia, Feiyu Zhao, Felipe Vieira Frujeri, Freek Stulp, Gaoyue Zhou, Gaurav S. Sukhatme, Gautam Salhotra, Ge Yan, Gilbert Feng, Giulio Schiavi, Glen Berseth, Gregory Kahn, Guangwen Yang, Guanzhi Wang, Hao Su, Hao-Shu Fang, Haochen Shi, Henghui Bao, Heni Ben Amor, Henrik I Christensen, Hiroki Furuta, Homanga Bharadhwaj, Homer Walke, Hongjie Fang, Huy Ha, Igor Mordatch, Ilija Radosavovic, Isabel Leal, Jacky Liang, Jad Abou-Chakra, Jaehyung Kim, Jaimyn Drake, Jan Peters, Jan Schneider, Jasmine Hsu, Jay Vakil, Jeannette Bohg, Jeffrey Bingham, Jeffrey Wu, Jensen Gao, Jiaheng Hu, Jiajun Wu, Jialin Wu, Jiankai Sun, Jianlan Luo, Jiayuan Gu, Jie Tan, Jihoon Oh, Jimmy Wu, Jingpei Lu, Jingyun Yang, Jitendra Malik, João Silvério, Joey Hejna, Jonathan Booher, Jonathan Tompson, Jonathan 13 Yang, Jordi Salvador, Joseph J. Lim, Junhyek Han, Kaiyuan Wang, Kanishka Rao, Karl Pertsch, Karol Hausman, Keegan Go, Keerthana Gopalakrishnan, Ken Goldberg, Kendra Byrne, Kenneth Oslund, Kento Kawaharazuka, Kevin Black, Kevin Lin, Kevin Zhang, Kiana Ehsani, Kiran Lekkala, Kirsty Ellis, Krishan Rana, Krishnan Srinivasan, Kuan Fang, Kunal Pratap Singh, Kuo-Hao Zeng, Kyle Hatch, Kyle Hsu, Laurent Itti, Lawrence Yunliang Chen, Lerrel Pinto, Li Fei-Fei, Liam Tan, Linxi "Jim" Fan, Lionel Ott, Lisa Lee, Luca Weihs, Magnum Chen, Marion Lepert, Marius Memmel, Masayoshi Tomizuka, Masha Itkina, Mateo Guaman Castro, Max Spero, Maximilian Du, Michael Ahn, Michael C. Yip, Mingtong Zhang, Mingyu Ding, Minho Heo, Mohan Kumar Srirama, Mohit Sharma, Moo Jin Kim, Muhammad Zubair Irshad, Naoaki Kanazawa, Nicklas Hansen, Nicolas Heess, Nikhil J Joshi, Niko Suenderhauf, Ning Liu, Norman Di Palo, Nur Muhammad Mahi Shafiullah, Oier Mees, Oliver Kroemer, Osbert Bastani, Pannag R Sanketi, Patrick "Tree" Miller, Patrick Yin, Paul Wohlhart, Peng Xu, Peter David Fagan, Peter Mitrano, Pierre Sermanet, Pieter Abbeel, Priya Sundaresan, Qiuyu Chen, Quan Vuong, Rafael Rafailov, Ran Tian, Ria Doshi, Roberto Martín-Martín, Rohan Baijal, Rosario Scalise, Rose Hendrix, Roy Lin, Runjia Qian, Ruohan Zhang, Russell Mendonca, Rutav Shah, Ryan Hoque, Ryan Julian, Samuel Bustamante, Sean Kirmani, Sergey Levine, Shan Lin, Sherry Moore, Shikhar Bahl, Shivin Dass, Shubham Sonawani, Shubham Tulsiani, Shuran Song, Sichun Xu, Siddhant Haldar, Siddharth Karamcheti, Simeon Adebola, Simon Guist, Soroush Nasiriany, Stefan Schaal, Stefan Welker, Stephen Tian, Subramanian Ramamoorthy, Sudeep Dasari, Suneel Belkhale, Sungjae Park, Suraj Nair, Suvir Mirchandani, Takayuki Osa, Tanmay Gupta, Tatsuya Harada, Tatsuya Matsushima, Ted Xiao, Thomas Kollar, Tianhe Yu, Tianli Ding, Todor Davchev, Tony Z. Zhao, Travis Armstrong, Trevor Darrell, Trinity Chung, Vidhi Jain, Vikash Kumar, Vincent Vanhoucke, Vitor Guizilini, Wei Zhan, Wenxuan Zhou, Wolfram Burgard, Xi Chen, Xiangyu Chen, Xiaolong Wang, Xinghao Zhu, Xinyang Geng, Xiyuan Liu, Xu Liangwei, Xuanlin Li, Yansong Pang, Yao Lu, Yecheng Jason Ma, Yejin Kim, Yevgen Chebotar, Yifan Zhou, Yifeng Zhu, Yilin Wu, Ying Xu, Yixuan Wang, Yonatan Bisk, Yongqiang Dou, Yoonyoung Cho, Youngwoon Lee, Yuchen Cui, Yue Cao, Yueh-Hua Wu, Yujin Tang, Yuke Zhu, Yunchu Zhang, Yunfan Jiang, Yunshuang Li, Yunzhu Li, Yusuke Iwasawa, Yutaka Matsuo, Zehan Ma, Zhuo Xu, Zichen Jeff Cui, Zichen Zhang, Zipeng Fu, and Zipeng Lin. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration. In IEEE International Conference on Robotics and Automation, p. 6892–6903, 2024. 3 Shifei Ding, Xiaomin Dong, Jian Zhang, Lili Guo, Wei Du, and Chenglong Zhang. Multi-agent policy gradients with dynamic weighted value decomposition. Pattern Recognition, 164:111576, 2025. 20 Eslam Eldeeb, Houssem Sifaou, Osvaldo Simeone, Mohammad Shehab, and Hirley Alves. Conservative and risk-aware offline multi-agent reinforcement learning. IEEE Transactions on Cognitive Communications and Networking, 2024. 3 Benjamin Ellis, Jonathan Cook, Skander Moalla, Mikayel Samvelyan, Mingfei Sun, Anuj Mahajan, Jakob Foerster, and Shimon Whiteson. Smacv2: An improved benchmark for cooperative multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 36:37567–37593, 2023. 11 Chanin Eom, Dongsu Lee, and Minhae Kwon. Selective imitation for efficient online reinforcement learning with pre-collected data. ICT Express, 10(6):1308–1314, 2024. 23 Meng Feng, Viraj Parimi, and Brian Williams. Safe multi-agent navigation guided by goal-conditioned safe reinforcement learning. arXiv preprint arXiv:2502.17813, 2025. 13 Jakob Foerster, Nantas Nardelli, Gregory Farquhar, Triantafyllos Afouras, Philip HS Torr, Pushmeet Kohli, and Shimon Whiteson. Stabilising experience replay for deep multi-agent reinforcement learning. In International conference on machine learning, p. 1146–1155. PMLR, 2017. 20 Claude Formanek, Asad Jeewa, Jonathan Shock, and Arnu Pretorius. Off-the-grid marl: Datasets with baselines for offline multi-agent reinforcement learning. arXiv preprint arXiv:2302.00521, 2023. 10, 22 Juan Formanek, Callum R Tilbury, Louise Beyers, Jonathan Shock, and Arnu Pretorius. Dispelling the mirage of progress in offline marl through standardised baselines and evaluation. Advances in Neural Information Processing Systems, 37:139650–139672, 2024. 10, 22 Scott Fujimoto and Shixiang Shane Gu. A minimalist approach to offline reinforcement learning. Advances in neural information processing systems, 34:20132–20145, 2021. 1, 3, 9 Tobias Gessler, Tin Dizdarevic, Ani Calinescu, Benjamin Ellis, Andrei Lupu, and Jakob Nicolaus Foerster. Overcookedv2: Rethinking overcooked for zero-shot coordination. arXiv preprint arXiv:2503.17821, 2025. 13 14 Sven Gronauer and Klaus Diepold. Multi-agent deep reinforcement learning: a survey. Artificial Intelligence Review, 55(2):895–943, 2022. 3 Carlos Guestrin, Daphne Koller, and Ronald Parr. Multiagent planning with factored mdps. Advances in neural information processing systems, 14, 2001. 13 Carlos Guestrin, Michail Lagoudakis, and Ronald Parr. Coordinated reinforcement learning. In ICML, volume 2, p. 227–234, 2002. 13 Zihao Guo, Shuqing Shi, Richard Willis, Tristan Tomilin, Joel Z Leibo, and Yali Du. Socialjax: An evaluation suite for multi-agent reinforcement learning in sequential social dilemmas. arXiv preprint arXiv:2503.14576, 2025. 13 Jian Hu, Siyang Jiang, Seth Austin Harding, Haibin Wu, and Shih-wei Liao. Rethinking the implementation tricks and monotonicity constraint in cooperative multi-agent reinforcement learning. arXiv preprint arXiv:2102.03479, 2021. 20 Mengda Ji, Genjiu Xu, and Liying Wang. Cora: Coalitional rational advantage decomposition for multi-agent policy gradients. arXiv preprint arXiv:2506.04265, 2025. 20 Jiechuan Jiang and Zongqing Lu. Offline decentralized multi-agent reinforcement learning. arXiv preprint arXiv:2108.01832, 2021. 3 Yipeng Kang, Tonghan Wang, Qianlan Yang, Xiaoran Wu, and Chongjie Zhang. Non-linear coordination graphs. Advances in neural information processing systems, 35:25655–25666, 2022. 13 Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. Interna- tional Conference on Learning Representations, 2022. 9 Dongsu Lee and Minhae Kwon. Episodic future thinking mechanism for multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 37:11570–11601, 2024. 3 Dongsu Lee and Minhae Kwon. Episodic future thinking with offline reinforcement learning for autonomous driving. IEEE Internet of Things Journal, 2025a. 3 Dongsu Lee and Minhae Kwon. Scenario-free autonomous driving with multi-task offline-to-online reinforcement learning. IEEE Transactions on Intelligent Transportation Systems, 2025b. 23 Dongsu Lee, Chanin Eom, and Minhae Kwon. Ad4rl: Autonomous driving benchmarks for offline reinforcement learning with value-based dataset. In IEEE International Conference on Robotics and Automation, p. 8239–8245, 2024. 3 Dongsu Lee, Daehee Lee, Yaru Niu, Honguk Woo, Amy Zhang, and Ding Zhao. Learning to interact in world latent for team coordination. arXiv preprint arXiv:2509.25550, 2025a. 13 Dongsu Lee, Daehee Lee, and Amy Zhang. Multi-agent coordination via flow matching. arXiv preprint arXiv:2511.05005, 2025b. 1, 3, 12, 22 Seunghyun Lee, Younggyo Seo, Kimin Lee, Pieter Abbeel, and Jinwoo Shin. Offline-to-online reinforcement learning via balanced replay and pessimistic q-ensemble. In Conference on Robot Learning, p. 1702–1712. PMLR, 2022. 13 Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. Offline reinforcement learning: Tutorial, review, and perspectives on open problems. arXiv preprint arXiv:2005.01643, 2020. 3 Chao Li, Ziwei Deng, Chenxing Lin, Wenqi Chen, Yongquan Fu, Weiquan Liu, Chenglu Wen, Cheng Wang, and Siqi Shen. Dof: A diffusion factorization framework for offline multi-agent reinforcement learning. In The Thirteenth International Conference on Learning Representations, 2025a. 1, 3 Yueheng Li, Guangming Xie, and Zongqing Lu. Revisiting cooperative off-policy multi-agent reinforcement learning. In International Conference on Machine Learning, 2025b. 20 Zhuoran Li, Xun Wang, Hai Zhong, and Longbo Huang. Om2p: Offline multi-agent mean-flow policy. arXiv preprint arXiv:2508.06269, 2025c. 3 Yuntao Liu, Yuan Li, Xinhai Xu, Yong Dou, and Donghong Liu. Heterogeneous skill learning for multi-agent tasks. Advances in neural information processing systems, 35:37011–37023, 2022. 13 15 Zongkai Liu, Qian Lin, Chao Yu, Xiawei Wu, Yile Liang, Donghui Li, and Xuetao Ding. Offline multi-agent reinforcement learning via in-sample sequential policy optimization. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, p. 19068–19076, 2025. 3 Zuxin Liu, Zijian Guo, Haohong Lin, Yihang Yao, Jiacheng Zhu, Zhepeng Cen, Hanjiang Hu, Wenhao Yu, Tingnan Zhang, Jie Tan, et al. Datasets and benchmarks for offline safe reinforcement learning. arXiv preprint arXiv:2306.09303, 2023. 3 Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments. Advances in neural information processing systems, 2017. 10 Xueguang Lyu, Yuchen Xiao, Brett Daley, and Christopher Amato. Contrasting centralized and decentralized critics in multi-agent reinforcement learning. arXiv preprint arXiv:2102.04402, 2021. 20 Xueguang Lyu, Andrea Baisero, Yuchen Xiao, Brett Daley, and Christopher Amato. On centralized critics in multi-agent reinforcement learning. Journal of Artificial Intelligence Research, 77:295–354, 2023. 8, 20 Daiki E Matsunaga, Jongmin Lee, Jaeseok Yoon, Stefanos Leonardos, Pieter Abbeel, and Kee-Eung Kim. Alberdice: addressing out-of-distribution joint actions in offline multi-agent rl via alternating stationary distribution correction estimation. Advances in Neural Information Processing Systems, 36:72648–72678, 2023. 3 Frans A Oliehoek, Christopher Amato, et al. A concise introduction to decentralized POMDPs, volume 1. Springer, 2016. 3 Bassel Al Omari, Michael Matthews, Alexander Rutherford, and Jakob Nicolaus Foerster. Multi-agent craftax: Benchmarking open-ended multi-agent reinforcement learning at the hyperscale. arXiv preprint arXiv:2511.04904, 2025. 13 Ling Pan, Longbo Huang, Tengyu Ma, and Huazhe Xu. Plan better amid conservatism: Offline multi-agent reinforcement learning with actor rectification. In International conference on machine learning, 2022. 3 Seohong Park, Kevin Frans, Sergey Levine, and Aviral Kumar. Is value learning really the main bottleneck in offline rl? Advances in Neural Information Processing Systems, 37:79029–79056, 2024. 1, 13, 20 Bei Peng, Tabish Rashid, Christian Schroeder de Witt, Pierre-Alexandre Kamienny, Philip Torr, Wendelin Böhmer, and Shimon Whiteson. Facmac: Factored multi-agent centralised policy gradients. Advances in Neural Information Processing Systems, 34:12208–12221, 2021. 4, 10, 20, 21 Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. arXiv preprint arXiv:1910.00177, 2019. 1, 3, 9, 20 Boris Polyak and Anatoli Juditsky. Acceleration of stochastic approximation by averaging. SIAM Journal on Control and Optimization, 30(4):838–855, 1992. 22 Haoyuan Qin, Chennan Ma, Deng Deng, Zhengzhu Liu, Songzhu Mei, Xinwang Liu, Cheng Wang, and Siqi Shen. The dormant neuron phenomenon in multi-agent reinforcement learning value factorization. Advances in Neural Information Processing Systems, 37:35727–35759, 2024. 22 Tabish Rashid, Gregory Farquhar, Bei Peng, and Shimon Whiteson. Weighted qmix: Expanding monotonic value function factorisation for deep multi-agent reinforcement learning. Advances in neural information processing systems, 33:10199–10210, 2020a. 4, 20, 21 Tabish Rashid, Mikayel Samvelyan, Christian Schroeder De Witt, Gregory Farquhar, Jakob Foerster, and Shimon Whiteson. Monotonic value function factorisation for deep multi-agent reinforcement learning. Journal of Machine Learning Research, 21(178):1–51, 2020b. 1, 4, 8, 20, 21, 22 Stephane Ross and J Andrew Bagnell. Agnostic system identification for model-based reinforcement learning. arXiv preprint arXiv:1203.1007, 2012. 23 Constantin Ruhdorfer, Matteo Bortoletto, Anna Penzkofer, and Andreas Bulling. The overcooked generalisation challenge. arXiv preprint arXiv:2406.17949, 2024. 13 Mikayel Samvelyan, Tabish Rashid, Christian Schroeder De Witt, Gregory Farquhar, Nantas Nardelli, Tim GJ Rudner, Chia-Man Hung, Philip HS Torr, Jakob Foerster, and Shimon Whiteson. The starcraft multi-agent challenge. arXiv preprint arXiv:1902.04043, 2019. 11 16 Jianzhun Shao, Yun Qu, Chen Chen, Hongchang Zhang, and Xiangyang Ji. Counterfactual conservative q learning for offline multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 36: 77290–77312, 2023. 1, 3 Siqi Shen, Mengwei Qiu, Jun Liu, Weiquan Liu, Yongquan Fu, Xinwang Liu, and Cheng Wang. Resq: A residual q function-based approach for multi-agent reinforcement learning value factorization. Advances in Neural Information Processing Systems, 35:5471–5483, 2022. 3 Alexey Skrynnik, Anton Andreychuk, Anatolii Borzilov, Alexander Chernyavskiy, Konstantin Yakovlev, and Aleksandr Panov. Pogema: A benchmark platform for cooperative multi-agent pathfinding. arXiv preprint arXiv:2407.14931, 2024. 13 Kyunghwan Son, Daewoo Kim, Wan Ju Kang, David Earl Hostallero, and Yung Yi. Qtran: Learning to factorize with transformation for cooperative multi-agent reinforcement learning. In International conference on machine learning, p. 5887–5896. PMLR, 2019. 4, 20, 21 Jianyu Su, Stephen Adams, and Peter Beling. Value-decomposition multi-agent actor-critics. In Proceedings of the AAAI conference on artificial intelligence, volume 35, p. 11352–11360, 2021. 20 Chuangchuang Sun, Macheng Shen, and Jonathan P How. Scaling up multiagent reinforcement learning for robotic systems: Learn an adaptive sparse communication graph. In 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS), p. 11755–11762. IEEE, 2020. 13 Peter Sunehag, Guy Lever, Audrunas Gruslys, Wojciech Marian Czarnecki, Vinicius Zambaldi, Max Jaderberg, Marc Lanctot, Nicolas Sonnerat, Joel Z Leibo, Karl Tuyls, et al. Value-decomposition networks for cooperative multi-agent learning. arXiv preprint arXiv:1706.05296, 2017. 3, 4, 8 Richard S Sutton, Andrew G Barto, et al. Reinforcement learning: An introduction, volume 1. MIT press Cambridge, 1998. 9, 13 Denis Tarasov, Vladislav Kurenkov, Alexander Nikulin, and Sergey Kolesnikov. Revisiting the minimalist approach to offline reinforcement learning. Advances in Neural Information Processing Systems, 36:11592–11620, 2023. 3 Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double q-learning. In Proceedings of the AAAI conference on artificial intelligence, volume 30, 2016. 22 Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017. 13 Jianhao Wang, Zhizhou Ren, Terry Liu, Yang Yu, and Chongjie Zhang. Qplex: Duplex dueling multi-agent q-learning. International Conference on Learning Representations, 2021. 4, 13, 20, 21 Li Wang, Yupeng Zhang, Yujing Hu, Weixun Wang, Chongjie Zhang, Yang Gao, Jianye Hao, Tangjie Lv, and Changjie Fan. Individual reward assisted multi-agent reinforcement learning. In International conference on machine learning, p. 23417–23432. PMLR, 2022. 8 Tonghan Wang, Heng Dong, Victor Lesser, and Chongjie Zhang. Roma: Multi-agent reinforcement learning with emergent roles. arXiv preprint arXiv:2003.08039, 2020. 21 Xiangsen Wang, Haoran Xu, Yinan Zheng, and Xianyuan Zhan. Offline multi-agent reinforcement learning with implicit global-to-local value regularization. Advances in Neural Information Processing Systems, 36: 52413–52429, 2023. 1, 3, 12 Ziyu Wang, Tom Schaul, Matteo Hessel, Hado Hasselt, Marc Lanctot, and Nando Freitas. Dueling network architectures for deep reinforcement learning. In International conference on machine learning, p. 1995–2003. PMLR, 2016. 13 Yifan Wu, George Tucker, and Ofir Nachum. Behavior regularized offline reinforcement learning. arXiv preprint arXiv:1911.11361, 2019. 4, 20, 22 Zhiwei Xu, Bin Zhang, Guangchong Zhou, Zeren Zhang, Guoliang Fan, et al. Dual self-awareness value decomposi- tion framework without individual global max for cooperative marl. Advances in neural information processing systems, 36:73898–73918, 2023. 4 17 Yaodong Yang, Jianye Hao, Ben Liao, Kun Shao, Guangyong Chen, Wulong Liu, and Hongyao Tang. Qatten: A general framework for cooperative multiagent reinforcement learning. arXiv preprint arXiv:2002.03939, 2020. 13, 21 Yiqin Yang, Xiaoteng Ma, Chenghao Li, Zewu Zheng, Qiyuan Zhang, Gao Huang, Jun Yang, and Qianchuan Zhao. Believe what you see: Implicit constraint approach for offline multi-agent reinforcement learning. Advances in Neural Information Processing Systems, 2021. 3 Kaiqing Zhang, Zhuoran Yang, and Tamer Başar. Multi-agent reinforcement learning: A selective overview of theories and algorithms. Handbook of reinforcement learning and control, p. 321–384, 2021. 3 Zhengbang Zhu, Minghuan Liu, Liyuan Mao, Bingyi Kang, Minkai Xu, Yong Yu, Stefano Ermon, and Weinan Zhang. Madiff: Offline multi-agent learning with diffusion models. Advances in Neural Information Processing Systems, 37:4177–4206, 2024. 3 18 Appendix Contents A Miscellaneous20 A.1 Summary of notations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 B Extended related works20 B.1 Actor critic with mixing network . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 B.2 Non-linear value decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 C Experimental Details21 C.1 Two-step coordination game . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 C.2 Normalized score for continuous control benchmarks . . . . . . . . . . . . . . . . . . . . . 21 C.3 The number of experiments for Figure 7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 D Implementation details22 19 A Miscellaneous A.1 Summary of notations Dec-POMDP elements Notation DescriptionNotation Description Aset of agentsaagent index Anumber of agentsγ ∈ [0, 1) discount factor Sglobal state spacesglobal state O a observation space of agent ao a local observation of agent a U a action space of agent au a action of agent a Pstate transition functionΩobservation function r a reward function of agent aRteam reward τ a trajectory of agent aDoffline dataset (replay buffer) Algorithm elements Notation DescriptionNotation Description Q tot global state-action value function Q a state-action value function of agent a f mix mixing functionh(·,·) regularization function πoffline policyμbehavioral policy RL Training Notation DescriptionNotation Description θpolicy parameters ̄ θtarget critic parameters φcritic parametersφ mix mixing network parameter B Extended related works B.1 Actor critic with mixing network Mixing networks were initially formulated for value-based discrete control, where Q-functions implicitly define policies. Recent research has adapted these architectures to actor-critic frameworks to focus on structured credit assignment (Su et al., 2021; Peng et al., 2021; Hu et al., 2021; Ding et al., 2025; Ji et al., 2025). This integration proves essential for centralized training when decoupling policy optimization from value estimation. By aggregating agent-wise utilities into a global objective, mixing-based critics preserve the coordination benefits of factorization. This paradigm effectively combines scalable value decomposition with the flexibility of independent policy learning. Compared to fully centralized critics that directly condition on the joint observation and joint action space, actor-critic methods with mixing networks offer improved scalability as the number of agents increases. Fully centralized value functions suffer from exponential growth in input dimensionality and often become impractical in environments with many agents or high-dimensional observations (Li et al., 2025b; Lyu et al., 2023, 2021; Foerster et al., 2017). In contrast, mixing-based critics decompose the value estimation into per-agent components that are aggregated through a structured mixing function, enabling parameter sharing and modularity while retaining access to global state information during training. However, existing methods with non-linear value decomposition have largely been studied in an online setting. Much less is understood about how non-linear value decomposition behaves in offline MARL. In particular, it remains unclear how the current RL recipe with value decomposition methods (Park et al., 2024; Wu et al., 2019; Peng et al., 2019; Wang et al., 2021; Son et al., 2019; Rashid et al., 2020b,a) 20 interacts with OOD issues and coordination errors under offline training. Therefore, in this work, we investigate how non-linear value decomposition can be effectively incorporated into offline MARL. B.2 Non-linear value decomposition The core challenge in MARL is credit assignment, which has motivated value decomposition methods that structure global value estimation into agent-wise components. Early work, e.g., VDN (Rashid et al., 2020b), assumes a linear additive structure over agent-wise value functions. This enables scalable learning but limits the expressivity of the global value function. QMIX (Rashid et al., 2020b,a) extends this formulation by introducing a state-conditioned mixing network, allowing non-linear yet monotonic aggregation of individual Q-values and improving its expressivity. QTRAN (Son et al., 2019) and other variants (Wang et al., 2021; Yang et al., 2020; Böhmer et al., 2020; Peng et al., 2021; Wang et al., 2020), e.g., graph-based critics and attention mechanisms, further relax structural constraints to model more general non-linear interactions among agents. Despite this progress, existing non-linear value decomposition methods have been exclusively studied in online settings. In offline MARL, the behavior of non-linear mixing-based critics remains largely unexplored; to the best of our knowledge, there has been no systematic investigation of non-linear value decomposition in an offline setting. This gap motivates our study, which examines how mixing network-based critics can be incorporated and analyzed in offline multi-agent settings, and what design choices are necessary to retain their scalability and coordination benefits under offline constraints. C Experimental Details C.1 Two-step coordination game We consider a simple two-step cooperative matrix game designed to highlight the limitations of linear value decomposition and the necessity of a non-linear mixing network. Game description. The MDP of this game consists of two agents, three states, and binary actions, and unfolds over two timesteps. At the first timestep, the environment is in an initial states 1 , where Agent A selects the next state (Agent B chooses an action, but it does not affect the state change). In particular, Agent A chooses between two options: (i) a safe states 2−1 , which leads to a deterministic but suboptimal outcome, and (i) a risky states 2−2 , which enables a higher optimal reward but requires coordinated behaviors. The reward structures of each state under coordinated joint actions are given as follows: R(s 2-1 ) = 7 7 7 7 , R(s 2-2 ) = 0 1 1 8 . Why linear value decomposition fails?VDNassumes that the joint action-value function is a linear sum of individual agent values. This assumption is violated in the risky states 2−2 , where the optimal reward arises only from a specific coordinated joint action. Because the benefit of coordination cannot be attributed additively to individual agents, VDN underestimates the value of the risky state. As a result, the agent selecting the state transition prefers the safe but suboptimal states 2 1 , leading to a suboptimal joint policy. C.2 Normalized score for continuous control benchmarks To enable consistent comparison across different continuous control benchmarks, we report normalized scores using min− max normalization. For each task, the normalized score is computed as Normalized Score = J (Π)− scale min scale max − scale min , 21 whereJ(Π) denotes the average episode return of the evaluated set of policies, andscale min andscale max define the task-specific minimum and maximum reference returns. For the MA-MuJoCo benchmarks,scale max is defined as the maximum return achieved by the Expert trajectories (the best quality dataset) in the dataset, whilescale min corresponds to the minimum return observed in the Medium-Replay dataset (the lowest quality dataset). Specifically, we use the following values: 2ant: scale min = 895.37, scale max = 2124.15, 3hopper: scale min = 70.75, scale max = 3762.68, 6halfcheetah: scale min =−198.76, scale max = 3866.08. For the MPE benchmark, we follow previously reported evaluation protocols and adopt the reference values reported in prior work (Lee et al., 2025b; Formanek et al., 2023, 2024). In particular, for the Spread task, we use the following scale values: scale min = 159.8, scale max = 516.8. . C.3 The number of experiments for Figure 7 For Figure 7, we run 16,384 independent runs with hyperparameter sweeps. Specifically, this corresponds to sweeping over the Cartesian product of hyperparameters for each learning algorithm: forTDandSARSA, we consider 4 value decomposition methods, 2 policy extraction methods, 4 policy extraction temperatures α, 4 tasks, 4 datasets per task, and 8 random seeds, resulting in 4,096 runs each; forIQL, we additionally sweep over 2 expectile loss coefficients τ, yielding 8, 192 runs. In total, this amounts to 16, 384 runs. D Implementation details Gitrepository. We provide our codebase implementation onhttps://github.com/DongsuLeeTech/offline-marl-recipe Actor-critic network architecture. Both the actor and critic are parameterized by multi-layer perceptrons (MLPs) with four hidden layers of size [512,512,512,512]. The actor outputs continuous values, while the critic is implemented as a double Q-network with two ensemble heads (Van Hasselt et al., 2016). Layer normalization (Ba et al., 2016) is applied to all critic layers. Target networks are maintained for both the critic and the mixing network, and are updated using Polyak averaging (Polyak & Juditsky, 1992) with coefficient τ. Action discretization. For discrete control domains, actions are represented internally using one-hot vectors from the continuous value of the actor network (Wu et al., 2019). The actor predicts logits over the discrete action space. During evaluation, actions are selected via argmax over the actor outputs. Mixing network. This is a hypernetwork-based architecture that generates state-dependent mixing weights (Rashid et al., 2020b). In our implementation, the agent-wise Q-values are first embedded into a 32-dimensional latent space, and the hypernetwork uses a hidden dimension of 128 to produce the mixing coefficients from the global state. The mixing network operates on per-agent Q estimates from all individual critics and is trained jointly with the critic. A separate target mixing network is maintained and used for computing TD targets to improve training stability (Qin et al., 2024). Scale-invariant value normalization. We apply a scale-invariant value normalization scheme when training the critic with a mixing network. Specifically, the TD loss is computed using normalized Q-values, where both current and target Q-values are centered by the mean and scaled by the mean absolute deviation of the current Q estimates. The normalization statistics are detached from the gradient graph to preserve Bellman invariance. This normalization is applied only to the critic loss and does not affect policy evaluation or execution. We provide Python-style pseudocode of SVN below. 22 Listing 1 Python style pseudocode of scale-invariant value normalization. ### SVN (Scale-invariant Value Normalization) for TD learning # Inputs: # qs[0], qs[1] : ensemble total Q estimates # targets : Bellman targets q_tot_curr = minimum(qs[0], qs[1]) target_q = stop_gradient(targets) # Detached normalization statistics computed only from the current total Q mu_q = stop_gradient(mean(q_tot_curr)) mad_q = stop_gradient(mean(abs(q_tot_curr - mu_q))) + eps # Normalize both current Q and target q_hat = (q_tot_curr - mu_q) / mad_q t_hat = (target_q - mu_q) / mad_q critic_loss = mean((q_hat - t_hat)**2) Online fine-tuning. For the offline-to-online experiments, we deviate from the common practice of balanced sampling, which incorporates offline data during online training (Ross & Bagnell, 2012; Ball et al., 2023; Eom et al., 2024; Lee & Kwon, 2025b). Instead, our approach focuses exclusively on newly collected online rollouts. Starting from the offline pretraining checkpoint, the agent undergoes an additional 500K gradient steps of pure online training. Training and evaluation. We train all methods with 1M gradient steps for SMACv1 and SMACv2, and 500K steps for MPE and MA-MuJoCo. For offline-to-online training, we first perform 500K steps of offline training, followed by 500K steps of online training. We evaluate the learned policy every 50K steps using 10 evaluation episodes. 23