Paper deep dive
Partial Attention in Deep Reinforcement Learning for Safe Multi-Agent Control
Turki Bin Mohaya, Peter Seiler
Intelligence
Status: succeeded | Model: anthropic/claude-sonnet-4.6 | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/24/2026, 1:37:07 AM
Summary
This paper proposes a partial attention mechanism integrated into the QMIX multi-agent reinforcement learning framework for safe autonomous vehicle control in highway merging scenarios. The environment is modeled as a Dec-POMDP, and each agent uses spatial attention (focusing on the front and opposite vehicle) and temporal attention (learned focus on past time steps) to make decentralized decisions. A comprehensive reward signal balances individual and global objectives. Simulations in SUMO demonstrate improved safety, driving speed, and reward compared to baseline approaches.
Entities (24)
Relation Signals (19)
Turki Bin Mohaya → affiliatedwith → University of Michigan
confidence 99% · T. Bin Mohaya and P. Seiler are with the Department of Electrical Engineering and Computer Science at the University of Michigan
Peter Seiler → affiliatedwith → University of Michigan
confidence 99% · T. Bin Mohaya and P. Seiler are with the Department of Electrical Engineering and Computer Science at the University of Michigan
Turki Bin Mohaya → fundedby → Ford Motor Company
confidence 99% · The authors acknowledge funding from the Ford Motor Company.
Peter Seiler → fundedby → Ford Motor Company
confidence 99% · The authors acknowledge funding from the Ford Motor Company.
Highway Merging → modeledas → Decentralized Partially Observable Markov Decision Process
confidence 99% · The environment is modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP)
Partial Attention Mechanism → appliedto → Highway Merging
confidence 98% · we include partial attention for each autonomous vehicle, thus allowing each ego vehicle to focus on the most relevant neighboring vehicles
Partial Attention Mechanism → evaluatedin → Simulation of Urban Mobility (SUMO)
confidence 98% · Simulations are conducted in the Simulation of Urban Mobility (SUMO)
Partial Attention Mechanism → integratedinto → QMIX
confidence 98% · Within a QMIX framework, we include partial attention for each autonomous vehicle
Partial Attention Mechanism → comprises → Spatial Attention
confidence 97% · Our notion of partial attention constitutes two elements: spatial attention and temporal attention
Partial Attention Mechanism → comprises → Temporal Attention
confidence 97% · Our notion of partial attention constitutes two elements: spatial attention and temporal attention
Partial Attention Mechanism → uses → Multi-Head Attention
confidence 97% · the projected sequences serve as inputs to separate multi-head attention modules tailored for the front and opposite vehicle histories
QMIX → uses → Deep Q-Network
confidence 95% · QMIX minimizes a Deep Q-Network regression loss over a replay mini-batch of size B
Partial Attention Mechanism → uses → Layer Normalization
confidence 95% · The design begins with layer normalization
QMIX → enforces → Individual-Global-Max Principle
confidence 93% · This is due to the Individual-Global-Max (IGM) principle
Partial Attention Mechanism → implementedwith → PyTorch
confidence 93% · We utilize PyTorch to conduct this operation
Reward Shaping → targets → Highway Merging
confidence 93% · we propose a comprehensive reward signal that considers the global objectives of the environment (e.g., safety and vehicle flow) and the individual interests of each agent
Partial Attention Mechanism → uses → Feed-Forward Neural Network
confidence 93% · they are passed through two feed-forward neural networks (FFN) with residual connections and layer normalization
QMIX → extends → Independent Q-Learning
confidence 90% · Independent Q-learning (IQL) decomposes the multi-agent problem into parallel single-agent problems... QMIX approximates the total action-value function through a nonlinear mapping
Autonomous Vehicles → uses → Vehicle-to-Vehicle Communication
confidence 88% · This corresponds to an idealized vehicle-to-vehicle (V2V) communication or sensing model
Cypher Suggestions (6)
Find all authors and their institutional affiliations · confidence 92% · unvalidated
MATCH (p:Entity {entity_type: 'Person'})-[:AFFILIATED_WITH]->(o:Entity {entity_type: 'Organization'}) RETURN p.name AS author, o.name AS institutionFind the modeling framework used for the multi-agent environment · confidence 90% · unvalidated
MATCH (t:Entity {name: 'Highway Merging'})-[:MODELED_AS]->(m:Entity) RETURN t.name AS task, m.name AS model, m.entity_typeFind all entities affiliated with or funded by Ford Motor Company · confidence 88% · unvalidated
MATCH (e:Entity)-[:AFFILIATED_WITH|FUNDED_BY]->(org:Entity {name: 'Ford Motor Company'}) RETURN e.name, e.entity_typeFind all components that QMIX uses or enforces · confidence 87% · unvalidated
MATCH (q:Entity {name: 'QMIX'})-[r:USES|ENFORCES|EXTENDS]->(c:Entity) RETURN q.name, type(r) AS relation, c.name AS componentFind all methods applied to the highway merging task · confidence 85% · unvalidated
MATCH (m:Entity)-[:APPLIED_TO|EVALUATED_IN]->(t:Entity {name: 'Highway Merging'}) RETURN m.name AS method, m.entity_typeFind all methods used in the proposed partial attention QMIX framework · confidence 82% · unvalidated
MATCH (m:Entity {name: 'Partial Attention Mechanism'})-[:USES|COMPRISES|INTEGRATED_INTO]->(related:Entity) RETURN m.name AS framework, related.name AS component, type(r) AS relationAbstract
Abstract:Attention mechanisms excel at learning sequential patterns by discriminating data based on relevance and importance. This provides state-of-the-art performance in advanced generative artificial intelligence models. This paper applies this concept of an attention mechanism for multi-agent safe control. We specifically consider the design of a neural network to control autonomous vehicles in a highway merging scenario. The environment is modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP). Within a QMIX framework, we include partial attention for each autonomous vehicle, thus allowing each ego vehicle to focus on the most relevant neighboring vehicles. Moreover, we propose a comprehensive reward signal that considers the global objectives of the environment (e.g., safety and vehicle flow) and the individual interests of each agent. Simulations are conducted in the Simulation of Urban Mobility (SUMO). The results show better performance compared to other driving algorithms in terms of safety, driving speed, and reward.
Tags
Links
- Source: https://arxiv.org/abs/2603.21810v1
- Canonical: https://arxiv.org/abs/2603.21810v1
Full Text
40,039 characters extracted from source content.
Expand or collapse full text
Partial Attention in Deep Reinforcement Learning for Safe Multi-Agent Control Turki Bin Mohaya Peter Seiler T. Bin Mohaya and P. Seiler are with the Department of Electrical Engineering and Computer Science at the University of Michigan, Ann Arbor, MI 48109, USA. Email: turki,pseiler@umich.edu. The authors acknowledge funding from the Ford Motor Company. (April 2025) Abstract Attention mechanisms excel at learning sequential patterns by discriminating data based on relevance and importance. This provides state-of-the-art performance in today’s advanced generative artificial intelligence models. This paper applies this concept of an attention mechanism for multi-agent safe control. We specifically consider the design of a neural network to control autonomous vehicles in a highway merging scenario. The environment is modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP). Within a QMIX framework, we include partial attention for each autonomous vehicle, thus allowing each ego vehicle to focus on the most relevant neighboring vehicles. Moreover, we propose a comprehensive reward signal that considers the environment’s global objectives (e.g., safety and vehicle flow) and the individual interests of each agent. Simulations are conducted in the Simulation of Urban Mobility (SUMO). The results show better performance compared to other driving algorithms in terms of safety, driving speed, and reward. I Introduction Highway merging is a fundamental yet challenging problem in autonomous driving. It requires agents to reason about dynamic interactions under uncertainty, where safe and efficient decisions depend not only on the agent’s own state but also on the behaviors of surrounding vehicles. Traditional rule-based methods, although simple, often lack the flexibility to adapt to complex and dynamic merging situations. On the other hand, fully centralized deep reinforcement learning approaches struggle with scalability and are difficult to deploy in practical settings. Thus, decentralized policies that can selectively attend to relevant information present a promising direction for improving highway merging. There are several similar works in the literature that deploy Multi-Agent Reinforcement Learning (MARL) [22, 5, 23, 8, 6]. The authors in [7] proposed a simple recurrent unit to capture temporal patterns in the highway merging problem. These patterns are then fed to a Deep Deterministic Policy Gradient (DDPG) [11] network. After that, they enhance the training by introducing a prioritized replay buffer that samples experiences in proportion to the error of performance during that specific experience. This allows frequent replay of challenging scenarios for fast learning. The work in [10] uses cross-attention mechanisms to fuse pose data and semantic data from different instruments on the vehicle, mainly for navigation. Their work shows good performance, but requires costly computational resources due to the complexity of the overall proposed framework. We consider the specific multi-agent scenario where vehicles are required to merge onto a highway. A model for this highway merging task is described in Section I. Our proposed solution for autonomous merging builds on two ingredients that are reviewed in Section I: attention mechanisms and QMIX [15] for decentralized MARL. Section IV then details our partial attention design and reward shaping. We illustrate our proposed method using the Simulation of Urban Mobility (SUMO) [12] and compare against other approaches (Section V). Finally, Section VI concludes the paper and discusses potential future directions. Our contributions are twofold. First, we enhance the QMIX architecture by introducing a partial attention mechanism that focuses on the most critical interactions for each agent. Our notion of partial attention constitutes two elements: spatial attention and temporal attention. In spatial attention, we impose, by design, that each agent only observes the vehicle in front and the vehicle on the opposite merging road. In temporal attention, the neural network learns by itself to automatically focus on past time steps of these vehicles. This improves decision quality without incurring significant computational overhead. Second, we design a comprehensive reward structure that balances individual objectives, such as velocity maintenance and comfort, with global objectives such as collision avoidance and traffic flow improvement. Our method demonstrates improved safety and efficiency, validated through sophisticated simulations. I Problem Formulation I-A Description Figure 1: Left: The highway merging problem. Middle and Right: Our contribution is deploying partial attention to the most critical interactions for safe highway merging. The highway merging problem is challenging as agents with lower velocities approach other agents that drive with fast velocity profiles. Furthermore, it is safety-critical and typically requires a decentralized solution. Fig. 1 (left) illustrates the problem: vehicle B is merging while other highway vehicles are driving with higher velocities. The individual objectives of each agent mainly include safety (i.e., avoiding collision), and maintaining a desired velocity without altering the riders’ comfort or compromising fuel efficiency. However, there are global objectives for the traffic that are sometimes in conflict with the objectives of individual vehicles. For example, it is desired to maintain a high average velocity to enhance vehicle flow, but highway agents may need to decelerate to allow merging. Also, the merging road should increase its throughput without negatively affecting the highway’s average velocity. Each agent makes decisions, in principle, based on all other nearby vehicles. However, not all other vehicles are equally relevant to an agent’s decision. For example, in Fig. 1 (middle), vehicles A, F, and D are less relevant to the decision of vehicle B. On the other hand, vehicles C and E are of great importance to the merger decision. In fact, considering only information from these two agents can significantly reduce the computational cost for the merging decision-making process, and thus simplify the multi-agent highway merging problem. The dual is also true. In Fig. 1 (right), Vehicle E should focus on vehicles B and C while reducing attention to other vehicles. I-B POMDPs The multi-vehicle environment is modeled as a Decentralized Partially Observable Markov Decision Process (Dec-POMDP) [1] and represented by the following tuple: =(ℐ,,i,,r,Ωi,,γ). = (I,S,\A_i\,T,r,\ _i\,O,γ ). (1) The index set ℐ:=1,…,NI:=\1,…,N\ labels the N vehicles. A state s∈s summarizes the status of the highway. Each vehicle i∈ℐi selects a discrete control ai∈ia_i _i. Collecting the local controls yields the joint control a=(a1,…,aN)∈a=(a_1,…,a_N) , with =×i∈ℐiA=×_i A_i, where × denotes the Cartesian product. State evolution is governed by the probability kernel :×→[0,1]T:S×A×S→[0,1], so that (s,a,s′)=Pr(s′∣s,a)T(s,a,s )= (s s,a), where s′s denotes the next state obtained after applying the action a at state s. After the transition, vehicle i receives a private observation oi∈Ωio_i∈ _i. The stacked observation o=(o1,…,oN)o=(o_1,…,o_N) resides in Ω=×i∈ℐΩi =×_i _i. A measurement function :×Ω→[0,1]O:S×A× →[0,1] defines the likelihood of an observation o for a given state-action pair (s,a)(s,a) as: (s,a,o)=Pr(o∣s,a)O(s,a,o)= (o s,a). Lastly, a scalar reward signal r:×→ℝr:S×A×S provides the reward r(s,a,s′)r(s,a,s ) accrued on a transition from (s,a)(s,a) to s′s . Future scalar rewards are weighted by the discount factor γ∈[0,1]γ∈[0,1]. Specifically, let GtG_t denote the total discounted reward starting at time t and going through the finite horizon T of an episode. This total discounted reward is given by: Gt G_t =∑k=tT−1γk−tr(k+1). = _k=t^T-1γ^k-tr(k+1). (2) Here, r(k+1)r(k+1) is a shortened notation for the reward accrued at time k+1k+1 for the transition from (s(k),a(k))(s(k),a(k)) to the next state s(k+1)s(k+1). The finite-horizon return can be equivalently expressed in recursive form as: Gt=r(t+1)+γGt+1,GT=0,G_t=r(t+1)+γ G_t+1, G_T=0, where r(t+1)r(t+1) is the immediate reward and γGt+1γ G_t+1 is the discounted future reward. The terminal condition GT=0G_T=0 enforces the final horizon. I Background Deep neural networks [9] have shown an astonishing performance in learning complex patterns across vast dynamics and datasets. This is due to their ability to automatically extract features and optimize their weights through backpropagating the error loss. However, automatic feature extraction can, in some problems, suffer from an inability to focus on more relevant parts of the data. This can lead to sub-optimal solutions and inefficient training. Therefore, attention mechanisms were designed to equip neural networks with the ability to learn selective focus. This enables state-of-the-art performance in language translation and large language models [20, 4]. We propose to apply similar attention mechanisms to autonomous driving. I-A Attention Neural Networks Consider a vector-valued sequence s0,s1,…,sT⊂ℝn\s_0,\,s_1,…,s_T\ ^n, and stack it row-wise to form the matrix S:=[s0⊤;s1⊤;…;sT⊤]∈ℝ(T+1)×nS:=[s_0 ;\,s_1 ;\,…;\,s_T ] ^(T+1)× n. The scaled attention neural network [20] takes S as input and produces an output matrix Z∈ℝ(T+1)×doZ ^(T+1)× d_o where the dimension dod_o is described below. The output is constructed from H attention heads, each computing a query, key, and value. For head j=1,…,Hj=1,…,H, the query j∈ℝ(T+1)×djQ_j ^(T+1)× d_j, key j∈ℝ(T+1)×djK_j ^(T+1)× d_j, and value j∈ℝ(T+1)×djV_j ^(T+1)× d_j are computed as j _j =SWQj,j=SWKj,j=SWVj, =SW_Q^j, _j=SW_K^j, _j=SW_V^j, (3) where WQj,WKj,WVj∈ℝn×djW_Q^j,W_K^j,W_V^j ^n× d_j are the learned projection weights for the j-th head. Setting H=1H=1 yields the single-head attention neural network, while setting H>1H>1 employs multiple attention heads in parallel. Each head independently processes the input sequence, allowing the model to capture different aspects of the input data simultaneously. The attention weights for each head are j=Softmax(jj⊤dj)∈ℝ(T+1)×(T+1)A_j= Softmax\! ( Q_jK_j d_j ) ^(T+1)×(T+1), where djd_j is the dimension of the keys and queries. The output of head j is then Zj=jj∈ℝ(T+1)×djZ_j=A_jV_j ^(T+1)× d_j. The outputs of all heads are concatenated and linearly transformed to produce the final output matrix of the attention neural network: Z Z =[Z1Z2…ZH]WO∈ℝ(T+1)×do, = bmatrixZ_1&Z_2&…&Z_H bmatrixW_O ^(T+1)× d_o, (4) where d=∑j=1Hdjd= _j=1^Hd_j is the column dimension of the concatenated matrix. Moreover, WO∈ℝd×doW_O ^d× d_o is a learned projection matrix that maps the combined output to the desired output dimension dod_o. This multi-head approach enables the model to capture diverse patterns and relationships within the input sequence. This enhances its ability to learn complex dependencies and improve overall performance. I-B QMIX Multi-agent deep reinforcement learning (MADRL) provides a robust method to learn practical policies or controllers for stochastic multi-agent environments. This is due to their ability to capture highly non-linear stochastic dynamics by iteratively interacting with the environment and observing the consequences. Independent Q-learning (IQL) [18] decomposes the multi-agent problem into parallel single-agent problems that are simultaneously collocated. In IQL, each agent assumes the other agents are non-moving in the environment. However, learning for the agents must be coordinated to address the nonstationary environment. The principle of optimality [17] can be used to express a recursive (centralized) solution for the total action-value function QtotQ_tot of all agents. QMIX [15] approximates the total action-value function through a nonlinear mapping Mix(⋅) Mix(·) of their individual action-value functions Qii=1N\Q_i\_i=1^N of the N agents. This nonlinear mapping is implemented as a neural network and is given by Qtot(o,a;θ)=Mix(o,a, Q_tot(o,a;θ)= Mix(o,a, Q1(o1,a1;θ1),… Q_1(o_1,a_1; _1),… QN(oN,aN;θN),θN+1), Q_N(o_N,a_N; _N), _N+1), (5) where θ includes the weights of the individual action-value functions θii=1N\ _i\_i=1^N and the weights of the mixing network θN+1 _N+1. QMIX minimizes a Deep Q-Network [14] regression loss over a replay mini‑batch of size B. To define this loss, let o(m)o(m) and a(m)a(m) denote the joint observation and the joint action at sample m, respectively. The one-step target is y(m)=r(m)+γmaxa′Qtot(o′(m),a′;θ−), y(m)=r(m)+γ _a Q_tot (o (m),a ;θ^- ), (6) where r(m)r(m) is the immediate reward, o′(m)o (m) is the next joint observation at sample m, a′a is the next joint action, and θ−θ^- are the parameters of a slowly updated target network. The loss over a replay mini‑batch of size B is then given by ℒ(θ)=1B∑m=1B[y(m)−Qtot(o(m),a(m);θ)]2. (θ)\;=\; 1B _m=1^B [y(m)-Q_tot(o(m),a(m);θ) ]^2. (7) The mixing network weights are constrained to be non‑negative. This architectural choice enforces the monotonicity condition ∂Qtot(o,a)∂Qi(oi,ai)≥0,∀i∈ℐ, ∂ Q_tot(o,a)∂ Q_i(o_i,a_i)≥ 0,∀ i , ensuring that an improvement in any agent’s value cannot reduce the global estimate. As a result of QtotQ_tot being monotone, the joint maximizer factorizes into independent maximizers. This is due to the Individual-Global-Max (IGM) principle [16]: argmaxaQtot(o,a)=[argmaxa1Q1(o1,a1)⋮argmaxaNQN(oN,aN)], *arg\,max_a\,Q_tot(o,a)= bmatrix *arg\,max_a_1Q_1(o_1,a_1)\\ \\ *arg\,max_a_NQ_N(o_N,a_N) bmatrix, (8) so team‑optimal actions can be obtained by greedy choices made locally by each agent. During learning, the full joint trajectory o is available to the mixing network, providing a centralized viewpoint that removes the non-stationarity that is a consequence of purely decentralized training. In this phase, each agent selects a random action airandoma_i^random with probability ϵε to encourage exploration, while with probability 1−ϵ1-ε it selects the greedy action ai⋆=argmaxaiQi(oi,ai)a_i = _a_iQ_i(o_i,a_i). Formally, the individual action selection at each step is ai=airandomwith probability ϵ,ai⋆with probability 1−ϵ. a_i= casesa_i^random&with probability ε,\\ a_i &with probability 1-ε. cases (9) The exploration probability ϵε decays over time according to a fixed decay rate ϵdecay _decay until reaching a specified minimum value ϵmin _min. During evaluation, the model fully exploits its learned policy by setting ϵ=0ε=0, thereby always choosing the greedy action. After convergence, the mixing network is discarded, and each agent selects its action as ai⋆a_i using local information oio_i. It is worth noting that even with each agent selecting actions based on local information, (8) guarantees that the resulting joint action remains optimal. IV Approach IV-A Agent State Design The state of each autonomous vehicle (AV) is designed to encapsulate both its own dynamics and the dynamics of its immediate surroundings through the partial attention mechanism. This approach allows the i-th AV to focus on the most relevant interactions, specifically the vehicle directly ahead of it, f(i)f(i), and the vehicle approaching from the opposite road, o(i)o(i). To illustrate, vehicle E in Fig. 1 (right) has a front vehicle C and an opposite vehicle B. On the other hand, in Fig. 1 (middle), vehicle B has the front vehicle C and the opposite vehicle E. Similarly, vehicle A has the front vehicle B and the opposite vehicle E. Finally, vehicles D and C have neither front nor opposite vehicle. Every agent’s state representation is composed of its current state and the historical states of these two neighboring vehicles over a specified time window. The kinematic state of vehicle i at time t is defined as S¯i(t)=[xi(t)yi(t)vi(t)ai(t)]⊤∈ℝ4 S_i(t)= bmatrixx_i(t)&y_i(t)&v_i(t)&a_i(t) bmatrix ^4, where xi(t)x_i(t) and yi(t)y_i(t) are the position coordinates, vi(t)v_i(t) is the speed, and ai(t)a_i(t) is the acceleration of the vehicle. The information state of vehicle i at time t, denoted Si(t)∈ℝ4+8(w+1)S_i(t) ^4+8(w+1), is defined as Si(t)=[S¯i(t)⊤vec(Fi(t))⊤vec(Oi(t))⊤]⊤, S_i(t)= bmatrix S_i(t) &vec(F_i(t)) &vec(O_i(t)) bmatrix , (10) where vec(⋅)vec(·) denotes row-major vectorization of a matrix into a flat vector, and Fi(t),Oi(t)∈ℝ(w+1)×4F_i(t),O_i(t) ^(w+1)× 4 represent the historical kinematic states of the front and opposite vehicles over the most recent w+1w+1 time steps: Fi(t) F_i(t) =[S¯f(i)(t−w)⋯S¯f(i)(t)]⊤, = bmatrix S_f(i)(t-w)&·s& S_f(i)(t) bmatrix , (11) Oi(t) O_i(t) =[S¯o(i)(t−w)⋯S¯o(i)(t)]⊤. = bmatrix S_o(i)(t-w)&·s& S_o(i)(t) bmatrix . (12) Here, S¯f(i)(t) S_f(i)(t) and S¯o(i)(t) S_o(i)(t) denote the kinematic states of the front and opposite vehicles associated with agent i, respectively. We assume that each agent can perfectly observe the kinematic state of its front and opposite vehicles. This corresponds to an idealized vehicle-to-vehicle (V2V) communication or sensing model. IV-B Partial Attention The front and opposite vehicle histories, Fi(t)F_i(t) and Oi(t)O_i(t), are leveraged by processing through a multi-head attention mechanism. This processing captures temporal dependencies and contextual interactions. This constructs a comprehensive state representation for each agent. The design begins with layer normalization [3]. Given a vector x, the output y of the layer normalization operation can be expressed as y=x−E[x]Var[x]+ρ⊙κ+β, y= x-E[x] Var[x]+ρ κ+β, (13) where E[x]E[x] is the mean of x, Var[x]Var[x] is the variance of x, and ⊙ represents element-wise multiplication. We set ρ=10−5ρ=10^-5 while κ and β are affine learnable parameters. We utilize PyTorch [2] to conduct this operation. The deployment of this function is discussed next. Each historical sequence is first normalized to ensure consistent scaling across different features and time steps: F~i(t)=LN(Fi(t)),O~i(t)=LN(Oi(t)), F_i(t)= LN(F_i(t)), O_i(t)= LN(O_i(t)), (14) where LN(⋅) LN(·) denotes layer normalization applied to each feature vector within the sequence. Following normalization, each token in the sequence is projected into a higher-dimensional embedding space to enhance the capacity of the attention mechanism: XFi(t) X_F_i(t) =F~i(t)WF⊤+w+1bF⊤∈ℝ(w+1)×dmodel, = F_i(t)\,W_F +1_w+1b_F ^(w+1)× d_model, (15) XOi(t) X_O_i(t) =O~i(t)WO⊤+w+1bO⊤∈ℝ(w+1)×dmodel, = O_i(t)\,W_O +1_w+1b_O ^(w+1)× d_model, (16) where WF,WO∈ℝdmodel×4W_F,W_O ^d_model× 4 and bF,bO∈ℝdmodelb_F,b_O ^d_model denote the weight matrices and bias vectors of the corresponding linear transformations, and w+1∈ℝw+11_w+1 ^w+1 is a vector whose entries are all ones. Here, dmodeld_model is the dimensionality of the embedding space. Next, the projected sequences XFi(t)X_F_i(t) and XOi(t)X_O_i(t) serve as inputs to separate multi-head attention modules tailored for the front and opposite vehicle histories: ZFi(t) Z_F_i(t) =MHA(XFi(t))∈ℝ(w+1)×dmodel, = MHA(X_F_i(t)) ^(w+1)× d_model, (17) ZOi(t) Z_O_i(t) =MHA(XOi(t))∈ℝ(w+1)×dmodel. = MHA(X_O_i(t)) ^(w+1)× d_model. (18) The function MHA(⋅) MHA(·) denotes the multi-head attention operation on the input sequence. This mechanism allows the model to focus on different parts of the input sequences simultaneously, capturing diverse aspects of temporal dependencies. To synthesize the information captured by the attention mechanism, a weighted aggregation of the attention outputs is performed along the temporal axis. This aggregation gives more emphasis on the most recent historical data, reflecting their higher relevance in the current decision-making context. We define α=Softmax(Ψ0.5,1)∈ℝw+1α= Softmax( _0.5,1) ^w+1 where Softmax(⋅) Softmax(·) normalizes the vector Ψ0.5,1 _0.5,1, which is an increasing evenly spaced vector between 0.5 and 1.0 over the w+1w+1 time steps. Then, the weighted embeddings are EFi(t)=ZFi(t)⊤α,EOi(t)=ZOi(t)⊤α. E_F_i(t)=Z_F_i(t) α, E_O_i(t)=Z_O_i(t) α. (19) This contracts the temporal axis of ZFi(t)Z_F_i(t) and ZOi(t)Z_O_i(t), producing a single embedding vector that weights the relevance of each time step. To further enhance the expressive power of the embeddings EFi(t)E_F_i(t) and EOi(t)E_O_i(t), they are passed through two feed-forward neural networks (FFN) with residual connections and layer normalization as EFi′(t) E _F_i(t) =LN(EFi(t)+FFNF(EFi(t)))∈ℝdmodel, = LN (E_F_i(t)+ FFN_ F(E_F_i(t)) ) ^d_model, (20) EOi′(t) E _O_i(t) =LN(EOi(t)+FFNO(EOi(t)))∈ℝdmodel, = LN (E_O_i(t)+ FFN_ O(E_O_i(t)) ) ^d_model, (21) where FFNF(⋅) FFN_ F(·) and FFNO(⋅) FFN_ O(·) are feed-forward networks comprising linear transformations and non-linear activations. The residual connections facilitate gradient flow and stabilize training by allowing the model to retain information from earlier layers. The final state representation Si′(t)S _i(t) for agent i at time step t is constructed by concatenating the agent’s own state vector S¯i(t) S_i(t) with the aggregated embeddings from the front and opposite vehicle histories: Si′(t)=[S¯i(t)⊤EFi′(t)⊤EOi′(t)⊤]⊤∈ℝ4+2dmodel. S _i(t)= bmatrix S_i(t) &E _F_i(t) &E _O_i(t) bmatrix ^4+2d_model. (22) This enriched state vector integrates both the intrinsic dynamics of the agent and the contextual information derived from its immediate vehicular environment through partial attention. The comprehensive state Si′(t)S _i(t) is subsequently fed into the QMIX utility network, which leverages this information to estimate the utility values for each possible action, thereby facilitating coordinated and efficient decision-making during merging maneuvers. IV-C Reward Design The reward structure is designed to incentivize desirable behaviors and penalize undesirable actions. It integrates both global and local reward components, ensuring that individual agent performance aligns with overall traffic efficiency and safety objectives. The sum of all terms gives a comprehensive reward signal that facilitates effective learning of the QMIX model and enables coordinated decision-making in the merging scenario. Global Reward Signal: The global reward is designed to promote collective traffic efficiency and safety through metrics that reflect the overall performance of all agents. First, to promote safety, a penalty is imposed when a collision occurs via the term rcollision(t)=−c1Ncollision(t), r_collision(t)=-c_1N_collision(t), (23) where Ncollision(t)N_collision(t) represents the number of collisions at time step t, and c1c_1 is a positive weighting coefficient. Upon receiving this penalty, the episode is terminated. Next, to maintain efficient traffic flow and minimize congestion, a traffic flow term encourages agents to sustain desired speeds: rflow(t)=c2v¯highway(t)+c3v¯merging(t), r_flow(t)=c_2 v_highway(t)+c_3 v_merging(t), (24) where v¯highway(t) v_highway(t) and v¯merging(t) v_merging(t) denote the average velocities on the highway and merging lanes at time step t, and c2,c3c_2,c_3 are positive balancing coefficients. To mitigate idling and reduce travel time, we define the waiting penalty as rwaiting(t)=−c4Twaiting(t), r_waiting(t)=-c_4T_waiting(t), (25) where Twaiting(t)T_waiting(t) is the cumulative time vehicles spend below the minimum allowed velocity, and c4c_4 is the weighting coefficient. Finally, to encourage route completion, we define rgoal(t)=c5Ngoal(t), r_goal(t)=c_5N_goal(t), (26) where Ngoal(t)N_goal(t) denotes the number of vehicles that successfully reached their destinations at time step t, and c5c_5 is the corresponding weighting coefficient. Individual Reward Signal: In addition to the global reward, individual rewards are directly attributed to each agent based on its specific actions and states, ensuring independent performance maximization. Each agent is rewarded for tracking its desired velocity via rvelocity,i(t)=−c6|vi(t)−vi,desired|vi,desired, r_velocity,i(t)=-c_6 |v_i(t)-v_i,desired |v_i,desired, (27) where vi(t)v_i(t) is the velocity of agent i at time t, vi,desiredv_i,desired is its target velocity, and c6c_6 is a positive weighting coefficient. Fuel-efficient driving is promoted through refficiency,i(t)=−c7Fueli(t), r_efficiency,i(t)=-c_7Fuel_i(t), (28) where Fueli(t)Fuel_i(t) is the fuel consumption of agent i at time step t, and c7c_7 is the weighting coefficient. Finally, passenger comfort is enforced by penalizing sharp accelerations via rcomfort,i(t)=−c8|ai(t)|, r_comfort,i(t)=-c_8 |a_i(t) |, (29) where ai(t)a_i(t) is the acceleration of agent i at time step t, and c8c_8 is the corresponding weighting coefficient. Figure 2: Performance during the training phase. Figure 3: Comparison of the proposed method against SUMO IDM in the evaluation phase. Final Reward Signal: The comprehensive reward r(t)r(t) at step t is the sum of all global and individual agent reward terms: r(t) r(t) =rcollision(t)+rflow(t)+rwaiting(t)+rgoal(t) =r_collision(t)+r_flow(t)+r_waiting(t)+r_goal(t) +∑i=1Nrvelocity,i(t)+refficiency,i(t)+rcomfort,i(t). + _i=1^Nr_velocity,i(t)+r_efficiency,i(t)+r_comfort,i(t). (30) The total reward r(t)r(t) enters the discounted return defined in (2). This reward structure ensures that agents are simultaneously motivated to maintain safe and efficient driving, both individually and collectively, enhancing overall traffic dynamics during highway merging. V Results We utilize the Simulation of Urban Mobility (SUMO) [12] to build the highway environment. Neural networks are designed and trained via PyTorch [2] and then deployed in SUMO through the integration of TraCI [21]. Our model is trained for 10001000 episodes, each with a maximum of 10001000 time steps. The hyperparameters used to train the networks are summarized in Table I. The highway has two lanes and a length of 400400 meters, but vehicles are not allowed to change lanes. The merging road has a length of 100100 meters and one lane. Table I reports the random parameters used to generate vehicles for each episode. This includes a fixed number of agents, the initial velocities of highway and merging road vehicles, the assigned departure time (the time at which the vehicle begins to exist), and the assigned road, i.e., label HWHW for highway spawn and M for a merging road spawn. Table I: Training Hyperparameters Parameter Value Episodes 10001000 Maximum time steps per episode 10001000 Optimizer AdamW [13] B (batch size) 256256 γ (discount factor) 0.990.99 Learning rate 0.00010.0001 ϵε 1.01.0 ϵmin _min 0.050.05 ϵdecay _decay 0.990.99 Target network update interval 44 episodes TraCI sampling interval (ss) 0.10.1 Replay buffer size 10000001000000 Action space - acceleration (m/s2m/s^2) [−6,−3,−2,−1,0,1,2,3,6][-6,-3,-2,-1,0,1,2,3,6] w 99 Table I: Vehicle Generation Parameters Parameter Value Number of agents 1616 Initial velocity for highway agents (m/sm/s) ∼Uniform([7,10]) ([7,10]) Initial velocity merging agents (m/sm/s) ∼Uniform([4,8]) ([4,8]) Departure time (ss) ∼Uniform([0,100]) ([0,100]) Route ∼Uniform(HW,M) (\HW,M\ ) Vehicle length (mm) 5.05.0 Table I reports the reward coefficients. c1c_1 is large to critically prioritize vehicle safety. c2c_2 and c3c_3 are balanced to enhance both the merging flow and the highway flow. c4c_4 penalizes waiting time without overshadowing other terms. c5c_5 rewards safely reaching the goal, but it is not enough to receive a large reward. c6c_6 increases the vehicle’s velocity, while c7c_7 is designed not to exceed high speeds. Lastly, the comfort term c8c_8 enables smoother acceleration signals. Table I: Reward Coefficients Coefficient c1c_1 c2c_2 c3c_3 c4c_4 c5c_5 c6c_6 c7c_7 c8c_8 Value 40 0.5 0.9 1.0 1.0 3.0 0.00001 0.01 The training was conducted on a MacBook Pro equipped with an Apple M4 chip, and it lasted approximately 56 minutes. Fig. 2 shows the performance of the proposed method during the training phase. Within 500 episodes, the model converged to high average velocities while significantly reducing the average number of collisions. Fuel consumption increases during training as a result of increasing the average velocities of vehicles. Furthermore, we conduct an ablation study by training the model without the temporal attention layers (i.e, omitting (17) and (18)). This ablation comparison is referred to as Vanilla QMIX (VQMIX) in Fig. 2. Its performance diverges within 300 episodes of training. This illustrates the importance of the temporal attention layers to focus on the critical temporal dynamics. Hence, the ablation study demonstrates the effectiveness of the proposed method. Next, we discuss the results during the evaluation phase. Fig. 3 compares our Partial Attention QMIX model with SUMO’s Intelligent Driving Model (IDM) [19] in four performance metrics. In terms of average reward, our method shows a clear improvement over episodes. For average velocities, our model reaches and maintains a higher level, meaning vehicles move more smoothly and efficiently. Fuel consumption is slightly higher with our approach, probably due to higher velocity. Finally, the number of collisions drops significantly during training with our model, but IDM continues to have more frequent crashes. These results illustrate how carefully tailoring the information state of the ego vehicle to include only important nearby interplay dynamics can enhance training speed and evaluation performance. VI Conclusion This paper presented a highway merging framework integrating partial attention into QMIX, where attention operates on two complementary levels: spatial attention identifies the relevant vehicles, while temporal attention extracts informative past states. Combined with a hybrid reward signal balancing global and individual objectives, this design yields safer and more efficient merging behavior. Simulation results in a SUMO-based environment confirm significant improvements in collision rate, average velocity, and overall reward over a standard driving baseline, with the trade-off of higher fuel consumption driven by increased speeds. Future work will extend the framework to multi-lane and mixed-autonomy settings where autonomous and human-driven vehicles coexist. References [1] C. Amato, G. Chowdhary, A. Geramifard, N. K. Üre, and M. J. Kochenderfer (2013) Decentralized control of partially observable Markov decision processes. In 52nd IEEE Conference on Decision and Control, p. 2398–2405. Cited by: §I-B. [2] J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. Voznesensky, B. Bao, P. Bell, D. Berard, E. Burovski, G. Chauhan, A. Chourdia, W. Constable, A. Desmaison, Z. DeVito, E. Ellison, W. Feng, J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalambarkar, L. Kirsch, M. Lazos, M. Lezcano, Y. Liang, J. Liang, Y. Lu, C. Luk, B. Maher, Y. Pan, C. Puhrsch, M. Reso, M. Saroufim, M. Y. Siraichi, H. Suk, M. Suo, P. Tillet, E. Wang, X. Wang, W. Wen, S. Zhang, X. Zhao, K. Zhou, R. Zou, A. Mathews, G. Chanan, P. Wu, and S. Chintala (2024) PyTorch 2: faster machine learning through dynamic Python bytecode transformation and graph compilation. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, Cited by: §IV-B, §V. [3] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. External Links: 1607.06450 Cited by: §IV-B. [4] Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, et al. (2024) A survey on evaluation of large language models. ACM transactions on intelligent systems and technology 15 (3), p. 1–45. Cited by: §I. [5] D. Chen, M. R. Hajidavalloo, Z. Li, K. Chen, Y. Wang, L. Jiang, and Y. Wang (2023) Deep multi-agent reinforcement learning for highway on-ramp merging in mixed traffic. IEEE Transactions on Intelligent Transportation Systems 24 (11), p. 11623–11638. Cited by: §I. [6] J. Chen, B. Zhu, M. Zhang, X. Ling, X. Ruan, Y. Deng, and N. Guo (2025) Multi-agent deep reinforcement learning cooperative control model for autonomous vehicle merging into platoon in highway. World Electric Vehicle Journal 16 (4), p. 225. Cited by: §I. [7] Z. Chen, Y. Du, A. Jiang, and S. Miao (2025) Deep reinforcement learning algorithm based ramp merging decision model. Proceedings of the Institution of Mechanical Engineers, Part D: Journal of Automobile Engineering 239 (1), p. 70–84. Cited by: §I. [8] J. Du, A. Yu, H. Zhou, Q. Jiang, and X. Bai (2025) Research on integrated control strategy for highway merging bottlenecks based on collaborative multi-agent reinforcement learning. Applied Sciences 15 (2), p. 836. Cited by: §I. [9] I. Goodfellow, Y. Bengio, and A. Courville (2016) Deep learning. MIT Press. Cited by: §I. [10] Z. Li, T. Shang, and P. Xu (2025) Multi-modal attention perception for intelligent vehicle navigation using deep reinforcement learning. IEEE Transactions on Intelligent Transportation Systems. Cited by: §I. [11] T. P. Lillicrap, J. J. Hunt, A. Pritzel, N. Heess, T. Erez, Y. Tassa, D. Silver, and D. Wierstra (2016) Continuous control with deep reinforcement learning. ICLR. Cited by: §I. [12] P. A. Lopez, M. Behrisch, L. Bieker-Walz, J. Erdmann, Y. Flötteröd, R. Hilbrich, L. Lücken, J. Rummel, P. Wagner, and E. Wießner (2018) Microscopic traffic simulation using SUMO. In The 21st IEEE International Conference on Intelligent Transportation Systems, Cited by: §I, §V. [13] I. Loshchilov and F. Hutter (2019) Decoupled weight decay regularization. In International Conference on Learning Representations, Cited by: Table I. [14] V. Mnih, K. Kavukcuoglu, D. Silver, A. A. Rusu, J. Veness, M. G. Bellemare, A. Graves, M. Riedmiller, A. K. Fidjeland, G. Ostrovski, et al. (2015) Human-level control through deep reinforcement learning. nature 518 (7540), p. 529–533. Cited by: §I-B. [15] T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster, and S. Whiteson (2020) Monotonic value function factorisation for deep multi-agent reinforcement learning. Journal of Machine Learning Research 21 (178), p. 1–51. Cited by: §I, §I-B. [16] K. Son, D. Kim, W. J. Kang, D. E. Hostallero, and Y. Yi (2019) Qtran: learning to factorize with transformation for cooperative multi-agent reinforcement learning. In International conference on machine learning, p. 5887–5896. Cited by: §I-B. [17] R.S. Sutton and A. Barto (2018) Reinforcement learning: an introduction. MIT Press. Cited by: §I-B. [18] M. Tan (1993) Multi-agent reinforcement learning: independent vs. cooperative agents. In Proceedings of the tenth international conference on machine learning, p. 330–337. Cited by: §I-B. [19] M. Treiber, A. Hennecke, and D. Helbing (2000) Congested traffic states in empirical observations and microscopic simulations. Physical review E 62 (2), p. 1805. Cited by: §V. [20] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. Advances in neural information processing systems 30. Cited by: §I-A, §I. [21] A. Wegener, M. Piórkowski, M. Raya, H. Hellbrück, S. Fischer, and J. Hubaux (2008) TraCI: an interface for coupling road traffic and network simulators. In Proceedings of the 11th communications and networking simulation symposium, p. 155–163. Cited by: §V. [22] K. Yang, X. Tang, S. Qiu, S. Jin, Z. Wei, and H. Wang (2023) Towards robust decision-making for autonomous driving on highway. IEEE Transactions on Vehicular Technology 72 (9), p. 11251–11263. Cited by: §I. [23] W. Zhou, D. Chen, J. Yan, Z. Li, H. Yin, and W. Ge (2022) Multi-agent reinforcement learning for cooperative lane changing of connected and autonomous vehicles in mixed traffic. Autonomous Intelligent Systems 2 (1), p. 5. Cited by: §I.