Paper deep dive

CAMO: A Conditional Neural Solver for the Multi-objective Multiple Traveling Salesman Problem

Fengxiaoxiao Li, Xiao Mao, Mingfeng Fan, Yifeng Zhang, Yi Li, Tanishq Duhan, Guillaume Sartoretti

Year: 2026Venue: arXiv preprintArea: cs.ROType: PreprintEmbeddings: 51

Abstract

Abstract:Robotic systems often require a team of robots to collectively visit multiple targets while optimizing competing objectives, such as total travel cost and makespan. This setting can be formulated as the Multi-Objective Multiple Traveling Salesman Problem (MOMTSP). Although learning-based methods have shown strong performance on the single-agent TSP and multi-objective TSP variants, they rarely address the combined challenges of multi-agent coordination and multi-objective trade-offs, which introduce dual sources of complexity. To bridge this gap, we propose CAMO, a conditional neural solver for MOMTSP that generalizes across varying numbers of targets, agents, and preference vectors, and yields high-quality approximations to the Pareto front (PF). Specifically, CAMO consists of a conditional encoder to fuse preferences into instance representations, enabling explicit control over multi-objective trade-offs, and a collaborative decoder that coordinates all agents by alternating agent selection and node selection to construct multi-agent tours autoregressively. To further improve generalization, we train CAMO with a REINFORCE-based objective over a mixed distribution of problem sizes. Extensive experiments show that CAMO outperforms both neural and conventional heuristics, achieving a closer approximation of PFs. In addition, ablation results validate the contributions of CAMO's key components, and real-world tests on a mobile robot platform demonstrate its practical applicability.

PDF

Open source PDF →Open local PDF →

Intelligence

Status: not_run | Model: - | Prompt: - | Confidence: 0%

Entities (0)

No extracted entities yet.

Relation Signals (0)

No relation signals yet.

Cypher Suggestions (0)

No Cypher suggestions yet.

Full Text

50,633 characters extracted from source content.

Expand or collapse full text

CAMO: A Conditional Neural Solver for the Multi-objective Multiple Traveling Salesman Problem Fengxiaoxiao Li1,∗, Xiao Mao2,∗, Mingfeng Fan1,†, Yifeng Zhang1, Yi Li3, Tanishq Duhan1, and Guillaume Sartoretti1 *Equal contribution. †Corresponding author: ming.fan@nus.edu.sg1National University of Singapore, Singapore.2China Unicom Shenzhen Branch, Shenzhen, China.3Central South University, Changsha, China. Abstract Robotic systems often require a team of robots to collectively visit multiple targets while optimizing competing objectives, such as total travel cost and makespan. This setting can be formulated as the Multi-Objective Multiple Traveling Salesman Problem (MOMTSP). Although learning-based methods have shown strong performance on the single-agent TSP and multi-objective TSP variants, they rarely address the combined challenges of multi-agent coordination and multi-objective trade-offs, which introduce dual sources of complexity. To bridge this gap, we propose CAMO, a conditional neural solver for MOMTSP that generalizes across varying numbers of targets, agents, and preference vectors, and yields high-quality approximations to the Pareto front (PF). Specifically, CAMO consists of a conditional encoder to fuse preferences into instance representations, enabling explicit control over multi-objective trade-offs, and a collaborative decoder that coordinates all agents by alternating agent selection and node selection to construct multi-agent tours autoregressively. To further improve generalization, we train CAMO with a REINFORCE-based objective over a mixed distribution of problem sizes. Extensive experiments show that CAMO outperforms both neural and conventional heuristics, achieving a closer approximation of PFs. In addition, ablation results validate the contributions of CAMO’s key components, and real-world tests on a mobile robot platform demonstrate its practical applicability. I Introduction The Multiple Traveling Salesman Problem (MTSP) aims to plan routes for multiple agents such that each target location is visited exactly once, while optimizing a predefined objective over the resulting tours. MTSP arises naturally in multi-robot systems that require coordinated task allocation and route planning under practical constraints. Representative applications include cooperative dual-arm rearrangement [18], constrained multiple-depot routing [32], multi-platform agricultural operations [3], and cooperative multi-agent navigation in dynamic environments [19]. Based on the objective definition, MTSP is commonly categorized into Min-Sum MTSP, which minimizes the total tour length, and Min-Max MTSP, which minimizes the longest tour length (i.e., the makespan). Both variants are NP-hard to solve optimally [12] and belong to the class of single-objective combinatorial optimization problems (SOCOPs). In practice, however, decision-makers often need to balance competing goals; for example, minimizing total travel distance may worsen workload imbalance among robots. Such requirements motivate the Multi-Objective MTSP (MOMTSP) [1], a representative multi-agent multi-objective combinatorial optimization problem (MOCOP), which seeks a diverse set of Pareto-optimal solutions under different preferences that capture trade-offs among conflicting objectives. MOMTSP therefore poses dual challenges: coordinating multiple agents while simultaneously modeling diverse trade-offs among objectives. Traditional Multi-Objective Evolutionary Algorithms (MOEAs) [34] have been widely adopted to approximate the Pareto front (PF) to handle MOMTSP. However, their population-based iterative search often incurs high computational cost and limited scalability when handling large-scale problem instances. Moreover, many approaches rely on handcrafted operators or problem-specific heuristics, which require substantial expert knowledge and reduce adaptability across different robotic scenarios [6]. As robotic systems increasingly require real-time responsiveness to dynamic events (e.g., newly appearing targets or robot failures), existing approaches [25] struggle to meet these demands simultaneously. In recent years, learning-based methods, particularly deep reinforcement learning (DRL), have achieved strong performance on combinatorial optimization problems such as single-agent TSP [12] and multi-objective TSP. However, extending these neural solvers to MOMTSP is challenging because it requires handling both multi-objective trade-offs and multi-agent coordination within a single policy. On the multi-objective side, the policy must adapt to different preference vectors and produce solutions that cover diverse regions of the Pareto front, rather than optimizing a single scalar reward. On the multi-agent side, the model must coordinate a variable number of agents under a predefined team size and handle agent idling consistently (e.g., some agents may remain at the depot under preferences that prioritize minimizing total tour length), while still producing an executable multi-agent plan with well-defined routes for all agents. This requirement is not captured by directly extending single-agent formulations or by models that assume flexible vehicle usage, as in multi-objective capacitated vehicle routing problems (MOCVRP) [11]. To address these limitations, we propose CAMO (Conditional Attention for Multi-objective Optimization), a conditional neural solver for MOMTSP. CAMO adopts a specialized encoder–decoder architecture: a conditional encoder fuses preference and instance features to model multi-objective trade-offs, while a collaborative decoder alternates between an agent-selection module and a node-selection module to construct tours. The decoder respects a predefined team size and scales naturally to varying numbers of agents. In particular, the agent-selection module coordinates the team by attending to a global context and all agents’ current states when selecting which agent acts next, thereby enabling effective multi-agent collaboration. The main contributions are summarized as follows: • We propose CAMO, a conditional neural solver for MOMTSP that jointly addresses multi-objective trade-offs and multi-agent coordination, and generalizes across varying numbers of targets, agents, and preference vectors. We further train CAMO with a REINFORCE-based objective over a mixed distribution of problem sizes to improve cross-scale generalization. • We design a conditional encoder with conditional attention (CA) to integrate preference and instance information, together with a gated aggregation (GA) module to enhance representations, and develop a collaborative decoder that coordinates agents via global–agent state attention with alternating agent and node selection. • We conduct extensive experiments to demonstrate CAMO’s effectiveness and validate real-world applicability through physical tests on a mobile robot platform. I Related Works Traditional methods for MOCOPs. MOEAs are the cornerstone metaheuristic for MOCOPs [9]. They can be broadly categorized into dominance-based methods, such as NSGA-I [8], which combine Pareto ranking with crowding-distance preservation. Decomposition-based approaches, such as MOEA/D [33], which scalarize objectives into collaborative single-objective subproblems. Subsequent advances include indicator-based selection like IBEA [35], which guides search via hypervolume or ϵε-dominance, and reference-point-based methods like NSGA-I [7], designed to maintain uniform front coverage in many-objective settings where dominance pressure deteriorates. Problem-specific evolutionary approaches have also been explored for MOMTSP, such as hybrid EDAs [25] that exploit a probabilistic model building over permutation spaces. Neural Multi-Objective Combinatorial Optimization. Existing neural solvers for MOCOPs primarily adopt a decomposition-based strategy [33], scalarizing the multi-objective problem into weighted single-objective subproblems parameterized by a preference vector λ. While early works trained a distinct model for each preference, suffering from inflexibility and high training costs, recent SOTA methods favor a single, generalized model conditioned on λ, enabling continuous preference interpolation at inference time. The central challenge is effectively integrating this preference signal; approaches range from hypernetworks like PMOCO [20], which generate task-specific weights from λ, to encoder-fusion methods like CNH [10] and WE-CA [5], which inject preference information directly into the attention mechanism. However, these representative works, including prominent collaborative DRL methods [30, 11], are exclusively designed for single-agent MOCOPs and fundamentally cannot address the cooperation required in multi-agent settings. I Background I-A Problem Formulation The MOMTSP problem is formally defined on a complete graph G(Ξ,E)G( ,E). The node set Ξ=0∪Ξ′ =\0\∪ consists of a single depot indexed by 0 (with location (x0,y0)(x_0,y_0)) and a set of nξn_ξ nodes Ξ′=(x1,y1),…,(xnξ,ynξ) =\(x_1,y_1),…,(x_n_ξ,y_n_ξ)\. A fleet of nan_a agents =1,…,naA=\1,…,n_a\ is available at the depot. The goal is to determine a set of routes, one for each agent, such that every agent starts and ends at the depot, and every node in Ξ′ is visited exactly once by exactly one agent. In our study, we consider two objectives: f1=min∑a∈∑i∈Ξ∑j∈Ξwijψijaf_1= _a _i∈ _j∈ w_ij _ij^a, which minimizes the total tour length of all agents; and f2=min⁡maxa∈⁡raf_2= _a r_a, which minimizes the makespan. Here, wijw_ij denotes the distance between nodes i and j, ψija∈1,0 _ij^a∈\1,0\ indicates whether agent a travels from i to j, and rar_a represents the cumulative travel distance of agent a. MOMTSP is challenging due to its NP-hardness, involving two coupled decisions: assigning nodes to agents and determining each agent’s visiting order. For instance, an instance with 100100 cities and 44 agents already yields 100!×(993)=100!×99!3!×96!≈10163100!× 993= 100!× 99!3!× 96!≈ 10^163 candidate solutions. Beyond this, introducing two objectives fundamentally transforms the search: rather than finding a single optimum, one must identify the PF of non-dominated solutions. Verifying Pareto optimality requires pairwise comparisons across all candidates—up to O(10326)O\! (10^326 ) in the worst case—rendering exhaustive search utterly infeasible, dwarfing even the estimated number of atoms in the observable universe (107810^78–108210^82). Definition 1 (Pareto Dominance). For a problem with n objectives, let Π comprise the set of all feasible solutions. A solution π∈Ππ∈ is said to dominate another solution π′∈Ππ ∈ (denoted π≺π′π π ) only if: ∀i∈1,…,n,fi(π)≤fi(π′)∀ i∈\1,…,n\,f_i(π)≤ f_i(π ) and ∃j∈1,…,n,fj(π)<fj(π′)∃ j∈\1,…,n\,f_j(π)<f_j(π ). Definition 2 (Pareto Optimality). A solution π∗∈Ππ^*∈ is Pareto optimal if it is not dominated by any other solution π∈Ππ∈ . The set of all solutions is the Pareto set, =π∗∈Π∣∄π∈Π,π≺π∗P=\π^*∈ π∈ ,π π^*\. The image of this set in the objective space is the PF, ℱ=F(π∗)∣π∗∈F=\F(π^*) π^* \, where F(π)=(f1(π),…,fn(π))⊤F(π)=(f_1(π),…,f_n(π)) maps each solution to a point. I-B Objective Decomposition Strategy Decomposition strategies are widely used to solve MOCOPs [34], which reformulate the MOCOP, F(π)=(f1(π),…,fn(π))⊤F(π)=(f_1(π),…,f_n(π)) , as a set of single-objective combinatorial optimization problems (SOCOPs). Each SOCOP is parameterized by a preference (weight) vector λ=(λ1,…,λn)λ=( _1,…, _n), where λj≥0 _j≥ 0. By solving these SOCOPs with diverse λ vectors, an approximation of the PF is obtained. Given the ability to find solutions on non-convex PFs, we adopt the Tchebycheff (TC) decomposition to scalarize the MOMTSP [20]. This function minimizes the maximum weighted deviation from an ideal objective vector z∗=(z1∗,…,zn∗)z^*=(z_1^*,…,z_n^*). For a given preference vector λi _i, a Single Objective MTSP (SOMTSP) is defined as: minπ⁡gtc(π|λi,z∗)=maxj=1,…,n⁡λij|fj(π)−zj∗|. _πg^tc(π| _i,z^*)= _j=1,…,n\ _i^j|f_j(π)-z_j^*|\. (1) IV CAMO IV-A Overview To achieve high-quality approximate Pareto optimal solutions for MOMTSP, we propose a conditional neural solver, CAMO, conditioned on the instance context ℐI, preference λ, agent counts nan_a, and node scales nξn_ξ. Our CAMO is designed to learn a stochastic policy pθp_θ for obtaining the approximate output solution. The final output is a sequence of action tuples (at,ξt)(a_t, _t), consisting of selected agents and nodes. The output solution is denoted as π=(a0,ξ0),…,(aT,ξT),ξt∈Ξ,at∈π=\(a_0, _0),…,(a_T, _T)\,\ _t∈ ,\ a_t , expressed as: pθ(π∣ℐ,λ,na,nξ)=∏t=0Tpθ((at,ξt)∣ℐ,λ,na,nξ,(a0:t−1,ξ0:t−1)), splitp_θ(π ,λ,n_a,n_ξ)&= _t=0^Tp_θ ((a_t, _t) ,λ,n_a,n_ξ,\\ & (a_0:t-1, _0:t-1) ), split (2) where θ represents the set of learnable parameters, and (at,ξt)(a_t, _t) and (a0:t−1,ξ0:t−1)(a_0:t-1, _0:t-1) represent the selected agent–node action tuple and the partial solution at time step t, and T denotes the number of steps to construct the solution π. To this end, we utilize an encoder-decoder architecture. The conditional encoder embeds and fuses the instance context ℐI and preference λ, allowing the model to understand their combined impact on the SOMTSPs and the eventual PF. The collaborative decoder, comprising iterative agent-selection and node-selection modules, provides dual scalability and generalizes to unseen agent counts nan_a and node scales nξn_ξ, offering an advantage over inflexible classifier-based structures [31]. Moreover, the agent-selection module coordinates multi-agent decision making via global–agent state attention, enabling effective collaboration. The neural architecture of our CAMO is illustrated in Fig. 1. We elaborate on the details of these key components in the following sections. Figure 1: The network architecture of CAMO. IV-B Conditional Encoder The conditional encoder takes an instance ℐI and a preference λ as inputs and outputs enhanced node embeddings, generated through a stack of L CA layers. Each layer consists of a CA sub-layer and a feed-forward (F) sub-layer. Raw Features. The conditional encoder first projects raw node features Ξ=0,1,…,nξS_ =\S_0,S_1,...,S_n_ξ\ and the preference λ into initial node embeddings Hξ0=h00,…,hnξ0H_ξ^0=\h_0^0,…,h_n_ξ^0\ and a preference embedding hλ0h_λ^0. Passing these directly to an Attention Model (AM) [14] encoder would indiscriminately blend heterogeneous inputs and obscure their specific interdependencies. To address this, we introduce a CA mechanism that conditions node embeddings on the preference vector and jointly updates preference and node embeddings to better capture preference–node interactions, thereby improving the solution quality. At layer l∈1,⋯,Ll∈\1,·s,L\, the CA layer receives the node embeddings Hξ(l−1)=h0(l−1),…,hnξ(l−1)H_ξ^\,(l-1)=\h_0^\,(l-1),…,h_n_ξ^\,(l-1)\ and a preference embedding hλ(l−1)h_λ^\,(l-1) as inputs. CA Sublayer. As illustrated in Fig. 1, the node embeddings Hξ(l−1)=h0(l−1),…,hnξ(l−1)H_ξ^\,(l-1)=\h_0^\,(l-1),…,h_n_ξ^\,(l-1)\ and the preference embedding hλ(l−1)h_λ^\,(l-1) are first passed through Instance Normalization (IN) [29]: hi∗(l−1)=IN(hi(l−1)),hλ∗(l−1)=IN(hλ(l−1))h_i^*(l-1)=IN(h_i^(l-1)),h_λ^*(l-1)=IN(h_λ^(l-1)). Then, we condition the normalized node embeddings hi∗(l−1)h_i^*(l-1) on the normalized preference embedding hλ∗(l−1)h_λ^*(l-1) via a feature-wise affine (FWA) transformation [22], as follows: γ(l−1)=Wγhλ∗(l−1),β(l−1)=Wβhλ∗(l−1),hi′⁣(l−1)=γ(l−1)∘hi∗(l−1)+β(l−1),∀i∈0,…,nξ, gatheredγ^(l-1)=W^γh_λ^*(l-1), β^(l-1)=W^βh_λ^*(l-1),\\ h_i (l-1)=γ^(l-1) h_i^*(l-1)+β^(l-1), ∀ i∈\0,...,n_ξ\, gathered (3) where WγW^γ and WβW^β are trainable matrices, and ∘ denotes element-wise multiplication. The resulting node embeddings hi′⁣(l−1)h_i (l-1) and the normalized preference embedding hλ∗(l−1)h_λ^*(l-1) are then jointly updated with a standard multi-head attention (MHA) block to encourage information exchange and mitigate representational conflicts. Let (l−1)=hλ∗(l−1)∪hj′⁣(l−1)j=0nξC^(l-1)=\h_λ^*(l-1)\∪\h_j (l-1)\_j=0^n_ξ denote the context used as Key/Value vectors: h^λ(l−1) h_λ^(l-1) =MHA(q=hλ∗(l−1),k=(l−1),v=(l−1)), =MHA(q=h_λ^*(l-1),k=C^(l-1),v=C^(l-1)), (4) h^i(l−1)=MHA(q=hi′⁣(l−1),k=(l−1),v=(l−1)),∀i∈0,…,nξ, split h_i^(l-1)&=MHA(q=h_i (l-1),k=C^(l-1),v=C^(l-1)),\\ & ∀ i∈\0,...,n_ξ\, split (5) where q, k, and v represent the Query, Key and Value vectors in MHA, respectively. Finally, we replace standard additive residual connections [27] with GA [17] to regulate information flow and enhance expressive power [26]. We apply GA at the end of the CA sublayer: GA⁡(,)=+σ(Ug−bg)⊙,GA(X,Y)=X+σ(U_g\,X-b_g) , (6) where =hλ(l−1)∪hj(l−1)j=0nξX=\h_λ^(l-1)\∪\h_j^(l-1)\_j=0^n_ξ denotes the inputs to the CA sublayer, and =h^λ(l−1)∪h^j(l−1)j=0nξY=\ h_λ^(l-1)\∪\ h_j^(l-1)\_j=0^n_ξ denotes the outputs of the MHA block. Besides, UgU_g and bgb_g are trainable parameters, σ is the Sigmoid function [13], and ⊙ denotes element-wise multiplication. F Sublayer. The F sublayer takes the outputs of the CA sublayer as input. It comprises Instance Normalization (IN), a two-layer MLP [24] with ReLU activation [21], and a dedicated GA module, applied sequentially. For the GA in the F sublayer, we set X to the inputs of the F sublayer and Y to the MLP outputs. The final node embeddings HξL=h0L,…,hnξLH_ξ^L=\h_0^L,…,h_n_ξ^L\ produced by the conditional encoder serve as inputs to the collaborative decoder. IV-C Collaborative Decoder The collaborative decoder, composed of an agent-selection module and a node-selection module, takes as input the final node embeddings produced by the conditional encoder, together with a graph embedding g obtained by average pooling over these node embeddings. It constructs the solution auto-regressively. At each decoding step, the decoder sequentially utilizes the agent-selection module and then the node-selection module, iteratively generating a sequence of agent-node tuples to build the complete multi-agent solution. Agent-selection Module. At decoding step t, we construct the current node sequence embeddings for all agents. Let H~it=h~i0,…,h~it−1 H_i^t=\ h_i^0,…, h_i^t-1\ denote the sequence of node embeddings visited by agent i before step t. The set of all sequences is ℋt=H~1t,…,H~natH^t=\ H_1^t,…, H_n_a^t\. This set ℋtH^t is processed via max pooling and a linear layer to obtain an aggregated embedding: ℋ~t=maxpool(ℋt)Wa+ba H^t=maxpool(H^t)W_a+b_a, where WaW_a and bab_a are trainable parameters. Here, ℋ~t=ℋ~1t,…,ℋ~nat H^t=\ H_1^t,…, H_n_a^t\ denotes the set of current node-sequence embeddings for all nan_a agents at step t. Subsequently, we compute a global representation g~t g^t by performing average pooling over these per-agent embeddings and then fusing the result with the graph embedding g, that is, g~t=1na∑i=1naℋ~it+g g^t= 1n_a _i=1^n_a H_i^t+g. Concurrently, we define a state context vector Ciξ=(xi,yi;ri)C_i^ξ=(x_i,y_i;r_i) for each agent i, which captures its current coordinates (xi,yi)(x_i,y_i) and cumulative travel distance rir_i. The global representation g~t g^t and each agent’s state context CiξC_i^ξ are then processed through separate F layers to generate the final global Query vector tG^t and the agent-specific Key vectors it G_i^t, respectively: t=F(g~t);it^=F(Ciξ)∀i∈1,2,…,na.G^t=F( g^t); G_i^t=F(C_i^ξ) ∀ i∈\1,2,…,n_a\. (7) These F layers share the same architecture as those in the encoder but utilize independently trained parameters. Finally, we use tG^t as the Query and it G_i^t as the Key to compute the attention score sias_i^a using a single-head attention (SHA) layer: sia=κ⋅tanh⁡((it^)⊤t/dk)∀i∈1,2,…,na,s_i^a=κ· (( G_i^t) G^t/ d_k) ∀ i∈\1,2,…,n_a\, (8) where κ is a clipping coefficient to ensure numerical stability and dkd_k is the dimensionality of the Key vector. After calculating the scores sias_i^a, agents that have already returned to the depot are masked. A softmax function is then applied to the attention scores to yield the agent selection probabilities p1a,p2a,…,pnaa\p_1^a,\,p_2^a,\,…,\,p_n_a^a\, thereby determining the agent ata_t selected at step t. This design enables the module to adapt to varying scales of agents, ensuring model scalability. Node-selection Module. Once the agent ata_t is selected, the node-selection process begins. We construct a node context vector Cat=[cat−1;ra]C_a^t=[c_a^t-1;r_a], where cat−1c_a^t-1 is the embedding of the node visited by agent ata_t at the previous step t−1t-1, and rar_a is its current travel distance. The selection is determined using an MHA block followed by a SHA layer. In the MHA with M heads, the Query (qmξq_m^ξ) is derived from the node context CatC_a^t, while the Key (ki,mξk_i,m^ξ) and Value (vi,mξv_i,m^ξ) are derived from the final node embeddings hiLh_i^L: qmξ=CatWq,mξ;ki,mξ=hiLWk,mξ;vi,mξ=hiLWv,mξ∀i∈0,…,nξ, splitq_m^ξ&=C_a^tW_q,m^ξ; k_i,m^ξ=h_i^LW_k,m^ξ;\\ v_i,m^ξ&=h_i^LW_v,m^ξ ∀ i∈\0,…,n_ξ\, split (9) where Wq,mξW_q,m^ξ, Wk,mξW_k,m^ξ, and Wv,mξW_v,m^ξ are the trainable parameter matrices for head m∈1,2,⋯,Mm∈\1,2,·s,M\. Next, the attention score ei,mξe_i,m^ξ is calculated, and a mask is applied to visited nodes: ei,mξ=qmξ(ki,mξ)⊤/dk∀i∈0,…,nξ,e_i,m^ξ=q_m^ξ(k_i,m^ξ) / d_k ∀ i∈\0,…,n_ξ\, (10) ei,mξ=ei,mξ,if node i is unvisited,−∞,if node i is visited.e_i,m^ξ= casese_i,m^ξ,&if node i is unvisited,\\ -∞,&if node i is visited. cases (11) The scores ei,mξe_i,m^ξ are normalized via softmax to obtain attention weights αi,mξ _i,m^ξ. These weights are used to compute the head output zmξz_m^ξ. The outputs of all heads are concatenated and passed through a linear mapping WξW^ξ to yield the advanced node context qξq^ξ: αi,mξ=softmax(ei,mξ);zmξ=∑i=0nξαi,mξvi,mξ;qξ=[z1ξ,z2ξ,…,zMξ]Wξ, split _i,m^ξ&=softmax(e_i,m^ξ); z_m^ξ= _i=0^n_ξ _i,m^ξv_i,m^ξ;\\ q^ξ&=[z_1^ξ,z_2^ξ,…,z_M^ξ]W^ξ, split (12) where Wξ∈ℝdh×dhW^ξ ^d_h× d_h is a trainable parameter matrix. The advanced node context qξq^ξ is then used as the Query, while the final node embeddings serve as the Key vector in the subsequent SHA layer. The attention score siξs_i^ξ in SHA is calculated as: siξ=κ⋅tanh⁡(qξ(hiL)⊤/dk)∀i∈0,1,…,nξ.s_i^ξ=κ· (q^ξ(h_i^L) / d_k) ∀ i∈\0,1,…,n_ξ\. (13) A masking mechanism is applied to the scores siξs_i^ξ, and a softmax function is used to obtain the final node selection probabilities p0ξ,p1ξ,…,pnξ\p_0^ξ,p_1^ξ,…,p_n_ξ^ξ\ for agent ata_t. The final node ξt _t is chosen based on these probabilities using either a greedy or sampling strategy. During training, we adopt a customized sampling strategy that applies to both agent and node selection, thereby encouraging exploration. During inference, a greedy strategy is employed to maximize the expected reward. Crucially, the parameters in the agent and node selection decoders are independent of the number of agents and nodes. This ensures the policy network is scalable and can be applied to problems of varying scales. IV-D Training Algorithm We train the policy network pθp_θ using a REINFORCE-based objective to obtain approximate Pareto solutions for MOMTSP. The training procedure is augmented with a customized sampling strategy that, for each batch, samples different agent counts nan_a, node scales nξn_ξ, and preference vectors λ. Given an MOMTSP instance ℐI with preference λ, we aim to maximize the expected return J(θ)=π∼pθ(π∣ℐ,λ,na,nξ)[R(π)]J(θ)=E_π p_θ(π ,λ,n_a,n_ξ)[R(π)], where the reward R(π)=gtc(π∣λi,z∗)R(π)=g^tc(π _i,z^*) is derived from the TC decomposition. To reduce the variance of the policy gradient, we adopt a baseline RbR_b: ∇θJ(θ)=π∼pθ(π∣ℐ,λ,na,nξ)[(R(π)−Rb)×∇θlogpθ(π∣ℐ,λ,na,nξ)]. split _θJ(θ)&=E_π p_θ(π ,λ,n_a,n_ξ) [(R(π)-R_b)\\ & × _θ p_θ(π ,λ,n_a,n_ξ) ]. split (14) For a batch of B instances, we sample K sequences πbk _b^k for each instance ℐbI_b using the POMO strategy [15], and calculate the approximate gradient as below: ∇θJ(θ)≈1BK∑b=1B∑k=1K[(R(πbk)−Rb)×∇θlogpθ(πbk∣ℐb,λb,nab,nξb)], split _θJ(θ)&≈ 1BK _b=1^B _k=1^K [(R( _b^k)-R_b)\\ & × _θ p_θ( _b^k _b, _b,n_a^b,n_ξ^b) ], split (15) where the baseline Rb=1K∑k=1KR(πbk)R_b= 1K _k=1^KR( _b^k) is the average reward of K sequences sampled for instance ℐbI_b, where each sequence starts with a different first node. The full training process is detailed in Algorithm 1. Algorithm 1 Training Algorithm 1:Number of training epochs E, training rounds T, batch size B, sampling sequences K, agent counts set D_A, node scales set ΞD_ 2:Policy network pθp_θ 3:Initialize policy network parameters θ 4:for e=1e=1 to E do 5: for t=1t=1 to T do 6: ℐb←ProblemSampler()I_b (), ∀b∈1,…,B∀ b∈\1,…,B\ 7: na←AgentCountSampler()n_a (D_A) 8: nξ←NodeScaleSampler(Ξ)n_ξ (D_ ) 9: λ←PreferenceSampler()λ () 10: πbk←TourSampler(pθ(⋅∣ℐb,λ,na,nξ)) _b^k (p_θ(· _b,λ,n_a,n_ξ) ), 11: ∀b∈1,…,B,∀k∈1,…,K∀ b∈\1,…,B\,\ ∀ k∈\1,…,K\ 12: Rb←1K∑k=1KR(πbk)R_b← 1K _k=1^KR( _b^k), ∀b∈1,…,B∀ b∈\1,…,B\ 13: ∇θJ(θ)←1BK∑b=1B∑k=1K[(R(πbk)−Rb) _θJ(θ)← 1BK _b=1^B _k=1^K [(R( _b^k)-R_b) 14: ⋅∇θlogpθ(πbk∣⋅)]· _θ p_θ( _b^k ·) ] 15: θ←Adam(θ,∇θJ(θ))θ (θ, _θJ(θ)) 16: end for 17:end for V Experiments V-A Experimental settings Problems and Instance Generation. We evaluate CAMO on MOMTSP instances where all node and depot locations are randomly generated within a [0,1]2[0,1]^2 unit square. We evaluate CAMO under different node scales and agent counts, including 20 nodes with 2 agents, 50 nodes with 3 agents, and 100 nodes with 4 agents, denoted as N20A2, N50A3, and N100A4, respectively. To boost generalization across varying problem sizes, we employ a customized sampling strategy. Specifically, for each training batch, we randomly sample the preference vector λ by drawing each element from [0,1][0,1] to induce diverse stochastic preferences [10]. When computing rewards, we normalize λ to sum to 1, so that it represents the relative importance of the objectives. The node scale nξn_ξ and agent number nan_a are randomly drawn from the integer ranges [20,100][20,100] and [2,4][2,4], respectively. Hyperparameters. We train CAMO for =400E=400 epochs, sampling N=100,000N=100,000 instances per epoch. We use the Adam optimizer with a batch size B=64B=64, a learning rate of 10−410^-4, and a weight decay of 10−610^-6. The solution sampling size per instance K is set to the node scale, K=nξK=n_ξ. For the network architecture, the embedding dimension is dh=128d_h=128. The MHA employs M=8M=8 heads with Query and Key dimensions of 1616. The hidden dimension of the F sublayer is df=512d_f=512. The encoder consists of L=6L=6 CA layers, and the clipping coefficient κ is 10. Baselines. We compare CAMO against a range of established baseline methods, including both classical MOEAs and a representative neural solver. Specifically, the selected traditional MOEAs include: 1) MOEA/D [33], a classical decomposition-based MOEA, where the neighborhood size, neighborhood selection probability, and the maximum number of iterations are set to 15, 0.7, and 4,000, respectively; 2) MOGLS [4], a multi-objective genetic local search algorithm with a population size of 140, executed for 10,000 iterations with 30 local search steps per iteration; 3) NSGA-I [8], a widely used Pareto dominance-based genetic algorithm with a population size of 100 and 4,000 iterations; and 4) NSGA-I [7], an extension of NSGA-I incorporating reference directions, also run for 4,000 iterations. Inspired by the work in [16], we apply problem-specific local operators for solving MOMTSP. For the DRL-based category, we benchmark against 5) MO-PARCO: we adapt PARCO [2], a SOTA neural solver originally designed for multi-agent SOCOPs, where agents construct solutions in parallel through coordinated decision-making. To make it applicable to the multi-objective setting, we extend PARCO by conditioning the policy on preference vectors via the same CA mechanism. Inference and Evaluation Metrics. During inference, we employ an Instance Augmentation (IA) strategy [15] for CAMO (i.e, CAMO-IA), solving 8 transformations of each instance and reporting the solution with the best HV as the final result. We evaluate all methods using three metrics: Hypervolume (HV) [36], Gap, and total runtime (Time). HV is a widely used indicator for measuring the coverage of the PF, where a larger HV value indicates better performance. Given a set P⊂ℝnP ^n and a reference point ∗ r^* dominated by all solutions in P, the HV is defined as the Lebesgue measure of the region it dominates up to the reference point ∗ r^*: HV(P)=ζn(∈ℝn|∃u∈P such that u≺∗)HV(P)=ζ^n\! ( \ r ^n\, |\,∃\,u∈ P such that u r r^* \ ), where ζnζ^n denotes n-dimensional Lebesgue measure (i.e., the volume in ℝnR^n). We report normalized HV values in [0,1][0,1] with the same ∗ r^* across all methods. In our experiments, for each problem scale, we set the reference point to 1.11.1 times the maximum objective value observed across all methods, ensuring that it is dominated by the union of their solution sets. For convenience, we then round the reference point up to the nearest integer. The Gap metric quantifies the relative performance difference between a given method and CAMO-IA in terms of HV, formally defined as Gap=(HVCAMO-IA−HVi)/HVCAMO-IA×100%Gap=(HV_CAMO-IA-HV_i)/HV_CAMO-IA× 100\%, where HViHV_i denotes the HV of the i-th method. A smaller Gap indicates closer performance to CAMO-IA. All experiments are implemented in Python and conducted on a device equipped with an NVIDIA Ampere A100 GPU and an AMD EPYC 7742 CPU. To ensure reproducibility, our code implementation and datasets will be made publicly available upon paper acceptance. V-B Experimental results Comparison Analysis. We evaluate all algorithms on 100 randomly generated instances for each configuration, including N20A2, N50A3, and N100A4. Since MOMTSP lacks standard a priori reference points for HV calculation, we establish them via conservative estimation from the complete solution pool of all methods. The reference points are set to [7, 6] for N20A2, [14, 11] for N50A3, and [29, 20] for N100A4, respectively. This setup ensures that all obtained non-dominated solutions are covered, preventing evaluation bias in the HV metric. TABLE I: Baseline comparison results. Method N20A2 N50A3 N100A4 HV↑ Gap Time HV↑ Gap Time HV↑ Gap Time MOEA/D 0.201 16.71% 32.92m 0.374 15.75% 2.58h 0.394 37.33% 4.56h MOGLS 0.214 11.57% 7.72h 0.207 53.38% 19.88h 0.227 63.85% 38.43h NSGA-I 0.236 2.48% 34.89m 0.389 12.39% 2.47h 0.416 33.76% 4.28h NSGA-I 0.235 3.06% 45.37m 0.386 13.18% 2.50h 0.402 35.94% 4.34h MO-PARCO 0.229 5.46% 1.53m 0.422 4.96% 2.56m 0.591 5.95% 3.22m CAMO 0.240 0.91% 10.55s 0.439 1.02% 20.77s 0.624 0.65% 55.31s CAMO-IA 0.242 0.00% 16.96s 0.444 0.00% 59.52s 0.628 0.00% 5.44m The experimental results summarized in Table I show that CAMO consistently outperforms all baseline methods, with its advantage becoming more pronounced as the problem scale increases. Moreover, when equipped with the IA strategy, CAMO-IA achieves further performance improvements. While traditional MOEAs remain competitive in the small-scale N20A2 scenario, their performance degrades substantially on larger N50A3 and N100A4 instances. MO-PARCO remains competitive across all problem scales, yet consistently underperforms our CAMO. In terms of efficiency, neural solvers significantly reduce computation time. Generalization Analysis. To evaluate generalization, we test CAMO and all baselines on 100 randomly generated instances for each larger-scale configuration, including N100A6, N150A4, N150A6, N200A4, and N200A6, which are unseen during training. For fairness, MO-PARCO is evaluated using the checkpoint trained on problem size 100 to ensure its best performance. We set the HV reference points to [40, 21] for N100A6, [43, 32] for N150A4, [49, 29] for N150A6, [57, 41] for N200A4, and [62, 37] for N200A6. The results in Table I show that CAMO-IA achieves the highest HV in every case, demonstrating strong generalization to larger-scale MOMTSPs. TABLE I: Generalization experiment results. Method N100A6 N150A4 N150A6 N200A4 N200A6 HV↑ Gap Time HV↑ Gap Time HV↑ Gap Time HV↑ Gap Time HV↑ Gap Time MOEA/D 0.536 24.36% 4.51h 0.348 50.41% 5.83h 0.426 41.54% 5.97h 0.265 64.02% 7.26h 0.342 54.82% 7.47h MOGLS 0.393 44.49% 50.96h 0.214 69.47% 52.39h 0.289 60.36% 67.88h 0.186 74.80% 65.65h 0.243 67.86% 81.76h NSGA-I 0.528 25.42% 4.45h 0.379 45.93% 5.91h 0.416 42.94% 5.94h 0.327 55.69% 7.53h 0.348 53.97% 7.75h NSGA-I 0.518 26.84% 4.51h 0.356 49.17% 5.97h 0.396 45.75% 6.29h 0.288 61.04% 7.61h 0.315 58.37% 7.82h MO-PARCO 0.669 5.47% 3.46m 0.664 5.22% 4.95m 0.699 4.05% 5.49m 0.682 7.55% 6.35m 0.720 4.71% 7.23m CAMO 0.704 0.59% 1.39m 0.698 0.53% 2.48m 0.724 0.63% 3.39m 0.734 0.48% 6.00m 0.752 0.53% 7.15m CAMO-IA 0.708 0.00% 7.62m 0.701 0.00% 16.44m 0.729 0.00% 23.16m 0.738 0.00% 35.42m 0.756 0.00% 49.06m Benchmark Evaluation. We further evaluate CAMO’s generalization on out-of-distribution benchmark instances adapted from TSPLIB [23]. We denote these instances as “Xii-Ajj”, where X is the TSPLIB instance name, i is the number of nodes, and j is the number of agents in our modified setting. TABLE I: Performance comparison between NSGA-I and CAMO on benchmark instances. Instance Obj.Sum ↓ Obj.Max ↓ NSGA-I CAMO NSGA-I CAMO ulysses22-A2 4.00 3.95 2.28 2.33 swiss42-A2 5.61 5.14 3.11 2.84 gr21-A3 3.78 3.78 2.01 2.08 berlin52-A4 6.14 5.54 2.06 1.82 att48-A5 5.84 5.48 1.91 1.93 brazil58-A6 5.03 4.19 2.07 2.06 eil76-A7 11.91 8.30 2.78 2.26 pr76-A10 11.73 6.90 2.74 2.47 kroA150-A15 28.76 9.49 3.44 2.27 kroB200-A20 41.83 10.99 4.09 2.56 For rigorous evaluation and fair comparison, we apply a unified preprocessing procedure to TSPLIB instances that normalizes all node coordinates into the [0,1]2[0,1]^2 unit square. For instances that only provide edge-weight distance matrices (i.e., swiss42, gr21, and brazil58), we first reconstruct node coordinates using multidimensional scaling (MDS) [28], and then apply the same normalization. We compare CAMO with NSGA-I in terms of Obj.Sum and Obj.Max, which denotes the minimum PF values along the two objectives, where smaller is better. As shown in Table I, CAMO achieves superior performance on most benchmark instances. In terms of Obj.Sum, CAMO obtains lower values than NSGA-I on 9 out of 10 instances, with significant gains on medium- and large-scale cases such as pr76-A10, kroA150-A15, and kroB200-A20. For Obj. Max, CAMO outperforms NSGA-I on 6 out of 10 instances, indicating that it also reduces makespan in most cases. These results demonstrate that CAMO maintains strong generalization on benchmark instances whose distributions differ from those used in training. Ablation Study. We conduct an ablation study to assess the contributions of key components in CAMO, including the GA module, the CA mechanism, and the customized sampling strategy. Note that we do not ablate the collaborative decoder, as removing it prevents CAMO from handling the multi-agent setting. First, we replace the GA module with a residual connection and replace the conditional attention (CA) mechanism with standard self-attention, resulting in the variants CAMO w/o GA and CAMO w/o CA, respectively. The performance of the original CAMO and its variants on N20A2, N50A3, and N100A4 is illustrated in Fig. 2. The results show that the performance of CAMO deteriorates significantly when any of these components is removed, confirming the effectiveness of each key component in the model architecture. Figure 2: Ablation study results of CAMO and its variants. Lower values of Gap (%) indicate better performance. Next, we train a CAMO variant without the customized sampling strategy by fixing the training instances to the N100A4 configuration, denoted as CAMO (N100A4). We compare the training time and HV across problem scales (N50A3, N100A4, and N200A4). Table IV shows that our customized sampling strategy reduces training time by more than 50% while achieving competitive performance on all scales. It is only slightly inferior to CAMO (N100A4) on the N100A4 setting, demonstrating the effectiveness of our training strategy. TABLE IV: CAMO vs. CAMO (N100A4). Method Training Time ↓ HV ↑ N50A3 N100A4 N200A4 CAMO (N100A4) 168.62 h 0.4308 0.6263 0.7339 CAMO 70.42 h 0.4394 0.6241 0.7340 V-C Real-World Robot Validation We implemented our proposed CAMO framework on a physical robot platform to validate its effectiveness in real-world robot-node planning scenarios. We deployed our algorithm on 3 mobile robots operating in an indoor environment where the robots were required to traverse 51 nodes, comprising one depot and 50 task nodes. (a) Simulation for λ=[1,0] (b) Simulation for λ=[0,1] (c) Real-world test for λ=[1,0] (d) Real-world test for λ=[0,1] Figure 3: Experimental results in simulation (top) and real-world (bottom) scenarios. Fig. 3 shows our experimental simulation and real-robot results, where the robots successfully visited all 50 task nodes. The paths represent the solutions generated by our algorithm, corresponding to the fixed preference vectors λ1=[1,0] _1=[1,0] and λ2=[0,1] _2=[0,1], respectively, demonstrating CAMO’s ability to effectively solve MOMTSP instances in practical settings. VI Conclusion In this paper, we propose CAMO, a conditional neural solver for MOMTSP. CAMO fuses preference and instance information via a conditional encoder and coordinates multi-agent decision making through a collaborative decoder. We further train CAMO with a REINFORCE-based objective over a mixed distribution of problem sizes to improve robustness to unseen agent and node configurations. Empirically, CAMO delivers strong performance, generalizes well to out-of-distribution instances and larger scales, and demonstrates practical applicability through real-world deployment on a mobile robot platform. In future work, we will investigate how to address the imbalanced objectives commonly encountered in multi-agent MOCOPs and explore solving large-scale MOMTSP instances through advanced generative models or divide-and-conquer principles. ACKNOWLEDGMENT We used ChatGPT to assist in the linguistic refinement of the Introduction and Conclusion sections. The generated text was critically reviewed and revised by the authors to ensure alignment with the research findings and academic standards. References [1] T. Bektas (2006) The multiple traveling salesman problem: an overview of formulations and solution procedures. Omega 34 (3), p. 209–219. Cited by: §I. [2] F. Berto, C. Hua, L. Luttmann, J. Son, J. Park, K. Ahn, C. Kwon, L. Xie, and J. Park (2025) PARCO: parallel autoregressive models for multi-agent combinatorial optimization. In Advances in Neural Information Processing Systems, Cited by: §V-A. [3] R. F. Carpio, J. Maiolini, C. Potena, E. Garone, G. Ulivi, and A. Gasparn (2021) MP-STSP: a multi-platform steiner traveling salesman problem formulation for precision agriculture in orchards. In 2021 IEEE International Conference on Robotics and Automation (ICRA), p. 2345–2351. External Links: Document Cited by: §I. [4] B. Chen, W. Zeng, Y. Lin, and D. Zhang (2014) A new local search-based multiobjective optimization algorithm. IEEE Transactions on Evolutionary Computation 19 (1), p. 50–73. Cited by: §V-A. [5] J. Chen, Z. Cao, J. Wang, Y. Wu, H. Qin, Z. Zhang, and Y. Gong (2025) Rethinking neural multi-objective combinatorial optimization via neat weight embedding. In The Thirteenth International Conference on Learning Representations, Cited by: §I. [6] C. A. C. Coello, G. B. Lamont, and D. A. V. Veldhuizen (2007) Evolutionary algorithms for solving multi-objective problems. Springer. Cited by: §I. [7] K. Deb and H. Jain (2013) An evolutionary many-objective optimization algorithm using reference-point-based nondominated sorting approach, part I: solving problems with box constraints. IEEE Transactions on Evolutionary Computation 18 (4), p. 577–601. Cited by: §I, §V-A. [8] K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan (2002) A fast and elitist multiobjective genetic algorithm: NSGA-I. IEEE Transactions on Evolutionary Computation 6 (2), p. 182–197. Cited by: §I, §V-A. [9] K. Deb (2011) Multi-objective optimisation using evolutionary algorithms: an introduction. In Multi-objective evolutionary optimisation for product design and manufacturing, p. 3–34. Cited by: §I. [10] M. Fan, Y. Wu, Z. Cao, W. Song, G. Sartoretti, H. Liu, and G. Wu (2024) Conditional neural heuristic for multiobjective vehicle routing problems. IEEE Transactions on Neural Networks and Learning Systems 36 (3), p. 4677–4689. Cited by: §I, §V-A. [11] M. Fan, J. Zhou, Y. Zhang, Y. Wu, J. Chen, and G. A. Sartoretti (2025) Preference-driven multi-objective combinatorial optimization with conditional computation. In Advances in Neural Information Processing Systems, Cited by: §I, §I. [12] Y. Guo, Z. Ren, and C. Wang (2024) IMTSP: solving min-max multiple traveling salesman problem with imperative learning. In 2024 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), Vol. , p. 10245–10252. External Links: Document Cited by: §I, §I. [13] J. J. Hopfield (1984) Neurons with graded response have collective computational properties like those of two-state neurons.. Proceedings of the National Academy of Sciences 81 (10), p. 3088–3092. Cited by: §IV-B. [14] W. Kool, H. van Hoof, and M. Welling (2019) Attention, learn to solve routing problems!. In International Conference on Learning Representations, Cited by: §IV-B. [15] Y. Kwon, J. Choo, B. Kim, I. Yoon, Y. Gwon, and S. Min (2020) POMO: policy optimization with multiple optima for reinforcement learning. Advances in Neural Information Processing Systems 33, p. 21188–21198. Cited by: §IV-D, §V-A. [16] P. Lacomme, C. Prins, and M. Sevaux (2006) A genetic algorithm for a bi-objective capacitated arc routing problem. Computers & Operations Research 33 (12), p. 3473–3493. Cited by: §V-A. [17] S. Li, Z. Wang, Z. Liu, C. Tan, H. Lin, D. Wu, Z. Chen, J. Zheng, and S. Z. Li (2024) MogaNet: multi-order gated aggregation network. In International Conference on Learning Representations, Cited by: §IV-B. [18] W. Li, S. Zhang, S. Dai, H. Huang, R. Hu, X. Chen, and K. Xu (2024) Synchronized dual-arm rearrangement via cooperative mTSP. In 2024 IEEE International Conference on Robotics and Automation (ICRA), p. 9242–9248. Cited by: §I. [19] J. Liang, Y. Cao, Y. Ma, H. Zhao, and G. Sartoretti (2025) HDPlanner: advancing autonomous deployments in unknown environments through hierarchical decision networks. IEEE Robotics and Automation Letters 10 (1), p. 256–263. External Links: Document Cited by: §I. [20] X. Lin, Z. Yang, X. Zhang, and Q. Zhang (2022) Pareto set learning for expensive multi-objective optimization. Advances in Neural Information Processing Systems 35, p. 19231–19247. Cited by: §I, §I-B. [21] V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), p. 807–814. Cited by: §IV-B. [22] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville (2018) FiLM: visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. Cited by: §IV-B. [23] G. Reinelt (1991) TSPLIB—a traveling salesman problem library. ORSA Journal on Computing 3 (4), p. 376–384. Cited by: §V-B. [24] N. Shazeer (2020) Glu variants improve transformer. arXiv preprint arXiv:2002.05202. Cited by: §IV-B. [25] V. A. Shim, K. C. Tan, and K. K. Tan (2012) A hybrid estimation of distribution algorithm for solving the multi-objective multiple traveling salesman problem. In 2012 IEEE Congress on Evolutionary Computation, p. 1–8. Cited by: §I, §I. [26] R. K. Srivastava, K. Greff, and J. Schmidhuber (2015) Training very deep networks. In Advances in Neural Information Processing Systems, Vol. 28. Cited by: §IV-B. [27] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31. Cited by: §IV-B. [28] J. B. Tenenbaum, V. d. Silva, and J. C. Langford (2000) A global geometric framework for nonlinear dimensionality reduction. Science 290 (5500), p. 2319–2323. Cited by: §V-B. [29] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §IV-B. [30] Y. Wu, M. Fan, Z. Cao, R. Gao, Y. Hou, and G. Sartoretti (2024) Collaborative deep reinforcement learning for solving multi-objective vehicle routing problems. In 23rd International Conference on Autonomous Agents and Multiagent Systems, AAMAS 2024, p. 1956–1965. Cited by: §I. [31] Y. Xu, M. Fang, L. Chen, G. Xu, Y. Du, and C. Zhang (2021) Reinforcement learning with multiple relational attention for solving vehicle routing problems. IEEE Transactions on Cybernetics 52 (10), p. 11107–11120. Cited by: §IV-A. [32] R. Yang and C. Fan (2024) A hierarchical framework for solving the constrained multiple depot traveling salesman problem. IEEE Robotics and Automation Letters 9 (6), p. 5536–5543. Cited by: §I. [33] Q. Zhang and H. Li (2007) MOEA/D: a multiobjective evolutionary algorithm based on decomposition. IEEE Transactions on Evolutionary Computation 11 (6), p. 712–731. Cited by: §I, §I, §V-A. [34] A. Zhou, B. Qu, H. Li, S. Zhao, P. N. Suganthan, and Q. Zhang (2011) Multiobjective evolutionary algorithms: a survey of the state of the art. Swarm and Evolutionary Computation 1 (1), p. 32–49. Cited by: §I, §I-B. [35] E. Zitzler and S. Künzli (2004) Indicator-based selection in multiobjective search. In International Conference on Parallel Problem Solving from Nature, p. 832–842. Cited by: §I. [36] E. Zitzler, L. Thiele, M. Laumanns, C. M. Fonseca, and V. G. Da Fonseca (2003) Performance assessment of multiobjective optimizers: an analysis and review. IEEE Transactions on Evolutionary Computation 7 (2), p. 117–132. Cited by: §V-A.