Paper deep dive
MoEless: Efficient MoE LLM Serving via Serverless Computing
Hanfei Yu, Bei Ouyang, Shwai He, Ang Li, Hao Wang
Abstract
Abstract:Large Language Models (LLMs) have become a cornerstone of AI, driving progress across diverse domains such as content creation, search and recommendation systems, and AI-assisted workflows. To alleviate extreme training costs and advancing model scales, Mixture-of-Experts (MoE) has become a popular backbone for modern LLMs, which are commonly served in distributed deployment using expert parallelism (EP). However, MoE's sparse activation mechanism leads to severe expert load imbalance, where a few experts become overloaded while others remain idle, resulting in expert stragglers that inflate inference latency and serving cost. Existing expert load balancing solutions assume static resource configurations on serverful infrastructures, limiting expert scalability and elasticity, and resulting in either costly real-time expert swapping or degraded generation quality. We present MoEless, the first serverless MoE serving framework that mitigates expert load imbalance and accelerates inference via serverless experts. MoEless employs lightweight, layer-aware predictors to accurately estimate incoming expert load distributions and proactively identify stragglers. We design optimized expert scaling and placement strategies to maximize function locality, improve GPU utilization, and balance loads across experts and GPUs. MoEless is prototyped on top of Megatron-LM and deployed on an eight-GPU testbed. Experiments with open-source MoE models and real-world workloads show that MoEless reduces inference latency by 43% and inference cost by 84% compared to state-of-the-art solutions.
Tags
Links
- Source: https://arxiv.org/abs/2603.06350v1
- Canonical: https://arxiv.org/abs/2603.06350v1
PDF not stored locally. Use the link above to view on the source site.
Intelligence
Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 95%
Last extracted: 3/13/2026, 12:18:50 AM
Summary
MoEless is a serverless Mixture-of-Experts (MoE) serving framework designed to mitigate expert load imbalance and reduce inference latency and costs in Large Language Models (LLMs). By decoupling experts from MoE models and integrating them as serverless functions, MoEless employs layer-aware predictors to identify stragglers and dynamically scales/places experts to optimize GPU utilization and function locality.
Entities (5)
Relation Signals (3)
MoEless → prototypedon → Megatron-LM
confidence 98% · MoEless is prototyped on top of Megatron-LM
MoEless → mitigates → Expert load imbalance
confidence 95% · MoEless, the first serverless MoE serving framework that mitigates expert load imbalance
MoEless → uses → Expert Parallelism
confidence 90% · MoEless decouples experts from MoE models under the EP paradigm
Cypher Suggestions (2)
List all software dependencies for MoEless · confidence 95% · unvalidated
MATCH (f:Framework {name: 'MoEless'})-[:PROTOTYPED_ON]->(s:Software) RETURN s.nameFind all frameworks that address expert load imbalance · confidence 90% · unvalidated
MATCH (f:Framework)-[:MITIGATES]->(p:Problem {name: 'Expert load imbalance'}) RETURN f.nameFull Text
73,842 characters extracted from source content.
Expand or collapse full text
MoEless: Efficient MoE LLM Serving via Serverless Computing Hanfei Yu ∗ hyu42@stevens.edu Stevens Institute of Technology Bei Ouyang ∗ bei_ouyang@outlook.com Stevens Institute of Technology Shwai He shwaihe@umd.edu University of Maryland College Park Ang Li angliece@umd.edu University of Maryland College Park Hao Wang hwang9@stevens.edu Stevens Institute of Technology Abstract Large Language Models (LLMs) have become a cornerstone of AI, driving progress across diverse domains such as content creation, search and recommendation systems, and AI-assisted workflows. To alleviate extreme training costs and advancing model scales, Mixture-of-Experts (MoE) has become a popular backbone for mod- ern LLMs, which are commonly served in distributed deployment using expert parallelism (EP). However, MoE’s sparse activation mechanism leads to severe expert load imbalance, where a few experts become overloaded while others remain idle, resulting in expert stragglers that inflate inference latency and serving cost. Existing expert load balancing solutions assume static resource configurations on serverful infrastructures, limiting expert scala- bility and elasticity, and resulting in either costly real-time expert swapping or degraded generation quality. We present MoEless, the first serverless MoE serving framework that mitigates expert load imbalance and accelerates inference via serverless experts. MoEless employs lightweight, layer-aware pre- dictors to accurately estimate incoming expert load distributions and proactively identify stragglers. We design optimized expert scaling and placement strategies to maximize function locality, im- prove GPU utilization, and balance loads across experts and GPUs. MoEless is prototyped on top of Megatron-LM and deployed on an eight-GPU testbed. Experiments with open-source MoE models and real-world workloads show that MoEless reduces inference la- tency by 43% and inference cost by 84% compared to state-of-the-art solutions. 1 Introduction Motivation. Large Language Models (LLMs) have revolutionized Natural Language Processing (NLP) research and driven break- throughs across diverse application domains, such as content gen- eration [2,9,15,57], information retrieval and recommendation [37, 80], and Artificial Intelligence (AI)-assisted decision-making [30,36, 49]. To mitigate the extreme training costs, modern LLMs increas- ingly adopt the Mixture-of-Experts (MoE) architecture [1,14,29, 64,71,73] as their core design.MoElayers replace the conventional feed-forward network (FFN) layers in Transformer blocks [69] with a gating network and a collection of experts, where only a small subset is activated during computation. This design significantly reduces the number of floating point operations (FLOPs), enabling MoE-based LLMs to deliver comparable or even superior perfor- mance to dense LLMs at a fraction of the training cost [14,29,73]. ∗ Both authors contributed equally to this work. 0 100 Workload Layer 0Layer 0 0 200 Workload Layer 15Layer 15 04 Expert ID 0 200 Workload Layer 31 (a) Mixtral-8×7B 04812 Expert ID Layer 31 (b) Phi-3.5-MoE Straggler ExpertHigh-load ExpertsLow-load Experts Figure 1: Expert load imbalance across layers for different MoE models and datasets: (a) Mixtral-8×7B on ShareGPT and (b) Phi-3.5-MoE on LMSYS-Chat-1M. Due to the immense model scale, servingMoE-based LLMs re- quires distributed deployment under the expert parallelism (EP) paradigm [38]. However, expert load imbalance has been identi- fied as a fundamental challenge in distributed serving [22,38,42], where certain experts become highly popular and receive over- whelming loads compared to others. Figure 1 illustrates the expert load imbalance in two representativeMoEmodels, Mixtral-8×7B and Phi-3.5-MoE, across two real-world datasets. Such imbalance leads to the expert straggler problem [24], which severely increases inference latency and serving cost in MoE serving. Limitations of state-of-the-art approaches. Expert load bal- ancing [24,35,38] has been extensively explored to mitigate the inherent load imbalance among experts inMoEserving. How- ever, existing approaches assume static resource configurations on serverful infrastructures, resulting in either costly real-time ex- pert swapping [22,35,38] with limited effectiveness, or lossy expert re-routing [24] that compromises generation quality. Achieving ex- pert load balance without generation performance loss requires fine-grained, elastic, and accurate expert scaling in MoE serving. Key insights and contributions. Serverless computing offers a promising alternative and has emerged as a new computing par- adigm for modernAIinfrastructures. Serverless Machine Learn- ing (ML) inference andLLMserving have gained significant atten- tion from both academia [20,45,67,68,76] and industry [5–7,52]. In this paper, we propose MoEless, the first serverlessMoEserving 1 arXiv:2603.06350v1 [cs.DC] 6 Mar 2026 Preprint, March 5, 2026Hanfei Yu, Bei Ouyang, Shwai He, Ang Li, and Hao Wang framework that mitigates expert load imbalance and accelerates MoEinference via serverless experts. MoEless decouples experts fromMoEmodels under theEPparadigm and integrates them with serverless functions to enable scalable and elastic execution. To proactively identify expert stragglers, we design lightweight pre- dictors that accurately estimate upcoming expert load distributions with layer awareness. Based on the predicted load characteristics, MoEless dynamically scales expert replicas to eliminate stragglers and balance workloads across both the expert and GPU levels, thereby minimizing inference latency. Furthermore, we design op- timized expert placement strategies to maximize function locality and GPU utilization while reducing all-to-all communication over- heads inEPdeployments. In summary, we make the following contributions: •We propose MoEless, the first serverlessMoEserving frame- work that accelerates inference by mitigating expert load imbalance via serverless experts. • We design layer-aware and lightweight expert load predic- tors that accurately estimate incoming expert load distribu- tions across different layers. •We develop dynamic expert scaling and placement strategies that efficiently balance workloads across experts and GPUs to eliminate straggler problems. •We prototype MoEless on top of Megatron-LM [63], deploy it on an eight-GPU testbed, and conduct extensive evaluation against state-of-the-art (SOTA) approaches. Experimental methodology. We prototype MoEless on top of the Megatron-LM framework [63]. All experiments are conducted on a testbed with eight NVIDIA A6000 GPUs, with a total of 384 GB of GPU memory, interconnected via pairwise NVLinks. We evalu- ate MoEless using three representativeMoE-based LLMs, Mixtral- 8×7B [29], Phi-3.5-MoE [1], and Llama-4-Scout [48], across two real-world datasets, LMSYS-Chat-1M [81] and ShareGPT [53]. We compare our approach againstSOTAexpert load balancing meth- ods, including Megatron-LM [63], Expert Parallelism Load Bal- ancer (EPLB) [38], and an Oracle baseline [24]. Extensive exper- iments show that MoEless reduces inference latency by 43% and inference cost by 84% compared to SOTA baselines. Limitations of the proposed approach. In this work, we adopt standard serverless function management schemes, such as pre- warming and fixed-duration keep-alive periods, to mitigate expert function cold starts. System parameters (e.g., prediction distance and load-balancing thresholds) are primarily determined through offline profiling, rather than being automatically or dynamically adapted across models and datasets. We leave the design of more advanced runtime optimizations to future work. 2 Background and Motivation 2.1 Large Language Model Serving LLMs have been extensively studied and deployed in both indus- trial production [3,32,54,56] and academic research [16,33,43,82]. Modern LLMs typically adopt the Transformer decoder-only archi- tecture [69], which operates in two consecutive stages: prefill and decode. Figure 2 illustrates the auto-regressive inference process, where anLLMprocesses an input prompt batch and incremen- tally generates new tokens. During the prefill stage, the model Input states “The” “The” “answer” Iteration 1 PrefillDecode Iter. 2 ... ... MoE LLM Embed TB L Head TB 1 MoE LLM ... ... DP MP Output states Embed TB L Head TB 1 “What is...” ... Gate Attention Expert Waiting GPU 1 Gate GPU G Attention All-to-All Gather All-to-All Scatter Straggler Figure 2: Illustration of serving Mixture-of-Experts (MoE) based Large Language Models under expert parallelism, where tokens are routed by per-layer gate networks to a sparse set of experts distributed across GPUs. Expert load im- balance triggers inefficient resource provisioning (e.g., over- scaling hot experts or under-utilizing cold ones), thereby increasing serving cost. Embed: embedding layer, TB: Trans- former Block, Head: language modeling head, Attention: at- tention layer, Gate: gate networks, DP: data parallelism, MP: model parallelism. processes all input prompt sequences in parallel and outputs the first new tokens for the batch in a single iteration. 1 This stage is typically compute-intensive because it runs full attention over the entire prompt length, and it also initializes the key-value (KV) cache that will be reused for subsequent generation. In contrast, the decode stage is latency-sensitive and often memory-bandwidth- bound: at each iteration, the model generates one new token per sequence while attending to all previously generated tokens via the KV cache. As the context grows, the amount of KV-cache reads increases monotonically, making per-iteration performance increas- ingly dominated by cache access and communication rather than raw compute. To improve throughput, serving systems commonly batch requests and dynamically adjust batch composition across iterations; however, batching also introduces trade-offs between utilization and tail latency, especially when sequences have diverse prompt lengths and output lengths. In the decode stage, the model generates one new token per sequence in each iteration until all responses are completed, and the overall end-to-end latency is there- fore determined by both the iteration time and the total number of decode iterations required. 2.2 Mixture of Experts and Expert Parallelism TheMoEarchitecture has emerged as a dominant paradigm for scal- ing LLMs beyond dense models, enabling hyper-scale parameters without proportionally increasing computation [14,17,19,29,61]. As illustrated in Figure 2,MoE-based LLMs replace the standard FFNlayer in each Transformer block [69] with anMoElayer, which consists of a gating network and multiple expert networks. Within 1 An iteration refers to one inference step that generates a new token. The iteration time represents the end-to-end step latency. 2 MoEless : Efficient MoE LLM Serving via Serverless ComputingPreprint, March 5, 2026 0 2 Requests (a) 0 2500 Tokens (b) 12:23:3212:23:5212:24:12 10 15 Active Experts (c) Timestamp Figure 3: Serving Phi-3.5-MoE on LMSYS-Chat-1M using Azure LLM traces: (a) request arrivals, (b) aggregated token loads, and (c) total number of active experts. each block, the attention module first computes token-level atten- tion [69] from the input hidden states. The gating network then routes tokens to specific experts, and each token activates only its assigned expert for computation. Compared with traditional dense LLMs,MoE-based models activate only a subset of parameters during training and inference, substantially reducing computation while achieving superior generation quality with a comparable total parameter count. Given the extreme scale,MoE-based LLMs are typically served through distributed deployment to meet their intensive computa- tional and memory demands. ExistingMoEserving systems [34, 35,62] employEP, which integrates both data parallelism (DP) and model parallelism (MP) as shown in Figure 2. In each Trans- former block, non-expert modules (e.g., attention layers and gating networks) are replicated across GPUs for parallel data processing, while experts are uniquely distributed across GPUs to accommodate their large memory footprints (e.g., each expert in Mixtral-8×7B occupies 0.33 GB of GPU memory). Consequently, two all-to-all communication operations are required between non-expert mod- ules and experts to realize token-to-expert assignments: a scatter operation that distributes tokens to their designated experts, and a gather operation that collects and reorders the outputs from ex- perts [22, 35, 38]. 2.3 Expert Load Imbalance in MoE Serving Expert load imbalance poses a fundamental challenge forMoE serving underEP[24,35,46]. Figure 1 illustrates the workload distribution across three representative layers from twoMoEmod- els, Mixtral-8×7B [29] and Phi-3.5-MoE [1], evaluated on two real- world datasets, ShareGPT [53] and LMSYS-Chat-1M [81]. Prior studies [8,24,26] have consistently observed that expert popularity is highly skewed, with certain experts receiving disproportionately higher loads than others. GPUs hosting these overloaded experts take significantly longer to complete their computations, leading to the straggler problem [24], where lightly loaded experts must wait for overloaded ones to finish. Moreover, such imbalance exacerbates all-to-all communication latency inEPdeployments, as GPUs with popular experts handle larger data transfers, further amplifying the 0200400600 Layer Forward Time (ms) 0.0 0.2 0.4 0.6 0.8 1.0 CDF Megatron-LM EPLB MoEless (a) MoE layer forward time. Megatron-LM EPLB M o e l es s 0 10000 20000 30000 Cost (GB×sec) Non-expert cost Expert cost (b) Total inference cost. Figure 4: Inference performance of three approaches when serving Phi-3.5-MoE on ShareGPT. straggler effect. Consequently, expert load imbalance can substan- tially increaseMoEinference latency, elevating inference cost and degrading overall serving performance. InMoEserving, the dynamic expert demands arise from a com- bination of workload-level and model-level factors: Varying request arrivals and token loads. Figure 3(a) shows the request arrivals of the real-world AzureLLMinference traces [54, 66]. We replay the peak traffic observed around noon from the traces. DuringMoEserving, varying request arrivals naturally result in dynamic usage of experts. Figure 3(b) shows the total request token loads aggregated over the same traces, where we batch and sum the tokens of requests within the same second. With varying input tokens,MoEmodels unavoidably assign different loads to their experts. Intrinsically skewed expert popularity inMoE. Existing research [8,24,31,83] has extensively demonstrated the highly skewed expert popularity, which stems from theMoEarchitecture and the differences between prefill and decode stages. To illustrate the dynamic expert demands, we replay the same AzureLLMtraces to serve Phi-3.5-MoE with Megatron-LM [63] using LMSYS-Chat- 1M. Experimental details can be found in §6.1. Figure 3(c) shows the fluctuation in number of active experts over time. 2.4 Motivating Serverless MoE Serving To mitigate expert load imbalance inMoEserving, existing solu- tions [22,35,38,70] selectively replicate popular experts for dis- tributing the loads. However, they rely on fixed configuration of expert scaling on serverful infrastructures. For example,EPLB[38] from DeepSeek assumes fixed number of expert replicas on fixed number of devices. ServerfulMoEserving fails to satisfying the dynamic expert demands (§2.3). Serverless computing has emerged as a transformative paradigm in modernAIinfrastructures, offering agile scalability, pay-as-you- go pricing, and simplified management to enable scalableMLin- ference [4,28] and elasticLLMserving [20,45,76]. In contrast, traditional serverful serving systems (e.g., Megatron-LM [63]) de- ploy experts on fixed compute nodes with staticEPconfigurations, often suffering from severe straggler effects and expert load imbal- ance. Existing serverful expert load balancing approaches [22,38] attempt to mitigate this issue by swapping low-usage experts with 3 Preprint, March 5, 2026Hanfei Yu, Bei Ouyang, Shwai He, Ang Li, and Hao Wang replicas of popular ones. However, their expert scalability and elas- ticity remain constrained by fixed resource allocations, resulting in inflated per-layer latency and higher inference costs. In contrast, a serverlessMoEarchitecture dynamically scales experts on demand, effectively eliminating hidden stragglers and achieving balanced workloads across experts and GPUs. As shown in Figure 4, the serverless design significantly reduces bothMoElayer forward latency and inference cost over serverful baselines. 3 Overview 3.1 Objectives and Challenges We design MoEless to achieve three goals: AccelerateMoEinference by mitigating expert load imbal- ance. ExistingMoE-based LLMs inevitably experience expert load imbalance from both workload and model perspectives, causing the expert straggler problem [24] that prolongs inference latency. We aim to alleviate the expert load imbalance and accelerateMoE inference by eliminating expert stragglers. Enable expert scalability via serverless computing. Existing serverfulMoEserving frameworks [22,35,38] rely on swapping out low-usage experts for popular ones limited by fixed resource allocations. By leveraging serverless computing, we aim to unlock expert scalability and elasticity to efficiently accommodate dynamic expert demands. MinimizeMoEinference cost with serverless experts. Since expert computations dominate the inference latency and cost [1,29, 38,48], we aim to reduce the inference cost with serverless experts. We must address three challenges to realize the goals: How to integrate serverless computing into existingMoE serving? Unlike dense LLMs, the scale ofMoE-based LLMs is pro- hibitively large, making it infeasible to encapsulate the entire model within serverless functions [20,25,45,76]. Therefore, our design must be fine-grained and generic to seamlessly integrate with ex- isting MoE serving frameworks. How to accurately predict expert load distributions? To enable proactive expert load balancing, we must effectively capture and predict dynamic expert load distributions in advance, provid- ing accurate guidance for asynchronous expert management and scaling decisions. How to efficiently scale and place expert functions? Given the predicted expert loads, we must construct execution plans that efficiently scale serverless expert functions and place them across devices to maximize GPU utilization, function locality, and mini- mize communication overheads. 3.2 Architecture and Workflow Unlike existing serverlessLLMserving approaches [20,25,45,76], which encapsulate the entireLLMinto serverless functions, we decouple the experts from theMoEmodel and package them as independent functions, while keeping the non-expert modules in DP. This design offers two key advantages. First, since expert com- putation dominates inference latency and cost, packaging experts as functions maximizes the benefits of serverless execution while avoiding unnecessary coordination among non-expert modules. Second, because experts are inherently decoupled and follow all- to-all communication patterns, the stateless nature of serverless TB d Expert Load Predictor Prompts “The answer is...” Generation “Answer this question...” ... Expert Load Distribution Expert Scaler ... TB L TB 1 MoE LLM ... Input States f Expert Placer ... Stragglers 1 2 3 GPU f ... f f f f 4 f Embed Head Figure 5: The architecture and workflow of MoEless. functions can be naturally hidden underEP, ensuring seamless integration with existing MoE serving workflows. Figure 5 shows the architecture and workflow of MoEless, con- sisting of three main components: Expert Load Predictor, Expert Scaler, and Expert Placer. MoEless servesMoEmodels in four steps: Step 1 : Expert load prediction. Unlike existing expert selection predictors [18,27,65,72,74], which operate at single-request level, our Expert Load Predictor accurately estimates expert load distri- butions across request batches and identifies batch-level expert stragglers for eachMoElayer (§4.1). The prediction results are then passed to the Expert Scaler for subsequent expert management. Step2: Expert scaling. The Expert Scaler receives the predicted expert loads and determines expert scaling decisions under infer- ence budget constraints (§4.2). It first analyzes expert stragglers and sets a target forward latency for each predicted layer. Next, it trims excessive straggler loads to meet the latency target and allocates additional replicas to handle the overflow. The resulting expert scaling plan is then forwarded to the Expert Placer for deployment. Step3: Expert placement. WithEP, each expert instance must be assigned to a GPU for execution. The Expert Placer generates an optimized GPU placement strategy based on prior expert states and hardware characteristics, given the new scaling plan (§4.3). The placement is designed to maximize function locality, improve GPU utilization, and minimize communication overhead (e.g., expert migration between GPU and CPU, or all-to-all data transfers). Step4: Expert serving. The inference process consists of one iteration in the prefill stage and multiple iterations in the decode stage. For eachMoElayer in every iteration, we evenly distrib- ute each expert’s load across its replicas, achieving dynamic load balancing and eliminating stragglers through parallel processing. 3.3 Problem Formulation Following priorMoEserving research [74], we consider serving aMoE-basedLLMcomposed of퐿 MoElayers on a cluster of퐺 4 MoEless : Efficient MoE LLM Serving via Serverless ComputingPreprint, March 5, 2026 020 Layer 0.0 0.5 1.0 Similarity (a) d = 1 d = 2 d = 3 d = 4 020 Layer 0.0 0.5 1.0 Accuracy (b) d = 1 d = 2 d = 3 d = 4 Figure 6: Characterizing Phi-3.5-MoE on LMSYS-Chat-1M: (a) cosine similarity of gate network inputs, and (b) expert load prediction accuracy across layers with different prediction distances. homogeneous GPUs. EachMoElayer consists of one gating network and퐸experts. The model processes a batch of request prompts 퐵, resulting in a total expert workload (i.e., token count) of푊. Let[퐿]:= 1, . . .,푙, . . .,퐿denote the set of layers and[퐸]:= 1, . . .,푒, . . .,퐸the set of experts in each layer. Each request prompt undergoes multiple iterations across the prefill and decode stages. During each iteration, we must make two decisions: how many replicas of each expert to instantiate and on which GPU each replica should be placed. Let[푅 (푖,푙,푒) ]:=1, . . .,푟 (푖,푙,푒) , . . .,푅 (푖,푙,푒) denote the set of replicas of expert푒at layer푙during iteration푖, where 푙 ∈ [퐿],푒 ∈ [퐸], and푖 ∈ 퐵. Let푝 (푖,푙,푒) 푟,푔 ∈ 0,1denote whether replica 푟 (푖,푙,푒) is placed on GPU 푔 ∈ [퐺]. Since experts in anMoEmodel typically share the same pa- rameter size, we assume expert memory footprints푀 푒 to be ho- mogeneous. The expert workload푊 푙,푒 is evenly divided among its replicas, yielding per-replica load푊 푙,푒,푟 := 푊 푙,푒 푅 (푖,푙,푒) . The process- ing time of a replica scales linearly with its assigned workload: 푇 푙,푒,푟 := 훼 ·푊 푙,푒,푟 , where훼is a processing coefficient. Each GPU incurs additional communication latency due to all-to-all scatter and gather operations. Since the input and output data of anMoE layer typically have the same size (hidden dimension), the one-time communication time of GPU푔is given by푇 푔 := 훽· Í 푝 (푖,푙,푒) 푟,푔 =1 푊 푙,푒,푟 , where훽is a communication coefficient. Each layer’s forward time mainly consists of the expert processing time and two rounds of all-to-all communication. Given the definitions, we optimize two objectives: total inference latency푇 and cost퐶. The total inference latency is defined as 푇 := ∑︁ 푖∈퐵 ∑︁ 푙∈[퐿] max 푒,푟 (푇 푙,푒,푟 )+ 2· max 푔 (푇 푔 )+푇 misc , where푇 misc is a constant for non-MoElatency. The total inference cost is computed as the product of GPU memory consumption and inference latency aggregated over all iterations: 퐶 := ∑︁ 푖∈퐵 ∑︁ 푙∈[퐿] max 푒,푟 (푇 푙,푒,푟 )+ 2· max 푔 (푇 푔 ) · ∑︁ 푒∈[퐸] ∑︁ 푅 (푖,푙,푒) 푀 푒 +푇 misc · 푀 misc , where 푀 misc is a constant for non-MoE memory footprints. 369121518 Prediction Distance 0.0 0.5 1.0 Accuracy (a) w/o fine-tuning w/ fine-tuning 369121518 Prediction Distance 0.0 0.5 1.0 Accuracy (b) w/o fine-tuning w/ fine-tuning Figure 7: Expert load prediction accuracy on LMSYS-Chat- 1M with and without fine-tuning at different prediction dis- tances: (a) Mixtral-8×7B, and (b) Phi-3.5-MoE. Finally, we formulate the expert load balancing problem as a multi-objective integer linear programming (ILP) optimization: min 푟 (푖,푙,푒) , 푝 (푖,푙,푒) 푟,푔 (푇,퐶), s.t. ∑︁ 푝 (푖,푙,푒) 푟,푔 =1 푀 푒 ≤ 푀 푔 , ∀푖 ∈ 퐵, ∀푔 ∈ [퐺]. The objectives are minimizing the total inference latency and cost while ensuring the total memory footprint of expert replicas of any GPU푔satisfies the available memory푀 푔 . Solving thisILPis NP-hard [13], and real-world workloads exhibit dynamic expert demands that further complicate the problem. Therefore, we opt for a heuristic-based design for MoEless. 4 Design 4.1 Expert Load Predictor For eachMoElayer, ideal expert scaling requires waiting for the gate network outputs to reveal the actual expert loads. However, due to the synchronization overheads of on-demand scaling and placement [22,42], prior studies have explored asynchronously predicting and prefetching experts ahead of time with a prediction distance [18,65,72,74,79]. The prediction distance푑refers to the number of layers in advance that scaling and placement decisions are made before the corresponding layer activates its experts. An optimal prediction distance should fully overlap the prediction, scal- ing, and placement overheads with the ongoing inference process. To this end, we design the Expert Load Predictor to accurately iden- tify overloaded expert stragglers and mitigate them in upcoming expert executions asynchronously. Speculative prediction. Due to the common use of residual con- nections [23] in Transformer architectures, the hidden states output by each Transformer block remain highly similar to their inputs. Leveraging this property, we use the input hidden states of the푙-th layer as input to the gate network of the(푙+푑)-th layer to speculate its expert load distribution푊 푙+푑 :=푤 (푙+푑,1) , . . .,푤 (푙+푑,푒) , . . .,푤 (푙+푑,퐸) , where푑is the prediction distance and푤 푙+푑,푒 is the token counts for expert푒 ∈ [퐸]. Figure 6(a) illustrates the cosine similarity of gate network inputs between Layers푙and푙 +푑, where푑 ∈ [1,4], showing consistently high similarity across layers. Figure 7 fur- ther presents the average expert load prediction accuracy across 5 Preprint, March 5, 2026Hanfei Yu, Bei Ouyang, Shwai He, Ang Li, and Hao Wang Algorithm 1: Expert Scaling Input: Predicted expert loads푊 푙 , expert memory 푀 푒 , per-layer memory cap 푀 cap , CV threshold푉 Output: Replica sets[푅 (푙,푒) ] for∀푒 ∈ [퐸] in 푙 1 Initialize replica count 푅 (푙,푒) ← 1,∀푒 ∈ [퐸]; 2 Initialize allocated memory 푀 alloc ← 0; 3 while 푀 alloc < 푀 cap and CV(푊 푙 )> 푉 do 4Select straggler 푒 ∗ ← arg max 푒∈[퐸] 푊 푙 푊 푙 ←푊 푙 −푤 (푙,푒 ∗ ) 푟 | 푟 ∈ [푅 (푙,푒 ∗ ) ]; 5Add one replica 푅 (푙,푒 ∗ ) ← 푅 (푙,푒 ∗ ) + 1, 푀 alloc ← 푀 alloc + 푀 푒 ∗ ; 6Split the load 푤 (푙,푒 ∗ ) 푟 ← 푤 (푙,푒 ∗ ) 푟 /푅 (푙,푒 ∗ ) , 푊 푙 ←푊 푙 +푤 (푙,푒 ∗ ) 푟 | 푟 ∈ [푅 (푙,푒 ∗ ) ]; 7 return[푅 (푙,푒) ] all layers at different prediction distances for two models. Intu- itively, a larger prediction distance offers greater opportunity to overlap asynchronous expert operations, but at the cost of reduced prediction accuracy. Gate network fine-tuning with layer awareness. Existing approaches either directly reuse the original gate networks as pre- dictors, leading to unsatisfactory prediction accuracy [18], or train large external predictors from scratch, introducing substantial com- putational overhead [65]. In contrast, we replicate and fine-tune the original gate networks as our predictors, preserving their inherent knowledge of expert selection while improving prediction accuracy over larger prediction distances. Our key observation is that not all layers require the same degree of fine-tuning. Figure 6(a) shows that early layers have lower input similarity across gate networks, while later layers maintain higher input similarity and yield more reliable predictions. Figure 6(b) presents the prediction accuracy across layers under different prediction distances. Early layers tend to exhibit more unstable expert load distributions and lower ac- curacy, whereas later layers are more stable and predictable. This observation aligns with prior research [11,44], which indicates that early layers are generally more plastic and less stable in their learn- ing dynamics. Leveraging this observation, we first profile each layer’s prediction accuracy prior to deployment and define a target thresholdℎ(e.g., 80%). Layers with accuracy belowℎare selectively fine-tuned until they exceed the threshold. Figure 7 shows our layer-aware fine-tuning consistently improves prediction accuracy across varying prediction distances. 4.2 Expert Scaler Upon receiving the predicted expert load distribution푊 푙 for Layer 푙, MoEless’s Expert Scaler determines the set of replicas[푅 (푙,푒) ]:= 1, . . .,푟 (푙,푒) , . . .,푅 (푙,푒) for each expert푒 ∈ [퐸]. As shown in Al- gorithm 1, we employ a greedy heuristic to iteratively cut off the expert stragglers with high loads and converge to a balanced load distribution for each layer. For each layer푙 ∈ [퐿], we first initializes all experts with a single instance and a per-layer memory cap푀 cap . Then, we repeatedly identifies the most overloaded expert using a max heap, adds a replica to that expert, and evenly split its load Algorithm 2: Expert Placement Input: Replica set[푅 (푙,푒) ] for∀푒 ∈ [퐸], expert loads푊 푙 , GPU set 퐺 , last placement results[푅 ′(푙,푒) ] Output: 푃 푙 :=푝 (푙,푒) 푟,푔 | 푒 ∈ [퐸],푟 ∈ [푅 (푙,푒) ],푔 ∈ [퐺] 1 Initialize per-GPU loads푊 푔 ←0 퐺 ,∀푔 ∈ [퐺]; 2 Initialize placement matrix 푃 푙 ←0 |푃 푙 | ,∀푝 ∈ 푃 푙 ; 3 while푊 푙 ≠∅ do 4Select most-loaded 푟 ∗ ← arg max 푟∈[푅 (푙,푒) ] 푊 푙 , 푊 푙 ←푊 푙 −푤 (푙,푒) 푟 ∗ ; 5 if 푟 ∗ ∈ [푅 ′(푙,푒) ] and 푟 ∗ ∈ 푔,∀푔 ∈ [퐺] then 6Select warm-start 푔 ∗ ← 푔; 7 else 8Select least-loaded 푔 ∗ ← arg min 푔∈[퐺] 푊 푔 ; 9Update placement 푝 (푙,푒) 푟 ∗ ,푔 ← 1,푊 푔 ←푊 푔 −푤 푔 ; 10Update loads 푤 푔 ← 푤 푔 +푤 (푙,푒) 푟 ∗ ,푊 푔 ←푊 푔 +푤 푔 ; 11 return 푃 푙 Table 1: Characterizations ofMoEmodels used in the evalua- tion. MoE Model Parameters Experts Per Layer Num. of (active / total) (active / total) Layers Mixtral-8×7B12.9B / 46.7B2 / 832 Phi-3.5-MoE6.6B / 42B2 / 1632 Llama-4-Scout17B / 109B1 / 1648 until either the coefficient of variance (CV) of expert loads falls below the threshold푉(e.g.,CV≤0.2), or the per-layer memory cap푀 cap is reached. This process is performed iteratively across all layers, ensuring that each layer independently achieves a balanced expert load distribution within its allocated memory budget. 4.3 Expert Placer After the number of replicas for each expert in a layer is determined, the Expert Placer decides where these replicas should be placed across available GPUs. We aim to minimize placement overheads while maintaining balanced GPU loads across the system. Our key objectives are two-fold: Reuse expert replicas for warm-starts. If a previously placed expert replica is kept alive on a GPU, we immediately reuse it to avoid data transfer and initialization delays (i.e., function warm- starts [20,45,67,68]). Thus, we eliminate expert cold-start over- heads with pre-warming and keep-alive techniques [59,60,68,75] from serverless computing. Balance per-GPU loads. Since each GPU’s computation and communication times are proportional to its aggregated expert load, balancing these loads minimizes straggler effects and ensures efficient parallel execution among GPUs. Hence, we distribute repli- cas across GPUs to achieve balanced utilization while satisfying per-GPU capacity constraints via a classic Join-the-Shortest-Queue algorithm [21]. 6 MoEless : Efficient MoE LLM Serving via Serverless ComputingPreprint, March 5, 2026 0100200300400500600 Layer Forward Time (ms) 0.0 0.2 0.4 0.6 0.8 1.0 CDF Megatron-LM Oracle EPLB MoEless (a) Mixtral-8×7B. 0100200300400500 Layer Forward Time (ms) 0.0 0.2 0.4 0.6 0.8 1.0 CDF Megatron-LM Oracle EPLB MoEless (b) Phi-3.5-MoE. 050100150200250300350 Layer Forward Time (ms) 0.0 0.2 0.4 0.6 0.8 1.0 CDF Megatron-LM Oracle EPLB MoEless (c) Llama-4-Scout. Figure 8: MoE layer forward time of four approaches across three models on LMSYS-Chat-1M. 0100200300400500600 Layer Forward Time (ms) 0.0 0.2 0.4 0.6 0.8 1.0 CDF Megatron-LM Oracle EPLB MoEless (a) Mixtral-8×7B. 0100200300400500600 Layer Forward Time (ms) 0.0 0.2 0.4 0.6 0.8 1.0 CDF Megatron-LM Oracle EPLB MoEless (b) Phi-3.5-MoE. 050100150200250300350 Layer Forward Time (ms) 0.0 0.2 0.4 0.6 0.8 1.0 CDF Megatron-LM Oracle EPLB MoEless (c) Llama-4-Scout. Figure 9: MoE layer forward time of four approaches across three models on ShareGPT. As shown in Algorithm 2, we first initialize the per-GPU load tracker and placement matrix. Then, we select the most-loaded expert replica from all replicas, checking whether any alive replicas can be reused from the previous placement results[푅 ′(푙,푒) ]. If reuse is not possible, we greedily assign the most-loaded replica to the GPU with the lowest current aggregated load. After assignment, the GPU load is updated to reflect the new placement, and the process continues until all replicas are assigned. 5 Implementation We prototype MoEless on top of the Megatron-LM framework [63]. The implementation details of each component are described as follows. Implementing predictors. We implement the predictors as lightweight neural networks (NNs) that share the same architecture and parameter size as theMoEmodel’s gate networks, using Py- Torch [55]. To eliminate prediction overheads, we invoke predictors asynchronously from the main model computation using dedicated CUDA streams [51]. For each Transformer layer, a separate CUDA stream is launched to execute the predictor on the current hidden states concurrently with theMoElayer’s forward pass. This de- sign ensures that prediction is fully overlapped with computation, introducing no blocking or additional latency. Fine-tuning predictors. We collect input hidden states and the corresponding gate network outputs from eachMoElayer to construct the fine-tuning dataset. The dataset is split into training and testing subsets with a 7:3 ratio. For each layer, we fine-tune multiple predictors with varying inter-layer prediction distances for flexible deployment. The fine-tuning process is lightweight, completing within five minutes on a single GPU. To further improve training efficiency, we parallelize fine-tuning across all layers. Scaling and placing serverless experts. We support two types of expert scaling: (1) intra-GPU scaling, which replicates or removes expert weights within a single GPU, and (2) inter-GPU scaling, which adjusts the number of GPUs that host expert replicas. Both expert scaling and placement are implemented asynchronously to minimize inference delay. We use NCCL [50] library for all-to-all communication between serverless experts and the Megatron-LM framework. We integrate Megatron-LM’sEPwith Docker contain- ers [47], where experts are distributed across GPU-enabled contain- ers for execution. Additionally, we adopt standard pre-warming and keep-alive techniques [59,60,68,75] from serverless computing to eliminate expert cold-start latency. 6 Evaluation This section conducts extensive experiments to evaluate MoEless, including the experimental setup (§6.1), overall performance against SOTAbaselines (§6.2), effectiveness of the expert load predictors (§6.3), sensitivity analysis (§6.4), ablation study (§6.5), and system overheads of MoEless (§6.6). 6.1 Experimental Setup We describe the details of experimental setup for evaluation. Models. We evaluate our system using three representative MoE-based LLMs: Mixtral-8×7B [29], Phi-3.5-MoE [1], and Llama- 4-Scout [48]. Table 1 characterizes the threeMoEmodels, including the number of parameters,MoElayers, and experts per layer, re- spectively. Together, these models span recent architectural scales 7 Preprint, March 5, 2026Hanfei Yu, Bei Ouyang, Shwai He, Ang Li, and Hao Wang Mixtral-8×7BPhi-3.5-MoELlama-4-Scout 0 20000 40000 60000 Cost (GB×sec) Megatron-LM Oracle EPLB MoEless (a) LMSYS-Chat-1M. Mixtral-8×7BPhi-3.5-MoELlama-4-Scout 0 20000 40000 60000 Cost (GB×sec) Megatron-LM Oracle EPLB MoEless (b) ShareGPT. Figure 10: Total inference cost of four approaches across three models on two datasets. 0.0 0.5 1.0 Accuracy (a) Mixtral-8×7B on LMSYS-Chat-1M (b) Phi-3.5-MoE on LMSYS-Chat-1M 1020 Prediction Distance 0.0 0.5 1.0 Accuracy (c) Mixtral-8×7B on ShareGPT 1020 Prediction Distance (d) Phi-3.5-MoE on ShareGPT Mixtral-offloadingProMoEMoEless Figure 11: Expert load prediction accuracy for three predic- tion methods at different prediction distances. and expert configurations, allowing evaluation across diverseMoE designs. Workloads and datasets. We employ two real-world prompt datasets widely used for LLM evaluation, LMSYS-Chat-1M [81] and ShareGPT [53], for input requests. We adopt real-world LLM infer- ence traces released by Microsoft Azure [54] to drive the request arrivals, where requests are randomly sampled from the datasets and sent to each baseline at trace timestamps. Since Megatron-LM does not natively support continuous batching, we emulate this behavior by aggregating all requests arriving within each second into a single input batch, resulting in time-varying batches that better align with real-world serving scenarios. Similar to existing LLMsystem evaluation methodologies [3,54,66,74,82], we con- figure theMoEmodels to process and generate the exact number of tokens specified in the traces to ensure consistency. Hardware and system environment. We conduct all exper- iments on an eight-GPU testbed, with GPUs interconnected via pairwise NVLinks. Each GPU is an NVIDIA A6000 equipped with 48 GB of GPU memory and connected to the host CPUs through PCIe 5.0 links, providing up to 64 GB/s bidirectional bandwidth per 0.10.50.9 Pearson Coefficient 1 11 21 31 Layer 0.0 0.5 (a) Mixtral-8x7B 0.10.50.9 Pearson Coefficient 1 11 21 31 Layer 0.0 0.5 (b) Phi-3.5-MoE Figure 12: Correlations between predicted and actual expert load distributions across layers of two models. Heavier color means more correlation results fall in the slot. GPU. The system is provisioned with 256 vCPUs backed by AMD EPYC 7C13 processors and 500 GB of system memory. Software stack and configuration. The experiments of all serverful baselines are conducted inside an Ubuntu 22.04.5 LTS (Jammy) virtual machine (VM) running Linux kernel 5.15.0-164- generic. The system uses NVIDIA driver 570.211.01 with CUDA 12.8 support, CUDA runtime librarieslibcudart.so.12, and PyTorch 2.10 built with CUDA 12.8, together with cuDNN 9.1.0. Collec- tive communication relies on NCCL 2.23.4. For MoEless, we install Docker Engine v29.1.3 to host serverless experts using containers. Baselines and comparisons. We compare MoEless against three SOTAserverfulMoEserving baselines: 1) Megatron-LM [63]: a basicEP-enabledMoEinference system without expert load bal- ancing, and 2)EPLB[38]: the load balancer proposed by DeepSeek [38], which periodically (e.g., every ten minutes) creates redundant experts based on historical expert usage to mitigate high expert loads. 3) Oracle [24]: an upper-bound baseline that ignores gate network outputs and performs perfect expert load balancing. Note that Oracle directly affects model generation quality, as it ignores a subset of the original expert routing decisions during compu- tation. We integrateEPLBand Oracle into Megatron-LM for fair comparison. Methodology and experimental protocol. We measure the MoElayer forward latency to directly evaluate the overheads caused by expert stragglers. We also report the overall serving cost, esti- mated as the product of memory consumption and inference latency 8 MoEless : Efficient MoE LLM Serving via Serverless ComputingPreprint, March 5, 2026 12345 Prediction Distance 0 25 50 75 100 Mean Layer Forward Time (ms) Mean (ms) Avg Expert Num Per Layer 22 23 Avg Expert Num Per Layer (a) Mixtral-8×7B. 12345 Prediction Distance 0 20 40 60 80 Mean Layer Forward Time (ms) 58 60 62 64 Avg Expert Num Per Layer (b) Phi-3.5-MoE. 12345 Prediction Distance 0 15 30 45 60 Mean Layer Forward Time (ms) 36 38 40 Avg Expert Num Per Layer (c) Llama-4-Scout. Figure 13: Sensitivity analysis of MoEless’s prediction distance on LMSYS-Chat-1M. 12345 Prediction Distance 0 25 50 75 100 Mean Layer Forward Time (ms) 22 23 Avg Expert Num Per Layer (a) Mixtral-8×7B. 12345 Prediction Distance 0 30 60 90 120 Mean Layer Forward Time (ms) 58 60 62 64 Avg Expert Num Per Layer (b) Phi-3.5-MoE. 12345 Prediction Distance 0 20 40 60 80 Mean Layer Forward Time (ms) 36 38 40 Avg Expert Num Per Layer (c) Llama-4-Scout. Figure 14: Sensitivity analysis of MoEless’s prediction distance on ShareGPT. aggregated across all input batches, representing the total mone- tary cost of serving the entire workload. In addition, we evaluate detailed metrics, including expert load prediction accuracy and system overheads. 6.2 Overall Performance We evaluate the overall performance of Megatron-LM, Oracle,EPLB, and MoEless by running threeMoEmodels on the LMSYS-Chat- 1M and ShareGPT datasets, measuringMoElayer forward latency and total inference cost. Due to continuous batching, batch-level latency varies across iterations. Therefore, we record the per-layer forward latency for all layers across all input batches and aggregate them into a unified cumulative distribution function (CDF) for comparison. Figures 8 and 9 present theCDFofMoElayer forward latency of four approaches across three models on LMSYS-Chat-1M and ShareGPT, respectively. With scalable and elastic serverless experts, MoEless achieves significant performance improvements, reducing the averageMoElayer forward latency 43.19% and 21.89% compared to Megatron-LM andEPLB, respectively. While Oracle is a lossy baseline that achieves perfect load balance by affecting generation quality, MoEless consistently stays closest to Oracle across all cases, indicating superior performance over existingSOTAexpert load balancing methods. Figure 10 reports the total inference cost aggregated across the entire experiment for all four approaches. All other baselines in- cur significantly higher costs due to serverful expert execution. In contrast, by leveraging serverless expert execution, MoEless consis- tently delivers higher serving efficiency, reducing overall inference cost by 92.68%, 84.06%, and 95.11% compared to Megatron-LM, Ora- cle, and EPLB, respectively. 6.3 Prediction Accuracy We compare MoEless’s predictor against twoSOTAexpert predic- tion solutions: 1) Mixtral-offloading [18], which directly employs the original gate networks to predict expert selections of subsequent layers, and 2) ProMoE [65], which trains a large, layer-specific pre- dictor from scratch to model the mapping between gate inputs and expert selections. We use the same experimental setup as described in the previous evaluation. As shown in Figure 11, MoEless consistently outperforms both baselines across different prediction distances, demonstrating greater robustness and accuracy. On average, MoEless improves prediction accuracy by up to 18% and 15% over Mixtral-offloading and ProMoE, respectively. Figure 12 presents the heatmap comparing the predicted and ac- tual expert load distributions. We collect pairwise predicted–actual data points across all layers and compute the Pearson correlation coefficients [12] for each pair. The results reveal a strong positive correlation between the predicted and actual distributions, indi- cating that our predictor effectively captures expert load patterns across layers under real-world workloads. 9 Preprint, March 5, 2026Hanfei Yu, Bei Ouyang, Shwai He, Ang Li, and Hao Wang 0.20.40.60.81 CV Threshold 0 20 40 60 80 100 Mean Layer Forward Time (ms) Mean (ms) Avg Expert Num Per Layer 12 16 20 Avg Expert Num Per Layer (a) Mixtral-8×7B. 0.20.40.60.81 CV Threshold 0 20 40 60 Mean Layer Forward Time (ms) 30 45 60 Avg Expert Num Per Layer (b) Phi-3.5-MoE. 0.20.40.60.81 CV Threshold 0 10 20 30 40 Mean Layer Forward Time (ms) 24 30 36 Avg Expert Num Per Layer (c) Llama-4-Scout. Figure 15: Sensitivity analysis of MoEless’s CV threshold on LMSYS-Chat-1M. 0.20.40.60.81 CV Threshold 0 20 40 60 80 100 Mean Layer Forward Time (ms) 12 16 20 Avg Expert Num Per Layer (a) Mixtral-8×7B. 0.20.40.60.81 CV Threshold 0 25 50 75 100 Mean Layer Forward Time (ms) 30 45 60 Avg Expert Num Per Layer (b) Phi-3.5-MoE. 0.20.40.60.81 CV Threshold 0 20 40 60 Mean Layer Forward Time (ms) 25 30 35 40 Avg Expert Num Per Layer (c) Llama-4-Scout. Figure 16: Sensitivity analysis of MoEless’s CV threshold on ShareGPT. 6.4 Sensitivity Analysis We evaluate the sensitivity of two key system parameters in MoEless: 1) the prediction distance, which determines the overlap overhead (§4.1), and 2) theCVthreshold, which governs expert replica scaling (§4.2). For both parameters, we measure the averageMoElayer forward time and the average number of expert replicas per layer to analyze their impact on inference latency and expert serving cost. Prediction distance. Figures 13 and 14 show the averageMoE layer forward time and the average number of expert replicas per layer as the prediction distance varies, evaluated across three mod- els and two datasets. We increase the prediction distance from 1 to 5 in increments of 1. As the prediction distance increases, theMoE layer forward time rises due to less accurate estimation of future expert load distributions, while the number of expert replicas de- creases as load predictions become coarser. Based on this trade-off, we set the prediction distance of MoEless to 1 in all evaluations, achieving high prediction accuracy with negligible overhead to inference latency. CV threshold. Figures 15 and 16 present the sensitivity of the averageMoElayer forward time and the average number of expert replicas per layer to differentCVthresholds, evaluated across three models and two datasets. We vary theCVthreshold from 0.2 to 1.0 in increments of 0.2. LargerCVthresholds permit greater load imbalance and trigger less aggressive expert scaling, reducing the number of expert replicas per layer but increasing theMoElayer forward time due to straggler effects. We therefore set theCV threshold of MoEless to 0.2 in our evaluation, minimizing inference latency while still achieving lower expert costs compared to other baselines. 6.5 Ablation Study We present an ablation study of MoEless by disabling all three critical components, denoted as MoEless w/o pred + scale + place. This variant replaces our Expert Load Predictor withEPLB’s periodic expert load estimation based on historical windows, disables the scaling of serverless experts, and removes our expert placement and load-balancing strategies. We follow the same evaluation setup as in §6.2 and serve Mixtral-8×7B and Phi-3.5-MoE on LMSYS-Chat-1M with each baseline to conduct this ablation study. Figure 17 shows theCDFofMoElayer forward latency for MoE- less and the ablated variant. The results demonstrate that the Expert Load Predictor (§4.1), Expert Scaler (§4.2), and Expert Placer (§4.3) are all essential to MoEless ’s overall performance improvements over serverful baselines. Specifically, removing the predictor re- duces the accuracy of expert load estimation, while disabling expert scaling and placement diminishes the effectiveness of load balanc- ing across experts and GPUs, thereby jointly increasing inference latency. 6.6 System Overheads We report fine-grained system overheads of MoEless. Predictor fine-tuning overheads. The fine-tuning process is computationally lightweight across three models, as the largest pre- dictor contains only 80K parameters. Across all threeMoEmodels, the complete set of predictors can be fine-tuned within five minutes on a single GPU, incurring negligible fine-tuning overhead. 10 MoEless : Efficient MoE LLM Serving via Serverless ComputingPreprint, March 5, 2026 50100150 Layer Forward Time (ms) 0.00 0.25 0.50 0.75 1.00 CDF w/o pred+scale+place MoEless (a) Mixtral-8×7B. 50100150 Layer Forward Time (ms) 0.00 0.25 0.50 0.75 1.00 CDF w/o pred+scale+place Moeless (b) Phi-3.5-MoE. Figure 17: Ablation study of MoEless on LMSYS-Chat-1M. Table 2: Predictor memory footprints across different models and methods. ModelMixtral-offloading ProMoE Ours Mixtral-8×7B1.92 MB128.32 MB 1.92 MB Phi-3.5-MoE4.16 MB128.64 MB 4.16 MB Llama-4-Scout3.84 MB120.48 MB 3.84 MB Predictor memory footprints. Table 2 reports the total GPU memory consumption of all predictors across the three models for different methods. Our predictors introduce minimal memory over- head, occupying less than 2% of the footprint required by ProMoE. Expert operation overheads. The prediction delay is below 0.2 ms per layer, and nearly all expert scaling and placement op- erations are warm-started without runtime delays. Moreover, all expert-related operations are executed asynchronously, ensuring minimal impact on inference latency. 7 Related Work Mitigating expert load imbalance inMoEsystems. Faster- MoE [22] introduces a dynamic shadowing mechanism to mitigate skewed expert selection during training. Prophet [70] performs fine- grained load balancing for parallelMoEtraining by coordinating token dispatch and expert placement to reduce imbalance-induced stalls. DeepSeek proposesEPLB[38], which periodically swaps low-usage experts with replicas of popular ones to balance expert loads. Capacity-Aware Inference [24] studies the straggler effect in MoEinference and mitigates it by explicitly accounting for capacity limits when allocating expert workloads. MoE-GPS [46] provides practical guidelines for prediction strategies in dynamic expert duplication, highlighting the importance of accurate look-ahead estimation under shifting loads. In addition, expert routing and dis- patch optimizations can indirectly alleviate imbalance by reducing communication and synchronization overheads, such as Tutel [26] (adaptiveMoEat scale) and Pre-Gated MoE [27] (algorithm–system co-design for fasterMoEinference). NetMoE [42] improves expert routing efficiency by considering the locality of training samples and dynamically rearranging their placements to reduce all-to-all communication costs. Lina [35] alleviates communication bottle- necks during inference through dynamic resource scheduling. Un- like these serverful approaches, MoEless is the first serverlessMoE serving framework that mitigates expert load imbalance through serverless experts. Optimizing expert parallelism inMoEtraining. SmartMoE [77] adaptively combines multiple parallelism strategies to accelerate MoEtraining. ScMoE [10] overlaps different forms of parallelism to effectively reduce all-to-all communication latency. Comet [78] achieves fine-grained communication–computation overlap by lever- aging data dependency analysis and task rescheduling. MoE Parallel Folding [39] decouples attention andMoElayers in Transformer blocks, allowing each to independently select optimal parallelism strategies. Beyond parallelism mappings, large-scaleMoEtraining systems such as GShard [34] and Switch Transformers [19] high- light how routing, capacity factors, and sharding strategies shape both efficiency and load balance. Orthogonal to these training op- timizations, MoEless focuses on mitigating expert load imbalance inMoEserving. The parallelism techniques can be seamlessly inte- grated into our framework. Expert offloading for resource-limitedMoEserving. MoE- Infinity [72] traces expert selection patterns to offload inactive ex- perts and prefetch important ones based on predictions. FineMoE [74] proposes a fine-grained offloading system that predicts, prefetches, and caches experts to reduce resource pressure. ProMoE [65] proac- tively predicts and prefetches experts usingNN-based predictors. DAOP [79] further explores data-aware offloading and predictive pre-calculation for efficientMoEinference, emphasizing that ac- curate forecasting can reduce both transfer overhead and tail la- tency. DeepSpeed-Inference [58] offloads parameters at the layer level without considering expert awareness. Mixtral-offloading [18] leverages gate network inputs from preceding layers to speculate expert selections. Complementary to expert offloading, a growing body of serverless LLM serving work targets general cold-start and elasticity challenges (e.g., ServerlessLLM [20], ParaServe [45], Medusa [76], DeepServe [25]), but these systems primarily treat the model as a whole and do not addressMoE-specific expert strag- glers under expert parallelism. MoEless complements offloading approaches by focusing on distributedMoEserving environments (where all-to-all and expert stragglers dominate) rather than single- node resource-constrained scenarios. MoEserving in serverless computing. Recent works have ex- plored servingMoEmodels using serverless computing to reduce inference cost. Liu et al. [40]optimizes the deployment cost of MoEmodels on serverless platforms using Bayesian optimization. Remoe [41] offloads expert modules to CPUs to reduce memory overhead and inference cost on heterogeneous hardware. How- ever, none of the existing works focus on addressing expert load imbalance by leveraging the elasticity of serverless functions. 8 Conclusion This paper proposes MoEless, the first serverlessMoEserving frame- work that explicitly targets expert load imbalance while accelerating inference through fine-grained serverless expert execution. MoEless employs lightweight, low-overhead predictors to accurately capture incoming expert load distributions and proactively identify expert 11 Preprint, March 5, 2026Hanfei Yu, Bei Ouyang, Shwai He, Ang Li, and Hao Wang stragglers with layer-level awareness, enabling timely and informed resource management decisions during inference. Guided by these predictions, we design optimized expert scaling and placement strategies that dynamically adjust expert instances across GPUs, improving function locality, increasing effective GPU utilization, and balancing workloads across both experts and devices. By decou- pling expert execution from rigid deployment boundaries, MoEless bridges serverless computing principles with large-scaleMoEin- ference, enabling elastic and efficient resource allocation without sacrificing latency. We prototype MoEless on top of Megatron-LM and deploy it on an eight-GPU testbed to evaluate its effectiveness under realistic serving conditions. Experiments with open-source MoEmodels and real-world workloads demonstrate that MoEless reduces inference latency by up to 43% and inference cost by up to 84% compared toSOTAsolutions, highlighting its practical benefits for scalable, cost-efficient MoE serving systems. References [1]Marah Abdin, Jyoti Aneja, Hany Awadalla, Ahmed Awadallah, Ammar Ahmad Awan, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Jianmin Bao, Harkirat Behl, et al.2024. Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv e-prints (2024), arXiv–2404. [2]Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al.2023. GPT-4 Technical Report. arXiv preprint arXiv:2303.08774 (2023). [3] Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. 2024. Taming Throughput-Latencytradeoff inLLMinference withSarathi-Serve. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 117–134. [4]Ahsan Ali, Riccardo Pinciroli, Feng Yan, and Evgenia Smirni. 2020. BATCH: Ma- chine Learning Inference Serving on Serverless Platforms with Adaptive Batching. In International Conference for High Performance Computing, Networking, Storage and Analysis (SC). IEEE. [5] AWS Bedrock. 2023. Amazon Bedrock - Generative AI. https://aws.amazon.com/ bedrock/. [6] AWS SageMaker. 2022. AWS SageMaker. https://aws.amazon.com/sagemaker/. [7]Azure Machine Learning. 2022. Azure Machine Learning. https://azure.microsoft. com/en-us/products/machine-learning/. [8] Oana Balmau, Anne-Marie Kermarrec, Rafael Pires, André Loureiro Espírito Santo, Martijn de Vos, and Milos Vujasinovic. 2025. Accelerating MoE Model Inference with Expert Sharding. In Proceedings of the 5th Workshop on Machine Learning and Systems. 192–199. [9]Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al.2020. Language Models are Few-Shot Learners. Advances in neural information processing systems (2020). [10]Weilin Cai, Juyong Jiang, Le Qin, Junwei Cui, Sunghun Kim, and Jiayi Huang. 2024. Shortcut-connected expert parallelism for accelerating mixture-of-experts. arXiv preprint arXiv:2404.05019 (2024). [11]Yixiong Chen, Alan Yuille, and Zongwei Zhou. 2023. Which Layer is Learning Faster? A Systematic Exploration of Layer-wise Convergence Rate for Deep Neu- ral Networks. In The Eleventh International Conference on Learning Representations (ICLR). [12]Israel Cohen, Yiteng Huang, Jingdong Chen, Jacob Benesty, Jacob Benesty, Jing- dong Chen, Yiteng Huang, and Israel Cohen. 2009. Pearson Correlation Coeffi- cient. Noise Reduction in Speech Processing (2009). [13]Thomas H Cormen, Charles E Leiserson, Ronald L Rivest, and Clifford Stein. 2022. Introduction to Algorithms. MIT press. [14] Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al.2024. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models. arXiv preprint arXiv:2401.06066 (2024). [15] Sunhao Dai, Yuqi Zhou, Liang Pang, Weihao Liu, Xiaolin Hu, Yong Liu, Xiao Zhang, Gang Wang, and Jun Xu. 2024. Neural Retrievers are Biased Towards LLM-Generated Content. In Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. [16]Kuntai Du, Bowen Wang, Chen Zhang, Yiming Cheng, Qing Lan, Hejian Sang, Yihua Cheng, Jiayi Yao, Xiaoxuan Liu, Yifan Qiao, et al.2025. PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applica- tions. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles. 399–414. [17] Nan Du, Yanping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Yuanzhong Xu, Maxim Krikun, Yanqi Zhou, Adams Wei Yu, Orhan Firat, et al. 2022. Glam: Efficient scaling of language models with mixture-of-experts. In International conference on machine learning. PMLR, 5547–5569. [18]Artyom Eliseev and Denis Mazur. 2023. Fast inference of mixture-of-experts language models with offloading. arXiv preprint arXiv:2312.17238 (2023). [19]William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research 23, 120 (2022), 1–39. [20] Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. 2024.ServerlessLLM:Low-Latencyserverless inference for large language models. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 135–153. [21]Varun Gupta, Mor Harchol Balter, Karl Sigman, and Ward Whitt. 2007. Analysis of Join-the-Shortest-Queue Routing for Web Server Farms. Performance Evaluation (2007). [22]Jiaao He, Jidong Zhai, Tiago Antunes, Haojie Wang, Fuwen Luo, Shangfeng Shi, and Qin Li. 2022. FasterMoE: Modeling and Optimizing Training of Large- Scale Dynamic Pre-Trained Models. In Proceedings of the 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. [23]Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. [24]Shwai He, Weilin Cai, Jiayi Huang, and Ang Li. 2026. Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts. In International Conference on Learning Representations (ICLR). [25] Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Jie Meng, Baoquan Zhang, Shining Wan, et al.2025. DEEPSERVE: Serverless Large Language Model Serving at Scale. arXiv preprint arXiv:2501.14417 (2025). [26]Changho Hwang, Wei Cui, Yifan Xiong, Ziyue Yang, Ze Liu, Han Hu, Zilong Wang, Rafael Salas, Jithin Jose, Prabhat Ram, et al.2023. Tutel: Adaptive mixture-of- experts at scale. Proceedings of Machine Learning and Systems 5 (2023), 269–287. [27]Ranggi Hwang, Jianyu Wei, Shijie Cao, Changho Hwang, Xiaohu Tang, Ting Cao, and Mao Yang. 2024. Pre-gated moe: An algorithm-system co-design for fast and scalable mixture-of-expert inference. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). [28]Jananie Jarachanthan, Li Chen, Fei Xu, and Bo Li. 2021. Amps-inf: Automatic model partitioning for serverless inference with cost efficiency. In Proceedings of the 50th International Conference on Parallel Processing. 1–12. [29]Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al.2024.Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024). [30]Zhihan Jiang, Jinyang Liu, Zhuangbin Chen, Yichen Li, Junjie Huang, Yintong Huo, Pinjia He, Jiazhen Gu, and Michael R Lyu. 2024. LILAC: Log Parsing using LLMs with Adaptive Parsing Cache. Proceedings of the ACM on Software Engineering (2024). [31] Yechan Kim, Hwijoon Lim, and Dongsu Han. 2024. Scaling beyond the GPU mem- ory limit for large mixture-of-experts model training. In Forty-first International Conference on Machine Learning. [32] Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles. 611–626. [33]Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024.InfiniGen: Efficient generative inference of large language models with dynamicKV cache management. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 155–172. [34]Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2020. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668 (2020). [35]Jiamin Li, Yimin Jiang, Yibo Zhu, Cong Wang, and Hong Xu. 2023. Accelerating distributed MoE training and inference with lina. In 2023 USENIX Annual Technical Conference (USENIX ATC 23). [36] Yichen Li, Yintong Huo, Renyi Zhong, Zhihan Jiang, Jinyang Liu, Junjie Huang, Jiazhen Gu, Pinjia He, and Michael R Lyu. 2024. Go Static: Contextualized Logging Statement Generation. Proceedings of the ACM on Software Engineering (2024). [37]Xinyu Lin, Wenjie Wang, Yongqi Li, Shuo Yang, Fuli Feng, Yinwei Wei, and Tat- Seng Chua. 2024. Data-efficient Fine-tuning for LLM-based Recommendation. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. [38]Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Cheng- gang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al.2024. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437 (2024). 12 MoEless : Efficient MoE LLM Serving via Serverless ComputingPreprint, March 5, 2026 [39]Dennis Liu, Zijie Yan, Xin Yao, Tong Liu, Vijay Korthikanti, Evan Wu, Shiqing Fan, Gao Deng, Hongxiao Bai, Ashwath Aithal, et al.2025. MoE Parallel Fold- ing: Heterogeneous Parallelism Mappings for Efficient Large-Scale MoE Model Training with Megatron Core. arXiv preprint arXiv:2504.14960 (2025). [40]Mengfan Liu, Wei Wang, and Chuan Wu. 2025. Optimizing Distributed Deploy- ment of Mixture-of-Experts Model Inference in Serverless Computing. In IEEE INFOCOM Conference on Computer Communications. [41]Wentao Liu, Yuhao Hu, Ruiting Zhou, Baochun Li, and Ne Wang. 2025. Remoe: Towards Efficient and Low-Cost MoE Inference in Serverless Computing. arXiv preprint arXiv:2512.18674 (2025). [42]Xinyi Liu, Yujie Wang, Fangcheng Fu, Xupeng Miao, Shenhan Zhu, Xiaonan Nie, and Bin Cui. 2025. NetMoE: Accelerating MoE Training through Dynamic Sample Placement. In The Thirteenth International Conference on Learning Representations. [43]Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, et al.2024. Cachegen: Kv cache compression and streaming for fast large language model serving. In Proceedings of the ACM SIGCOMM 2024 Conference. 38–56. [44] Ziyin Liu, Isaac Chuang, Tomer Galanti, and Tomaso Poggio. 2024. Formation of Representations in Neural Networks. arXiv preprint arXiv:2410.03006 (2024). [45]Chiheng Lou, Sheng Qi, Chao Jin, Dapeng Nie, Haoran Yang, Xuanzhe Liu, and Xin Jin. 2025. Towards Swift Serverless LLM Cold Starts with ParaServe. arXiv preprint arXiv:2502.15524 (2025). [46]Haiyue Ma, Zhixu Du, and Yiran Chen. 2025. MoE-GPS: Guidlines for Prediction Strategy for Dynamic Expert Duplication in MoE Load Balancing. arXiv preprint arXiv:2506.07366 (2025). [47] Dirk Merkel et al.2014. Docker: Lightweight Linux Containers for Consistent Development and Deployment. Linux Journal (2014). [48] Meta AI. 2025.The Llama 4 Herd: The Beginning of a New Era of Na- tively Multimodal AI Innovation. https://ai.meta.com/blog/llama-4-multimodal- intelligence/. [49]Daye Nam, Andrew Macvean, Vincent Hellendoorn, Bogdan Vasilescu, and Brad Myers. 2024. Using an LLM to Help with Code Understanding. In Proceedings of the IEEE/ACM 46th International Conference on Software Engineering. [50]NVIDIA. 2015. NVIDIA Collective Communications Library (NCCL) . https: //developer.nvidia.com/nccl. [51]NVIDIA. 2024. CUDA Runtime API :: CUDA Toolkit Documentation. https: //docs.nvidia.com/cuda/cuda-runtime-api/index.html. [52]NVIDIA NIM. 2024. AI Agents: Built to Reason, Plan, Act. https://w.nvidia. com/en-us/ai/. [53]OpenAI Community. 2025. ShareGPT: Share Your Wildest ChatGPT Conversa- tions. https://sharegpt.com/. [54] Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative llm inference using phase splitting. In 2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 118–132. [55] PyTorch. 2018. PyTorch: Tensors and Dynamic Neural Networks in Python with Strong GPU Acceleration. https://pytorch.org. [56] Haoran Qiu, Anish Biswas, Zihan Zhao, Jayashree Mohan, Alind Khare, Esha Choukse, Íñigo Goiri, Zeyu Zhang, Haiying Shen, Chetan Bansal, et al.2025. Modserve: Scalable and resource-efficient large multimodal model serving. arXiv preprint arXiv:2502.00937 (2025). [57] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al.2019. Language Models are Unsupervised Multitask Learners. OpenAI blog (2019). [58]Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Am- inabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. 2022. Deepspeed-moe: Advancing mixture-of-experts inference and training to power next-generation ai scale. In International conference on machine learning. PMLR, 18332–18346. [59]Rohan Basu Roy, Tirthak Patel, and Devesh Tiwari. 2022. IceBreaker: Warming Serverless Functions Better with Heterogeneity. In ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). [60]Vaishaal Shankar, Karl Krauth, Kailas Vodrahalli, Qifan Pu, Benjamin Recht, Ion Stoica, Jonathan Ragan-Kelley, Eric Jonas, and Shivaram Venkataraman. 2020. Serverless Linear Algebra. In the 11th ACM Symposium on Cloud Computing (SoCC). [61]Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. 2017. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538 (2017). [62]Shaohuai Shi, Xinglin Pan, Qiang Wang, Chengjian Liu, Xiaozhe Ren, Zhongzhe Hu, Yu Yang, Bo Li, and Xiaowen Chu. 2024. Schemoe: An extensible mixture-of- experts distributed training system with tasks scheduling. In Proceedings of the Nineteenth European Conference on Computer Systems. 236–249. [63]Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models using Model Parallelism. arXiv preprint arXiv:1909.08053 (2019). [64]Snowflake. 2024. Snowflake Arctic: The Best LLM for Enterprise AI. https: //w.snowflake.com/en/data-cloud/arctic/. [65]Xiaoniu Song, Zihang Zhong, Rong Chen, and Haibo Chen. 2024. Promoe: Fast moe-based llm serving using proactive caching. arXiv preprint arXiv:2410.22134 (2024). [66]Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse. 2025. Dynamollm: Designing llm inference clusters for performance and energy efficiency. In 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 1348–1362. [67]Yifan Sui, Hao Wang, Hanfei Yu, Yitao Hu, and Jianxun Li. 2025. ServerlessLoRA: Minimizing Latency and Cost in Serverless Inference for LoRA-Based LLMs. arXiv preprint arXiv:2505.14468 (2025). [68] Yifan Sui, Hanfei Yu, Yitao Hu, Jianxun Li, and Hao Wang. 2024. Pre-Warming is Not Enough: Accelerating Serverless Inference with Opportunistic Pre-Loading. In Proceedings of the 2024 ACM Symposium on Cloud Computing (SoCC). [69]Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems 30 (2017). [70]Wei Wang, Zhiquan Lai, Shengwei Li, Weijie Liu, Keshi Ge, Yujie Liu, Ao Shen, and Dongsheng Li. 2023. Prophet: Fine-grained load balancing for parallel train- ing of large-scale moe models. In 2023 IEEE International Conference on Cluster Computing (CLUSTER). IEEE, 82–94. [71] xAI. 2023. Announcing Grok. https://x.ai/blog/grok. [72]Leyang Xue, Yao Fu, Zhan Lu, Luo Mai, and Mahesh Marina. 2024. MoE-Infinity: Offloading-Efficient MoE Model Serving. arXiv preprint arXiv:2401.14361 (2024). [73]An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Cheng- peng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al.2024. Qwen2 Technical Report. arXiv preprint arXiv:2407.10671 (2024). [74]Hanfei Yu, Xingqi Cui, Hong Zhang, and Hao Wang. 2025. Taming Latency- Memory Trade-Off in MoE-Based LLM Serving via Fine-Grained Expert Offload- ing. In European Conference on Computer Systems (EuroSys). [75] Hanfei Yu, Rohan Basu Roy, Fontenot Christian, Devesh Tiwari, Jian Li, Hong Zhang, Hao Wang, and Seung-Jong Park. 2024. RainbowCake: Mitigating Cold- starts in Serverless with Layer-wise Container Caching and Sharing. In ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). [76]Shaoxun Zeng, Minhui Xie, Shiwei Gao, Youmin Chen, and Youyou Lu. 2025. Medusa: Accelerating serverless LLM inference with materialization. In Pro- ceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1. 653–668. [77] Mingshu Zhai, Jiaao He, Zixuan Ma, Zan Zong, Runqing Zhang, and Jidong Zhai. 2023. SmartMoE: Efficiently training Sparsely-Activated models through combining offline and online parallelization. In 2023 USENIX Annual Technical Conference (USENIX ATC 23). [78]Shulai Zhang, Ningxin Zheng, Haibin Lin, Ziheng Jiang, Wenlei Bao, Chengquan Jiang, Qi Hou, Weihao Cui, Size Zheng, Li-Wen Chang, et al.2025. Comet: Fine- grained Computation-communication Overlapping for Mixture-of-Experts. arXiv preprint arXiv:2502.19811 (2025). [79] Yujie Zhang, Shivam Aggarwal, and Tulika Mitra. 2025. DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference. In Design Automation and Test in Europe (DATE). [80]Yuyue Zhao, Jiancan Wu, Xiang Wang, Wei Tang, Dingxian Wang, and Maarten de Rijke. 2024. Let Me Do It for You: Towards LLM Empowered Recommendation via Tool Learning. In Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. [81] Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhang- hao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric P Xing, et al.2023. Lmsys- chat-1m: A large-scale real-world llm conversation dataset. arXiv preprint arXiv:2309.11998 (2023). [82]Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. 2024.DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 193–210. [83] Yanqi Zhou, Tao Lei, Hanxiao Liu, Nan Du, Yanping Huang, Vincent Zhao, An- drew M Dai, Quoc V Le, James Laudon, et al.2022. Mixture-of-experts with expert choice routing. Advances in Neural Information Processing Systems 35 (2022), 7103–7114. 13