← Back to papers

Paper deep dive

RAMoEA-QA: Hierarchical Specialization for Robust Respiratory Audio Question Answering

Gaia A. Bertolino, Yuwei Zhang, Tong Xia, Domenico Talia, Cecilia Mascolo

Year: 2026Venue: arXiv preprintArea: cs.SDType: PreprintEmbeddings: 84

Abstract

Abstract:Conversational generative AI is rapidly entering healthcare, where general-purpose models must integrate heterogeneous patient signals and support diverse interaction styles while producing clinically meaningful outputs. In respiratory care, non-invasive audio, such as recordings captured via mobile microphones, enables scalable screening and longitudinal monitoring, but the heterogeneity challenge is particularly acute: recordings vary widely across devices, environments, and acquisition protocols, and questions span multiple intents and question formats. Existing biomedical audio-language QA systems are typically monolithic, without any specialization mechanisms for tackling diverse respiratory corpora and query intents. They are also only validated in limited settings, leaving it unclear how reliably they handle the shifts encountered in real-world settings. To address these limitations, we introduce RAMoEA-QA, a hierarchically routed generative model for respiratory audio question answering that unifies multiple question types and supports both discrete and continuous targets within a single multimodal system. RAMoEA-QA applies two-stage conditional specialization: an Audio Mixture-of-Experts routes each recording to a suitable pre-trained audio encoder, and a Language Mixture-of-Adapters selects a LoRA adapter on a shared frozen LLM to match the query intent and answer format. By specializing both acoustic representations and generation behaviour per example, RAMoEA-QA consistently outperforms strong baselines and routing ablations with minimal parameter overhead, improving in-domain test accuracy to 0.72 (vs. 0.61 and 0.67 for state-of-the-art baselines) and exhibiting the strongest generalization for diagnosis under domain, modality, and task shifts.

Tags

ai-safety (imported, 100%)cssd (suggested, 92%)preprint (suggested, 88%)

Links

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/13/2026, 12:19:58 AM

Summary

RAMoEA-QA is a hierarchically routed generative model designed for respiratory audio question answering. It utilizes a two-stage conditional specialization approach: an Audio Mixture-of-Experts (Audio-MoE) to route recordings to specific audio encoders, and a Language Mixture-of-Adapters (LoRA-MoA) to select task-specific adapters on a frozen LLM. This architecture improves performance on diverse respiratory datasets and tasks while maintaining parameter efficiency and robustness under domain, modality, and task shifts.

Entities (5)

RAMoEA-QA · model · 100%Audio-MoE · component · 95%LoRA-MoA · component · 95%RA-QA · dataset · 95%Respiratory Disease · medical-condition · 90%

Relation Signals (3)

RAMoEA-QA evaluatedon RA-QA

confidence 100% · We evaluate our approach on the RA-QA collection

RAMoEA-QA uses Audio-MoE

confidence 100% · RAMoEA-QA applies two-stage conditional specialization: an Audio Mixture-of-Experts routes each recording

RAMoEA-QA uses LoRA-MoA

confidence 100% · and a Language Mixture-of-Adapters selects a LoRA adapter on a shared frozen LLM

Cypher Suggestions (2)

Find all components of the RAMoEA-QA model · confidence 90% · unvalidated

MATCH (m:Model {name: 'RAMoEA-QA'})-[:USES]->(c:Component) RETURN m.name, c.name

Identify datasets used for evaluating the model · confidence 90% · unvalidated

MATCH (m:Model {name: 'RAMoEA-QA'})-[:EVALUATED_ON]->(d:Dataset) RETURN d.name

Full Text

83,689 characters extracted from source content.

Expand or collapse full text

Preprint: Under Review 2026 RAMoEA-QA: Hierarchical Specialization for Robust Respiratory Audio Question Answering Gaia A. Bertolinogab62@cam.ac.uk University of Cambridge, United Kingdom Yuwei Zhangyz798@cam.ac.uk University of Cambridge, United Kingdom Tong Xiatongxia@mail.tsinghua.edu.cn Tsinghua University, China Domenico Taliatalia@dimes.unical.it University of Calabria, Italy Cecilia Mascolocm542@cam.ac.uk University of Cambridge, United Kingdom Abstract Conversational generative AI is rapidly en- tering healthcare, where general-purpose models must integrate heterogeneous pa- tient signals and support diverse interac- tion styles while producing clinically mean- ingful outputs. In respiratory care, non- invasive audio, such as recordings captured via mobile microphones, enables scalable screening and longitudinal monitoring, but the heterogeneity challenge is particularly acute: recordings vary widely across de- vices, environments, and acquisition pro- tocols, and questions span multiple intents and question formats. Existing biomed- ical audio-language QA systems are typ- ically monolithic, without any specializa- tion mechanisms for tackling diverse res- piratory corpora and query intents. They are also only validated in limited settings, leaving it unclear how reliably they han- dle the shifts encountered in real-world set- tings. To address these limitations, we in- troduce RAMoEA-QA, a hierarchically routed generative model for respiratory au- dio question answering that unifies multi- ple question types and supports both dis- crete and continuous targets within a sin- gle multimodal system. RAMoEA-QA ap- plies two-stage conditional specialization: an Audio Mixture-of-Experts routes each recording to a suitable pre-trained au- dio encoder, and a Language Mixture- of-Adapters selects a LoRA adapter on a shared frozen LLM to match the query intent and answer format. By specializ- ing both acoustic representations and gen- eration behaviour per example, RAMoEA- QA consistently outperforms strong base- lines and routing ablations with minimal parameter overhead, improving in-domain test accuracy to 0.72 (vs. 0.61 and 0.67 for state-of-the-art baselines) and exhibit- ing the strongest generalization for diagno- sis under domain, modality, and task shifts. Data and Code Availability We use respi- ratory audio question-answer pairs from the RA- QA collection (Bertolino et al., 2026), which ag- gregates multiple open-source respiratory audio datasets and their associated metadata. Data access follows the original dataset licenses. Our code is available at this link. Institutional Review Board (IRB) This study obtained IRB approval from the relevant boards to work with the public and controlled access data described. 1. Introduction Respiratory diseases remain a leading cause of global morbidity and mortality, motivating early, accessible, and scalable screening tools, espe- cially for telemedicine and low-resource settings (Labaki and Han, 2020). In clinical practice, respiratory sound assessment provides diagnos- © 2026 G.A. Bertolino, Y. Zhang, T. Xia, D. Talia & C. Mascolo. arXiv:2603.06542v1 [cs.SD] 6 Mar 2026 RAMoEA-QA tic cues (e.g., wheezes, crackles, timing pat- terns) for conditions such as asthma, pneu- monia, and chronic obstructive pulmonary dis- ease (COPD). Recent machine learning work has achieved strong results for respiratory sound classification from recordings and has enabled reusable pretrained audio backbones for down- stream respiratory tasks (Reichert et al., 2008; Sovij ̈arvi et al., 1999; Zhang et al., 2024). Despite the progress, existing systems typically produce a single, task-specific output (e.g., dis- ease label or abnormality detection), which lim- its their usefulness in real clinical interactions. In practice, respiratory assessment is inher- ently driven by different questions iteratively: clinicians and patients seek different, context- dependent information from the same recording (e.g., Is wheezing present?, Which diagnosis is most likely?, How severe is it? ) Enabling flexible question answering over respiratory audio would allow interactive respiratory assessment beyond a single, static prediction. Although general- purpose multimodal assistants are increasingly explored for conversational healthcare support (Munikoti et al., 2024; OpenAI, 2026; Anthropic, 2026; Google AI, n.d.), and clinical QA has been widely studied for modalities such as electronic health records, medical imaging, and biosignals (Pampari et al., 2018; Nguyen et al., 2023; Oh et al., 2023), conversational AI for respiratory au- dio remains under-explored. Existing large-scale audio-language foundation models (Deshmukh et al., 2024b; Das et al., 2024; Chu et al., 2023) have proven to be strong for generic audio understanding, yet they are unre- liable for respiratory assessment (Wang et al., 2025).This limitation stems from the lack of training on respiratory acoustics and clinical knowledge, where cues are subtle, recordings are uncontrolled, and sound-label relationships are noisy and context-dependent.Early biomedi- cal audio-language QA models partially address these limitations but are developed and validated in relatively constrained settings (e.g., limited question formats and target types), and rarely study how to specialize across diverse respira- tory corpora and query intents, which is crucial because real-world respiratory audio question- answering must generalize across question-types, modality (cough/breath/speech), dataset/device, and task shifts, thus leaving it unclear whether a single monolithic pathway can reliably support the diversity of clinically structured questions en- countered in practice (Wang et al., 2025). To address these limitations, we introduce RAMoEA-QA, a hierarchically routed genera- tive model for respiratory audio question answer- ing that supports multiple question types and both discrete and continuous targets within a sin- gle multimodal system. RAMoEA-QA applies two-stage conditional specialization: an Audio Mixture-of-Experts routes each recording to a suitable pre-trained audio encoder, and a Lan- guage Mixture-of-Adapters selects a LoRA adapter on a shared frozen LLM to match the query intent and answer format. We evaluate our approach on the RA-QA collection (Bertolino et al., 2026), a multi- dataset benchmark spanning three question for- mats (open-ended, single-verify, multiple-choice) and two task families (diagnosis, including sever- ity assessment, and regression).Beyond in- distribution evaluation, we conduct controlled ro- bustness tests under modality, dataset, and task shift, and provide routing analyses that characterize specialization patterns and diag- nose/mitigate routing collapse via expert hetero- geneity and routing stabilization. Contributions. This paper makes the follow- ing contributions: • We propose RAMoEA-QA, a two-stage routed architecture that activates one au- dio encoder expert (Audio-MoE) and one LoRA adapter on a shared frozen LLM (Language-MoA) per example.This en- ables hierarchical specialization with min- imal overhead, supporting multi-dataset, multi-task, and multi-format respiratory- audio question-answering. • We systematically study respiratory audio QA by benchmarking against a generic audio QA model and single-path biomedical QA baselines on the RA-QA collection, cover- ing open-ended, single-verify, and multiple- choice questions across diagnosis, severity, and regression targets. • Our experiments show consistent gains over strong baselines on both discriminative and regression tasks, improving discriminative 2 RAMoEA-QA accuracy by 12.5% on average, while ex- hibiting stronger generalization for diagno- sis under domain, modality, and task shifts. Our routing analysis reveals that the learned router selectively leverages the benefits of different paths within a single unified model. 2. Related Work Audio ML for respiratory health. Machine learning for respiratory acoustics has primarily focused on supervised classification and detec- tion using cough and breath recordings, targeting tasks such as disease screening, symptom identi- fication, and severity estimation (Gairola et al., 2021; Fraiwan et al., 2021a; Srivastava et al., 2021; Reichert et al., 2008; Sovij ̈arvi et al., 1999). More recently, respiratory audio foundation mod- els have been introduced to leverage large-scale pretraining for improved transfer across datasets and recording conditions (Zhang et al., 2024). While these approaches provide strong acoustic representations and competitive performance on standard prediction tasks, they are typically de- signed for fixed outputs (labels or scores) and do not directly support question-conditioned clini- cal interactions where the same recording may be queried in multiple ways (e.g., diagnosis, symp- tom verification, or severity/regression targets). QA systems in healthcare. Clinical ques- tion answering has been extensively studied in text-centric settings, including QA over clinical notes and biomedical literature (Pampari et al., 2018; Jin et al., 2019; Nguyen et al., 2023; Tsat- saronis et al., 2015), as well as structured EHR query answering (Wang et al., 2020; Lee et al., 2023). Beyond text, multimodal medical QA benchmarks have emerged for modalities such as medical imaging and biosignals (Abacha et al.; Hu et al., 2024; Oh et al., 2023), alongside approaches that integrate large language mod- els with modality-specific encoders for instruc- tion following and multimodal reasoning (Sing- hal et al., 2022; Wu et al., 2024a). More broadly, recent work emphasizes a shift toward generalist assistants that aim to handle diverse modalities, tasks, and interaction styles within a single inter- face, including emerging healthcare-facing con- versational systems (Munikoti et al., 2024; Ope- nAI, 2026; Anthropic, 2026; Google AI, n.d.). In the audio domain, general-purpose audio- language models have demonstrated broad ca- pabilities across tasks (Deshmukh et al., 2024b; Das et al., 2024; Chu et al., 2023; Gong et al., 2024; Ghosh et al., 2024; Huang et al., 2023). More recently, respiratory-focused resources such as CareAQA (Wang et al., 2025) have targeted biomedical audio question answering, but they typically emphasize a narrower set of respira- tory objectives rather than robust, multi-format clinical querying.This leaves a gap between generalist conversational systems and clinically grounded respiratory audio QA: practical assis- tants must handle heterogeneous question styles (open-ended explanations, verification, multiple- choice) and task families (discriminative and regression-style severity targets) under dataset and recording-condition shift. Prior respiratory audio QA systems remain limited in scale and scope, and they rarely study conditional special- ization across heterogeneous respiratory data and query types, nor integrated uncertainty signals that could support safer clinical behaviour. MoE/MoA architectures for specializa- tion, efficiency, and robustness. Mixture- of-Experts (MoE) enable conditional computa- tion by routing each input to a subset of ex- perts, improving scalability and capacity while controlling computation (Jacobs et al., 1991; Lep- ikhin et al., 2020; Fedus et al., 2022). In paral- lel, parameter-efficient fine-tuning methods such as adapters and LoRA have made it practical to adapt large pretrained models without up- dating all parameters (Pfeiffer et al., 2021; Hu et al., 2021). Building on these ideas, mixture- of-adapters (often referred to as MoA) train multiple lightweight adapters that are dynami- cally selected at inference time, enabling modu- lar specialization with limited trainable parame- ters (Lee et al., 2024; Buehler and Buehler, 2024; Wu et al., 2024b; Wang and Li, 2024). Beyond efficiency, conditional specialization can improve robustness under distribution shifts by activat- ing experts aligned to the current input domain (e.g., device, modality, cohort) or query formats, rather than forcing a single parameterization to fit all conditions (Guo et al., 2018). A known challenge in mixture-based systems is routing im- balance and collapse, where a small subset of experts receives most of the traffic and parame- 3 RAMoEA-QA ters are effectively under-utilized; prior work has studied load-balancing, entropy regularization, and related stabilization strategies in sparse mix- tures (Fedus et al., 2022; Lepikhin et al., 2020). Despite their success in NLP, MoE and MoA have been less explored as joint mechanisms for spe- cialization in multimodal clinical QA, particu- larly for respiratory audio, where both acoustic variability and heterogeneous question formats motivate hierarchical specialization. 3. Method This section presents RAMoEA-QA as a two- stage conditional-specialization model for respi- ratory audio question answering (RA-QA). Given an audio recording and a natural-language ques- tion, the model generates a short answer whose format and target are defined by the question (e.g., diagnosis/severity as discrete labels or spirometry/physiology as a numeric value). 3.1. Overview Our key principle is hierarchical specialization under constrained resources: we activate one acoustic pathway and one generation pathway per example (hard top-1 routing), achieving con- ditional computation without mixing experts. Figure 1 summarizes the full flow and highlights the two routing stages: (A) Audio-MoE and (B) LoRA-MoA. Concretely, (A) we compute a lightweight routing representation from the input and select a single pre-trained audio encoder ex- pert; its output is aligned to the LLM hidden size and injected as a selected audio prefix (aligned audio embeddings concatenated into the LLM in- put). (B) conditioned on the question and the selected aligned audio prefix, we select a single LoRA adapter attached to a shared frozen LLM backbone and generate the answer. 3.2. Routing configuration and policies In our setting, routing denotes the expert selec- tion mechanism: which audio encoder (Audio- MoE) and which language adapter (MoA) are ac- tivated for a given input pair. The routing con- figuration refers to the router-input policy used to compute gating logits: audio, question, or fused. This choice does not change what the ex- perts themselves receive (audio encoders always process the spectrogram; the LLM always pro- cesses the prompt plus the selected aligned audio prefix). Instead, it only specifies which repre- sentation is fed to each router: an audio-only summary, a question-only summary, or a fused question-audio summary. Routing is trained with straight-through Gumbel-Softmax for expert exploration and eval- uated in inference with argmax selection; we use a short balanced warmup to prevent routing col- lapse and keep router temperatures and auxiliary weights fixed across experiments unless explicitly ablated. 3.3. Audio Mixture-of-Experts (Audio-MoE) Lightweight routing proxy. As illustrated in the orange block in Fig. 1, the router directs the audio recording towards one of the avail- able experts. To avoid running all audio en- coders, the Audio-MoE first computes a low-cost routing proxy from the spectrogram and (op- tionally) the question, and uses it to select a single audio expert. We support three router- input policies (audio/question/fused); fused enables question-conditioned expert selection us- ing the proxy rather than any full expert output. Expert activation and alignment. Let E a 1 ,...,E a N a be heterogeneous pre-trained au- dio encoders used as frozen foundation back- bones.After the router selects an expert, only the chosen encoder processes the spectro- gram, producing an embedding sequence that is mapped to the LLM hidden size via an expert- specific aligner and injected into the LLM input as a selected audio prefix (aligned audio embed- dings concatenated into the LLM context). We do not update the audio encoders’ base parame- ters; instead, the audio encoding layer is wrapped with a lightweight LoRA-style adaptation layer so it can be specialized to RA-QA without full au- dio fine-tuning. Together with hard top-1 routing over a single LoRA adapter on the language side, this yields example-specific specialization while keeping inference compute close to a single-path model. 4 RAMoEA-QA Figure 1: Two-stage routing for RA-QA. (A) The MoE selects an audio expert (encoder). The resulting aligned audio embeddings are injected as a selected audio prefix. (B) The MoA selects a LoRA adapter for the language model during generation. 3.4. Language Mixture-of-Adapters (LoRA-MoA) Shared LLM with adapter experts. As il- lustrated in the green block in Fig. 1, we use a shared frozen autoregressive language backbone and attach N ℓ LoRA adapters as lightweight lan- guage experts. The MoA router selects exactly one adapter per example, enabling specialization across question formats and task families while keeping the backbone fixed. Router input without re-embedding. Af- ter audio expert selection and alignment, the MoA router computes its routing vec- tor using the same three router-input policies (audio/question/fused), instantiated at the language stage. Here, audio summarizes the se- lected aligned audio prefix, question summarizes the prompt/question tokens, and fused performs a lightweight fusion between them. Complementary two-stage fusion. Impor- tantly, even if both routers use a fused con- figuration, they do not condition on the same signal: (A) fuses the question with a cheap pre-encoder proxy before selecting an audio ex- pert, whereas (B) fuses the prompt with expert- produced aligned audio embeddings after expert selection. This staged conditioning makes rout- ing decisions complementary (coarse acoustic/- domain choice vs. generation/style/intent refine- ment), rather than redundant. 3.5. Training objective and routing regularization. We train with teacher forcing and compute the negative log-likelihood only over answer tokens (prompt and audio-prefix positions are masked). Let L main denote this masked language-model loss. For routing, let p a ∈R N a and p ℓ ∈R N ℓ be the softmax routing probabilities over the N a au- dio experts and N ℓ LoRA adapters, and let u a , u ℓ be the corresponding hard one-hot assignments (from straight-through Gumbel-Softmax during training). We add load-balancing regularization L LB (·) to encourage non-degenerate expert uti- lization, and optionally an entropy termL ENT (·) to control confidence/sharpness. The full objec- tive is: L =L main + λ a L LB (p a , u a ) + λ ℓ L LB (p ℓ , u ℓ ) − β a L ENT (p a )− β ℓ L ENT (p ℓ ), (1) where λ a ,λ ℓ ≥ 0 weight the load-balancing terms for the audio and language routers, and β a ,β ℓ ≥ 0 weight the entropy regularizers. More imple- mentation details are reported in Appendix C. 5 RAMoEA-QA 4. Experimental Setup 4.1. Datasets and tasks. We conduct all experiments on the RA-QA collection, which harmonizes multiple publicly available respiratory audio datasets into a uni- fied question-answering (QA) setting. Our train- ing mixture comprises 7 datasets belonging to the collection, spanning different acquisition settings, modalities, and labeling schemes (Table 7). To explicitly evaluate generalization, we further de- fine controlled held-out settings along three axes (Table 4): (i ) modality shift (∆M ), where we evaluate on audio modalities not used dur- ing training for a dataset (e.g., Coswara asthma trained on cough and tested on breathing/count- ing/vowel); (i ) dataset shift (∆D), where an entire dataset is held out from training and used only at evaluation (e.g., UK COVID-19); and (i ) task shift (∆T ), where we evaluate on tasks not seen in training (e.g., pneumonia). Task families. We organize questions into two task families: Diagnosis (predicting a categori- cal respiratory condition), including presence of disease and Severity (predicting an ordinal/dis- crete severity level), and Regression (predicting continuous physiological values such as spirome- try measures, FEV1, FVC, FEV1/FVC, and res- piratory rate). Question formats We formulate respiratory assessment as question-conditioned predic- tion from audio.Each sample consists of a recording x and a natural-language question q, and the model produces an answer ˆy appropriate to the query. We evaluate three question formats: • Open-ended (OE). The model generates a free-form textual answer. This format is used for descriptive responses (e.g., likely condition, short clinically grounded sum- mary) and for natural-language reporting of numeric targets. • Single-verify (SV). The model answers a yes/no verification question (e.g., ’Does the patient suffer from asthma?’). We treat this as binary decision making with standardized outputs (Yes/No) for evaluation. • Multiple-choice (MC) Unlike open-ended QA, multiple-choice questions explicitly enu- merate the candidate labels/options in the prompt. This is a form of prompt engineer- ing that constrains generation to a closed set and makes the task a selection problem rather than free-form answering. For discriminative tasks, when a dataset provides multiple question formats for the same recording (e.g., open-ended, single-verify, and/or multiple- choice), we treat each (audio, question) pair as a separate example and include all available for- mats. For regression tasks, we only present ques- tion in the open-ended format since it is the most suitable. 4.2. Baselines. We compare our model against CareAQA-style (Wang et al., 2025) single-path baselines (one LoRA adapter on a frozen LLM). In addition, we include one generic audio-language QA baseline, Pengi (Deshmukh et al., 2024a), that contextu- alize respiratory-audio question-answering within broader audio-text modelling. We provide imple- mentation details of the baselines in Appendix D. 4.3. Evaluation Metrics We follow the RA-QA protocol (Bertolino et al., 2026) and report both text-level and task-level metrics to separately capture (i) how well the model matches the reference form and semantics, and (i) whether it predicts the correct clinical la- bel or value. Semantic-level metrics. To evaluate answer quality at the word and semantic level, we com- pute Exact Match (strict normalized match), TokenF1 (token overlap), BERTScore (seman- tic similarity), and METEOR (lexical align- ment) between generated and reference answers after standard normalization.These metrics reflect complementary aspects:EM/TokenF1 emphasize surface fidelity, METEOR captures word-level matching under minor morphologi- cal variation, and BERTScore provides a se- mantic view that is less sensitive to paraphras- ing. We report overall averages and stratify by question-type. Label-correctness metrics. For classifica- tion tasks (the diagnosis task family), we eval- uate label correctness by extracting the target la- bel from the model output and comparing it to 6 RAMoEA-QA Table 1: Overview of the datasets used in our training, grouped by task family, with their target tasks, available audio modalities, and supported question formats. Task Family DatasetTaskAudio conditionQuestion-types Discriminative CoswaraAsthmaDeep and shallow breathingSV COUGHVIDDiagnosis (Healthy, Symptomatic, Covid-19)CoughingSV, OE, MC ICBHIDiagnosis (URTI, COPD)Four clinical recording devicesSV, OE, MC KAUHDiagnosis (Healthy, LRTI, COPD)Sound types and chest positionsSV, OE, MC Respiratory@TRCOPD SeverityRecording positionsOE, MC Regression M-LungSpirometry (FEV1)Deep breathingOE M-LungSpirometry (FVC)Deep breathingOE NosemicRespiratory rateBefore and after runningOE the ground truth. For open-ended and multiple- choice questions, we parse the generated text to recover the predicted diagnosis label (e.g., by nor- malizing case/punctuation and matching against the dataset label set), whereas for single-verify questions we parse the output intoYes, No. We then compute macro-F1 and accuracy over the extracted labels. For regression tasks, we extract the first valid numeric value from the generated answer (if present) and compare it to the ground-truth tar- get. We report MAE and RMSE (lower is bet- ter) computed over successfully parsed examples. We further study, and additionally report pars- ing coverage, i.e., the fraction of test instances for which numeric extraction succeeds. 4.4. Implementation details All models use the same audio preprocessing pipeline, following the OPERA (Zhang et al., 2024) protocol for respiratory audio. Audio is resampled, trimmed to remove leading and trail- ing silence, and padded/truncated to a fixed duration; log-mel spectrograms are then com- puted following parameters in OPERA. Dataset splits follow the provided metadata and are subject-independent when subject IDs are avail- able, stratified by disease label. Optimization and decoding. We optimize with AdamW (learning rate 2 · 10 −5 , gradient clipping 1.0) for up to 100 epochs (early stopping on the validation set). Unless stated otherwise, we train with a batch size of 4 and evaluate with a batch size of 1. Model setting (default). Our default con- figuration uses two OPERA audio experts (OPERA-CT and OPERA-GT) and a shared GPT-2 backbone with LoRA adapters. Audio experts are kept frozen and their outputs are mapped to the GPT-2 hidden size with a train- able MLP aligner. Checkpoint selection. We select checkpoints by best in-distribution validation performance using the primary metric appropriate to the task family (MacroF1 for discriminative; MAE for re- gression). 5. Results We first compare RAMoEA-QA against baselines on in-distribution RA-QA performance across question formats and task families. We then study in-domain routing behaviour, including routing distributions, uncertainty-based absten- tion via routing entropy, and routing-collapse ablations, and analyze how the model behaves across different question types. Finally, we eval- uate robustness and generalization under con- trolled modality, dataset, and task shifts. Un- less otherwise stated, metrics are macro-averaged across datasets within each question format and averaged over multiple training runs; per-dataset and further results are deferred to the Ap- pendix E. 5.1. Main results Table 2 summarizes overall performance, mea- sured by answer correctness, of our model and the baseline methods, aggregated across discrim- inative and regression task families. Specifically, we compare the generic audio QA model, Pengi, single-path baselines CaReAQA-style (with OPERA-CT and OPERA-GT), and RAMoEA- QA (our proposed model).Dataset- and question-type specific results are provided in Ap- pendix E. Table 3 complements this compari- 7 RAMoEA-QA Table 2: Main results. We compare the strongest single-path baseline to our best two-stage routing model. NA indicates that no valid numeric value could be parsed from the generated answer. Bold indicates the best result and underlining indicates the second best. ModelDiscriminative tasksRegression tasks Accuracy ↑Macro F1 ↑Token F1 ↑Exact Match ↑MAE↓RMSE↓Acc@0.5 ↑ PENGI0.22 ± 0.000.21 ± 0.000.02 ± 0.000.000 ± 0.00NANANA CareAQA-operaCT0.61± 0.010.53± 0.060.83± 0.010.49± 0.012.61± 0.163.93±0.140.23±0.00 CareAQA-operaGT 0.67±0.150.59±0.200.86±0.060.55±0.142.61±0.094.26± 0.140.26± 0.04 RAMoEA-QA (ours) 0.72 ± 0.02 0.67 ± 0.03 0.88 ± 0.000.60 ± 0.032.29 ± 0.31 3.77 ± 0.53 0.32± 0.04 Table 3: String-level response similarity by question type (discriminative tasks). We report BERTScore / METEOR computed on the full generated text, stratified by question format. Bold indicates the best result and underlining indicates the second best. ModelGlobalSingle-VerifyOpen-endedMultiple-choice BERTscore ↑METEOR ↑BERTscore ↑METEOR ↑BERTscore ↑METEOR ↑BERTscore ↑METEOR ↑ PENGI-0.01 ± 0.002.61 ± 0.040.03 ±0.004.11 ± 0.06-0.09 ± 0.000.00 ± 0.00-0.08 ± 0.000.00 ± 0.00 CareAQA-operaCT0.87±0.0184.89 ± 0.900.87 ± 0.0286.08 ± 1.740.86±0.0083.15±0.870.85 ± 0.0282.96±1.83 CareAQA-operaGT0.89 ± 0.0587.05±5.850.91±0.0189.95±9.360.87 ± 0.00 83.22 ± 1.320.84±0.0081.72 ± 0.74 RAMoEA-QA (ours) 0.90 ± 0.01 88.38 ± 0.920.94 ± 0.00 92.64 ± 0.530.83 ± 0.0181.02 ± 1.630.85 ± 0.02 83.17 ± 2.68 son with a fine-grained semantic-level analysis (BERTScore/METEOR) across question types. Discriminative tasks. First, general audio QA models like Pengi fail to perform effectively on respiratory audio QA questions due to lim- ited exposure to respiratory sounds and domain knowledge. As shown in Table 3, Pengi often produces irrelevant responses, resulting in ex- tremely low word-overlap and semantic similarity scores. Moreover, within models adapted for the RA-QA collection, two-stage routing consistently improves diagnosis performance over single-path baselines. As shown in Table 2, our best config- uration increases MacroF1 from 0.53 (OPERA- CT plus 1 LoRA adapter) and 0.59 (OPERA-GT plus 1 LoRA adapter) to 0.67, while also improv- ing Accuracy (0.61 and 0.67 → 0.72) and Ex- act Match (0.49 and 0.55 → 0.60). Token-level agreement also increases (0.83 and 0.86→ 0.88), indicating more faithful short-answer generation beyond label extraction. Regression tasks. Generic audio QA models again fail in the regression setting: Pengi is unable to produce meaningful numerical predic- tions, and no valid numeric values can be reli- ably parsed from its outputs. Focusing instead on respiratory audio QA models, our two-stage routing mechanism also consistently improves re- gression performance. As shown in Table 2, the strongest single-path baselines achieve an MAE of 2.61, whereas our routed model reduces error to an MAE of 2.29. Following common practice in biomedical mea- surement validation (O’Brien et al., 2001; Elgendi et al., 2024), we complement MAE/RMSE with tolerance accuracy Acc@τ , i.e., the cumulative fraction of predictions whose absolute error falls within a clinically meaningful threshold τ . Fig- ure 2 shows that two-stage routing attains higher Acc@τ at stricter tolerances on multiple spirom- etry and physiology targets, while converging to similar performance at looser tolerances. To- gether, these results indicate that conditional specialization improves both average regression quality and reliability under clinically relevant er- ror budgets. Response-form fidelity across question types. Table 3 evaluates format alignment across question types using BERTScore and METEOR. RAMoEA-QA achieves the high- est global similarity scores, with notable gains in single-verify tasks (0.94 vs 92.64) due to more decisive, format-appropriate short answers.While remaining competitive on 8 RAMoEA-QA Figure 2: Tolerance accuracy for regression. Accuracy as a function of absolute-error tolerance ε for spirometry targets (FVC/FEV1 on M-Lung) and respiratory rate (NoseMic), com- paring a single-path baseline against two-stage routing. Two-stage routing reaches higher accuracy at tighter tolerances, indicating fewer large prediction errors. multiple-choice, the model shows a slight trade-off in open-ended tasks, likely due to more paraphrastic generations. Overall, in addition to improving prediction correctness, routing also enhances structured response alignment demon- strated by comparative or improved semantic similarity across formats. 5.2. Robustness and generalization We assess robustness under controlled distribu- tion shifts by comparing the strongest single-path baseline against our RAMoEA-QA model. Modality shift. We evaluate unseen modal- ity on a seen dataset and task to probe robust- ness to changes of respiratory audio type within a dataset.Specifically, we test on Coswara asthma, trained on breathing while tested here on cough, vowels and counting, and, while all models transfer reasonably, RAMoEA-QA re- mains competitive and more stable across runs. On Coswara COPD it improves Accuracy/- MacroF1 to 0.91/0.92 versus 0.68/0.75 (best baseline). Dataset shift. We evaluate task seen dur- ing train but on an unseen dataset to quantify transfer under cohort/device shift while keep- ing the question semantics fixed.Concretely, we test on UK COVID-19 using asthma and COVID questions. RAMoEA-QA improves both tasks under this shift, reaching 0.88/0.88 on asthma and 0.84/0.84 on COVID, outperform- ing the strongest single-path baselines in both cases. This suggests that conditional special- ization helps mitigate cohort/device differences when transferring to a new acquisition domain. Task and modality shift. We evaluate an un- seen task on unseen modalities to measure zero- shot task transfer enabled by the shared gener- ative interface. Concretely, we test Coswara on unseen symptom attributes (pneumonia), and our model reaches the best performance of 0.83 accuracy while one of the single-path baselines degrades to chance-level performance. 5.3. Routing analysis To understand why routing helps, we analyze empirical routing distributions for our best two- stage model. Figure 3 reports the fraction of sam- ples routed to each audio expert (OPERA-CT vs. OPERA-GT) and each language expert (LoRA adapter 1 vs. 2), stratified by dataset, question format and task. Dataset- and format-dependent specializa- tion. Routing is non-uniform: the model con- sistently prefers different experts across datasets, question formats, tasks, and diseases. This sup- ports the hypothesis that heterogeneous respi- ratory corpora (acquisition setups, noise condi- tions, and labelling conventions) benefit from conditional pathways rather than a single aver- aged representation. Coupled but non-redundant two-stage routing. A notable pattern is that adapter routing largely mirrors audio routing (LoRA- 9 RAMoEA-QA Table 4: Robustness under controlled shifts. We compare single-path baselines to RAMoEA-QA under modality, dataset, and task shifts. Bold indicates the best result and underlining indicates the second best. Legend: ∆M = unseen modality; ∆D = unseen dataset; ∆T = task never seen in training. ShiftDatasetTaskCareAQA-operaCTCareAQA-operaGTRAMoEA-QA (ours) Accuracy ↑MacroF1 ↑Accuracy ↑MacroF1 ↑Accuracy ↑MacroF1 ↑ ∆MCoswara Asthma0.90± 0.13 0.90± 0.130.66± 0.470.75± 0.350.87±0.020.87±0.02 COPD0.68±0.440.75±0.340.66± 0.470.75± 0.350.91± 0.02 0.92± 0.02 ∆DUK COVID-19 Asthma0.84±0.210.84±0.210.66± 0.470.74± 0.350.88± 0.07 0.88± 0.06 COVID0.56± 0.320.65± 0.20 0.66±0.470.74±0.350.84± 0.04 0.84± 0.04 ∆T + ∆MCoswaraPneumonia0.77±0.310.79± 0.280.33± 0.000.50± 0.000.83± 0.05 0.84± 0.04 Disc. Avg.0.75±0.190.75±0.190.67± 0.200.59± 0.290.78± 0.04 0.79± 0.04 1/LoRA-2trackOPERA-CT/OPERA-GT closely), suggesting that training discovers two coherent end-to-end pipelines (audio expert plus generation adapter) rather than mixing components arbitrarily.Importantly, this does not make the routers redundant: even when both use a fused policy, they condition on different signals at different stages.The Audio-MoE router uses a cheap pre-encoder proxy computed before expert selection, whereas the MoA router conditions on expert-produced aligned audio embeddings (the selected audio prefix concatenated into the LLM input). This staged conditioning lets the first router make coarse acoustic/domain decisions at low cost, while the second refines generation behaviour using richer, expert-specific representations. Routing collapse analysis. We further study routing collapse in the MoE and MoA lay- ers and its downstream impact on discrimina- tive performance (MacroF1) and regression er- ror (MAE). We disable routing at inference by fixing an end-to-end path (audio expert ∈ OPERA-CT, OPERA-GT, LoRA ∈L1, L2). Table 5 reports MacroF1/MAE for forced routes (fixed audio expert × fixed LoRA). While some collapsed paths are locally competitive for certain tasks, they have a high variance across dataset- question-type combinations and typically under- perform the best routed configuration. We also note that the forced routes here have a better performance on certain tasks compared to single- path models in Table 2 on their strongest tasks (e.g. OPERA-GT for classification), which shows the strength brought by specialization. Table 5: Forced-route evaluation of RAMoEA- QA under expert collapse. We report MacroF1 (discriminative) / MAE (re- gression) by dataset and question type. Full is the fully routed model. DatasetQTForced routing operaCToperaGTFull L1L2L1L2 KAUH SV0.750.751.001.001.00 OE0.330.330.500.440.33 MC0.330.330.440.500.38 CoughVID SV0.500.550.910.880.77 OE0.160.160.160.000.00 MC0.330.440.500.500.61 ICBHI SV0.750.751.001.000.71 OE0.310.250.250.250.43 MC0.500.500.500.500.43 Resp.@TR SV0.650.700.600.500.57 OE1.001.001.001.001.00 MC1.001.001.001.001.00 CoswaraSV0.730.530.920.900.85 M LungOE (↓)3.152.422.423.612.29 Reg. Avg.OE (↓)3.152.424.153.622.29 Disc. Avg.ALL0.540.490.730.710.67 Our two-stage routing consistently yields the lowest MAE and remains competitive on discrim- inative tasks. By selectively leveraging expert- adapter pairs, the router balances performance across datasets; for instance, it selects the high- performing OPERA-GT/L1 path for 88% of KAUH samples.Crucially, sample-level rout- ing can outperform even the best fixed path, as demonstrated by the superior regression results on the M Lung dataset. In nearly all dataset- question-type combinations, the full model with routing is either the best option or a strong com- promise among forced paths. 10 RAMoEA-QA Figure 3: Unified routing heatmap across datasets, question formats, and diagnosis categories. Columns are grouped into Datasets (green), Question types (red), Diagnosis labels (blue), and Tasks (orange), while rows correspond to the four experts (operaCT, operaGT, LoRA expert 1, LoRA expert 2). 6. Discussion We presented RAMoEA-QA, a hierarchical and modular model for respiratory audio ques- tion answering built around a two-stage condi- tional specialization scheme: an Audio Mixture- of-Experts (Audio-MoE) selects one frozen au- dio encoder expert per example, and a Lan- guage Mixture-of-Adapters (MoA) selects one LoRA-specialized language expert on top of a shared frozen LLM, with both decisions made by lightweight routers. The central empirical takeaway is that this two-stage routing (audio- expert selection followed by LoRA-adapter selec- tion) improves accuracy and robustness across heterogeneous question formats and task fami- lies, with the clearest gains on diagnosis-style questions where the model benefits from both acoustic diversity and language specialization. Importantly, RAMoEA-QA also generalizes better than non-specialized baselines: without modifying the frozen backbones, it transfers more reliably to new datasets, recording conditions, and previously unseen task formulations. Our evaluation spans both discriminative (classifica- tion/decision) and regression question families, and we observe consistent benefits across these supervision types. Future directions include exploiting router un- certainty as a confidence signal for selective an- swering (abstention) and studying how the pro- posed architecture scales with larger backbones, more experts, and additional data. References Asma Ben Abacha, Sadid A Hasan, Vivek V Datla, Joey Liu, and Henning Muller. VQA- Med: Overview of the Medical Visual Question Answering Task at ImageCLEF 2019. Gokhan Altan, Yakup Kutlu, Yusuf Garbi, Adnan Ozhan Pekmezci, and Serkan Nural. Multimedia respiratory database (respiratory- database@tr): Auscultation sounds and chest x-rays, 2021. URL https://arxiv.org/abs/ 2101.10946. Anthropic. Advancing claude in healthcare and the life sciences. https://w.anthropic. com/news/healthcare-life-sciences, Jan- uary 2026. Accessed 2026-02-01. Gaia A. Bertolino, Yuwei Zang, Tong Xia, Domenico Talia, and Cecilia Mascolo. Ra-qa: Towards respiratory audio-based health ques- tion answering. February 2026. doi: 10.5281/ zenodo.18489541. URL https://zenodo.org/ records/18489541. Debarpan Bhattacharya, Neeraj Kumar Sharma, Debottam Dutta, Srikanth Raj Chetupalli, Pravin Mote, Sriram Ganapathy, C. Chan- drakiran, Sahiti Nori, K. K. Suhail, Sadhana 11 RAMoEA-QA Gonuguntla, and Murali Alagesan. Coswara: A respiratory sounds and symptoms dataset for remote screening of SARS-CoV-2 infection. Scientific Data, 10(1):397, June 2023. ISSN 2052-4463. doi: 10.1038/s41597-023-02266-0. URL https://w.nature.com/articles/ s41597-023-02266-0. Eric L. Buehler and Markus J. Buehler.X- lora: Mixture of low-rank adapter experts, a flexible framework for large language mod- els with applications in protein mechanics and molecular design.(arXiv:2402.07148), March 2024.doi:10.48550/arXiv.2402. 07148. URL http://arxiv.org/abs/2402. 07148. arXiv:2402.07148 [cond-mat]. Kayla-Jade Butkow,Ting Dang,Andrea Ferlini, Dong Ma, Yang Liu, and Cecilia Mascolo.An evaluation of heart rate monitoring with in-ear microphones under motion.Pervasive and Mobile Computing, 100:101913, 2024.ISSN 1574-1192.doi: https://doi.org/10.1016/j.pmcj.2024.101913. URL https://w.sciencedirect.com/ science/article/pii/S1574119224000397. Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou.Qwen-Audio: Advanc- ing Universal Audio Understanding via Uni- fied Large-Scale Audio-Language Models, De- cember 2023. URL http://arxiv.org/abs/ 2311.07919. Issue: arXiv:2311.07919 arXiv: 2311.07919 [eess]. Harry Coppock, George Nicholson, Ivan Kiskin, Vasiliki Koutra, Kieran Baker, Jobie Budd, Richard Payne, Emma Karoune, David Hur- ley, Alexander Titcomb, Sabrina Egglestone, Ana Tendero Ca ̃nadas, Lorraine Butler, Radka Jersakova, Jonathon Mellor, Selina Patel, Tracey Thornley, Peter Diggle, Sylvia Richard- son, Josef Packham, Bj ̈orn W. Schuller, Da- vide Pigoli, Steven Gilmour, Stephen Roberts, and Chris Holmes. Audio-based ai classifiers show no evidence of improved covid-19 screen- ing over simple symptoms checkers, 2023. URL https://arxiv.org/abs/2212.08570. Nilaksh Das, Saket Dingliwal, Srikanth Ronanki, Rohit Paturi, Zhaocheng Huang, Prashant Mathur, Jie Yuan, Dhanush Bekal, Xing Niu, Sai Muralidhar Jayanthi, Xilai Li, Karel Mundnich, Monica Sunkara, Sundararajan Srinivasan, Kyu J. Han, and Katrin Kirch- hoff. SpeechVerse: A Large-scale Generaliz- able Audio Language Model, May 2024. URL http://arxiv.org/abs/2405.08295.Issue: arXiv:2405.08295 arXiv: 2405.08295 [cs]. Soham Deshmukh, Benjamin Elizalde, Rita Singh, and Huaming Wang.Pengi:An audio language model for audio tasks. (arXiv:2305.11834),January2024a. doi:10.48550/arXiv.2305.11834.URL http://arxiv.org/abs/2305.11834. arXiv:2305.11834 [eess]. Soham Deshmukh, Benjamin Elizalde, Rita Singh, and Huaming Wang. Pengi: An Au- dio Language Model for Audio Tasks, January 2024b. URL http://arxiv.org/abs/2305. 11834. arXiv:2305.11834 [eess]. Mohamed Elgendi, Fridolin Haugg, Richard Ri- bon Fletcher, John Allen, Hangsik Shin, Ay- men Alian, and Carlo Menon. Recommen- dations for evaluating photoplethysmography- based algorithms for blood pressure assess- ment.Communications Medicine, 4(1):140, 2024.ISSN 2730-664X.doi:10.1038/ s43856-024-00555-2. William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion param- eter models with simple and efficient sparsity, 2022. URL https://arxiv.org/abs/2101. 03961. Mohammad Fraiwan, Luay Fraiwan, Mohanad Alkhodari, and Omnia Hassanin. Recognition of pulmonary diseases from lung sounds using convolutional neural networks and long short- term memory. Journal of Ambient Intelligence and Humanized Computing, 13:4759–4771, 04 2021a. doi: 10.1007/s12652-021-03184-y. Mohammad Fraiwan, Luay Fraiwan, Basheer Khassawneh, and Ali Ibnian. A dataset of lung sounds recorded from the chest wall using an electronic stethoscope.Data in Brief, 35:106913, 2021b. ISSN 2352-3409. doi: https://doi.org/10.1016/j.dib.2021.106913. URL https://w.sciencedirect.com/ science/article/pii/S2352340921001979. 12 RAMoEA-QA Siddhartha Gairola, Francis Tom, Nipun Kwa- tra, and Mohit Jain.Respirenet: A deep neural network for accurately detecting abnor- mal lung sounds in limited data setting. In 2021 43rd Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC), pages 527–530, 2021. doi: 10.1109/EMBC46164.2021.9630091. Sreyan Ghosh, Sonal Kumar, Ashish Seth, Chan- dra Kiran Reddy Evuru, Utkarsh Tyagi, S Sak- shi, Oriol Nieto, Ramani Duraiswami, and Di- nesh Manocha. Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities, 2024. URL https: //arxiv.org/abs/2406.11768. Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, and James Glass. Listen, think, and understand, 2024. URL https: //arxiv.org/abs/2305.10790. Google AI. Health ai. https://ai.google/ health/, n.d. Accessed 2026-02-01. Jiang Guo, Darsh J. Shah, and Regina Barzi- lay.Multi-source domain adaptation with mixture of experts.(arXiv:1809.02256), October 2018.doi:10.48550/arXiv.1809. 02256. URL http://arxiv.org/abs/1809. 02256. arXiv:1809.02256 [cs]. Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models, 2021. URL https://arxiv.org/abs/2106.09685. Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. Om- niMedVQA: A New Large-Scale Comprehen- sive Evaluation Benchmark for Medical LVLM, April 2024.URL http://arxiv.org/abs/ 2402.09181. arXiv:2402.09181 [eess]. Rongjie Huang, Mingze Li, Dongchao Yang, Jia- tong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, Yi Ren, Zhou Zhao, and Shinji Watanabe. Au- diogpt: Understanding and generating speech, music, sound, and talking head, 2023. URL https://arxiv.org/abs/2304.12995. Robert Jacobs, Michael Jordan, Steven Nowlan, and Geoffrey Hinton. Adaptive mixtures of lo- cal experts. Neural Computation, 3:79–87, 03 1991. doi: 10.1162/neco.1991.3.1.79. Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W. Cohen, and Xinghua Lu. Pub- MedQA: A Dataset for Biomedical Research Question Answering, September 2019. URL http://arxiv.org/abs/1909.06146.Issue: arXiv:1909.06146 arXiv: 1909.06146 [cs]. Wassim Labaki and Meilan Han. Chronic res- piratory diseases: a global view. The Lancet Respiratory Medicine, 8:531–533, 06 2020. doi: 10.1016/S2213-2600(20)30157-0. Gyubok Lee, Hyeonji Hwang, Seongsu Bae, Yeonsu Kwon, Woncheol Shin, Seongjun Yang, Minjoon Seo, Jong-Yeup Kim, and Edward Choi. Ehrsql: A practical text-to-sql bench- mark for electronic health records, 2023. URL https://arxiv.org/abs/2301.07695. Gyuseong Lee, Wooseok Jang, Jinhyeon Kim, Jaewoo Jung, and Seungryong Kim. Domain generalization using large pretrained models with mixture-of-adapters. (arXiv:2310.11031), December 2024.doi: 10.48550/arXiv.2310. 11031. URL http://arxiv.org/abs/2310. 11031. arXiv:2310.11031 [cs]. Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding, 2020.URL https://arxiv.org/ abs/2006.16668. Mohammed Mosuily, Lindsay Welch, and Jag- mohan Chauhan. Mmlung: Moving closer to practical lung health estimation using smart- phones. In Interspeech 2023, pages 2333–2337, 2023. doi: 10.21437/Interspeech.2023-721. Sai Munikoti,Ian Stewart,Sameera Ho- rawalavithana, Henry Kvinge, Tegan Emerson, Sandra E. Thompson, and Karl Pazdernik. Generalist multimodal ai: A review of archi- tectures, challenges and opportunities, 2024. Vincent Nguyen, Sarvnaz Karimi, Maciej Rybin- ski, and Zhenchang Xing. MedRedQA for Med- ical Consumer Question Answering: Dataset, 13 RAMoEA-QA Tasks, and Neural Baselines. In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Con- ference of the Asia-Pacific Chapter of the As- sociation for Computational Linguistics (Vol- ume 1: Long Papers), pages 629–648, Nusa Dua, Bali, 2023. Association for Computa- tional Linguistics.doi: 10.18653/v1/2023. ijcnlp-main.42. URL https://aclanthology. org/2023.ijcnlp-main.42. Jungwoo Oh, Gyubok Lee, Seongsu Bae, Joon- myoung Kwon, and Edward Choi.ECG- QA: A Comprehensive Question Answering Dataset Combined With Electrocardiogram, October 2023. URL http://arxiv.org/abs/ 2306.15681. Issue: arXiv:2306.15681 arXiv: 2306.15681 [q-bio]. OpenAI.Introducingchatgpt health. https://openai.com/index/ introducing-chatgpt-health/,January 2026. Accessed 2026-02-01. Lara Orlandic, Tomas Teijeiro, and David Atienza.The COUGHVID crowdsourcing dataset, a corpus for the study of large-scale cough analysis algorithms. Scientific Data, 8 (1):156, June 2021.ISSN 2052-4463.doi: 10.1038/s41597-021-00937-4. URL https:// doi.org/10.1038/s41597-021-00937-4. Eoin O’Brien, Bernard Waeber, Gianfranco Parati, Jan Staessen, and Martin G Myers. Blood pressure measuring devices: recommen- dations of the european society of hyperten- sion. BMJ: British Medical Journal, 322(7285): 531–536, March 2001. ISSN 0959-8138. doi: 10.1136/bmj.322.7285.531. Anusri Pampari, Preethi Raghavan, Jennifer Liang, and Jian Peng. emrqa: A large cor- pus for question answering on electronic med- ical records, 2018. URL https://arxiv.org/ abs/1809.00732. Jonas Pfeiffer, Aishwarya Kamath, Andreas R ̈uckl ́e, Kyunghyun Cho, and Iryna Gurevych. Adapterfusion: Non-destructive task compo- sition for transfer learning. In Paola Merlo, Jorg Tiedemann, and Reut Tsarfaty, edi- tors, Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics:Main Volume, page 487–503, Online, April 2021. Associa- tion for Computational Linguistics. doi: 10. 18653/v1/2021.eacl-main.39. URL https:// aclanthology.org/2021.eacl-main.39/. Sandra Reichert, Raymond Gass, Christian Brandt, and Emmanuel Andr`es. Analysis of respiratory sounds: state of the art.Clin. Med. Circ. Respirat. Pulm. Med., 2:45–58, May 2008. Bruno M Rocha, Dimitris Filos, Lu ́ıs Mendes, Gorkem Serbes, Sezer Ulukaya, Yasemin P Kahya, Nikˇsa Jakovljevic, Tatjana L Tu- rukalo, Ioannis M Vogiatzis, Eleni Peran- toni, Evangelos Kaimakamis, Pantelis Natsi- avas, Ana Oliveira, Cristina J ́acome, Alda Marques, Nicos Maglaveras, Rui Pedro Paiva, Ioanna Chouvarda, and Paulo de Carvalho. An open access database for the evalua- tion of respiratory sound classification algo- rithms.Physiological Measurement, 40(3): 035001, March 2019. doi: 10.1088/1361-6579/ ab03ea. URL https://dx.doi.org/10.1088/ 1361-6579/ab03ea. Publisher: IOP Publish- ing. Karan Singhal, Shekoofeh Azizi, Tao Tu, S. Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole- Lewis, Stephen Pfohl, Perry Payne, Mar- tin Seneviratne, Paul Gamble, Chris Kelly, Nathaneal Scharli, Aakanksha Chowdhery, Philip Mansfield, Blaise Aguera y Arcas, Dale Webster, Greg S. Corrado, Yossi Matias, Katherine Chou, Juraj Gottweis, Nenad Toma- sev, Yun Liu, Alvin Rajkomar, Joelle Barral, Christopher Semturs, Alan Karthikesalingam, and Vivek Natarajan. Large language models encode clinical knowledge, 2022. URL https: //arxiv.org/abs/2212.13138. Anssi Sovij ̈arvi, F. Dalmasso, J. Vanderschoot, Leo Malmberg, G. Righini, and S.A.T. Stone- man. Definition of terms for applications of res- piratory sounds. Eur Respir Rev, 10, 11 1999. Arpan Srivastava, Sonakshi Jain, Ryan Mi- randa, Shruti Patil, Sharnil Pandya, and Ke- tan Kotecha. Deep learning based respiratory sound analysis for detection of chronic obstruc- 14 RAMoEA-QA tive pulmonary disease. PeerJ Computer Sci- ence, 7, 02 2021. doi: 10.7717/peerj-cs.369. George Tsatsaronis, Georgios Balikas, Prodro- mos Malakasiotis, Ioannis Partalas, Matthias Zschunke, Michael R. Alvers, Dirk Weis- senborn, Anastasia Krithara, Sergios Petridis, Dimitris Polychronopoulos, Yannis Almiran- tis, John Pavlopoulos, Nicolas Baskiotis, Patrick Gallinari, Thierry Arti ́eres, Axel- Cyrille Ngonga Ngomo,Norman Heino, Eric Gaussier, Liliana Barrio-Alvers, Michael Schroeder, Ion Androutsopoulos, and Georgios Paliouras. An overview of the bioasq large- scale biomedical semantic indexing and ques- tion answering competition. BMC Bioinfor- matics, 16(1):138, April 2015. ISSN 1471-2105. doi: 10.1186/s12859-015-0564-6. Ping Wang, Tian Shi, and Chandan K. Reddy. Text-to-sql generation for question answering on electronic medical records, 2020.URL https://arxiv.org/abs/1908.01839. Renzhi Wang and Piji Li. Memoe: Enhancing model editing with mixture of experts adap- tors. (arXiv:2405.19086), 2024. doi: 10.48550/ arXiv.2405.19086. URL http://arxiv.org/ abs/2405.19086. arXiv:2405.19086 [cs]. Tsai-Ning Wang, Lin-Lin Chen, Neil Zeghi- dour, and Aaqib Saeed.Careaqa: A car- diac and respiratory audio question answer- ing model for open-ended diagnostic reason- ing. (arXiv:2505.01199), 2025. doi: 10.48550/ arXiv.2505.01199. URL http://arxiv.org/ abs/2505.01199. arXiv:2505.01199 [cs]. Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm, 2024a. URL https://arxiv. org/abs/2309.05519. Xun Wu, Shaohan Huang, and Furu Wei. Mix- ture of lora experts.(arXiv:2404.13628), April 2024b.doi:10.48550/arXiv.2404. 13628. URL http://arxiv.org/abs/2404. 13628. arXiv:2404.13628 [cs]. Tianyi Zhang, Varsha Kishore, Felix Wu, Kil- ian Q. Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert, 2020. URL https://arxiv.org/abs/1904.09675. Yuwei Zhang, Tong Xia, Jing Han, Yu Wu, Geor- gios Rizos, Yang Liu, Mohammed Mosuily, Jagmohan Chauhan, and Cecilia Mascolo. To- wards open respiratory acoustic foundation models: Pretraining and benchmarking, 2024. URL https://arxiv.org/abs/2406.16148. Appendix A. Datasets overview The RA-QA collection is built upon 11 distinct datasets, each contributing to the overall diver- sity and richness of the resource (dataset avail- ability is reported in Table 6). This section pro- vides a more detailed overview of these datasets. Each dataset comprises two core components: a metadata file and a corresponding audio set. The metadata encompasses structured clinical and contextual information, while the audio data consists of respiratory-related recordings. All au- dio recordings are fully anonymized and the as- sociated metadata have been carefully curated to exclude any personally identifiable informa- tion or inappropriate content, ensuring compli- ance with ethical standards and data protection regulations. This unified structure supports con- sistent integration and facilitates downstream re- search tasks. Following this introduction, a more detailed description is provided for each dataset from the RA-QA collection used in this paper. Table 7 reports the number of recordings and QA pairs used per dataset, split, modality, and task family within training. UK COVID-19 (Coppock et al., 2023). The UK COVID-19 Vocal Audio Dataset is designed for training and evaluating machine learning models to classify SARS-CoV-2 infection status and associated respiratory symptoms using vocal audio.The dataset was collected by the UK Health Security Agency from March 2021 to March 2022, during the prevalence of the Alpha and Delta SARS-CoV-2 variants, as well as some Omicron sublineages.Par- ticipants were recruited through the national Test and Trace programme and the REACT-1 survey. Audio recordings of voluntary coughs and exhalations were gathered through the ’Speak up to help beat coronavirus’ digital sur- vey, alongside demographic data, self-reported symptoms and respiratory conditions (see 15 RAMoEA-QA Table 6: Dataset Availability: The ICBHI and HF Lung datasets are sourced from multiple origins; please refer to the detailed description in the text below. The MMLung and NoseMic datasets are available upon request. The specific licensing terms are outlined in the Data Transfer Agreement (DTA). DatasetSourceAccessLicense UK COVID-19IC https://zenodo.org/records/10043978OGL 3.0 CoughVIDEPFL https://zenodo.org/records/4048312C BY 4.0 ICBHI* https://bhichallenge.med.auth.grCC0 HF Lung* https://gitlab.com/techsupportHF/HF_Lung_V1C BY 4.0 https://gitlab.com/techsupportHF/HF_Lung_V1_IPCC BY-NC 4.0 CoswaraIISc https://github.com/iiscleap/Coswara-DataCC BY 4.0 KAUHKAUH https://data.mendeley.com/datasets/jwyy9np4gv/3C BY 4.0 Respiratory@TR ITU https://data.mendeley.com/datasets/p9z4h98s6j/1C BY 4.0 MMlungUoS https://github.com/MohammedMosuily/mmlungCustom license NoseMicUoC https://github.com/evelyn0414/OPERA/tree/main/datasets/nosemic Custom license Table 7: Statistics of the RA-QA collection used in the training. For each dataset we report the task family, target task, audio modalities, and the number of derived QA pairs per split (Train/Val/Test). An asterisk (*) denotes datasets where the diagnosis task is available only in the single-verify format. Task Family DatasetTaskModalitiesQA N (T/V/Te) Discriminative CoswaraAsthma *Deep and shallow breathing204 / 64 / 64 COUGHVIDDiagnosis (Healthy, Symptomatic, Covid-19)Coughing225 / 45 / 45 ICBHIDiagnosis (URTI, COPD)Four clinical recording devices423 / 32 / 32 KAUHDiagnosis (Healthy, LRTI, COPD)6 sound types and 7 chest positions225 / 45 / 45 Respiratory@TRCOPD Severity11 recording positions210 / 70 / 70 Regression M-LungSpirometry (FEV1)Deep breathing32 / 4 / 4 M-LungSpirometry (FVC)Deep breathing32 / 4 / 4 NosemicRespiratory rateBefore and after running50 / 9 / 9 16 RAMoEA-QA Figure 4).These recordings were linked to SARS-CoV-2 test results,although speech recordings are not included in the open access version of the dataset and were not used in this study.The study was approved by The National Statistician’s Data Ethics Advisory Committee (reference NSDEC(21)01), the Cam- bridge South NHS Research Ethics Committee (reference 21/E/0036) and the Nottingham NHS Research Ethics Committee (reference 21/EM/0067). Informed consent was obtained from all participants. COUGHVID (Orlandic et al., 2021) The COUGHVID dataset consists of over 25,000 crowdsourced cough recordings, collected from a diverse group of participants across differ- ent ages, genders, geographic locations, and COVID-19 statuses (see Figure 5).All data collection and annotation processes adhered to the relevant ethical guidelines, with informed consent obtained from all participants who submitted their cough recordings and associated metadata. ICBHI (Rocha et al., 2019) The ICBHI Respi- ratory Sound Database includes audio samples collected by two independent research teams from different countries over several years. Ethical approval was granted by the relevant ethics committees of the respective institutions. The majority of the database comprises audio samples recorded by the School of Health Sci- ences, University of Aveiro (ESSUA) research team at the Respiratory Research and Rehabili- tation Laboratory (Lab3R) in Aveiro, Portugal and Hospital Infante D. Pedro.The second team, from Aristotle University of Thessaloniki (AUTH) and the University of Coimbra (UC), collected respiratory sounds at Papanikolaou General Hospital in Thessaloniki and the General Hospital of Imathia (Health Unit of Naousa), Greece. The database contains a total of 5.5 hours of recordings across 920 annotated audio samples from 126 subjects (see Figure 6). M Lung (Mosuily et al., 2023): The MMLung dataset was collected from 40 participants (20 male, 20 female) aged 18 to 85 years, all of whom are English speakers from the UK. Among the participants, 12 were healthy, while the others included seven self-reported COPD patients, seven self-reported asthma patients, and 14 indi- viduals with other long-term conditions. Ethical approval for this study was granted by the University of Southampton. Data were gathered using three devices: a Google Pixel 6 smart- phone with a custom app for data collection and an Easy on-PC ultrasonic spirometer from ndd Medical Technologies. The audio data was collected in stereo mode at a sampling rate of 44,100 Hz, saved in WAV format and recorded in a quiet room to ensure optimal conditions. The data collection included four audio modalities: cough, vowels, mobile spirometry and speech, through a series of tasks performed in a single session by each participant. For this paper, only the deep breath and the vowel sound of ’o’ are included (see Figure 7). Ground truth data were obtained using a medical-grade spirometer, with measurements taken by a healthcare professional according to European Respiratory Society (ATS/ERS) clinical standards. However, it is important to note that objective measurements can be subject to individual effort, which may introduce some errors (e.g., effort-dependent blows). This dataset is available upon request. Coswara (Bhattacharya et al., 2023) The Coswara dataset consists of respiratory sounds recorded between April 2020 and February 2022 from 2,635 individuals, including 1,819 SARS- CoV-2 negative, 674 positive and 142 recovered participants.The dataset includes nine cate- gories of respiratory sounds related to breathing, coughing, and speech. Metadata accompanying the recordings contains demographic details such as age, gender and geographic location, as well as health-related information including symptoms, pre-existing respiratory conditions, comorbidities and SARS-CoV-2 test status (see Figure 8). The data collection was approved by the Institutional Human Ethics Committee at the Indian Institute of Science, Bangalore. Informed consent was obtained from all partici- pants who uploaded their data and the collected data was anonymized to exclude personally identifiable information. KAUH (Fraiwan et al., 2021b) The KAUH dataset includes audio recordings of lung sounds from patients with seven different respiratory 17 RAMoEA-QA conditions: asthma, heart failure, pneumonia, bronchitis, pleural effusion, lung fibrosis and chronic obstructive pulmonary disease (COPD), as well as normal breathing sounds (see Fig- ure 9).The recordings were made using an electronic stethoscope, with the chest wall exam- ined at various points by a specialist physician. Each sound was recorded three times with differ- ent frequency filters to highlight specific bodily sounds. This dataset is valuable for the develop- ment of automated systems designed to detect pulmonary diseases or classify lung sounds. All participants (or their parents in the case of minors), provided written informed consent for their inclusion in the study and the sharing of their data.The study was approved by the Institutional Review Board at King Abdullah University Hospital and Jordan University of Science and Technology (Ref.91/136/2020). Data collection adhered to all relevant ethical guidelines and regulations and the authors have the right to share the data publicly. NoseMic (Butkow et al., 2024) The NoseMic dataset is a subset of data collected for a respiratory rate estimation project. The audio recordings were captured using microphones positioned close to the participants’ noses, while respiratory dynamics were measured using a Zephyr pressure sensor on the chest. Data col- lection took place in a stationary environment, both before and after physical exercise. A total of 21 participants were involved in the study, with data from some participants excluded due to poor sensing quality. The benchmark includes audio recordings from before and after running, with each recording segmented into 30-second windows that overlap by 15 seconds. The average respiratory rate for each window was used as the ground truth. Respiratory@TR (Altan et al., 2021) The Respiratory@TR dataset contains lung sounds recorded from both the left and right sides of the posterior and anterior chest wall, as well as the back, using two digital stethoscopes at An- takya State Hospital. In addition to the lung sound recordings, the dataset includes chest X- rays, pulmonary function test (PFT) variables, spirometric curves and the St. George Respi- ratory Questionnaire (SGRQ-C) as multimedia and clinical functional analysis data. The lung sounds were captured across 12 channels, focus- ing on the upper, middle and lower lung regions, as well as the costophrenic angles on both the posterior and anterior chest sides. The recordings were validated and labeled by two pulmonolo- gists, who assessed the chest X-rays, PFT results and auscultation sounds of the subjects. The la- bels correspond to five levels of COPD severity (COPD0, COPD1, COPD2, COPD3, COPD4). The dataset was released by Iskenderun Techni- cal University, Turkey. Participation was vol- untary and patients, aged 38 to 68, were se- lected from various occupational groups, socio- economic backgrounds and genders to ensure a diverse representation of the disorders. Appendix B. Metrics overview We follow the RA-QA evaluation protocol (Bertolino et al., 2026) and report both text-level and task-level metrics to separately capture (i) how well the model matches the reference form and semantics, and (i) whether it predicts the correct clinical label or value. Text-level metrics are computed directly between the generated an- swer and the reference after standard normaliza- tion, while task-level metrics operate on an ex- tracted label/value from the output and therefore ignore any additional explanatory context. B.1. Metrics for discriminative tasks. We group discriminative evaluation into response form metrics and label-correctness metrics. B.1.1. Response form metrics. These metrics assess surface fidelity and semantic similarity of the generated answer: • Exact Match (EM) evaluates whether the normalized predicted response matches the normalized reference string exactly (strict match). EM is informative for structured outputs but can underestimate performance for valid paraphrases. • TokenF1 measures token-level overlap be- tween prediction and reference using the har- monic mean of token precision and token re- call. TokenF1 rewards partial matches and is less brittle than EM. 18 RAMoEA-QA Figure 4: The figure shows label distributions for viral load categories (multiclass), as well as binary labels for symptoms such as runny or blocked nose and conditions like asthma from the UK COVID-19 dataset. As with other datasets used in this work, these labels are highly im- balanced and require preprocessing and reduction strategies to ensure meaningful training and evaluation. Figure 5: Label distributions from the CoughVid dataset, illustrating examples of audio-related attributes such as cough type, presence of stridor and associated diagnoses. This highlights that, beyond clinical metadata, some datasets also include perceptual or acoustic labels (e.g., wheezes, stridors), which are directly linked to the audio signal and can support more fine-grained sound analysis. Figure 6: Label distributions from the ICBHI dataset, showing the annotated diagnosis categories, presence of crackles and recording positions (e.g., trachea, anterior left, posterior right). This exemplifies how some datasets provide detailed contextual metadata, such as aus- cultation position, which can be crucial for interpreting respiratory sounds and modeling location-sensitive acoustic features. 19 RAMoEA-QA Figure 7: The figure illustrates key characteristics from the M Lung dataset.The Type of Audio distinguishes between different types of respiratory recordings, such as vowel phonation and deep breathing, which capture varied aspects of lung sound dynamics. Figure 8: The figure presents the distribution from the Coswara dataset of the selected attributes health status, COVID test result and mask usage at the time of recording. The Health Status attribute distinguishes between healthy individuals and COVID-19 positive cases with varying symptom severity, categorized as asymptomatic, mild or moderate. Figure 9: The figure displays key attributes of the KAUH dataset: sound type, recording position and diagnosis. The Sound Type plot shows the presence of multiple annotated labels per recording. 20 RAMoEA-QA Figure 10: The plots illustrate the recording phase attribute distribution form the NoseMic dataset. This captures whether the respiratory sounds were recorded before or after physical exertion (e.g., running), which is important for evaluating exertion-induced changes in breathing patterns. Figure 11: The plots illustrate two key attributes from the Respiratory@TR dataset: recording position and COPD severity. The recording position captures the recording point while the COPD Severity is scored on a scale from 0 (no COPD) to 4 (very severe), providing clinically relevant gradation to analyze how disease progression correlates with acoustic features in the recordings. • METEOR measures lexical alignment be- tween prediction and reference, designed to be robust to minor morphological variation and alternate word forms via flexible word- level matching. • BERTScore (Zhang et al., 2020) measures semantic similarity by computing cosine sim- ilarity between contextual token embeddings from a pre-trained BERT model and aggre- gating alignment scores across tokens. It provides a semantic view that is less sen- sitive to paraphrasing than n-gram overlap metrics. Together, EM/TokenF1 emphasize surface-form fidelity, METEOR captures word-level matching under small lexical variations, and BERTScore provides a semantics-oriented complement for open-ended answers. B.1.2. Label-correctness metrics. For diagnosis-style questions, we evaluate label correctness by extracting the target label from the model output and comparing it to the ground truth, ignoring extra answer context. For open- ended and multiple-choice questions, we parse the generated text to recover the predicted di- agnosis label (e.g., by normalizing case/punctua- tion and matching against the dataset label set), whereas for single-verify questions we parse the output into Yes, No. We then compute: • Accuracy as the fraction of examples whose extracted label matches the ground truth. While interpretable, accuracy can be mis- leading under class imbalance. • Macro F1-score (MacroF1) as the un- weighted mean of per-class F1-scores, where each class contributes equally.MacroF1 is robust to imbalance and reflects the precision–recall trade-off, which is important in clinical settings where both missed cases (false negatives) and false alarms (false pos- itives) carry cost. We emphasize MacroF1 as the primary discrim- inative metric due to its robustness to skewed la- 21 RAMoEA-QA bel distributions and its balanced accounting of false positives and false negatives. B.2. Metrics for regression tasks. For numeric targets, we treat the model output as free-form text and parse a scalar prediction by extracting the first valid numeric value in the generated answer (if present). We then compare this extracted value ˆy to the ground-truth tar- get y and compute error metrics over successfully parsed examples (lower is better). Concretely, we report: • Mean Absolute Error (MAE): MAE = 1 N P N i=1 |ˆy i − y i |, which measures the aver- age absolute deviation and is less sensitive to occasional outliers. • Root Mean Squared Error (RMSE): RMSE = q 1 N P N i=1 (ˆy i − y i ) 2 , which pe- nalizes larger errors more strongly and is therefore sensitive to rare but severe mis- estimates. Because regression evaluation depends on suc- cessful numeric extraction from free-form gener- ations, we additionally report parsing cover- age, i.e., the fraction of test instances for which numeric extraction succeeds. Unless otherwise stated, MAE/RMSE are computed on the subset of examples with valid numeric parses. Appendix C. Routing and conditioning details Audio-MoE routing vector. We denote the Audio-MoE routing vector by r a ∈R d r and in- stantiate the router-input policy as: r a ∈z a audio , z a question , z a fused ,(2) where z a audio is obtained from a shallow spectro- gram proxy extractorF and pooling, z a question is a pooled question embedding projected to d r , and z a fused is computed via lightweight cross-attention between question tokens and proxy audio tokens. The router MLP produces logits over N a experts and selects a single expert via straight-through Gumbel-Softmax in training and argmax at in- ference. Audio expert encoding and alignment. Given the selected expert index e a , only E a e a en- codes the spectrogram to produce h a e a , which is aligned to the LLM hidden size d ℓ with an expert- specific aligner A e a : ̃ h a e a =A e a (h a e a )∈R T e a ×d ℓ .(3) Selected audio prefix and LLM input. We inject the aligned audio embeddings as a selected audio prefix by concatenating them into the LLM input embeddings: X = [H p ; ̃ h a e a ; H Ans ; H y ],(4) where H p encodes the prompt/question, H Ans is an “Answer:” tag, and H y are teacher-forced an- swer embeddings during training. LoRA-MoA routing vector. Analogously, the MoA routing vector r ℓ ∈R d ℓ instantiates the same router-input policy at the language stage: r ℓ ∈z ℓ audio , z ℓ question , z ℓ fused ,(5) with z ℓ audio = Pool( ̃ h a e a ), z ℓ question = Pool(H p ), and z ℓ fused computed by lightweight fusion be- tween prompt and the selected aligned audio em- beddings. The adapter router outputs logits over N ℓ LoRA adapters and selects one adapter via straight-through Gumbel-Softmax (train) and argmax (eval). Appendix D. Baselines We compare RAMoEA-QA against two repre- sentative baselines: (i) a general-purpose audio- language model used as-is (Pengi), and (i) a domain-specific audio-QA architecture imple- mented as a monolithic (single-path) counterpart to our routed model (CaReAQA-style). D.1. Pengi (general-purpose audio-language model) Pengi is a general audio-language model orig- inally designed for broad audio captioning and QA-style generation across diverse audio domains (see Figure 12). It formulates a wide range of audio tasks as audio+text → text generation by conditioning a (largely) frozen causal language model on a fixed-length prefix built from an audio embedding and a text-prompt embedding, each 22 RAMoEA-QA mapped through lightweight trainable networks (Deshmukh et al., 2024b). In our experiments, we use Pengi without additional fine-tuning on RA-QA: we provide the respiratory recording and question as input and decode answers using the model’s default generation procedures. Because Pengi is not trained for clinically structured res- piratory assessment (nor specialized to RA-QA labels, question formats, or respiratory acous- tics), it serves as a reference point for the limita- tions of general multimodal audio-language mod- els when transferred to respiratory clinical QA. This baseline therefore tests how far a strong, task-general audio-language model transfers un- der dataset, modality, and task shifts. Figure 12: Overview of the Pengi architecture used as a baseline in our experiments. D.2. CaReAQA-style monolithic audio-QA model CaReAQA is an audio-language QA framework for cardiac and respiratory diagnostic reasoning that combines an audio encoder with a decoder- only LLM to produce open-ended answers (Wang et al., 2025). Its core design maps audio fea- tures into the LLM hidden space via an audio mapper and injects the resulting aligned embed- dings as a soft prefix for autoregressive genera- tion. Prior work in this line is typically trained and evaluated primarily in a monolithic setting and on a single question format (open-ended), and does not explicitly analyze conditional spe- cialization mechanisms (e.g., expert routing or format-/intent-dependent pathways) across het- erogeneous datasets and question types. To isolate the effect of routing in a controlled comparison, we implement a monolithic base- line that mirrors the CaReAQA-style pipeline: (i) a single audio encoder processes every spec- trogram, (i) a single audio-to-LLM aligner pro- duces aligned audio embeddings, and (i) a sin- gle LLM pathway (one adapter (one set of train- able parameters) on a frozen backbone, matching our training recipe) generates the answer from the prompt plus aligned audio prefix. This base- line keeps the multimodal interface (audio-prefix injection and autoregressive objective) compara- ble to RAMoEA-QA, while forcing all examples through one fixed acoustic representation and one fixed generation behavior. Appendix E. Further results This appendix provides the complete breakdown of results for both discriminative and regres- sive RA-QA tasks.For discriminative tasks, we report per-dataset performance across ques- tion types (Global, Single-verify, Open-ended, and Multiple-choice) using standard classifica- tion and string-level metrics (e.g., Accuracy, MacroF1, TokenF1, Exact Match), as well as se- mantic similarity scores (BERTScore and ME- TEOR). For regressive tasks, we report per- dataset performance for each target variable (e.g., M Lung: FEV1/FVC; NoseMic: respira- tory rate) using error-based metrics (MAE and RMSE). Tables in this section are organized to facilitate direct comparison across datasets and across question types, highlighting how performance varies depending on both the underlying data source and the QA formulation. 23 RAMoEA-QA Table 8: Per-dataset breakdown for in-domain testing (mean ± std over runs) for the baseline con- figuration of operaGT on discriminative tasks. MetricSubsetGlobalCoswara CoughVidICBHIKAUHRespiratory@TR Accuracy ↑ Global0.68± 0.160.75± 0.350.62± 0.060.50± 0.000.54± 0.180.88± 0.04 Single-verify0.70± 0.210.75± 0.350.67± 0.000.50± 0.000.67± 0.240.75± 0.07 Open-ended0.60± 0.06-0.33± 0.000.50± 0.000.50± 0.241.00± 0.00 Multiple-choice 0.68± 0.06-0.83± 0.240.50± 0.000.33± 0.001.00± 0.00 Macro F1 ↑ Global0.59± 0.210.67± 0.470.55± 0.080.33± 0.000.43± 0.230.87± 0.05 Single-verify0.62± 0.280.67± 0.470.62± 0.000.33± 0.000.58± 0.350.74± 0.09 Open-ended0.50± 0.06-0.18± 0.020.33± 0.000.40± 0.221.00± 0.00 Multiple-choice 0.59± 0.08-0.78± 0.310.33± 0.000.17± 0.001.00± 0.00 Token F1 ↑ Global0.86± 0.060.87± 0.190.90± 0.020.81± 0.000.85± 0.050.88± 0.02 Single-verify0.89± 0.100.87± 0.190.93± 0.010.82± 0.000.89± 0.080.97± 0.01 Open-ended0.81± 0.01-0.79± 0.010.79± 0.000.83± 0.040.83± 0.00 Multiple-choice0.82± 0.01-0.97± 0.050.79± 0.000.79± 0.000.76± 0.09 Exact Match ↑ Global0.56± 0.140.75± 0.350.62± 0.060.50± 0.000.54± 0.180.47± 0.04 Single-verify0.70± 0.210.75± 0.350.67± 0.000.50± 0.000.67± 0.240.75± 0.07 Open-ended0.38± 0.06-0.33± 0.000.50± 0.000.50± 0.240.20± 0.00 Multiple-choice0.46± 0.06-0.83± 0.240.50± 0.000.33± 0.000.20± 0.00 BERTScore (F1) ↑ Global0.89± 0.000.94± 0.040.86± 0.030.88± 0.040.90± 0.010.84± 0.01 Single-verify0.93± 0.020.94± 0.040.97± 0.020.93± 0.071.00± 0.000.82± 0.01 Open-ended0.78± 0.00-0.61± 0.010.83± 0.000.79± 0.010.87± 0.01 Multiple-choice 0.85± 0.02-0.89± 0.080.81± 0.030.81± 0.030.87± 0.01 METEOR ↑ Global0.88± 0.010.94± 0.040.86± 0.050.85± 0.060.89± 0.010.82± 0.01 Single-verify0.92± 0.020.94± 0.040.93± 0.050.93± 0.081.00± 0.000.81± 0.01 Open-ended 0.76± 0.01-0.67± 0.030.77± 0.030.78± 0.010.83± 0.01 Multiple-choice0.82± 0.02-0.91± 0.080.76± 0.050.79± 0.030.83± 0.01 Table 9: Per-dataset breakdown for in-domain testing (mean ± std over runs) for the baseline con- figuration of operaCT on discriminative tasks. MetricSubsetGlobalCoswara CoughVidICBHIKAUHRespiratory@TR Accuracy ↑ Global 0.61± 0.020.59± 0.120.65± 0.140.56± 0.090.47± 0.040.79± 0.02 Single-verify0.60± 0.030.59± 0.120.72± 0.160.62± 0.180.53± 0.200.58± 0.04 Open-ended 0.60± 0.06-0.33± 0.000.50± 0.000.50± 0.241.00± 0.00 Multiple-choice 0.68± 0.06-0.83± 0.240.50± 0.000.33± 0.001.00± 0.00 Macro F1 ↑ Global0.54± 0.060.50± 0.240.58± 0.180.43± 0.140.40± 0.030.77± 0.01 Single-verify0.54± 0.100.50± 0.240.69± 0.200.53± 0.280.51± 0.170.54± 0.02 Open-ended0.49± 0.05-0.17± 0.000.33± 0.000.40± 0.221.00± 0.00 Multiple-choice0.59± 0.08-0.78± 0.310.33± 0.000.17± 0.001.00± 0.00 Token F1 ↑ Global0.84± 0.020.78± 0.070.91± 0.040.83± 0.030.82± 0.020.89± 0.00 Single-verify0.84± 0.030.78± 0.070.93± 0.050.87± 0.060.83± 0.060.96± 0.00 Open-ended0.81± 0.01-0.79± 0.000.79± 0.000.83± 0.040.83± 0.00 Multiple-choice 0.84± 0.01-0.97± 0.050.79± 0.000.79± 0.000.83± 0.00 Exact Match ↑ Global 0.50± 0.020.59± 0.120.65± 0.140.56± 0.090.47± 0.040.39± 0.02 Single-verify0.60± 0.030.59± 0.120.72± 0.160.62± 0.180.53± 0.200.58± 0.04 Open-ended 0.38± 0.06-0.33± 0.000.50± 0.000.50± 0.240.20± 0.00 Multiple-choice0.46± 0.06-0.83± 0.240.50± 0.000.33± 0.000.20± 0.00 BERTScore (F1) ↑ Global 0.87± 0.010.82± 0.070.90± 0.030.87± 0.030.86± 0.020.89± 0.02 Single-verify0.90± 0.020.82± 0.070.96± 0.030.90± 0.060.89± 0.040.93± 0.01 Open-ended0.82± 0.01-0.72± 0.000.83± 0.000.88± 0.010.85± 0.03 Multiple-choice0.86± 0.02-0.96± 0.060.83± 0.000.79± 0.000.85± 0.03 METEOR ↑ Global0.85± 0.010.81± 0.060.89± 0.050.84± 0.030.83± 0.020.88± 0.01 Single-verify0.88± 0.020.81± 0.060.91± 0.070.89± 0.050.86± 0.060.94± 0.00 Open-ended0.80± 0.01-0.77± 0.000.79± 0.020.83± 0.050.81± 0.01 Multiple-choice0.83± 0.02-0.96± 0.060.77± 0.000.77± 0.000.81± 0.01 24 RAMoEA-QA Table 10: Per-dataset breakdown for in-domain testing (mean ± std over runs) for RAMoEA-QA on discriminative tasks. MetricSubsetGlobalCoswara CoughVidICBHIKAUHRespiratory@TR Accuracy ↑ Global0.72± 0.020.86± 0.090.54± 0.140.58± 0.150.68± 0.020.79± 0.02 Single-verify0.81± 0.000.86± 0.090.78± 0.160.72± 0.221.00± 0.000.57± 0.04 Open-ended 0.46± 0.02-0.00± 0.000.44± 0.090.33± 0.001.00± 0.00 Multiple-choice0.62± 0.10-0.61± 0.240.44± 0.090.39± 0.081.00± 0.00 Macro F1 ↑ Global0.67± 0.030.86± 0.090.51± 0.180.47± 0.180.60± 0.020.75± 0.01 Single-verify0.79± 0.010.86± 0.090.76± 0.190.67± 0.291.00± 0.000.51± 0.03 Open-ended0.38± 0.02-0.00± 0.000.26± 0.110.17± 0.001.00± 0.00 Multiple-choice 0.53± 0.12-0.51± 0.340.30± 0.040.23± 0.091.00± 0.00 Token F1 ↑ Global0.88± 0.000.93± 0.050.88± 0.040.84± 0.060.90± 0.000.83± 0.01 Single-verify0.92± 0.010.93± 0.050.95± 0.040.90± 0.081.00± 0.000.83± 0.01 Open-ended0.78± 0.00-0.71± 0.020.77± 0.020.79± 0.000.84± 0.01 Multiple-choice 0.83± 0.02-0.93± 0.060.76± 0.040.80± 0.020.84± 0.01 Exact Match ↑ Global0.58± 0.020.86± 0.090.54± 0.140.58± 0.150.68± 0.020.34± 0.02 Single-verify0.79± 0.000.86± 0.090.78± 0.160.72± 0.221.00± 0.000.47± 0.04 Open-ended0.24± 0.02-0.00± 0.000.44± 0.090.33± 0.000.20± 0.00 Multiple-choice0.40± 0.10-0.61± 0.240.44± 0.090.39± 0.080.20± 0.00 BERTScore (F1) ↑ Global0.89± 0.000.94± 0.040.86± 0.030.88± 0.040.90± 0.010.84± 0.01 Single-verify0.93± 0.020.94± 0.040.97± 0.020.93± 0.071.00± 0.000.82± 0.01 Open-ended0.78± 0.00-0.61± 0.010.83± 0.000.79± 0.010.87± 0.01 Multiple-choice 0.85± 0.02-0.89± 0.080.81± 0.030.81± 0.030.87± 0.01 METEOR ↑ Global 0.88± 0.010.94± 0.040.86± 0.050.85± 0.060.89± 0.010.82± 0.01 Single-verify0.92± 0.020.94± 0.040.93± 0.050.93± 0.081.00± 0.000.81± 0.01 Open-ended0.76± 0.01-0.67± 0.030.77± 0.030.78± 0.010.83± 0.01 Multiple-choice0.82± 0.02-0.91± 0.080.76± 0.050.79± 0.030.83± 0.01 Table 11: Per-dataset breakdown for in-domain testing (mean ± std over runs) for the baseline using operaCT on regressive tasks. Question-type MetricMM LungNoseMic FEV1FVCRespiratory rate Open-ended MSE ↓ 1.11± 0.303.42± 0.2827.79± 2.34 RMSE ↓1.05± 0.141.85± 0.085.27± 0.22 Table 12: Per-dataset breakdown for in-domain testing (mean ± std over runs) for the baseline using operaGT on regressive tasks. Question-type MetricMM LungNoseMic FEV1FVCRespiratory rate Open-ended MSE ↓1.13± 0.282.73± 0.6933.23± 2.01 RMSE ↓1.06± 0.131.65± 0.215.76± 0.17 Table 13: Per-dataset breakdown for in-domain testing (mean ± std over runs) for RAMoEA-QA on regressive tasks. Question-type MetricMM LungNoseMic FEV1FVCRespiratory rate Open-ended MSE ↓ 0.96± 0.132.42± 1.0026.11± 8.09 RMSE ↓0.98± 0.071.54± 0.335.08± 0.80 25