Paper deep dive

NasoVoce: A Nose-Mounted Low-Audibility Speech Interface for Always-Available Speech Interaction

Jun Rekimoto, Yu Nishimura, Bojian Yang

Year: 2026Venue: arXiv preprintArea: cs.HCType: PreprintEmbeddings: 55

Abstract

Abstract:Silent and whispered speech offer promise for always-available voice interaction with AI, yet existing methods struggle to balance vocabulary size, wearability, silence, and noise robustness. We present NasoVoce, a nose-bridge-mounted interface that integrates a microphone and a vibration sensor. Positioned at the nasal pads of smart glasses, it unobtrusively captures both acoustic and vibration signals. The nasal bridge, close to the mouth, allows access to bone- and skin-conducted speech and enables reliable capture of low-volume utterances such as whispered speech. While the microphone captures high-quality audio, it is highly sensitive to environmental noise. Conversely, the vibration sensor is robust to noise but yields lower signal quality. By fusing these complementary inputs, NasoVoce generates high-quality speech robust against interference. Evaluation with Whisper Large-v2, PESQ, STOI, and MUSHRA ratings confirms improved recognition and quality. NasoVoce demonstrates the feasibility of a practical interface for always-available, continuous, and discreet AI voice conversations.

PDF

Open source PDF →

PDF not stored locally. Use the link above to view on the source site.

Intelligence

Status: succeeded | Model: google/gemini-3.1-flash-lite-preview | Prompt: intel-v1 | Confidence: 96%

Last extracted: 3/13/2026, 1:09:19 AM

Summary

NasoVoce is a nose-bridge-mounted speech interface for smart glasses that integrates a MEMS microphone and a vibration sensor. By fusing air-conducted acoustic signals with bone/skin-conducted vibration signals, the system enables robust, low-audibility, and noise-tolerant speech interaction, specifically optimized for whispered speech which is typically difficult for standard microphones to capture in noisy environments.

Entities (5)

NasoVoce · device · 100%D-DCCRN · neural-network-model · 98%MEMS microphone · sensor · 95%MEMS vibration sensor · sensor · 95%Whispered speech · speech-modality · 95%

Relation Signals (4)

NasoVoce → integrates → MEMS microphone

confidence 100% · NasoVoce... integrates a microphone and a vibration sensor.

NasoVoce → integrates → MEMS vibration sensor

confidence 100% · NasoVoce... integrates a microphone and a vibration sensor.

D-DCCRN → processesinputfrom → MEMS microphone

confidence 95% · D-DCCRN... accepts composite inputs from a conventional microphone (Mic) and a vibration sensor (Vib).

D-DCCRN → processesinputfrom → MEMS vibration sensor

confidence 95% · D-DCCRN... accepts composite inputs from a conventional microphone (Mic) and a vibration sensor (Vib).

Cypher Suggestions (2)

Find all sensors integrated into the NasoVoce device. · confidence 90% · unvalidated

MATCH (d:Device {name: 'NasoVoce'})-[:INTEGRATES]->(s:Sensor) RETURN s.name

Identify the neural network model used for processing sensor data. · confidence 90% · unvalidated

MATCH (m:Model)-[:PROCESSES_INPUT_FROM]->(s:Sensor) WHERE s.name CONTAINS 'MEMS' RETURN m.name, s.name

Full Text

54,307 characters extracted from source content.

Expand or collapse full text

NasoVoce: A Nose-Mounted Low-Audibility Speech Interface for Always-Available Speech Interaction Jun Rekimoto ∗ Sony CSL - Kyoto Kyoto-shi, Kyoto, Japan The University of Tokyo Bunkyo-ku, Tokyo, Japan rekimoto@acm.org Yu Nishimura Sony CSL Shinagawa-ku, Tokyo, Japan nishimura@csl.sony.co.jp Bojian Yang Sony CSL Shinagawa-ku, Tokyo, Japan bojian.yang@sony.com ASR Voice Communication Vibration Sensor Microphone Dual-input Audio Enhancement High-quality but affected by noise High-quality, noise tolerantNoise tolerant, but low quality Vib Mic Figure 1: NasoVoce integrates a microphone and a vibration sensor at the nasal bridge for capturing normal and whispered speech. This location enables effective capture of both air-conducted and bone/skin-conducted signals even in very noisy environment, while remaining unobtrusive for daily wear. Abstract Silent and whispered speech offer promise for always-available voice interaction with AI, yet existing methods struggle to bal- ance vocabulary size, wearability, silence, and noise robustness. We present NasoVoce, a nose-bridge–mounted interface that in- tegrates a microphone and a vibration sensor. Positioned at the nasal pads of smart glasses, it unobtrusively captures both acous- tic and vibration signals. The nasal bridge, close to the mouth, allows access to bone- and skin-conducted speech and enables reli- able capture of low-volume utterances such as whispered speech. While the microphone captures high-quality audio, it is highly sensitive to environmental noise. Conversely, the vibration sensor is robust to noise but yields lower signal quality. By fusing these complementary inputs, NasoVoce generates high-quality speech robust against interference. Evaluation with Whisper Large-v2, PESQ, STOI, and MUSHRA ratings confirms improved recognition and quality. NasoVoce demonstrates the feasibility of a practical ∗ Corresponding author. This work is licensed under a Creative Commons Attribution 4.0 International License. CHI ’26, Barcelona, Spain © 2026 Copyright held by the owner/author(s). ACM ISBN 979-8-4007-2278-3/2026/04 https://doi.org/10.1145/3772318.3791397 interface for always-available, continuous, and discreet AI voice conversations. CCS Concepts • Human-centered computing→Sound-based input / output; Interface design prototyping; Mobile devices;• Computing method- ologies→ Neural networks. Keywords speech interaction, wearable computing, nose-mounted device, whispered voice, whispered voice conversion, silent speech, neural networks ACM Reference Format: Jun Rekimoto, Yu Nishimura, and Bojian Yang. 2026. NasoVoce: A Nose- Mounted Low-Audibility Speech Interface for Always-Available Speech Interaction. In Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems (CHI ’26), April 13–17, 2026, Barcelona, Spain. ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3772318.3791397 1 Introduction In daily life, the usage of generative AI is increasing. Beyond sup- porting document writing and programming, generative AI is also effective as a tool for assisting thought processes, such as idea ex- ploration and information retrieval. This function is particularly valuable in voice mode. For instance, users can record spontaneous arXiv:2603.10324v1 [cs.HC] 11 Mar 2026 CHI ’26, April 13–17, 2026, Barcelona, SpainRekimoto et al. ideas verbally, or query the AI for opinions and related information, enabling its use as an always-available voice agent. Recent advances in glasses-type devices, which integrate micro- phones, speakers, cameras, and in-lens displays, provide an ideal platform for such continuous interaction. The combination of voice and image input allows users to ask questions about surrounding objects or scenery, or inquire about the operation of nearby equip- ment, enabling context-aware conversations with AI in real-world settings. However, voice-based interfaces face several challenges: social acceptability (speaking aloud in public), privacy (difficulty vocaliz- ing confidential information in public), and noise robustness (avoid- ing interference from environmental noise or nearby speech). Wear- ability is also a significant issue. Previous research on “silent speech” interfaces, which attempt to recognize speech utterances using sensors, has primarily been limited to small command sets and remains insufficient for open- ended AI conversation. Lip-reading techniques have an extended vocabulary size, but camera-based systems covering the face impose high wearability costs, limiting their suitability for daily use. As a method related to silent speech, “whispered speech” has also been proposed. Input using sufficiently soft whispers shares characteristics with silent speech in that it minimizes disturbance to the surrounding environment. Moreover, by employing speech recognition models optimized for whispered speech, the system can support an effectively unrestricted vocabulary, offering a major advantage [6,11,12,26,35]. However, because whispered speech is inherently low in volume, it is highly susceptible to interference from ambient noise. We identify the following requirements for a practical always- available voice interface: Wearability:Devices must impose minimal burden during daily and continuous use. In-ear canal devices, such as earbuds, cause discomfort over long durations, and noise-canceling earphones may isolate users from environmental sounds, leading to safety concerns. Approaches requiring surface mounted sensors (e.g., intraoral or EEG electrodes) or a lip-reading camera in front of the user’s face, are also unsuitable for continuous everyday use. Silence:User speech should not be externally audible, avoiding both disturbance to others in public and leakage of private infor- mation. Noise Robustness:Recognition must remain accurate under en- vironmental noise and avoid misrecognition of nearby speech. Vocabulary:The system should support a vocabulary comparable to standard speech recognition. Silent speech systems restricted to dozens of commands are inadequate for general AI conversa- tion, and multilingual support is desirable. Speech Rate:Input should be possible at normal conversational speed. Letter-by-letter spelling or telegraphic input cannot sup- port natural dialogue. To the best of our knowledge, no existing system simultaneously satisfies all of these requirements, as summarized in Table 1. In this work, we propose NasoVoce, a silent speech interface that inte- grates a microphone and a vibration sensor into the nasal bridge. This anatomical location, close to the mouth and nasal cavity, is well suited to capture air-conducted sounds of both normal and whispered speech via a microphone. Unlike throat microphones or bone-conduction sensors placed on the skull which primarily rely on vocal cord vibrations (voiced signals), the nasal bridge possesses a unique acoustic property; it effectively captures the aerodynamic turbulence and cavity resonance generated even during unvoiced whispering. Simultaneously, speech vibrations can be acquired us- ing a vibration sensor, enabling robust speech recognition against environmental noise. This advantage allows NasoVoce to detect whispered speech signals that are physically undetectable by con- ventional throat or skull-contact sensors, establishing a distinct advantage for silent voice interaction. This configuration is also particularly well-suited for integration into smart glasses, which are expected to become increasingly widespread. Embedding the system in the nasal component of smart glasses allows them to function as silent AI communication devices. The contributions of this research are: •A whisper input mechanism by integrating a microphone and vibration sensor into nose pads; This design reduces ambient noise for both normal and whispered speech while maintaining the smart glasses’ appearance and wearability. •Develop a deep learning model, constructed a dataset, and veri- fied the actual effectiveness of the proposed method. Unlike prior work on bone conduction and microphone input, this study focuses on whispered speech. We emphasize this modality to enable AI conversation in any situation, similar to silent speech. 2 Related Work This study proposes a system capable of capturing normal and whispered speech that can be integrated into the nose pads of eyewear. In this context, we review related work from four perspec- tives—silent speech, whispered speech recording, speech recording via bone and skin conduction, and the fusion processing of air- and bone-conducted speech—and compare them with our proposed approach. Silent Speech Silent Speech Interfaces (SSIs) enable communication when an acoustic signal is unavailable or undesirable by decoding artic- ulatory, neuromuscular, or other biosignals instead of air-borne speech [5]. Various sensing families have been investigated includ- ing surface electromyography (sEMG) [18], articulatory imaging (ultrasound tongue imaging and lip video), contactless RF, and neural pathways—while outlining persistent challenges such as speaker/session variability, latency, and deployment ergonomics. sEMG (neuromuscular) systems, such as AlterEgo [18], capture articulator muscle activity from the face/neck to recognize or resyn- thesize speech. Classical systems established feasibility and high- lighted mode mismatch between audible vs. silent articulation; recent deep models improve “voicing” silent EMG and sequence-to- sequence resynthesis, yet cross-session robustness remains a core problem. Ultrasound tongue imaging (UTI), often paired with lip video, supports articulatory to acoustic mapping and silent recognition. NasoVoce: A Nose-Mounted Low-Audibility Speech Interface for Always-Available Speech InteractionCHI ’26, April 13–17, 2026, Barcelona, Spain namemethodsilencenoise tolerancecommand/speech AlterEgo [18]neuromuscularsilentgoodcommand SottoVoce [20]ultrasoundsilentgoodlimited vocab. SilentVoice [7] ingressive speechlowunknownspeech WESPER [26]micnormal/whisperunknownspeech LipLerner [29] lip-readingsilentgoodcommand V-Speech [23]BC on nosepadnormal voicegoodspeech VibeVoice [9] mic/IMU on HMDnormal voicegoodspeech Unvoiced [28]jawbone motion+ LLMsilentgoodphrase AccCall [39] mic+ smartphone IMUnormalgoodspeech EchoSpeech [44]ultrasound on eyeglasses framenormalgoodcommand AirPods [2]beamforming/signal processingnormalgoodspeech NasoVoce (ours) mic/vibration on nose-pad normal/whispergoodspeech Table 1: Comparison of silent and whispered speech / wearable speech technologies SottoVoce demonstrated the feasibility of connecting existing voice- aware devices [26]. STN modules refine continuous vocoding, adap- tation and multi-speaker recognition under silent vs. modal mis- matches. EchoSpeech employs multiple pairs of ultrasonic trans- ducers and receivers on a glasses-type device to estimate speech based on cheek skin deformations during silent articulation [44]. While it shares our goal of being integratable into a glasses form factor, it is restricted to recognizing 31 speech commands and does not function as a general speech interface. Unvoiced senses jaw motion during silent speech using an In- ertial Measurement Unit (IMU) [28]. This system relies on the jaw (a secondary articulator), which suffers from inherent physical ambiguity where multiple sounds map to the same movement (one- to-many mapping). Consequently, it has the limitation of being unable to physically distinguish words with similar jaw motion profiles without strictly relying on LLM-based contextual inference to resolve the ambiguity. Deep end-to-end lip-reading (for example, LipNet [3] and Li- plearner [29]) decodes speech act from images of the lips or mouth. Lip-reading by a depth image sensor uses depth information of the speech act [41]. The main challenges of lip-reading are the need to position a camera where it can capture the lips and the high sensitivity of the system to ambient lighting, making mobile use dif- ficult. At the same time, as lip-reading technology advances, a new concern may emerge: even when users input speech through other silent speech methods, the capture of lip images could undermine privacy and confidentiality. Whispered Speech Although not completely silent, unvoiced whispering—produced without vocal cord vibration—possesses characteristics similar to silent speech, particularly in its minimal impact on the surrounding environment. Unlike silent speech approaches, whispered speech recognition enables performance comparable to general ASR with an unconstrained vocabulary. WESPER [26] , SilentWhisper [12], DistillW2N [35] use self supervised learning to obtain whisper recognition capability based on HuBERT [13]. Recent speech recog- nition systems trained on large-scale vocabularies, such as OpenAI Whisper [25], are also capable of recognizing whispered speech. Farhadipour et al. also reported finetuning on OpenAI whisper on whisper utterances could increase recognition accuracy [6]. On the other hand, a significant challenge for whispered speech recognition is its high susceptibility to external noise. WhisperMask addresses this with an input device designed to record the wearer’s whispers while reducing external noise [11]. Yet, because it necessi- tates wearing a mask, it is intended not for everyday use, but rather for environments where mask-wearing is already a prerequisite, such as medical facilities or construction sites. Commercially available headsets, such as the Apple AirPods Pro, offer “Voice Isolation” feature that utilizes multi-microphone beam- forming and signal processing to capture the wearer’s speech [2]. However, our evaluation revealed that while this function works effectively for normal speech, it fails to capture whispered speech entirely (as demonstrated in the supplemental video). This sug- gests that distinguishing between whispered speech and external noise remains a significant challenge for standard signal processing algorithms. Overall, many of these SSI studies do not address fully free-form speech but are instead designed for systems with limited vocab- ularies or for silent command input. In contrast, low-voice and whispered speech have reduced sound pressure, yet with speech transformation systems can maintain vocabulary size and recog- nition accuracy comparable to ordinary speech, allowing usage similar to SSI. However, since low-voice speech is more vulnerable to environmental noise, robust noise countermeasures are essential. Voice Sensing with Bone/Skin Conduction Early NAM (non-audible murmur) work showed that skin-coupled stethoscopic/silicone microphones behind the ear can capture whis- pered or “nonaudible murmur” speech robustly to ambient noise, enabling ASR and even whisper-to-speech enhancement. Subse- quent studies improved sensors and explored model adaptation and enhancement from body-conducted resonances. Moon investigated simultaneous bilateral recordings with mini- accelerometers on the nasal bones show that nasal airflow patency (resistance) affects the detectability of nasal-bone vibration [24]. Measurement with an accelerometer over the nose is suitable for capturing nasal-origin energy but is modulated by nasal patency. CHI ’26, April 13–17, 2026, Barcelona, SpainRekimoto et al. Yiu et al. also examined the correlation between nasal-bridge bone vibration and perceptual auditory ratings [42]. They find a moderate correlation between the magnitude of the vibration of the nasal bridge and the perceived resonance. Chen et al. reported that resonant voice training increases facial, particularly nasal-bridge, bone vibration and suggested that these vibrations are likely to contribute to resonant voice production [4]. These findings support the usefulness of the nasal bridge as a mea- surement point. However, they did not examine the involvement of nasal-bridge vibration in whispered speech. Kitamura used scanning laser Doppler vibrometry to map facial vibration velocity during speech [21]. He reports strong vibrations at the nose and surrounding area for nasal sounds, and even for vowels, the lateral sides of the nose show prominent vibration. The area around the nose is visualized as a hot spot for speech-induced vibration. V-Speech places a vibration sensor on the nose pads of smart glasses to capture speech [23]. Compared with an air microphone, the SNR improves substantially making it practical for ASR and calls. However, “nasal distortion” can occur because nasal consonants become overamplified, so compensation must be provided. In summary, while nasal sensing appears promising for speech acquisition, no prior work has attempted recognition by fusing bone/skin-conducted signals with conventional airborne speech. However, whispered speech recognition using nasal sensing has not been explored in prior work. Fused Air Conduction and Bone/Skin Conduction Utterance Recognition Zhou et al. use bone conduction (BC) for robust voice activation detection (VAD), adaptive filtering, and wind noise handling [45]. It demonstrates gains over an acoustic conduction (AC)-only array and describes a real-time embedded prototype. Yu et al. uses a fully convolutional time domain architecture with early vs. late fusion of BC and AC [43]. It reports that late fusion performs better and surpasses single-modality enhancement on a Mandarin corpus. Wang et al. propose an attention-based fusion of BC and noisy AC spectrograms [38]. It also introduces a semi- supervised training scheme (CycleGAN-style) to leverage unpaired AC/BC data. On the EMSB corpus it outperforms a time-domain baseline and single-sensor systems, with especially strong gains at low SNRs. Wang et al. introduce an end-to-end time-domain model (MMINet) that jointly uses noisy AC + BC to enhance speech under low SNR conditions, yielding consistent improvements over AC-only and BC-only models [40]. VibVoice targets wearables by fusing microphone audio + on- device IMU sensors [9]. It shows BC vibrations are insensitive to external speakers and mostly below 800 Hz, then proposes a Bone Conduction Function for data augmentation and trains a multi- modal DNN that improves enhancement under competing-speaker and motion conditions. However, whispered voice input is not achieved by this work. AccCall also utilizes an audio microphone and the IMU built into a smartphone to reduce speech noise [39]. However, this study has also not realized the recording of whispered speech. 7m MEMS vibration sensor (SyntiantV2S200D) MEMS microphone (SyntiantSPH0141LM4H-1) silicone pad Figure 2: NasoVoce sensor configuration: A MEMS micro- phone and MEMS vibration sensor can be acquired as time- synchronized digital data as the left and right channels of a TDM audio interface. An example usage is shown with the sensor mounted on the nose pad of a smart glasses frame. Huang et al. highlight that conventional BC + AC fusion can degrade sharply with even tiny mismatches [15]. They propose an online adaptive fusion method suitable for real-time communication scenarios. Useful for deployment where sensor latencies/placements drift. Kuang et al. build a DenGCAN backbone with attention-based feature fusion and attention-gated skip connections [22]. They re- port improvement over noisy AC, showing that compact BC+AC fusion models can still be highly effective. 3 NasoVoce The NasoVoce sensor configuration is illustrated in Fig. 2. It is composed of a MEMS microphone (Syntiant SPH0141LM4H-1 [31]) and a MEMS vibration sensor (Syntiant V2S200D [32]). Both devices provide a pulse density modulation (PDM) interface with 16KHz sampling rate and 16-bit digitization, allowing them to be driven by the same clock. This configuration enables the acquisition of accurately time-synchronized acoustic signals from both sensors at the clock in the digital domain, eliminating the time-mismatch problem pointed out by Huang et al. [15]. By mapping the vibration sensor signal to the left channel and the microphone signal to the right channel, the sensor output can be treated as a conventional stereo audio stream, thus maintain- ing compatibility with a wide range of existing audio interfaces. The V2S200D vibration sensor is capable of measuring at higher frequencies (up to 10 kHz) than typical IMU (Inertial Measurement Unit) sensors, making it well suited for capturing skin- or bone- conducted speech signals including whisper utterances. Fig. 2 (right) shows an example in which the sensor is mounted on the nose pad of a commercially available smart glasses frame. The design incorporates a silicone pad that mechanically isolates vibrations from the frame while ensuring a discreet appearance suitable for continuous wear. This configuration provides two key advantages: (1) speech can be captured via bone/skin conduction from the nasal bones, and (2) proximity to the mouth makes it effective for capturing air- conducted speech. Although whispered speech tends to propagate poorly through bone conduction, we will show it can still be ac- quired through the air-conduction channel of the microphone. For private speech input, users can cover their mouth with a hand, which prevents sound leakage into the environment (Fig. 3). This NasoVoce: A Nose-Mounted Low-Audibility Speech Interface for Always-Available Speech InteractionCHI ’26, April 13–17, 2026, Barcelona, Spain Figure 3: By covering the mouth and nose with one’s hand, the spoken content cannot be inferred through lip reading. This posture further serves as a social signal, indicating that the person is engaged in a conversation with the device. gesture not only protects the secrecy of speech, but also functions as a socially interpretable cue, signaling to bystanders that the user is engaging in voice-based interaction with a device. 4 Recognition Model We constructed and evaluated a neural network recognition model that combines the vibration sensor (Vib) and microphone (Mic) input : D-DCCRN by extending DCCRN (Fig. 4). DCCRN (Deep Complex Convolution Recurrent Network) [14] is a deep learning model for audio enhancement that combines a complex-valued convolutional encoder–decoder with a complex- valued LSTM (Long Short-Term Memory), allowing effective han- dling of both magnitude and phase. We adopted the DCCRN ar- chitecture because whispered speech lacks harmonic structure and closely resembles noise, making it difficult for conventional magnitude-based networks to separate from environmental noise. DCCRN utilizes complex-valued networks to process both mag- nitude and phase information. Since the “phase” structure of the near-field whisper turbulence (captured by the vibration sensor) differs significantly from far-field environmental noise, we expect preserving phase information is crucial for effective enhancement. The extended model is named D-DCCRN (Dual-DCCRN); it ac- cepts composite inputs from a conventional microphone (Mic) and a vibration sensor (Vib). While the original DCCRN applies a short- time Fourier transform (STFT) and processes the real and imaginary parts using complex CNN layers, D-DCCRN generalizes this design to jointly process the real and imaginary components of both Mic and Vib signals. Same as DCCRN, the output of D-DCCRN is en- hanced speech, enabling direct listening, communication use, or input to any ASR system. The main advantage is improved audio quality by leveraging both Vib and Mic features. D-DCCRN has 438.25M parameters, which is 28 % of the pa- rameter size of OpenAI Whisper large-v2 (1543.3 M). The average processing time for D-DCCRN is 136.9ms, while the average pro- cessing time for OpenAI Whisper is 429.96ms, which means that D-DCCRN requires 31.8 % of Whisper’s processing time. We con- sider this computational cost acceptable in light of the benefit of noise-robust whispered speech recognition. However, fully stream- ing speech processing and integration into smartphones have not yet been realized and remain as future work. 4.1 Training Method We collected a data set of clean speech and vibration sensor paired signals. 45 participants (English fluent, gender balanced, 25–55 years old) each read English text obtained from the Free ST Amer- ican English Corpus [30] for approximately 2.3 hours, recorded simultaneously with a MEMS microphone and a vibration sensor, resulting in a total of 104 hours. To simulate noisy environments, clean speech audio signals were mixed with samples from the DE- MAND noise dataset [36]. Noise was added at root mean square (RMS) levels ranging from−10dBto 10dBrelative to the clean speech RMS level, with noise instances randomly selected from the noise dataset for each utterance to ensure diverse corruption patterns. For D-DCCRN, the objective was to reconstruct clean speech from Vib+ noisy audio input (Mic). We trained the model using a reconstruction loss targeting clean audio. In addition, we introduced a loss based on the OpenAI Whisper decoder, inspired by knowledge distillation methods [10] such as Distil-Whisper [8]. The recognition results from the Vib+noisy input are trained to match those from the clean input. We applied both a soft loss (aligning decoder output distributions) and a hard loss (improving token-level accuracy) following standard distillation practice. Fig. 6 shows an example of audio enhancement from noisy audio and vibration sensor. At the audio enhancement level, loss is defined as: 푒푛푐= 퐴퐸 푑푢푎푙 (푁표푖푠푒(푚푖푐),푣푖푏) 퐿 푎푒 = 푀푆퐸(푀푒푙(푚푖푐),푀푒푙(푒푛푐))+ 휆 푠푖 푆푖-푆퐷푅(푚푖푐,푒푛푐) (1) where푚푖푐is mic signal,푣푖푏is vibration signal,퐴퐸 푑푢푎푙 (·,·)is dual input audio enhancement (D-DCCRN),푁표푖푠푒(·)is a noise adding function,푀푆퐸(·)is mean square error,푀푒푙(·)is mel-spectrogram, 푆퐼 -푆퐷푅is Scale-Invariant Signal-to-Distortion Ratio, and the goal is to minimize the difference between the clean encoder output (푒푛푐) and the noise-robust encoder output (푒푛푐 ′ ). To ensure consistency at the ASR level, we additionally use퐿 푘푑 , loss of knowledge distillation following the method proposed by Distil-Whisper [8]. In Distill-Whisper, the output of the OpenAI Whisper (decoder) is used as a teacher signal to enhance the capa- bility of the student decoder. In contrast, our objective is to improve the ASR performance of the dual input audio enhancement output. To this end, we freeze both the encoder and decoder of OpenAI Whisper and use them solely as an evaluator, while employing them in the training process of the audio enhancement model. To achieve this objective, we define two losses: a hard loss퐿 ℎ푎푟푑 and a soft loss퐿 푠표푓푡 [10]. Using the OpenAI Whisper encoder, we get hidden state representation퐻from clean푀푖푐input and hidden state representation퐻 ′ from noise enhanced input푒푛푐. Then퐿 ℎ푎푟푑 is defined as: 퐿 ℎ푎푟푑 =− 푁 ′ ∑︁ 푖=1 푃(푦 푖 | ˆ 푦< 푖,퐻 ′ )(2) CHI ’26, April 13–17, 2026, Barcelona, SpainRekimoto et al. Conv STFT Complex LSTM Dense ISTFT Complex CNN layers Complex CNN layers Real Img Real Img Skip connections Noisy Audio Enhanced Audio Original DCCRN Conv STFT Complex LSTM Dense ISTFT Complex CNN layers Complex CNN layers Real Img Real Img Skip connections Noisy Audio Enhanced Audio D(dual)-DCCRN Vibration Sensor Conv STFT Real Img Figure 4: D-DCCRN (Dual-DCCRN) accepts composite inputs from a microphone (Mic) and a vibration sensor (Vib). D-DCCRN generalizes the design of DCCRN audio enhancement model to jointly process the real and imaginary components of both Mic and Vib signals. OpenAI Whisper ASR Dual input Audio Enhancement (D-DCCRN) Noisy Audio Vib Clean Audio add noise OpenAI Whisper ASR Enhanced Audio Audio enhancement Loss (Lae) Pseudo label estimation Knowledge Distillation Loss (Lkd) Figure 5: The NasoVoce training method: it combines audio enhancement loss (퐿 푎푒 ) and Knowledge Distillation loss (퐿 푘푑 ) . where ˆ 푦are the pseudo-labels [19] generated from clean speech. This loss represents how accurately enhanced audio from noisy voices can generate tokens compared to clean voice tokens. 퐿 푠표푓푡 can be defined as the Kullback–Leibler (KL) divergence of the probability distributions of clean audio (푄 푖 ) and enhanced audio (푅 푖 ), so that푅 푖 can be trained to match the full distribution of the teacher (푄 푖 ) by minimizing the KL divergence over the entire set of token probabilities at position 푖: 퐿 푠표푓푡 = 푁 ∑︁ 푖=1 퐾퐿(푄 푖 ,푃 푖 )(3) Then we get the combined knowledge distillation (KD) objective: 퐿 푘푑 = 휆 푠표푓푡 퐿 푠표푓푡 + 휆 ℎ푎푟푑 퐿 ℎ푎푟푑 (4) Finally, the total loss is given by: 퐿 푡표푡푎푙 = 퐿 푎푒 + 휆퐿 푘푑 (5) 5 Evaluation To evaluate the proposed approach, we conducted (i) ASR accuracy tests, (i) objective audio-quality measurements, and (i) participant- based quality ratings. The evaluation data were disjoint from the training set; for each utterance we prepared a clean source and a corresponding noise-corrupted version. ASR and objective quality were computed on 1,000 items. For the subjective study, participants rated five sets, each comprising: a clean reference, a simultaneously NasoVoce: A Nose-Mounted Low-Audibility Speech Interface for Always-Available Speech InteractionCHI ’26, April 13–17, 2026, Barcelona, Spain Microphone (with noise) Vibration Sensor Enhanced (by D-DCCRN) Ground Truth Whisper voice Normal voice Figure 6: Audio quality improvement examples (normal and whispered speeches) through the combined use of microphone (Mic) input and vibration sensor (Vib) input: While the Mic input contains external noise, the audio enhancement results obtained from both Mic and Vib inputs (Enhancement) closely approximate the Ground Truth. This outcome is demonstrated through a simulation in which noise is added to the Ground Truth to form the Mic input, and an audio enhancement model is subsequently applied. 20100-10 Signal to noise ratio [dB] 0 10 20 30 40 50 WER [%] WER vs Noise Level (Normal) Mic Vib Enhanced 20100-10 Signal to noise ratio [dB] 0 10 20 30 40 CER [%] CER vs Noise Level (Normal) Mic Vib Enhanced 20100-10 Signal to noise ratio [dB] 0 10 20 30 40 50 WER [%] WER vs Noise Level (Whisper) Mic Vib Enhanced 20100-10 Signal to noise ratio [dB] 0 10 20 30 40 CER [%] CER vs Noise Level (Whisper) Mic Vib Enhanced Figure 7: Speech recognition accuracy (WER, CER) for nor- mal and whispered speech: MEMS microphone (Mic), MEMS vibration sensor (Vib), audio enhancement by D-DCCRN 20100-10 Signal to noise ratio [dB] 1.0 1.5 2.0 2.5 3.0 3.5 PESQ PESQ vs Noise Level Mic Vib Enhanced 20100-10 Signal to noise ratio [dB] 0.4 0.5 0.6 0.7 0.8 STOI STOI vs Noise Level Mic Vib Enhanced Figure 8: PESQ and STOI scores of MEMS microphone (Mic), MEMS vibration sensor (Vib), and enhanced voice using the D- DCCRN model by combining Mic and Vib (D-DCCRN) under various noise conditions. recorded MEMS vibration sensor (Vib) channel, microphone sig- nals obtained by mixing the clean speech with various noises (Mic conditions), and signals enhanced from Vib+Mic by the proposed audio-enhancement model. 5.1 ASR (WER /CER) Fig. 7 reports ASR recognition accuracy in terms of Word Error Rate (WER) and character error rate (CER). We conducted the mea- surements using data consisting of 200 utterances from each of CHI ’26, April 13–17, 2026, Barcelona, SpainRekimoto et al. CleanVib10dB0dB-10dB 0 20 40 60 80 100 Score 90.5 41.5 60.4 78.0 37.6 68.7 15.1 35.1 Voice Quality Evaluation (MUSHRA) Clean Vib Mic Enhanced Figure 9: MUSHRA-based audio quality evaluation: We as- sessed the Clean speech (Hidden Reference), the simultane- ously recorded MEMS vibration sensor (Vib), a microphone signal with added noise (Mic, SNR 10dB, 0dB,−10dB), and an Enhanced signal obtained by fusing Mic and Vib and process- ing with D-DCCRN. the evaluators (n = 4) who were distinct from the participants in- volved in the training dataset collection. As in the training phase, noise was superimposed using the DEMAND noise dataset [36]. Mic denotes recognition from the microphone audio; Vib denotes recognition directly from the vibration sensor signal; Enhanced denotes recognition from an Enhanced signal produced by audio enhancement that fuses Mic and Vib. As shown in the figure, for both normal and whispered speech, Mic recognition accuracy degrades as the noise level increases. While Vib remains relatively stable, its recognition accuracy for whispered speech is worse than that of Normal. Enhanced is com- paratively robust to noise and, in particular, consistently achieves better recognition accuracy than Vib for whispered speech. More- over, when the noise level is 0dBor higher, Enhanced outperforms Mic in terms of recognition accuracy. From these results, we can conclude that NasoVoce can reliably recognize whispered speech even under noisy conditions. On the other hand, for normal speech, depending on the situa- tion, Vib can sometimes achieve the highest recognition accuracy. In previous mixed-modality speech recognition studies, the IMU sensing bandwidth was around 0–400 Hz, which is not directly suitable for speech recognition. In contrast, our system employs a MEMS vibration sensor (V2S200D [32]) with a sensing range much higher than normal IMUs (10Hz - 2000Hz), suggesting that Vib alone may be sufficient for speech recognition. As future work, we plan to develop a recognition model that can dynamically select the input configuration – Mic, Mic+Vib, or Vib alone – that yields the highest recognition accuracy under a given condition. 5.2 Voice Quality (PESQ / STOI) We then evaluated voice enhancement quality with PESQ (Percep- tual Evaluation of Speech Quality) [1,27] and STOI (Short-Time Objective Intelligibility) [33,34]. PESQ is an intrusive reference- based metric (ITU-T P.862/P.862.2) that predicts perceived quality by comparing a processed/degraded signal to a clean reference using a psychoacoustic model. STOI is an intrusive metric that estimates speech intelligibility by correlating short-time temporal envelopes across one-third octave bands between clean and processed speech. In our case, MEMS microphone (Mic), MEMS vibration sensor (Vib), and enhanced voice using the D-DCCRN model by combining Mic and Vib (Enhanced), are evaluated (Fig. 8), under various noise conditions (−20dB,−10dB, 0dB, and 10dB). As shown in the graph, the enhanced results outperform Mic conditions in all noise levels. Under very strong noise (10dB), the Vib condition is better than the Enhanced model. 5.3 User Evaluation (MUSHRA) Fig. 9 shows the subjective evaluation of speech quality with MUSHRA (MUltiple Stimuli with Anchor) [37]. MUSHRA is a method to eval- uate audio quality as defined by ITU-R Recommendation BS.1534. It uses hidden reference speech and other speech, and participants are expected to give scores (from 0 to 100) comparing the anchor audio as a reference. Due to the use of a hidden reference, MUSHRA is considered to be more reliable than the mean opinion score (MOS). In our case, a clean voice was used as a reference. The partici- pants rated the Clean voice (Hidden Reference), Vib (audio from the MEMS vibration sensor), Mic (audio from the MEMS microphone), and Enhanced (enhanced audio by the proposed D-DCCRN model with Vib and Mic input) under various noise conditions (−10dB, 0 dB, and 10 dB). 50 English-fluent, gender-balanced participants over 18 years of age were recruited using the Prolific online survey system [16] with a MUSHRA online tool based on JavaScript [17]. For each rating, participants could play each voice as many times as they wanted. As shown in the graph, the Enhanced condition consistently outperformed the Mic condition and remained superior to Vib until the noise level reached 0dB. Under the very high noise condition (10dB), the Vib was slightly higher. We hypothesize that in high- noise environments, the microphone input contributes minimally to the Enhanced model’s output, thereby limiting improvements in perceived quality. 5.4 In-the-Wild Test The evaluations described above were conducted with artificially superimposed noise. We also evaluated the acoustic performance of NasoVoce in real-world environments. Specifically, we recorded both normal and whispered speech in four scenarios: inside a café, on a busy roadside, while walking outdoors, and inside a train car. For each setting, we captured speech simultaneously with a conventional air-conduction microphone and with NasoVoce, and compared the resulting signals. Fig. 10 illustrates the experimental setup; representative audio is included in the supplementary video. In the in-train condition, we additionally compared NasoVoce against the “voice isolation feature” of Apple AirPods Pro 2. The voice isolation function performs multi-microphone beamform- ing and iOS-side signal processing to suppress background sounds other than the wearer’s speech. The comparison showed that, while Voice Isolation effectively attenuated environmental noise for nor- mal speech, it almost completely suppressed whispered speech, NasoVoce: A Nose-Mounted Low-Audibility Speech Interface for Always-Available Speech InteractionCHI ’26, April 13–17, 2026, Barcelona, Spain Figure 10: In-the-wild trial of NasoVoce: We recorded normal and whispered speech in four everyday environments—inside a café, on a busy roadside, while walking, and inside a train car—and compared signals obtained with a conventional microphone and with NasoVoce. presumably because whisper components were treated as back- ground noise. In contrast, NasoVoce consistently captured whis- pered speech across all environments while robustly attenuating external noise. Overall, our results show that the proposed dual sensor method (Vibration + Mic) achieves higher recognition accuracy than Mic alone under diverse noise conditions and remains superior to vibra- tion alone until the noise level reaches 0 dB (i.e., noise and speech at parity). This is supported by recognition accuracy on the trial user’s utterances (WER, CER), objective quality / intelligence met- rics (PESQ, STOI), and user ratings using the MUSHRA procedure. 6 Discussions Use of Whisper, Vibration, and Enhanced Speech According to our evaluation results, under low-noise conditions, the accuracy of Enhanced speech and Mic speech is comparable. However, as noise levels increase, the accuracy of Mic speech drops sharply. In contrast, under very high-noise conditions, the perfor- mance of Enhanced speech becomes roughly equivalent to that of the Vib input alone. For whispered speech, however, the performance of Enhanced speech degrades considerably under noisy conditions. We attribute this to the small vibration signals produced during whispering, which reduce the quality of the Vib input. On the other hand, in quiet environments, whisper input is particularly valuable, while in environments with substantial external noise, whisper input is less necessary, and normal speech can be used instead. We argue that it is desirable to provide either a user interface that allows flexible selection of the optimal input mode or an auto- matic mechanism that determines the most suitable input method depending on context. Adaptive Sensor Fusion Strategy While our dual-input model (D-DCCRN) achieves good perfor- mance across most conditions, we observed that under extreme noise conditions (e.g., 10dB), the recognition accuracy using the vibration sensor (Vib) alone exceeds that of the enhanced signal. This indicates that when the acoustic Signal-to-Noise Ratio (SNR) drops below a critical threshold, the air-conducted microphone sig- nal may introduce more noise artifacts than useful speech features into the fusion network, counteracting the benefits of multi-modal integration. This finding provides a design implication for real-world deploy- ment: a robust always-available speech interface should not rely on a static fusion weight. Instead, we could propose an SNR-adaptive gating mechanism. Such a system would monitor environmental noise levels in real-time and dynamically transition between modes. The ability to discard the microphone signal entirely in favor of the mechanically isolated vibration sensor validates the necessity of our heterogeneous sensor configuration, ensuring communication remains possible even when air-conduction fails. Alternative Sensor Configurations We expect the proposed dual-sensor approach, which combines a vibration sensor (Vib) and a microphone (Mic), to remain effective at contact sites other than the nose. Fig. 11 (left) illustrates use with the device placed against the philtrum (the region between the upper lip and the nose). In a non-wearable, voice-memo–style configuration, the device can operate as a conventional microphone during normal phonation and be pressed against the skin in noisy environments or when whispered speech is required. Fig. 11 (right) also depicts a throat-microphone configuration in which the Vib and Mic are co-located; compared with throat microphones that rely solely on skin conduction, this design is expected to improve audio quality. Physiological Variability Previous work suggests that nasal patency modulates acceleromet- ric detectability on the nasal bridge. Given our nose-pad mounting, future versions should incorporate per-user calibration and online adaptation to mitigate day-to-day variability (e.g., rhinitis, mask wear). 7 Conclusion We present NasoVoce, a nose bridge-mounted interface that inte- grates a microphone and a vibration sensor. Positioned at the nasal pads of smart glasses, NasoVoce unobtrusively captures normal and whispered utterances in everyday use, through acoustic and vibration sensors. The nasal bridge’s proximity to the mouth facilitates the acquisi- tion of bone- and skin-conducted speech. Compared with earphone or neckband devices, it can also capture low-volume utterances, CHI ’26, April 13–17, 2026, Barcelona, SpainRekimoto et al. Figure 11: Alternative use cases of the proposed Vib-Mic sensor: (Left) Placed against the skin between the upper lip and the nose, it can reliably capture whispered speech via skin conduction. (Right) Applied as a throat microphone, it acquires both skin-conducted and vocal signals through the neck. such as whispered speech, more reliably. While the microphone pro- vides high-quality audio, it is susceptible to environmental noise; whereas the vibration sensor captures speech through bone and skin conduction with strong noise robustness, but lower quality. By combining these complementary inputs, we obtain high-quality speech that is robust to external noise. We evaluated two dual-input models: D-DCCRN, an extension of the DCCRN audio enhancement model. Recognition experiments with OpenAI Whisper Large-v2 ASR confirm that the proposed models improve accuracy under noisy conditions. Objective metrics (PESQ, STOI) and subjective ratings (MUSHRA) further validate the benefits of multimodal input. This work demonstrates the feasibility of NasoVoce as a practical interface for continuous and discreet AI voice conversations. Acknowledgments This work was supported by JST Moonshot R&D Grant JPMJMS2012, JSPS KAKENHI Grant Numbers JP25097211, JP24984259. We thank Yoko Rekimoto for her assistance with the demonstration video production. References [1] 2001. Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs. [2]Apple Inc. 2025.UseVoiceIsolation,WideSpectrum,orAutomaticMicMode onyouriPhoneandiPad. https://support.apple.com/en-us/101993 Accessed: 2025-11-29. [3]Yannis M. Assael, Brendan Shillingford, Shimon Whiteson, and Nando de Freitas. 2016. LipNet: End-to-End Sentence-level Lipreading. arXiv:1611.01599 [cs.LG] https://arxiv.org/abs/1611.01599 [4]Fei C. Chen, Estella P.-M. Ma, and Edwin M.-L. Yiu. 2014. Facial Bone Vibration In Resonant Voice Production.JournalofVoice28, 5 (2014), 596–602. doi:10. 1016/j.jvoice.2013.12.014 [5]Adiba Tabassum Chowdhury, Mehrin Newaz, Purnata Saha, Mohannad Natheef AbuHaweeleh, Sara Mohsen, Diala Bushnaq, Malek Chabbouh, Raghad Aljindi, Shona Pedersen, and Muhammad E. H. Chowdhury. 2025. Decoding silent speech: a machine learning perspective on data, methods, and frameworks.Neural ComputingandApplications37, 10 (2025), 6995–7013. doi:10.1007/s00521-024- 10456-z [6]Aref Farhadipour, Homa Asadi, and Volker Dellwo. 2024.Leverag- ing Self-Supervised Models for Automatic Whispered Speech Recognition. arXiv:2407.21211 [eess.AS] https://arxiv.org/abs/2407.21211 [7]Masaaki Fukumoto. 2018. SilentVoice: Unnoticeable Voice Input by Ingressive Speech. InProceedingsofthe31stAnnualACMSymposiumonUserInterface SoftwareandTechnology(Berlin, Germany)(UIST ’18). Association for Comput- ing Machinery, New York, NY, USA, 237–246. doi:10.1145/3242587.3242603 [8] Sanchit Gandhi, Patrick von Platen, and Alexander M. Rush. 2023. Distil- Whisper: Robust Knowledge Distillation via Large-Scale Pseudo Labelling. arXiv:2311.00430 [cs.CL] https://arxiv.org/abs/2311.00430 [9]Lixing He, Haozheng Hou, Shuyao Shi, Xian Shuai, and Zhenyu Yan. 2023. To- wards Bone-Conducted Vibration Speech Enhancement on Head-Mounted Wear- ables. InProceedingsofthe21stAnnualInternationalConferenceonMobile Systems,Applications,andServices(MobiSys’23). ACM, Helsinki, Finland, 14 pages. doi:10.1145/3581791.3596832 [10]Geoffrey E. Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowl- edge in a Neural Network.CoRRabs/1503.02531 (2015).http://dblp.uni- trier.de/db/journals/corr/corr1503.html#HintonVD15 [11] Hirotaka Hiraki, Shusuke Kanazawa, Takahiro Miura, Manabu Yoshida, Masaaki Mochimaru, and Jun Rekimoto. 2024. WhisperMask: a noise suppressive mask- type microphone for whisper speech. InProceedingsoftheAugmentedHumans InternationalConference2024(Melbourne, VIC, Australia)(AHs ’24). Associa- tion for Computing Machinery, New York, NY, USA, 1–14. doi:10.1145/3652920. 3652925 [12]Hirotaka Hiraki and Jun Rekimoto. 2025. SilentWhisper: inaudible faint whis- per speech input for silent speech interaction. InProceedingsoftheExtended AbstractsoftheCHIConferenceonHumanFactorsinComputingSystems (CHI EA ’25). Association for Computing Machinery, New York, NY, USA, Article 746, 6 pages. doi:10.1145/3706599.3721185 [13]Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. (June 2021). arXiv:2106.07447 [cs.CL] [14] Yanxin Hu, Yun Liu, Shubo Lv, Mengtao Xing, Shimin Zhang, Yihui Fu, Jian Wu, Bihong Zhang, and Lei Xie. 2020. DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement. InProc.Interspeech 2020. 2472–2476. doi:10.21437/Interspeech.2020-2537 [15]Boyan Huang, Baiyu Liu, Shuai Zhang, Zhijun Zhang, Tao Zhang, Wenqi Jia, Shiming Zhang, Yifeng Lin, and Tetsuya Shimamura. 2024. Online bone/air- conducted speech fusion in the presence of strong narrowband noise.Signal Processing 225 (2024), 109615. doi:10.1016/j.sigpro.2024.109615 [16] Prolific inc. 2014. Prolific. https://w.prolific.co [17] jfsantos. 2019. mushraJS. https://github.com/jfsantos/mushraJS [18]Arnav Kapur, Shreyas Kapur, and Pattie Maes. 2018. AlterEgo: A Personalized Wearable Silent Speech Interface. In23rdInternationalConferenceonIntelligent UserInterfaces(Tokyo, Japan)(IUI ’18). Association for Computing Machinery, New York, NY, USA, 43–53. doi:10.1145/3172944.3172977 [19] Yoon Kim and Alexander M. Rush. 2016. Sequence-Level Knowledge Distillation. arXiv:1606.07947 [cs.CL] https://arxiv.org/abs/1606.07947 [20] Naoki Kimura, Michinari Kono, and Jun Rekimoto. 2019. SottoVoce: An Ultra- sound Imaging-Based Silent Speech Interaction Using Deep Neural Networks. InProceedingsofthe2019CHIConferenceonHumanFactorsinComputing Systems(Glasgow, Scotland Uk)(CHI ’19). Association for Computing Machin- ery, New York, NY, USA, 1–11. doi:10.1145/3290605.3300376 [21]Tatsuya Kitamura. 2012. Measurement of vibration velocity pattern of facial surface during phonation using scanning vibrometer.AcousticalScienceand Technology 33, 2 (2012), 126–128. NasoVoce: A Nose-Mounted Low-Audibility Speech Interface for Always-Available Speech InteractionCHI ’26, April 13–17, 2026, Barcelona, Spain [22]Kelan Kuang, Feiran Yang, and Jun Yang. 2024. A lightweight speech enhancement network fusing bone- and air-conducted speech.TheJournaloftheAcoustical SocietyofAmerica 156, 2 (2024), 1355–1366. doi:10.1121/10.0028339 [23] Héctor A. Cordourier Maruri, Paulo Lopez-Meyer, Jonathan Huang, Willem Marco Beltman, Lama Nachman, and Hong Lu. 2018.V-Speech: Noise-Robust Speech Capturing Glasses Using Vibration Sensors.Proc.ACM Interact.Mob.WearableUbiquitousTechnol.2, 4, Article 180 (Dec. 2018), 23 pages. doi:10.1145/3287058 [24] J. Moon. 1990. The influence of nasal patency on accelerometric transduction of nasal bone vibration.TheCleftpalatejournal27, 3 (1990), 266–274. doi:10.1597/ 1545-1569(1990)027<0266:tionpo>2.3.co;2 [25]Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv:2212.04356 [eess.AS] https://arxiv.org/abs/2212.04356 [26]Jun Rekimoto. 2023. WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions. InProceedingsofthe2023 CHIConferenceonHumanFactorsinComputingSystems(Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 700, 12 pages. doi:10.1145/3544548.3580706 [27]Antony W. Rix, John G. Beerends, Michael P. Hollier, and Andries P. Hekstra. 2001. Perceptual Evaluation of Speech Quality (PESQ)—A New Method for Speech Quality Assessment of Telephone Networks and Codecs. InProceedingsof theIEEEInternationalConferenceonAcoustics,Speech,andSignalProcessing (ICASSP), Vol. 2. 749–752. doi:10.1109/ICASSP.2001.941023 [28]Tanmay Srivastava, Prerna Khanna, Shijia Pan, Phuc Nguyen, and Shubham Jain. 2024. Unvoiced: Designing an LLM-assisted Unvoiced User Interface using Earables. InProceedingsofthe22ndACMConferenceonEmbeddedNetworked SensorSystems(Hangzhou, China)(SenSys ’24). Association for Computing Machinery, New York, NY, USA, 784–798. doi:10.1145/3666025.3699374 [29] Zixiong Su, Shitao Fang, and Jun Rekimoto. 2023. LipLearner: Customizable Silent Speech Interactions on Mobile Devices. InProceedingsofthe2023CHI ConferenceonHumanFactorsinComputingSystems(Hamburg, Germany) (CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 696, 21 pages. doi:10.1145/3544548.3581465 [30] surfing.ai. 2018. ST-AEDS-20180100_1 Free ST American English Corpus. https: //openslr.org/45/. [31] Syntiant. 2024. SiSonic Surface Mount MEMS Microphones. https://w.syntiant. com/mems. [32] Syntiant. 2024. V2S Voice Vibration Sensor. https://w.syntiant.com/v2s. [33] Cees H. Taal, Richard C. Hendriks, Richard Heusdens, and Jesper Jensen. 2010. A Short-Time Objective Intelligibility Measure for Time-Frequency Weighted Noisy Speech. InProc.IEEEInt.Conf.Acoustics,SpeechandSignalProcessing (ICASSP). 4214–4217. doi:10.1109/ICASSP.2010.5495701 [34] Cees H. Taal, Richard C. Hendriks, Richard Heusdens, and Jesper Jensen. 2011. An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech.IEEETransactionsonAudio,Speech,andLanguageProcessing19, 7 (2011), 2125–2136. doi:10.1109/TASL.2011.2114881 [35] Tianyi Tan, Haoxin Ruan, Xinan Chen, Kai Chen, Zhibin Lin, and Jing Lu. 2025. DistillW2N: A Lightweight One-Shot Whisper to Normal Voice Conversion Model Using Distillation of Self-Supervised Features. InICASSP2025-2025IEEE InternationalConferenceonAcoustics,SpeechandSignalProcessing(ICASSP). 1–5. doi:10.1109/ICASSP49660.2025.10888480 [36]J. Thiemann, N. Ito, and E. Vincent. 2013. DEMAND: a collection of multi-channel recordings of acoustic noise in diverse environments.21stInternationalCongress onAcoustics(ICA2013) (2013). https://doi.org/10.5281/zenodo.1227121 [37] International Telecommunication Union. 2013. BS.1534 : Method for the sub- jective assessment of intermediate quality level of audio systems.https: //w.itu.int/rec/R-REC-BS.1534/en [38]Heming Wang, Xueliang Zhang, and DeLiang Wang. 2022.Fusing Bone- Conduction and Air-Conduction Sensors for Complex-Domain Speech Enhance- ment.IEEE/ACMTransactionsonAudio,Speech,andLanguageProcessing30 (2022), 3134–3143. doi:10.1109/TASLP.2022.3209943 [39]Lei Wang, Xingwei Wang, Xi Zhang, Xiaolei Ma, Yu Zhang, Fusang Zhang, Tao Gu, and Haipeng Dai. 2025. AccCall: Enhancing Real-time Phone Call Quality with Smartphone’s Built-in Accelerometer.Proc.ACMInteract.Mob.Wearable UbiquitousTechnol.9, 3, Article 133 (Sept. 2025), 33 pages. doi:10.1145/3749463 [40]Mou Wang, Junqi Chen, Xiaolei Zhang, Zhiyong Huang, and Susanto Rahardja. 2022. Multi-modal speech enhancement with bone-conducted speech in time domain.AppliedAcoustics200 (2022), 109058. doi:10.1016/j.apacoust.2022.109058 [41]Xue Wang, Zixiong Su, Jun Rekimoto, and Yang Zhang. 2024. Watch Your Mouth: Silent Speech Recognition with Depth Sensing. InProceedingsofthe 2024CHIConferenceonHumanFactorsinComputingSystems(Honolulu, HI, USA)(CHI ’24). Association for Computing Machinery, New York, NY, USA, Article 323, 15 pages. doi:10.1145/3613904.3642092 [42]E. M. Yiu, F. C. Chen, G. Lo, , and G. Pang. 2012. Vibratory and perceptual measurement of resonant voice.Journalofvoice26, 5 (2012). doi:10.1016/j.jvoice. 2012.02.005 [43]Cheng Yu, Kuo-Hsuan Hung, Syu-Siang Wang, Szu-Wei Fu, Yu Tsao, and Jeih- Weih Hung. 2020. Time-Domain Multi-modal Bone/air Conducted Speech En- hancement.IEEESignalProcessingLetters27 (2020), 1035–1039. doi:10.1109/ LSP.2020.3000968 [44]Ruidong Zhang, Ke Li, Yihong Hao, Yufan Wang, Zhengnan Lai, François Guim- bretière, and Cheng Zhang. 2023.EchoSpeech: Continuous Silent Speech Recognition on Minimally-obtrusive Eyewear Powered by Acoustic Sensing. InProceedingsofthe2023CHIConferenceonHumanFactorsinComputing Systems(Hamburg, Germany)(CHI ’23). Association for Computing Machinery, New York, NY, USA, Article 852, 18 pages. doi:10.1145/3544548.3580801 [45]Yi Zhou, Yufan Chen, Yongbao Ma, and Hongqing Liu. 2020. A Real-Time Dual- Microphone Speech Enhancement Algorithm Assisted by Bone Conduction Sen- sor.Sensors 20, 18 (2020), 5050. doi:10.3390/s20185050