Paper deep dive
Spectrogram features for audio and speech analysis
Ian McLoughlin, Lam Pham, Yan Song, Xiaoxiao Miao, Huy Phan, Pengfei Cai, Qing Gu, Jiang Nan, Haoyu Song, Donny Soh
Abstract
Abstract:Spectrogram-based representations have grown to dominate the feature space for deep learning audio analysis systems, and are often adopted for speech analysis also. Initially, the primary motivator for spectrogram-based representations was their ability to present sound as a two dimensional signal in the time-frequency plane, which not only provides an interpretable physical basis for analysing sound, but also unlocks the use of a wide range of machine learning techniques such as convolutional neural networks, that had been developed for image processing. A spectrogram is a matrix characterised by the resolution and span of its two dimensions, as well as by the representation and scaling of each element. Many possibilities for these three characteristics have been explored by researchers across numerous application areas, with different settings showing affinity for various tasks. This paper reviews the use of spectrogram-based representations and surveys the state-of-the-art to question how front-end feature representation choice allies with back-end classifier architecture for different tasks.
Tags
Links
- Source: https://arxiv.org/abs/2603.14917v1
- Canonical: https://arxiv.org/abs/2603.14917v1
Intelligence
Status: not_run | Model: - | Prompt: - | Confidence: 0%
Entities (0)
Relation Signals (0)
No relation signals yet.
Cypher Suggestions (0)
No Cypher suggestions yet.
Full Text
114,091 characters extracted from source content.
Expand or collapse full text
Received: December 2025 Accepted: 2026 Published: 2026 Citation: McLoughlin, I.; Lastname, F.; Lastname, F. Spectrogram features for audio and speech analysis. Appl. Sci. 2025, 1, 0. https://doi.org/ Copyright: © 2026 by the authors. Submitted to Appl. Sci. for possible open access publication under the terms and conditions of the Creative Commons Attri- bution (C BY) license (https://creativecommons. org/licenses/by/4.0/). Article Spectrogram features for audio and speech analysis Ian McLoughlin 1,† *, Lam Pham 2 , Yan Song 3 , Xiaoxiao Miao 1 , Huy Phan 4 , Pengfei Cai 3 , Qing Gu 3 , Jiang Nan 3 , Haoyu Song 1 , Donny Soh 1 1 Singapore Institute of Technology; ian.mcloughlin@, 2303822@sit., donny.soh@singaporetech.edu.sg 2 Austrian Institute of Technology; lam.pham@ait.ac.at 3 The University of Science and Technology of China; songy@, cqi525@mail., jiang_nan@mail., qinggu6@mail.ustc.edu.cn 4 Meta Inc., Reality Labs; huy.phan@ieee.org *Correspondence: ian.mcloughlin@singaporetech.edu.sg † 1 Punggol Coast Road, Singapore 828608 Featured Application: Spectrogram-based input features have become the most popular choice for deep learning models that classify audio and speech, yet there are many settings related to resolution and representation type. This article surveys those choices and discusses their suitability for different application areas. Abstract: Spectrogram-based representations have grown to dominate the feature space for deep learning audio analysis systems, and are often adopted for speech analysis also. Initially, the primary motivator for spectrogram-based representations was their ability to present sound as a two dimensional signal in the time-frequency plane, which not only provides an interpretable physical basis for analysing sound, but also unlocks the use of a wide range of machine learning techniques such as convolutional neural networks, that had been developed for image processing. A spectrogram is a matrix characterised by the resolution and span of its two dimensions, as well as by the representation and scaling of each element. Many possibilities for these three characteristics have been explored by researchers across numerous application areas, with different settings showing affinity for various tasks. This paper reviews the use of spectrogram-based representations and surveys the state-of-the-art to question how front-end feature representation choice allies with back-end classifier architecture for different tasks. Keywords: Spectrogram; Spectrogram Image Feature, Mel-frequency Spectrogram; Mel Frequency Cepstral Coefficient (MFCC), Constant-Q transform, audio analysis, speech classification 1. Introduction The spectrogram, considered to have been invented at Bell Labs in the 1940s [1] was initially generated by a sound spectrograph machine as a stylus-on-paper plot to visualise the distribution of sound energy in a time-frequency plane. Then, and now, it transforms a one-dimensional sound waveform into a two dimensional image. Originally popular for ease of visualisation, allowing identification of important structures within a sound signal by eye, the spectrogram became useful in phonetics and various branches of acoustics. Well established by the late 1970s, its information-carrying abilities were highlighted by Victor Zue and Ron Cole who demonstrated that it could be used for speech recognition [2]. Version March 17, 2026 submitted to Appl. Sci.https://doi.org/ arXiv:2603.14917v1 [eess.AS] 16 Mar 2026 Version March 17, 2026 submitted to Appl. Sci.2 of 30 Figure 1. Illustration of spectrogram creation from input audio data, as a stack of frequency vectors. 2. Taxonomy of spectrograms At its heart, a spectrogram is nothing more than a two-dimensional picture of sound, usually with one axis representing frequency and the other axis representing time. In- dividual pixel intensity represents in some way the strength of each frequency element at a particular time instant. In the earliest systems [1] the frequency axis and intensity were non-linear. Advances in sensors, high quality analogue-to-digital converters and improvements in signal processing led to the ability to form linear spectrograms [3]. More recently, non-linear representations have again become prevalent, tuned for different tasks, as we will explore in the remainder of this paper. 2.1. Basic spectrogram Spectrograms are typically formed as a matrix of stacked frequency vectors, each of which represent the frequency magnitude over a short duration of time, referred to as an analysis frame. The frequency magnitude vectors are obtained from an orthogonal time-frequency transform such as a discrete Fourier transform (DFT), fast Fourier transform (FFT) or filterbank (FB). Discrete cosine transform (DCT), modified DCT, discrete wavelet transform (DWT) and many other transforms have been employed [3]. Fixed-size analysis frames are typically slices of an input audio waveform of between 10 and 30ms for audible signals. Hence an auditory spectrogram is formed from a stack of frequency vectors obtained from successive frames, as illustrated in Fig. 1. In almost all cases, to avoid issues with spectral leakage and edge effects (e.g. Gibbs phenomenon [4]), the slices of input audio are windowed [5] prior to the time-frequency transform, and because the window functions usually taper to zero or near-zero at the edges of each frame, the frames are overlapped to ensure that frequencies from all regions in the input audio waveform (i.e. including the tapered edge regions), will contribute to the frequency representation [3]. The overlap between frames is specified either as a percentage (often 50% overlap), or as a step between frames e.g. 256 sample windows advanced by 128 steps between frames or a 30 ms window advancing 10 ms each step. The step can be referred to as an hop, advance or shift. The maximum frequency resolution is limited by the number of samples in the analysis window, and the spectrogram time axis resolution is defined by the step size. Let us denote an input audio waveform asx(n)and set frame length to bew s samples for current analysis framef. With 50% overlap between frames, the analysis frame is x f (n) = x[ f.w s /2: f.w s /2+ w s ]. Given a lengthw s window functionw(n), the spectral magnitude representation X(k) is then, X f (k) = w s −1 ∑ n=0 w(n)x f (n)e −j2πnk/w s f or k = 0 . . . w s − 1(1) SpectrogramSis obtained by stacking the frequency vectors directly into a rectangular matrix, i.e. for a time duration of F frames, Version March 17, 2026 submitted to Appl. Sci.3 of 30 S F,w s = h X 0 X 1 X 2 . . . X F i (2) When used as an input feature to a deep learning system, it is also common that frequency downsampling or pooling happens at this point [6] to reduce the frequency dimension. The pooling process is discussed further in section 2.9. Numerous alternative methods of forming spectrograms exist, with the main variants shown in Table 1 along with their dimension, element scaling and frequency span. The top three are the linear spectrogram (LS) as described above, followed by variants in which each element of the matrix has been scaled using log, A- orμ-law. The next two variants use Mel and log-Mel scaling, discussed in section 2.3, while the derivations of the bottom three are explored subsequently Table 1. Taxonomy of spectrograms. Description DimensionsElement scaleFrequency span Linear spectrogram (LS) time, frequency (T, F)scalar (0, 1)(0, Nyquist) Log-scaled spectrogram (LSS) T, Flog (−100dB, 0)(0, Nyquist) A/μ-law scaling T, Flog (0, 255)(0, Nyquist) Mel-spectrogram (MS) T, Mel-Flinear(0, Mel(Nyquist)) Log-Mel-spectrogram (LMS) T, Mel-Flog (−100dB, 0)(0, Mel(Nyquist)) Gammatonegram (GTG) trapeziodal/squared-T, Flog(ERB(0), ERB(Nyquist)) Constant-Q transform (CQT) trapeziodal/squared-T, Flog/linear(0, Nyquist) Stabilised auditory image (SAI) non-linear F, lagscaled (0, 1)(0, 35ms by default) 2.2. Spectrograms are not pictures While spectrograms allow audio and speech to be processed in a deep learning system using techniques that have originally been developed for image processing, caution should be observed for the following three aspects in which spectrograms and picture images differ significantly: 2.2.1. Colour and greyscale Basic linear spectrograms are greyscale with pixel values that are typcially scaled to the range[0, 1], but are often colourised for ease of viewing. Colourisation maps scalar pixel values to RGB values [3]. In MATLAB (The Mathworks Inc.), which is often used for visualisation of spectrograms, each pixel is scaled from 0 to 1. Prior to MATLAB release R2014b, the ‘Jet’ colourmap was used by default to scale a spectrum through blue- green-yellow-orange-red across the range 0 to 1. More recent versions of MATLAB use the ‘parula’ colourmap that scales blue-green-yellow. The popular audio handling tool Audacity maps from -100dB in black, through purple-magenta-light orange to white, for pixel values above -20dB. Both can be modified to display in either greyscale or using other colourmaps. Python-based tools also impose a colourmap, which may differ based on the library used. In matplotlib, the pcolormap and pcolor functions both default to using the ‘viridis’ colourmap which scales blue-green-yellow across the range 0 to 1. While colour scaling produces pretty plots, many researchers simply input a spectro- gram, or a spectrogram patch into a convolutional neural network (CNN) that has been designed for image processing, and thus assumes 3 input channels to handle RGB compo- nents separately. Since the mapping from spectral magnitude to RGB depends arbitrarily on the kind of spectrogram used, and the version of tool used to produce it, there is no logical justification for processing using colour spectrograms. Networks such as CNNs can learn a mapping from any scaling, but it may be at the cost of three times the front-end complexity of a greyscale spectrogram input. Version March 17, 2026 submitted to Appl. Sci.4 of 30 2.2.2. Translation invariance and scaling Structures or objects in pictures can very easily be translated to different locations in the image, while remaining the same object. So classification is usually location-invariant. In a spectrogram, translation of structures along the time axis does not change their fundamental nature, but significant translation along the frequency axis can completely change the sound that those structures represent. Unlike in a picture, relationships in the X axis and the Y axis have very different meanings in a spectrogram. Furthermore, scaling an object in a picture does not change the nature of the object, it only changes its size, i.e. making it appear closer or further away. In a spectrogram, scaling a structure that represents a sound event yields a very different result. It adjusts both the time duration of the event and also its frequency span, and has potential to completely change the audible nature of the event. Importantly, audio deep learning systems need careful matching of scaling between training and inference. 2.2.3. Local features Advanced image processing techniques can exploit both local and global regional char- acteristics to interpret the content of an image. To do this, systems perform neighbourhood correlations, as well as global texture correlations across an image, and this is part of the motivation behind the use of CNNs [7]. While both local and global correlations are also important in audio tasks such as sound event detection, the nature of those correlations will be very different. For example similar ‘textures’ in frequency ranges of 0-50Hz and 16-18kHz of a spectrogram are unlikely to be significant to understanding the content, whereas in a picture, similar regions of the same texture might be patches of grass at the bottom left and top right of an image – which also relates to the translation invariance noted above. Local correlations across the time axis of spectrograms may be more akin to the frame-to-frame difference in video frames than they are to physically proximate points in a picture. 2.3. Mel-spectrogram Mel-spectrograms were inspired by the Mel scale which utilises human equal-loudness data to map frequencies in Hertz to a non-linear scale corresponding to human auditory perception. The mapping from linear frequencyf hz (in Hertz) to Mel frequencyf mel [8] is generally computed by: f mel = 2595.log(1 + f hz /700),(3) A short time Fourier transform (STFT) output vector (i.e. a vector of instantaneous power values for uniformly sampled frequency bins), is transformed into a mel-scale representation vector via a set of bandpass filters, normally with triangular shape. The bandpass filters are centred at Mel scale frequencies based on eqn. 3. Each triangular filter accumulates the weighted power spectrum sum along the frequency dimension [8]. Just as a linear spectrogram is constructed from a stack of linear frequency vectors, a Mel-spectrogram is constructed from a stack of Mel-frequency vectors obtained from successive analysis frames. Since the Mel-spectrogram is based on Mel filters that were developed from human auditory perception experiments, both individual Mel-frequency feature vectors, and their stack into a Mel-spectrogram, has proven effective for various tasks related to human speech analysis. State-of-the-art systems proposed for Speaker Identification [9], Speech-to- Text [10], Speech Emotion Detection [11], and so on, have used Mel-based spectrograms for the pre-processing feature engineering. Version March 17, 2026 submitted to Appl. Sci.5 of 30 Given that applying Mel filter banks to the SFTF spectrogram across the frequency dimension is effective to capture distinct features in audio signal (i.e., Mel filters are widely used in human speech analysis), several similarly-inspired filter bank representations have also been proposed. These include the Gammatone filter [12] inspired by cochlea simulation, and the Nearest Neighbour filter [13] inspired from image pre-processing. Several are illustrated in Fig. 2. 2.4. Constant-Q spectrogram The Constant-Q spectrogram is generated by appplying a constant-Q transform (CQT) which was first introduced in [14] and is closely related to the Fourier Transform. Like the Fourier Transform, the CQT is formed from a bank of filters, but with the difference that the centre frequencies of each CQT element are spaced in a geometrical tonal space as: f k = f min .2 k b f or 1≤ k≤ K(4) wheref k denotes the centre frequency ofk th ,f min is the minimum frequency,bis the number of filters per octave. As the name suggests, theQvalue, which is the ratio of central frequency to bandwidth, is constant. It is computed as: Q = f k ∆ f k = f k f k+1 − f k = 2 1 b − 1 −1 (5) In musical analysis, by settingf min andbto directly correspond to musical notes (i.e., choosingb =12 andf min as the frequency of, for example midinote 0 orC −1 ), the central frequencies in the CQT will correspond to musical note frequencies, making it effective at capturing musical tones. As a result, the Constant-Q spectrogram has been widely used for musical analysis [15,16], but has also been applied to more general sound event detection. Since it is effectively a triangular representation, it is often transformed into a rectangular matrix prior to use. 2.5. Correlogram The Correlogram utilises autocorrelation to capture the similarity between an audio signal and itself at a given time lag. Autocorrelation vectors are computed for a range of different time lags [17]. Given a long audio signal, it is first separated into short audio segments. A correlogram (or auto-correlation vector) is obtained for each audio segment from the auto-correlation coefficients of frequency components along the time axis. As a result, the long audio signal is presented by a matrix of auto-correlation variables, with each matrix column being an auto-correlation vector representing a given time lag. In structure, it presents similarly to the stabilised audio image (SAI) of Section 2.6. 2.6. Stabilised auditory image Patterson et al. [18] proposed the auditory image model (AIM) in 1995, aiming to simulate the frequency discrimination and amplitude sensitivity of neural activity patterns from hearing. Essentially, an AIM models the function of the basilar membrane, which is part of the organ of hearing within the human cochlea [3], when exposed to pure tone. Walters [19] integrated this in time (strobed temporal integration, essentally a type of correlogram) to yield the stabilised auditory image (SAI), aiming to improve noise robustness and enhance the detection of periodicity compared to the AIM. A single SAI is a two-dimensional representation similar to a classical linear spectro- gram, but where the y-axis is frequency and the x-axis represents lag or periodicity. As such it captures the nature of sound in a fixed time window, e.g. 35ms in Fig. 2(a). Version March 17, 2026 submitted to Appl. Sci.6 of 30 Figure 2. Illustrations of two dimensional time-frequency spectrograms based on (a) stabilised auditory image, (b) Constant-Q transform, (c) Mel-scaled spectrogram, (d) stacked MFCC, (e) Linear magnitude spectrogram. SAIs were used as input features in several of Google’s early audio recall systems as developed by Lyon et al. [20–23], specifically employing PAMIR (passive-aggressive model for image retrieval), a pre-deep learning ordering algorithm based on statistics obtained from regions within the SAI. Early attempts at using SAIs with deep learning architectures for sound event detection [24] were outperformed by linear spectrogram equivalents, probably because of the limited short-time window represented in single SAIs. Figure 3. High level system diagram showing spectrogram features (a) being extracted from an input waveform as a stack of scaled transforms from windowed speech regions then (b) features gathered from patches, pooled regions or a downsampled spectrogram image for (c) input to a deep learning classification pipeline. 2.7. Patches and regions Object detection from images, where the characteristic shape of an item can be in any location within the image, as well as any size from small to large, often benefit from techniques where the image is divided into randomly scaled and located patches, each of which are processed independently by the deep learning model [25]. When the image is a spectrogram, such a process no longer has a physical justification (see section 2.2 above). The exception is when slices of the spectrogram can be selected in the time domain, spanning the entire frequency range. However there seems to be limited benefit in allowing the time windows to have different durations, so in practice fixed-sized regions are usually independently inferenced, as in audio spectrum transformer (AST) [26]. 2.8. Scaling and number representation Audio samples, such as in the WAVE file format, are typically represented as 16-bit signed linear fixed-point numbers with range[−32768, 32767]. During computation they would generally be converted to 32-bit floating point and then divided by 2 15 so they are scaled to a range of[−1:1). An FFT of those samples will, by default, also retain the same Version March 17, 2026 submitted to Appl. Sci.7 of 30 32-bit floating point number format. Although a true Fourier Transform yields a complex spectrum, spectrograms are usually formed from the magnitude spectrum, hence each pixel value is always positive, and can be scaled to a range of[0:1]. As mentioned above, log encoding is often used, to provides a perceptually relevant emphasis to the samples. For lower complexity,μ- or A-law is used to convert samples to 8-bit fixed point scaled values in the range[0, 255]. This can help to substantially reduce downstream computational complexity, at the cost of higher quantization noise. 2.9. Pooling and downsampling State-of-the-art deep learning architectures for performing sound event detection, audio analysis and related tasks including language identification, speaker verification and speech emotion recognition, generally utilise front-end learned layers to compute a one dimensional representation vector (e.g. an embedding), from a raw feature input. Thus, whatever input feature is ingested, an intermediate representation – a fixed dimension embedding – is produced. A time-stepped series of features yields a time-stepped series of embeddings for analysis. This stack of embeddings obtained over time can reveal statistics of how the underlying feature, and hence underlying audio signal, varies over time (e.g. over an utterance, or a sound event). The two dimensional block of embeddings from a well-trained front-end is usually amenable for classification. As we have seen for spectrograms, which are computed from overlapping frames of speech which are windowed, and then transformed to a magnitude spectrum. Magnitude spectra from successive frames are stacked into a two dimensional spectrogram image. This was discussed in Section 2.1, and shown in eqns. 1 and 2. In early machine hearing systems that pre-dated deep learning approaches, meta-features were extracted from the two-dimensional spectrogram and those features were classified. For example, Dennis et al. [27] divided a spectrogram into nine equal-sized regions, and classified the zero and first order statistics from each of the nine regions using SVM. Lyon et al. [23] classified the marginal statistics from rows and columns of an SAI. The advent of deep learning allowed neural networks to become capable of classifying raw spectrograms directly, but not at full resolution. Thus, downsampling (of samples prior to forming the spectrogram) or pooling of the frequency representation vector (i.e. combin- ing frequency bins) have been common approaches since the very first DNN spectrogram classifier [28]. In fact a very similar process happens in many deep learning systems that classify non-spectrogram features too. Examples include MFCC, perceptual linear prediction (PLP) coefficients and filterbank coefficients that are used in tasks such as LID [29–31], discussed further in Section 4.1. In almost all cases, one-dimensional features are extracted from overlapping input audio frames and then stacked into a two-dimensional time-frequency block for classification. The frequency-domain features can be pooled at that point, or may have been already downsampled. Pooling or downsampling along the frequency axis involves taking the mean of, typically, 2, 4 or 8 neighbouring spectral magnitudes to reduce dimensionality by the same factor. In some cases, particularly for MFCC features, in addition to averaging, either max-pooling or standard deviation are computed. Delta-coefficients are derived in the time-domain to capture changes from one frame to the next. MFCC features are then concatenated with delta-MFCC, and even delta-delta- MFCC features to capture acceleration characteristics [32]. Shifted delta cepstral (SDC) coefficients are commonly used in speech analysis to expand the context window of a classifier. These are formed from a few sequential cepstral delta coefficients per block. For example [33] concatenated coefficients over 7 blocks, with a shift of 3 between them (called Version March 17, 2026 submitted to Appl. Sci.8 of 30 a 7-1-3-7 arrangement). The aim being to capture statistics in a way that mean-pooling in time would not. The same kind of delta computation, shift and concatenation, have also been used with other features like filterbanks and PLPs. The same functionality could be learned within a neural network, particularly a recurrent neural network for time-based changes, but at the cost of additional parameters and training time. 2.10. Variance normalised features Intuitively speaking, when attempting to classify features, the more their statistics differ between two classes, compared to the within-class difference, then the more discrimi- native the feature is likely to be. This is essentially Fisher ’s criterion [34] restated. Applying this viewpoint to the downsampling or feature pooling operations used in almost all neu- ral network classifiers (noted above), three of the current authors sought a data-driven approach to maximise Fisher’s criterion – using between-class and within-class variance difference over a development data set – to identify optimal spectral pooling rules. Instead of mean pooling fixed blocks of spectral bins (e.g. 8) to reduce the frequency dimension (e.g. from 2048 to 256), the size of the pool is varied across the spectral range based on the variance difference between/within classes. The aim is to normalise the variance contribution of each downsampled feature point. Thus the technique is called variance normalised features (VNF). Both standard pooling and VNF begin with an identical high resolution spectrum, and aim to reduce the dimensionality before stacking into a spectrogram. For standard pooling, the low-resolutionN ′ point spectrumX ′ (k)is obtained from high resolutionN point spectrumX(k), where the downsampling factorD s =⌊N/N ′ ⌋. As noted, it is usually accomplished via mean-pooling; X ′ (k) = 1 D s (k+1)D s ∑ n=kD s X(n)f or k = 0 . . . N ′ (6) Alternatively, max-pooling would be X ′ (k) = maxX kD s . . . X (k+1)D s for k = 0...N ′ . To obtain VNFs, a pre-processing step is required. In that step, the spectrumX is computed over every analysis frameFfrom all examples in each ofCclasses in the development data set. The bin-wise spectral meanS c and variance f S c are obtained for each class, c, in that set, S c (k) = 1 F F∈c ∑ f =0 X f (k)(7) e S c (k) = 1 F− 1 F∈c ∑ f =0 (X f (k)− S c (k)) 2 (8) for all N spectral bins, 0≤ k< N, and for every class c∈C. Given the variance and mean spectral characteristics of each class, the per-bin variance is accumulated over allC classes. This is referred to as the total variance budget, V c = N ∑ k=0 f S c (k)− S c (k) (9) In standard downsampling, the variance contribution of each downsampled point differs depending upon the variance difference across the underlying region. VNF attempts to normalise it so that each downsampled point contributes approximately equal variance Version March 17, 2026 submitted to Appl. Sci.9 of 30 difference. This is done by changing from fixed-size pooling regions, with equally spaced partitions, to different sized pooling regions defined by data-driven partition rules. Those partition rules are computed iteratively from the development set data to specify pooling regions with near-equal amounts of variance contribution. The sum of the variance contributions equals the total budget. One possible partition-setting heuristic is outlined in [31]. Once all partitions are defined, the pre-processing stage has completed. During operation (i.e. model training or inference) using the VNF pooled features, pooling is applied by obtaining the mean spectral magnitude within each of the pooling partitions. The difference between VNF and standard features is that the former uses data-driven pooling regions computed as discussed, whereas the latter use a fixed pooling size to compute all downsized elements. Table 2. Performance of variance normalised features (VNF) compared to standard pooling for three tasks, aiming for higher accuracy and lower C avg . Task DetailsFixed poolingVNFs SED 50 class RWCP, 20dB SNR94.8% accuracy96.3% accuracy SED 50 class RWCP, 0dB SNR75.1% accuracy84.0% accuracy LID NIST LRE07 DNN x-vector 3s10.17 C avg 8.80 C avg LID NIST LRE07 CLSTM 3s7.15 C avg 6.70 C avg DID Arabic dialect challenge 3.20 C avg 2.62 C avg The performance of VNFs for three different tasks is shown in Table 2. Sound event detection (SED) on real-world computing partnership (RWCP) test data [30], language identification (LID) on NIST Language Recognition Evaluation 2007 challenge data, and dialect identification (DID) for spoken Arabic [31], are performed by models trained from standard fixed pooling inputs, as well as identical models trained with VNF pooled inputs. The aim is for higher accuracy score or lowerC avg score. In most tested systems, VNF based pooling outperformed the standard method of mean or max pooling, but it cannot compensate for architectural deficiencies, i.e. it is more important to use a good classifier architecture than to optimise the features, but having found a good classifier architecture, VNF has potential to improve results compared to fixed pooling. Essentially, any system where spectral bins are mean or max-pooled before classifica- tion could potentially benefit from the data-driven VNF approach, as long as a represen- tative development dataset exists from which the one-time pre-processing step can infer suitable partition rules. 3. Audio analysis Audio analysis refers to the detection and classification of sounds that lie within the range of human hearing (approx. 20Hz to 20kHz) [3]. It is related to the field of machine hearing [23], which involves endowing computers with the ability to detect and interpret sound in ways analogous to humans. Generally, we use the term ‘audio analysis’ to refer to non-speech sounds, since speech analysis involves additional techniques which will be considered separately in Section 4 – although there is considerable overlap in the types of features used. This section will first present a taxonomy of audio analysis, before briefly describing three application areas of audio event detection, anomalous sound detection and the related area of bioacoustics. Version March 17, 2026 submitted to Appl. Sci.10 of 30 3.1. Taxonomy of audio analysis Audio-based classification systems tend to follow a sequential taxonomy as shown in Table 3, although much depends upon the task being performed. For example in clip detection, or acoustic scene analysis (ASA), a short recording may be analysed, whereas animal call detection in bioacoustics, where segmenting the input into separate animal calls may be difficult, could involve analysing an entire recording. 3.1.1. Feature extraction Input audio needs to be processed and transformed into features suitable for classi- fication. These could be raw waveform segments [35], spectra (including spectrograms), statistical or timbral [36] features. As we have seen in Section 2, there are many variants of stacked spectra, which could be clipped, segmented, or pooled. Features such as MFCC or perceptual linear prediction (PLP) coefficients can also be stacked to form time-frequency features, as shown in Table 3 as ‘named features’. There may be a natural affinity of certain classification models to particular feature types, but a common alternative is to train a data- driven feature extractor. This is a front-end feature extraction network such as a few CNN layers, that produce features suitable for classification by a back-end classifier [37]. The front-end and back-end networks can be trained separately, or end-to-end, if appropriate loss functions can be defined. Authors also increasingly make use of feature extractors that have been effectively pre-trained by other authors to extract discriminative features for related tasks. Prominent examples include the AST [26] as mentioned in Section 2.7, PaSST [38] or HTS-AT [39]. These can be fine-tuned to be used in different tasks, or coupled with domain adaptator layers/blocks [40,41]. Table 3. Taxonomy of audio analysis. The input length (first column) can vary from a short frame to a continuous signal, one or more features extracted from this, and then am output class, or timing, obtained. InputFeature extraction Stack & Classify Output continuous,raw waveform/spectrum, one-hot class per instance, full recording,named features, posterior probabilities, utterance,trained feature extractor or vote over multiple instances, segment orpre-trained feature extractor average/threshold over time or frame localisation in time The stacked features are then classified by a back-end classifier which outputs typically a one-hot class prediction per instance (e.g. in a detector system), or posterior probabilities in a multi-sound classifier. As shown in Table 3, many other possibilities exist for output processing. For example, where a clip of audio to be classified has been split into multiple classification instances, then majority voting, or some kind of weighted averaging provides a single per-class score over multiple classifications for the whole clip. In continuous audio, thresholding of posterior probabilities over a sliding window can yield an activation signal (e.g. for a wake word system [42]). Some tasks are not concerned with clip-level classifications, but require precise detection of the timing of events, in terms of start and end timestamps [41]. A very wide variety of tasks use this basic method of audio analysis from time- frequency spectrogram features. These include sound scene detection (SSD) and auditory scene analysis (ASA) [43], clip recall and recognition [20], sound event detection (SED) [44], anomalous sound detection (ASD) [45], and acoustic classification of speech for purposes such as language identification (LID) [32], dialect identification (DID) [46], speaker identifi- cation (SID) [47], diarization [48,49], speaker verification (SV) [50]. There are also medical Version March 17, 2026 submitted to Appl. Sci.11 of 30 uses for spectrogram-based auditory analysis that include lung auscultation (stethoscope signal) analysis [51,52], disease diagnosis from speech [53], breathing and non-speech vocalisation. This includes from humans [54] as well as animals [55]. Music classification or retrieval [56], analysis [57,58] and even beat tracking [59] utilise the same basic steps. While spectrogram-based methods either predominate or show excellent performance in most of these fields, alternative approaches exist. Most prominent are those based on direct time-domain waveform analysis [60] as well bag-of-features approaches using statistical indicators [20,58,61]. 3.1.2. Overlap and occluded sounds In simple environments, targeted sound events may occur in isolation with at most one such occurrence at any time point. This is the basis of clip-level recognition systems which assume that an audio clip contains at most one kind of sound, such as the song of a single bird species, the sound of a gunshot, or a clip from a music recording. However, in complex real-world environments, sound events often coincide with other sounds. These include other target sounds (i.e. known sound classes to be recognised) or non- target sounds (i.e. out-of-set sound classes and background acoustic noise). The former situation is sometimes referred to as ‘polyphonic’, meaning ‘many sounds’, however the term polyphonic is already used in audio literature to refer to the existence of multiple audio channels – something that we are not considering here. Almost all sound analysis tasks assume a single channel of audio that potentially captures many sounds. It is thus best described as having “overlapping or occluded sounds” [62,63] to avoid confusion with the terminology of multi-channel audio 1 . Real-world sounds never occur in isolation, and always have at least some acoustic background noise, so in a sense there are always ‘many sounds’ present in audio collected in the wild. Research has shown that performance of audio classification systems in even very low levels of noise (i.e. real-world scenarios in a quiet environment), can be very different from the performance with sounds recorded under anechoic conditions of almost zero background noise [24,28]. Hence real-world deployments of sound classifiers require careful attention to several different techniques that may not always be found in challenge competitions [65]. The temporal occlusion and overlap could be partial or in full. Co-occurring sound events have their frequency-temporal content mixed together. i.e. unlike image occlusion, which implies masking of one object by another, co-incidental audio events are recorded as the linear complex sum of the two events. A visual inspection of sound mixture spectro- grams can sometimes reveal indications of coinciding sounds that are unlike each other (e.g. a long slow low-frequency background during which several short high frequency squeaks or wideband snapping sounds appear). However sounds with similar characteristics can be difficult to discern visually as separate instances, and hence machine learning classifiers or detectors likewise have difficulty in being able to detect similar coinciding sounds [62,65]. In general, classification/detection of occluded or overlapping sound events is more challenging than that of isolated ones. In the research literature, three alternative ap- proaches can be taken: •Recognise from mixed sounds by implicitly learning from non-occluded sounds as well as all kinds of overlapping sounds. It is also necessary to reframe the classification problem from multi-class to multi-label or 1-vs-rest [6,66,67]. •Separate first using a source separation framework and then operating classifica- tion/detection on the separate channels [68–70]. 1 If multiple audio channels are available for analysis, this can improve noise removal [3] as well as enhance localisation and classification performance [64]. Version March 17, 2026 submitted to Appl. Sci.12 of 30 •Introduce mixed classes where potential mixtures are essentially tagged as new combined classes [41,71]. Separation-based methods are a natural choice for multi-channel audio recordings, in which they can leverage spatial localisation to separate sources, but have shown limited success for single-channel audio. Introducing mixed classes can be useful for commonly mixed sounds, but where there are different degrees and sequences of overlap (e.g. sound 1 occurs first and sound 2 occurs during sound 1, or sound 1 occurs after sound 2, plus more variations), this can negate any benefits of introducing new combined classes. Hence most current approaches are trained in the presence of random mixtures in order to improve robustness, and generality. Interestingly, this may be similar to the robustness benefits gained from the widely-used mixup technique for training classifiers [72]. 3.1.3. Early sound event classification Sound events have a time duration that can range from around a hundred milliseconds (e.g. transient events like door knocking or hand clapping) to dozens of seconds (like car passing by or baby crying) or even continual (e.g. power line hum). Sound classification systems, initially driven by datasets of individually labelled sound clips, were first trained to classify at clip level [20]. Yet many deployment scenarios involve monitoring of live feeds. Hence the sound event detection task (the identification of what sound is present as well as when the sound is present – see Section 3.2 below), acknowledges that reality. The nature of the task also means that the output from any SED system must come after the sound has ended. However it is easy to see that sometimes classification needs to occur before a sound event has ended. This is obviously the case for continuous sounds, but also for something like an alarm, or turning off music when someone starts to speak. Some sounds can be classified at frame level, but where the characteristic frequency patterns used to identify a sound have a time span extending beyond a frame (e.g. beyond 10 to 30ms), a different paradigm is necessary. This was the motivation behind investigations into the timeliness perspective of sound event detection and classification systems. Specifically, the question ‘how early can a system reliably detect ongoing sound events’ from a partial observation of the initial section of an event. So-called “early detection” systems [73,73–76] require a classifier to fulfil a monotonicity property on continual input. Early sound event detection systems that scan a sliding window of spectrogram features to dynamically classify segments of input [77] are particularly useful in surveillance and safety-related applications which require a low latency response from a continuous feed. 3.2. Sound event detection Sound Event Detection (SED) aims to identify and temporally localize sound events in audio recordings. It outputs either onset-offset pairs or frame-level activity probabilities for each class [78], and is essential for applications such as environmental monitoring, surveillance and multimedia analysis [41]. Early SED systems used handcrafted features (e.g., MFCC), while recent approaches rely on deep learning with spectrogram inputs. This is because spectrograms, especially log-mel representations, are widely adopted due to their alignment with human auditory perception and their ability to capture the time-frequency evolution of sound events [28,79]. CNNs and CRNNs [37,80] model local correlations effectively, whereas transformer-based models such as PaSST [38] and HTS-AT [39] better capture long-range dependencies. Different spectrogram variants have been explored for task-specific benefits. Mel- spectrograms are compact and perceptually motivated; CQT spectrograms suit tonal event Version March 17, 2026 submitted to Appl. Sci.13 of 30 detection [81]; gammatonegrams offer robustness in low SNR [82,83]; and PCEN [84] improves invariance to background noise through dynamic compression. Overlapping sound events remain a major challenge. As we have noted in Section 2.2, spectrograms are not translation-invariant, and co-occurring sounds may occupy similar frequency regions. Thus, multi-label classification strategies, attention mechanisms, and source separation methods are often employed [85]. In summary, spectrograms are central to recent SED research due to their compati- bility with modern deep models and their descriptive time-frequency structure, though challenges like overlap resolution and domain generalisation persist. Table 4. Prominent sound event detection methods that utilise spectrogram features, including linear (LS), log-mel (LMS), mel scale (MS), constant-Q (CQT) and gammatonegram (GTG). Tasks include the Real World Computing Partnership (RWCP) sounds, the TUT sound events database, and Domestic Environment Sound Event Detection Dataset (DESED). YearRef.Task Spectrogram type Resolution & settings Pooling 2014[28]50 class RWCP LS[30×24], 16kHzvote 2017[86]6 class TUT events [87] 1 LMS+LS[240×256], 44.1kHzmax 2019[88]10 class from [89]+ [90] LMS[64×500], 16kHzmedian 2020[91]50 class RWCP LS,GTG,CQT[52×40], 16kHzmean 2022[92]10 class DESED [93] 2 LMS[128×960], 16kHzmean 2023[94]10 class DESED [93] 2 MS[128×1001], 16kHzmean 2024[95]11 class DCASE24 task 4 [96] MS [128×100] 3 in AST+fPaSST, 16kHz ensemble-mean 1 DCASE 2017 task 3 2 DCASE 2022/3/4 task 4 3 In 16x16 patches Table 4 samples approximately a decade of advances in the SED field since the first published use of spectrograms with deep learning [28]. Numerous spectrogram variants have been applied to this field, with many recent systems favouring Mel spectrograms or log-Mel spectrograms. The common evaluation tasks have largely been driven by DCASE (Detection and Classification of Acoustic Scenes and Events) challenges and workshops 2 , which has also resulted in a tendency towards datasets with a relatively small number of classes. Finally we note that embeddings from pre-trained transformers operating on spectrogram inputs, as mentioned above, can be utilised to improve performance, while effectively reducing the overall training resource required through model re-use and adaptation. 3.3. Anomalous Sound Detection Anomalous Sound Detection (ASD) is the task of determining whether a given au- dio signal contains abnormal or anomalous sounds that deviate from patterns typically observed under normal conditions [45,55,97]. This task is of significant importance in scenarios such as industrial monitoring and security surveillance. For instance, in factory environments, the early detection of machine failures, system anomalies, or unexpected environmental events can effectively prevent accidents, reduce downtime, and enhance overall safety [98]. Unlike conventional audio classification tasks, ASD is typically conducted in an unsupervised manner, as anomalous events are rare and diverse, making them difficult to 2 DCASE (Detection and Classification of Acoustic Scenes and Events): https://dcase.community/ Version March 17, 2026 submitted to Appl. Sci.14 of 30 define and annotate in advance during training. As a result, most methods rely on modeling the distribution of normal sounds, and identify anomalies by evaluating reconstruction errors, likelihood scores, or deviations in feature embedding space [99]. For example, in DCASE 2025 Challenge Task 2 (and similar to 2020-2024 task 2), only normal audio data is provided for training. Models are required to learn the feature distribution of normal audio samples and perform classification during testing by comparing the characteristics of normal and abnormal audio samples. Current ASD systems can be broadly categorised into two main approaches: genera- tive, and discriminative. Generative methods, grounded in the paradigm of self-supervised learning, tend to employ autoencoder-based models, like AE [100], VAE [101], PAE [102], to learn the feature distribution of normal audio. Anomalies are identified by computing a reconstruction error between the generated and original samples. The underlying as- sumption is that normal samples result in low reconstruction errors, whereas anomalies yield significantly higher errors. However the strong generalisation ability of generative models, even in mismatched domains, means that they can be capable of reconstructing some anomalous samples, leading to false negatives [99]. Recent discriminative learning methods use an Outlier Exposure (OE) strategy [103]. In this case, additional meta-information obtained during the data collection process (e.g. machine ID and attributes, such as operating condition) is utilised to train a classifier on the acoustic features of normal samples. Normal samples from different categories are treated as pseudo-anomalies relative to the target category. A compact normal feature space is then constructed using both this meta-information and a feature extractor, such as ResNet [104], MobileFaceNet [105], or Transformer-based models [106,107]. During the inference phase, the feature distance between a test sample and the normal training samples is regarded as a proxy indicator of the degree of anomaly. Table 5. Spectrograms used in ASD, primarily log-Mel (LMS, their resolution, sample rate and scaling.). Method SpectrogramPixelsSample RateScale Chakrabarty et al. [108] LMS128× T8kHzlog Zeng et al. [99] LMS128× T16kHzinverted log Li et al. [109] LNS128× T16kHzlog Liu et al. [105] LMS+Tgram128× T16kHzlog Yin et al. [110] LMS128× T16kHznormalised Table 5 summarises representative spectrogram-based approaches used in ASD, where LNS denotes log non-uniform spectrum. Chakrabarty et al. [108] were the first to apply spectrograms for anomalous detection, utilizing log-mel spectrograms (LMS) with 10-frame concatenation as input to Restricted Boltzmann Machines. Zeng et al. [99] employed log-mel spectrograms with transposed filters, where filters are sparse in the low-frequency region and dense in the high-frequency region. For machine sounds, high-frequency components often contain richer and more discriminative information, while the low-frequency part tends to be more noise-prone. Li et al. [109] further advanced this direction by computing F-ratios to analyze the distribution of information across the spectrum and designed machine-specific non-uniform filterbanks. Recently, more studies in ASD have begun combining both spectrogram and time-domain information as input features. For example, Liu et al. [105] proposed the STgram-MFN method, which concatenates temporal features with a log-Mel spectrogram for classification. The time-domain features (called a ‘Tgram’) are derived from a trained CNN network, with an ArcFace-derived loss. Interestingly, this revealed that the Tgram feature was able to provide useful, and complementary, Version March 17, 2026 submitted to Appl. Sci.15 of 30 information alongside the log-Mel spectrogram. Taking a different approach, Yin et al. [110] applied a diffusion model to synthesize log-mel spectrograms for data augmentation, achieving a state-of-the-art (SOTA) result with an official score of 67.12% in DCASE 2024 Challenge Task 2 using a discriminative approach. Despite the promising performance of spectrogram-based methods in ASD, several challenges remain. First, spectrograms are sensitive to noise and machine type, requiring tuning. This includes different parameter settings for different machine categories, which limits model generalisation. Second, under domain shift scenarios, reconstruction-based spectrogram methods may fail to detect anomalies due to misleadingly low reconstruction errors. Finally, in the absence of machine metadata, discriminative models struggle to construct a compact normal sound space, resulting in significant performance degradation. There is thus significant potential for ongoing research in this area. 3.4. Bioacoustics Bioacoustics is primarily applied in three core tasks: species classification, call segmen- tation, and sound event detection. The purpose of these tasks is to automate the analysis of animal vocalizations recorded in natural environments. Species classification involves identifying a species from audio recordings. [111–113]. Call segmentation aims to isolate individual vocalisations (e.g., bird syllables, frog calls, whale units) within recorded audio streams [111,114]. We also note increasing research relating to animal vocalisations for purposes such as health monitoring [55], emotion recognition [115] and potentially communications. This includes human-to-animal voice conversion techniques such as “Speak like a dog” [116] or wider species-to-species conversion using feature fusion that includes spectrograms [117]. Animal call segmentation, by identifying when and which biological sounds occur is a subset of Sound event detection (SED). This is characterised by long, noisy recordings, often with multiple overlapping species [111]. Each of these tasks presents distinct challenges. For classification, models must differentiate highly similar calls across species, sometimes with very few labeled examples [112]. Segmentation is complicated by overlapping sounds, variable call durations, and background noise [111,114] (sometimes with the background noise inextricably correlated to the species). Detectors and classifiers must deal with complex real-world soundscapes, requiring highly robust models to generalise across time and different environments. Despite these difficulties, recent spectrogram-based approaches have shown strong performance across all three main task categories [111,113]. Version March 17, 2026 submitted to Appl. Sci.16 of 30 Table 6. Deep learning front-ends in bioacoustic analysis that make use of raw waveforms (top three), linear and log-Mel spectrograms (LS, LMS) and stabilized auditory image (SAI) (middle four) and hybrid approaches (bottom two). TechniqueInput TypeTask(s)Taxa SincNet [118]Raw waveformSpecies classificationBirds SampleCNN [119]Raw waveformMusic auto-taggingMusic RawNet [120]Raw waveformSpeaker verificationHumans CNN/ResNet [111,121]LMSSpecies classification, SEDBirds, Frogs, Whales PCEN-enhanced CNN [84,122,123] PCEN-MelLow-SNR event detectionBirds, Whales CNN on STFT [124]LSCall segmentationBats CNN with spectrogram and stabilized auditory image input [28] LS + SAISound event classificationGeneral sounds) LEAF [122] Learned spectrogram from raw waveform Species classification, detection in noise Birds, Whales Wavegram-Logmel-CNN [125] Wavegram + LMSGeneral classificationVarious Table 6 summarizes recent trends in bioacoustic deep learning, highlighting a strong preference for spectrograms that are either linear (LS) or log-Mel spectrograms (LMS). The PCEN (per-channel energy normalisation) enhanced Mel spectrograms address back- ground noise by essentially performing scaling and auto-gain control on each frequency bin [126]. However the baseline linear spectrogram remains relevant in segmentation tasks—particularly in high-frequency domains such as bat echolocation—due to its sim- plicity and fine temporal resolution [124]. It is also true that the Mel scale, based on human hearing, would be inappropriate for spectrograms encompassing the ultrasonic region. At audible frequencies, log-Mel spectrograms remain widely adopted for species classifica- tion, given their well-demonstrated predictive performance when paired with CNNs. For example, on a bird classification task, a ResNet50 trained on Mel spectrograms achieved an accuracy of 0.77 and ROC AUC of 0.80, outperforming a raw waveform CNN baseline, which yielded 0.71 and 0.76, respectively [121]. PCEN-enhanced spectrograms have gained traction for their robustness in noisy settings, enabling better detection of faint or distant calls across taxa [111,122]. Although hybrid models such as LEAF attempt to combine the strengths of raw waveforms and spectral representations, real-world applications still show a clear advantage for spectrogram-based features, which have been shown to be robust, interpretable, and scalable across taxa and datasets. Spectrogram-based methods are not without limitations, however. In species classi- fication, they often struggle with fine-grained distinctions between species that produce acoustic calls in overlapping frequency bands [111,124]. Variability in vocal structure, such as regional ‘dialects’ or age-related changes, can also reduce accuracy. One structural limi- tation may be the use of fixed-size spectrogram windows that can truncate short calls, or blend closely spaced vocalisations in recordings with many overlapping signals [111,114]. This is also problematic for insect chirps and brief bat calls. Although preprocessing using PCEN improves robustness to loudness variation and background noise [84,122], it does not fully resolve these segmentation and overlap challenges. Across all tasks, significant train- ing challenges exist due to class imbalance, sparse labels for rare species, and taxonomic bias in training data. These issues limit generalisation and deployment, especially when extending models to new ecosystems or to poorly studied taxa [111] for which, ironically, they may be most needed. Version March 17, 2026 submitted to Appl. Sci.17 of 30 4. Speech analysis Speech analysis differs from pure auditory analysis due to the linguistic and semantic nature of the underlying speech signal. It not only conveys different information, but our understanding of it (i.e. labels) has greater complexity and allows more resources to be applied for speech analysis, compared to general audio analysis. Speech analysed at frame- level primarily captures acoustic properties of the human vocal system during production of the current senone or phonetic unit [32], or non-verbal vocalisation. It provides a snapshot view of both how the speech is being produced, which reveals information about the speaker, as well as the nature of what is being produced, which reveals information about the current pronunciation unit or sound. Speech analysed at utterance level captures linguistic content, which reveals informa- tion about the semantic meaning conveyed by the speech. Dynamics of frame-level changes also reveal information about the speaker, including their identity, their mood, gender, age, as well as potentially reflecting several physical and mental conditions. The analysis often makes use of the statistical variation in time as the frame-level features evolve [32]. Stacking frequency-domain frame-level features creates a time-frequency image, which is a type of spectrogram. Considering the audio analysis framework taxonomy of Fig. 3, front-end features tend to be at frame level, whereas the back end classification tends to be utterance level. Many variants to this simple understanding exist, such as front-end features spanning several frames, or word-level, chunk-level [127] and entire recording analysis. The following subsections survey three specific speech tasks that encompass that range, albeit with different objectives: language and dialect identification, speaker verification and speech emotion recognition. 4.1. Language and dialect identification The objective in language and dialect identification (LID/DID) is to extract informa- tion from recorded speech utterances – often with different linguistic content, spoken by different speakers, and captured in varying acoustic environments – and to develop meth- ods that can reliably determine which language or dialect is spoken, typically from among a closed-set of known alternatives. While many approaches have been proposed over the years, state-of-the-art systems rely on the evidence that acoustic features carry robust language-specific cues that are suitable for front-end feature extraction [128]. Extracted embeddings are then typically stacked and classified as noted above, using deep neural network-based back-end architectures. The well established MFCC [33] features, representing the short-time power cepstrum of speech, mapped onto a Mel scale – essentially the discrete cosine transform (DCT) of log- Mel filterbank features. They involve weighted pooling across overlapping spectral regions (i.e. Mel coefficients). Because MFCC are extracted from short (e.g. 25 ms) frames, they are limited in their ability to capture longer-term temporal dependencies in speech, hence are typically stacked with their delta and delta-delta (framewise difference, and difference between framewise differences) across an utterance. This captures first- and second-order temporal derivatives to improve the modeling of speech dynamics [37,129–131]. Shifted delta coefficients [132], as discussed in Section 2.9, similarly help to capture patterns that extend beyond individual frames, which is important for modeling the sequential nature of speech. More recently, research has moved beyond handcrafted features and instead demon- strated the effectiveness of directly using raw log-Mel spectrograms as input [133,134]. The fact that spectrograms better preserve the time-frequency structures of speech, enables Version March 17, 2026 submitted to Appl. Sci.18 of 30 convolutional or recurrent neural networks to learn discriminative representations from the data, without relying heavily on engineered features. Table 7. Prominent LID research showing various kinds of spectrogram, including linear (LS) and log-Mel spectrograms (LMS). EER is equal error rate. MethodSpectrogramResolution Task Ma et al. [135]PLP + bottleneck48× 21 23 languages, EER 4.38% 1 Kaiyr et al. [136]LS, CNN-RNN116×200 5-10s segments 7 languages, acc. 94.3% 2 Liu et al. [130]MFCC+delta+delta-delta39+39+39, 25ms window, 10ms hop 14 languages, EER 3.82% 3 Miao et al. [129]MFCC+D-MONA23×5 frames 14 languages, EER 1.15% 3 Tjanda et al. [137]LMS80×4, 25ms window, 10ms hop 26 languages, acc. 90.3% 4 1 Evaluated on 10s utterances using NIST LRE2009. 2 Evaluated on 5-10s clips. 3 Evaluated on 10s utterances using NIST LRE2017. 4 Evaluated on 6s utterances. Table 7 presents some representative works for LID. Evaluation tasks vary widely in terms of the number of languages, the utterance length and the variety of speakers. Features also vary between approaches, and clearly there are trade-offs between feature size and resolution. Performance is generally measured by Equal Error Rate (EER), or accuracy (and by C avg in newer works). In general, the more languages and the shorter the evaluation clip, the more difficult the task becomes. Confusion matrices [135] reveal that inclusion of similar languages can significantly reduce average performance scores. All of the systems in Table 7 contain deep recurrent networks, hence even input features with a short context length are able to benefit from time-domain context within the network to perform well. 4.2. Speaker verification Speaker verification (SV) is the task of determining whether an input speech signal matches a claimed speaker identity from a set of enrolled speakers. SV serves as a funda- mental component in biometric authentication, forensic analysis, and secure access control systems – for example, enabling user verification for banking transactions and for unlocking mobile devices [138,139]. The related ‘speaker validation’ task computes the probability that a given speaker is who they claim to be, while ‘forensic speaker identification’ aims to discern as much information as possible, including identity, of an unknown speaker. Unlike most automatic conventional speech classification tasks, SV operates in an open-set setting because speakers being analysed may have not been seen during train- ing. However training can be conducted in a supervised manner, since labeled speaker data are readily available for enrolment. During training, models learn discriminative representations of target and non-target speakers, while during inference they compare embeddings of test utterances against enrolled models to make acceptance or rejection decisions [29,140]. In benchmarks such as the NIST Speaker Recognition Evaluations (SRE) and the VoxCeleb challenges, thousands of labeled utterances recorded over telephone (8kHz) and “in the wild” (16kHz) conditions are provided for system development. Evaluation is performed by scoring the similarity. This often uses cosine distance or probabilistic linear discriminant analysis (PLDA) between enrolment and test embeddings [141]. Performance is generally measured by Equal Error Rate (EER) and Detection Cost Function (DCF) [142]. Current SV systems can be broadly categorised into generative embedding meth- ods and discriminative embedding methods. Generative embedding methods, such as Version March 17, 2026 submitted to Appl. Sci.19 of 30 GMM-UBM and total variability (i-vector) frameworks, model speaker and channel vari- ability via statistical supervectors and employ PLDA for scoring [138,139]. Discriminative embedding methods leverage deep neural networks to directly learn fixed-dimensional speaker embeddings: the d-vector approach averages frame-level DNN activations [140], the x-vector architecture uses TDNNs with statistics pooling [29], and enhanced variants like ECAPA-TDNN incorporate channel attention and hierarchical feature aggregation to further improve robustness [143]. Table 8. Prominent SV research showing a progression of spectrogram use, where LMS refers to log-Mel spectrogram and FB are filterbanks. MethodSpectrogramResolution Reynolds et al. [138]MFCC+context13× T Dehak et al. [139]i-vector from 60d MFCC200 WCCN Variani et al. [140]trained from 40d FB+context256 d-vector Snyder et al. [29]60d MFCC+delta+delta-delta150 x-vector Desplanques et al. [143]LMS80× 80 Liu et al. [141]LMS128× 304 Table 8 summarises several representative works for SV, including recent spectrogram- based feature representations. Despite the promising performance of these methods, several challenges remain. First, channel and domain mismatch between training and test recordings leads to performance degradation under cross-corpus and cross-device conditions [139]. Second, short-duration utterances often yield unreliable embeddings, increasing error rates. Finally, spectrogram parameter tuning for different languages, noise environments, and recording devices remains a manual and time-consuming process. The need to re-tune systems for a new task, coupled with limitations in both robustness and generalisation hinders the large-scale deployment of this technology [144]. It also presents opportunities for future research. 4.3. Speech emotion recognition Speech Emotion Recognition (SER) seeks to infer affective states from analysis of speech. It generally models prosodic, spectral, and temporal variations correlated with arousal and valence, as well as with discrete emotion categories [145]. In contrast to speaker verification, a task that benefits from representations that are stable across a speaker ’s vocal space, SER requires speaker-invariant embeddings that encode emotion-related acoustic fluctuations. Much past research has indicated that emotional expressions often correlate with changes in fundamental frequency, harmonic-to-noise ratio, spectral tilt, and the bandwidth of formants. Spectrograms have gained widespread use in SER research in recent years, given that their time-frequency viewpoint into evolving speech signals. Unlike raw waveform inputs, spectrograms reveal clear energy distributions across frequency bands over time, and these can be used to describe affective correlates. Early SER systems generally employed hand-crafted descriptors such as MFCC, pitch- and energy-based prosodic features, as well as jitter/shimmer measures. These would often be combined into utterance-level statistics, e.g. openSMILE [146,147]. While these features have been shown effective for acted emotional speech (i.e. databases of actors representing emotions on demand), they inherently compress some spectral detail. For example, MFCC decorrelate and smooth spectral envelopes using a DCT. High-frequency cues, which have been associated with emotional arousal, can be lost or misaligned in the process. Utterance-level pooling further removes useful information regarding temporal evolution, which limits the ability to capture brief spectral features, and may hide discriminative Version March 17, 2026 submitted to Appl. Sci.20 of 30 distributions. This probably contributes to a widely observed performance degradation in which models trained using one corpus perform much less well when they are evaluated using another, i.e. cross-corpus evaluation, or generalisation testing [148]. The transition to explicit use of spectrograms enabled richer modelling of time- frequency features. Linear, log-Mel and Mel-spectrograms preserve local spectral–temporal patterns, allowing convolutional neural networks (CNNs) to learn filters sensitive to emotion-associated frequency characteristics as noted above [149]. Time domain cues that unfold over several successive frames can be effectively captured by recurrent ar- chitectures like BLSTMs. For example, the evolution of pitch and intensity profiles and spectral modulation over time [150]. Attention mechanisms further refine the time-domain sensitivity by weighting frames that contribute more strongly to emotional perception, while down-weighting linguistically dominant or neutral segments [151]. Importantly, spectrogram configuration choices such as window length, hop size, and number of Mel bands can affect the emotional cue representation granularity in both time and frequency dimensions. Smaller hop sizes, allied with a recurrent network, increase sensitivity to rapid prosodic change, whereas higher Mel-band resolution allows finer modelling of high-frequency structures. As with other spectrogram-based audio analysis tasks, there are trade-offs to be made between granularity and context, especially in the time domain. As with SV and, to some extent LID, recent SER systems integrate high-resolution spec- trograms with deep sequence models such as CNN–Transformer hybrids. Transformers capture long-range dependencies and global contextual structures. This can complement the ability of CNNs to extract localised patterns. Meanwhile self-supervised learning (SSL) models such as wav2vec 2.0, HuBERT, and WavLM provide contextualised frame-level em- beddings that have been learned from large unlabeled speech corpora [152,153]. Although WavLM operates on raw waveforms, there is some evidence that intermediate layers can encode spectrally-relevant information, such as fundamental frequency trajectories, and amplitude modulation patterns – attributes that overlap with those observable in spec- trograms [154]. As a result, SSL embeddings can serve as either an alternative to, or a complement for, spectrogram features in modern SER pipelines. Table 9. Representative SER systems illustrating the evolution from hand-crafted features to high- resolution spectrograms and SSL-based embeddings. FB refers to filterbank, LMS is log-mel spectro- gram. MethodFeature Type Representation Schuller et al. [147]MFCC + prosody + energy 1582-d openSMILE Satt et al. [149]LMS ∼40–64 Mel bands, 25ms window, 10ms hop Mirsamadi et al. [151]FB with frame attention 40dim FB, 25ms window, 10ms hop Trigeorgis et al. [150]LMS + channel attention 40dim FB with 40ms frame, 5ms hop Pepino et al. [152]wav2vec 2.0 SSL embedding 768dim contextual frames Chen et al. [153]WavLM SSL embedding 1024dim contextual frames Chowdhury et al. [155]LMS + 5 other features 64dim LMS 20-30ms, and 126dim other features Table 9 summarises representative SER approaches and highlights the shift from coarse statistical descriptors to high-resolution spectrograms, and to spectrogram-informed latent embeddings from pre-trained models. Despite many recent advances, several challenges persist in SER research using spectro- grams. Firstly, emotional correlates vary across speakers, languages, speaking styles, as well as recording conditions. Spectral tilt, harmonic structure as well as prosodic patterns are inconsistent across corpora, which remain highly influenced by their recording conditions, task (e.g. spontaneous, scripted etc.) and labelling methodology. This contributes to do- Version March 17, 2026 submitted to Appl. Sci.21 of 30 main mismatch [156]. Secondly, many emotion-relevant cues occur at short temporal scales that are sensitive to spectrogram settings, where inappropriate frame sizes, windowing or Mel resolution may obscure rapid spectral transitions. Other emotion cues may evolve over a long timescale that is sensitive to the recurrence length or context size of features. Third, while SSL features appear to offer robustness, their lack of explicit frequency structure makes it difficult to model multi-resolution emotional cues, or to interpret how spectral information influences predictions. Addressing these issues may require frameworks that can integrate interpretable time-frequency structures obtained from spectrograms with the robustness and abstraction provided by waveform-based analysis. 5. Conclusion This paper has surveyed the nature and application of spectrogram time-frequency features when used for audio and speech analysis. Beginning with the definition of a spectrogram, we considered element scalings such as Mel, log-Mel, A-law andμ-law, as well as alternaive transforms including Gammatonegrams, stabilised auditory images (SAI) and constant-Q tranforms. Spectrograms formed from stacking other kinds of frequency- vectors in time, such as MFCC, PLP, filterbanks and embeddings from pre-trained models. Settings including frequency resolution, time span, range, frame size and hop were consid- ered alongside the related task of pooling or downsampling resolution, and the need to vote, or otherwise process indvidual frames to obtain a per-utterance/chunk/recording classification, as well as timestamps for start and end of events, where detection is required. In each of the analysis domains sampled within this paper (namely SED, ASD, bioa- coustics, LID/DID, SV and SER), the past decade has seen a shift away from statistical features (e.g. variance, shimmer, skewness), through handcrafted features such as MFCC and filterbanks, to spectrograms. Early spectrogram-based deep learning classifiers [24,28] used small rectangular patches. This was because limited training datasets restricted the model complexity that could be effectively trained, which in turn limited the size of input features. As more training data became available, larger models were possible and spectro- gram patch sizes tended to increase. Recurrent networks enabled deep neural networks to exploit time-domain statistics, allowing the size of spectrogram patches to reduce – or to reduce in time span but increase in frequency resolution (often through larger analysis windows with smaller hop sizes). While spectrogram features have been shown effective in many audio analysis do- mains, pooling to downsample tends to obscure fine detail. Variance normalised features (VNF) were proposed to define more nuanced pooling rules to maximise between-class variance compared to within-class variance. However the complexity of optimising front- end features, and the need for a large representative dataset has led researchers increasingly to adopt pre-trained foundation models. These have often been well trained for a different but related task, such as automatic speech recognition, where large high quality datasets are readily available. For speech systems, models are often pre-trained as ASR phone detectors. For audio, models such as AST [26], PaSST [38] and HTS-AT [39], they are trained as classi- fiers using large scale audio datasets. Adaptation methods or fine tuning are then used to harness these models for target tasks such as SER [39], SV [157] and more using pre-trained speech models, or SED [41], bioacoustics [158] and more for audio models. In each, the performance of the adaptation techniques is crucial to the resulting system performance. Apart from reasons of training efficiency, a strong motivation for use of pre-trained models is the example of the human auditory system – a fixed external capture system (pinna, outer and middle ear), a fixed feature transformer (inner ear, auditory nerve) [3], both of which handle all auditory tasks. However there are specialised back-end processes operating within physically separate areas of the brain for different target tasks. Version March 17, 2026 submitted to Appl. Sci.22 of 30 It is likely that the ability of auditory analysis foundation models will continue to improve in the coming years, while adaptation and fine-tuning methods will likewise improve. Since time duration of sounds and events has been cited as a problematic issue for balancing feature size and resolution, it seems likely that advances will me made in the area of more effective multi-scale analysis methods. 5.1. Future directions As we have explored, deep learning methods that make use of spectrogram features have gained prominence across the field of audio analysis, and for several speech analysis tasks too. However there are general aspects in which performance needs to improve sub- stantially before widespread deployment becomes possible. These include the following; •Noise robustness, particularly towards overlapping sounds and reverberation. •Model complexity and real-time operation on edge devices. •Robust separation of intertwined sounds, particularly for polyphonic (multi recording channel) audio sources. •Timeliness – early detection before a sound has completed. •Generalisation to unseen sounds, e.g. few- and zero-shot classification, including from multimodal prompts [41]. Beyond this, determining optimal spectrogram settings for a particular back-end architecture and task is currently a largely empirical process. These settings refer to the dimension of spectrogram patches, their resolution in time and frequency domain, and whether the frequency dimension is linear or non-linear. Then each element (pixel) in a spectrogram must be scaled, such as log/A-law, perceptually scaled or otherwise. Data- driven methods of determining the optimal settings are required - but this conflicts with the aim of adopting generalised pre-trained foundation models, unless those can be multi- resolution and multi-scaled, or make use of data fusion techniques. We also note that many researchers rely on Python libraries such as librosa 3 , adopting default settings, so any future techniques should ideally be as stable and easy to use. Author Contributions: Authors contributed mainly to the sections relating to their respective domain expertise, with all authors contributing equally to the remaining sections of the manuscript. All authors have read and agreed to the published version of the manuscript. Funding: This research received no external funding. Conflicts of Interest: The authors declare no conflicts of interest. Abbreviations The following abbreviations are used in this manuscript: 3 https://librosa.org Version March 17, 2026 submitted to Appl. Sci.23 of 30 AEAutoencoder AIMAuditory Image Model ASAAcoustic Scene Analysis ASDAnomalous Sound Detection ASTAudio Spectrogram Transformer CNNConvolutional Neural Network CQTConstant-Q Transform DCASEDetection and Classification of Acoustic Scenes and Events DCTDiscrete Cosine Transform DFTDiscrete Fourier Transform DIDDialect Identification DWTDiscrete Wavelet Transform ERBEquivalent Rectangular Banks FBFilterbanks FFTFast Fourier Transform GTGGammatonegram LIDLanguage Identification LMSLog-Mel Spectrogram LNSLog Non-uniform Spectrum LSLinear Spectrogram LSSLog-Scaled Spectrogram LSTBLong-Short Term Memory MFCCMel-frequency Cepstral Coefficients MSMel Spectrogram OEOutlier Exposure PAMIRPassive-aggressive Model for Image Retrieval PLPPerceptual Linear Prediction PSDSPolyphonic Sound Detection Score RWCPReal World Computing Partnership RNNRecurrent Neural Network SAIStabilised Auditory Image SDCShifted Delta Cepstra SEDSound Event Detection SIDSpeaker Identification SSASound Scene Analysis SSDSound Scene Detection SERSpeech Emotion Recognition SNRSignal to Noise Ratio SVSpeaker Verification SVMSupport Vector Machine STFTShort Time Fourier Transform VAEVariational Autoencoder VNFVariance Normalised Features References 1. Koenig, W.; Dunn, H.K.; Lacy, L.Y. The Sound Spectrograph. The Journal of the Acoustical Society of America 1946, 18, 19–49. https://doi.org/10.1121/1.1916342. 2.Zue, V.W.; Cole, R.A. Experiments on spectrogram reading. In Proceedings of the Acoustics, Speech, and Signal Processing, IEEE International Conference on ICASSP’79. IEEE, 1979, Vol. 4, p. 116–119. 3.McLoughlin, I.V. Speech and Audio Processing: a MATLAB-based approach; Cambridge University Press, 2016. 4.Gibbs, J.W. Fourier Series. Nature 1899, 59. 5.Ifeachor, E.C.; Jervis, B.W. Digital Signal Processing: A Practical Approach; Addison-Wesley, 1993. 6. McLoughlin, I.; Zhang, H.; Xie, Z.; Song, Y.; Xiao, W.; Phan, H. Continuous robust sound event classification using time-frequency features and deep learning. PloS one 2017, 12, e0182309. Version March 17, 2026 submitted to Appl. Sci.24 of 30 7.Wang, Z.J.; Turko, R.; Shaikh, O.; Park, H.; Das, N.; Hohman, F.; Kahng, M.; Polo Chau, D.H. CNN Explainer: Learning Convolutional Neural Networks with Interactive Visualization. IEEE Transactions on Visualization and Computer Graphics 2021, 27, 1396–1406. https://doi.org/10.1109/TVCG.2020.3030418. 8.McLoughlin, I.V. Applied Speech and Audio Processing; Cambridge University Press, 2009. 9. Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Proceedings of the Proc. INTERSPEECH, 2020, p. 3830–3834. 10.Radford, A.; Kim, J.W.; Xu, T.; Brockman, G.; McLeavey, C.; Sutskever, I. Robust speech recognition via large-scale weak supervision. In Proceedings of the In Proc. ICML, 2023. 11.Kim, J.Y.; Lee, S.H. Accuracy enhancement method for speech emotion recognition from spectrogram using temporal frequency correlation and positional information learning through knowledge transfer. IEEE Access 2024. 12.Ellis, D.P. Gammatone-like spectrograms. web resource: http://w. e. columbia. edu/dpwe/resources/matlab/gammatonegram 2009. 13. FitzGerald, D. Vocal separation using nearest neighbours and median filtering. In Proceedings of the IET Irish Signals and Systems Conference (ISSC 2012), 2012, p. 98G. 14.Brown, J.C. Calculation of a constant Q spectral transform. The Journal of the Acoustical Society of America 1991, 89, 425–434. 15. Yizhi, L.; Yuan, R.; Zhang, G.; Ma, Y.; Chen, X.; Yin, H.; Xiao, C.; Lin, C.; Ragni, A.; Benetos, E.; et al. MERT: Acoustic music understanding model with large-scale self-supervised training. In Proceedings of the In Proc. ICLR, 2023. 16.Huang, H.; Man, J.; Li, L.; Zeng, R. Musical timbre style transfer with diffusion model. PeerJ Computer Science 2024, 10, e2194. 17. Ma, N.; Green, P.; Barker, J.; Coy, A. Exploiting correlogram structure for robust speech recognition with multiple speech sources. Speech Communication 2007, 49, 874–891. 18. Patterson, R.D.; Allerhand, M.H.; Giguere, C. Time-domain modeling of peripheral auditory processing: A modular architecture and a software platform. The Journal of the Acoustical Society of America 1995, 98, 1890–1894. 19.Walters, T.C. Auditory-based processing of communication sounds. PhD thesis, University of Cambridge, Cambridge, UK, 2011. 20.Lyon, R.F.; Rehn, M.; Bengio, S.; Walters, T.C.; Chechik, G. Sound retrieval and ranking using sparse auditory representations. Neural computation 2010, 22, 2390–2416. 21.Lyon, R.F.; Rehn, M.; Walters, T.; Bengio, S.; Chechik, G. Audio classification for information retrieval using sparse features, 2013. US Patent 8,463,719. 22.Lyon, R.F.; Ponte, J.; Chechik, G. Sparse coding of auditory features for machine hearing in interference. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2011, p. 5876–5879. 23.Lyon, R.F. Machine hearing: Audio analysis by emulation of human hearing. In Proceedings of the Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2011, p. viii–viii. 24.Zhang, H.; McLoughlin, I.; Song, Y. Robust Sound Event Recognition using Convolutional Neural Networks. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2015, number 2635, p. 559–563. 25.Nowak, E.; Jurie, F.; Triggs, B. Sampling Strategies for Bag-of-Features Image Classification. In Proceedings of the Computer Vision – ECCV 2006; Leonardis, A.; Bischof, H.; Pinz, A., Eds., Berlin, Heidelberg, 2006; p. 490–503. https://doi.org/10.1007/11 744085_38. 26. Gong, Y.; Chung, Y.A.; Glass, J. AST: Audio Spectrogram Transformer. In Proceedings of the Interspeech 2021, 2021, p. 571–575. https://doi.org/10.21437/Interspeech.2021-698. 27.Dennis, J.; Tran, H.D.; Chng, E.S. Image feature representation of the subband power distribution for robust sound event classification. IEEE Transactions on Audio, Speech, and Language Processing 2013, 21, 367–377. 28.McLoughlin, I.; Zhang, H.M.; Xie, Z.P.; Song, Y.; Xiao, W. Robust Sound Event Classification using Deep Neural Networks. IEEE Transactions on Audio, Speech, and Language Processing 2015, 23, 540–552. 29.Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-Vectors: Robust DNN Embeddings for Speaker Recognition. In Proceedings of the Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, p. 5329–5333. https://doi.org/http://doi.org/10.1109/ICASSP.2018.8461375. 30.Xie, Z.; McLoughlin, I.; Zhang, H.; Song, Y.; Xiao, W. A new variance-based approach for discriminative feature extraction in machine hearing classification using spectrogram features. Digital Signal Processing 2016, 54, 119–128. 31. Miao, X.; McLoughlin, I.; Song, Y. Variance Normalised Features for Language and Dialect Discrimination. Circuits, Systems, and Signal Processing 2021, 40, 3621–3638. https://doi.org/10.1007/s00034-020-01641-1. 32.Jin, M.; Song, Y.; McLoughlin, I. LID-senone Extraction via Deep Neural Networks for End-to-End Language Identification. In Proceedings of the Odyssey, June 2016. 33.Davis, S.; Mermelstein, P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE transactions on acoustics, speech, and signal processing 1980, 28, 357–366. 34.Malina, W. On an Extended Fisher Criterion for Feature Selection. IEEE Transactions on Pattern Analysis and Machine Intelligence 1981, PAMI-3, 611–614. https://doi.org/10.1109/TPAMI.1981.4767154. Version March 17, 2026 submitted to Appl. Sci.25 of 30 35.Oord, A.v.d.; Dieleman, S.; Zen, H.; Simonyan, K.; Vinyals, O.; Graves, A.; Kalchbrenner, N.; Senior, A.; Kavukcuoglu, K. WaveNet: A Generative Model for Raw Audio, 2016. arXiv:1609.03499 [cs], https://doi.org/10.48550/arXiv.1609.03499. 36. Rafsanjani, M.A.H.; Mawalim, C.O.; Lestari, D.P.; Sakti, S.; Unoki, M. Unsupervised Anomalous Sound Detection Using Timbral and Human Voice Disorder-Related Acoustic Features. In Proceedings of the Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2024, p. 1–6. 37.Miao, X.; McLoughlin, I.; Yan, Y. A New Time-Frequency Attention Mechanism for TDNN and CNN-LSTM-TDNN, with Application to Language Identification. In Proceedings of the Interspeech, 2019. 38. Koutini, K.; Schlüter, J.; Eghbal-zadeh, H.; Widmer, G. Efficient Training of Audio Transformers with Patchout. In Proceedings of the Interspeech, 2022, p. 2753–2757. https://doi.org/10.21437/Interspeech.2022-227. 39. Chen, K.; Du, X.; Zhu, B.; Ma, Z.; Berg-Kirkpatrick, T.; Dubnov, S. HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection. In Proceedings of the ICASSP 2022. 40. Zheng, X.; Song, Y.; Dai, L.R.; McLoughlin, I.; Liu, L. An Effective Mutual Mean Teaching Based Domain Adaptation Method for Sound Event Detection. In Proceedings of the Interspeech, 2021, p. 556–560. https://doi.org/10.21437/Interspeech.2021-281. 41. Cai, P.; Song, Y.; Gu, Q.; Jiang, N.; Song, H.; McLoughlin, I. Detect Any Sound: Open-Vocabulary Sound Event Detection with Multi-Modal Queries. In Proceedings of the 33rd ACM International Conference on Multimedia, New York, NY, USA, 2025; p. 582–591. https://doi.org/10.1145/3746027.3755574. 42.Chen, T.; Yang, Y.; Qiu, C.; Fan, X.; Guo, X.; Shangguan, L. Enabling Hands-Free Voice Assistant Activation on Earphones. In Proceedings of the 22nd Annual International Conference on Mobile Systems, Applications and Services (MOBISYS), New York, NY, USA, 2024; p. 155–168. https://doi.org/10.1145/3643832.3661890. 43.Bregman, A.S. Auditory scene analysis: The perceptual organization of sound; MIT press, 1994. 44. Mesaros, A.; Heittola, T.; Virtanen, T.; Plumbley, M.D. Sound Event Detection: A tutorial. IEEE Signal Processing Magazine 2021, 38, 67–83. https://doi.org/10.1109/MSP.2021.3090678. 45.Zeng, X.M.; Song, Y.; Zhuo, Z.; Zhou, Y.; Li, Y.H.; Xue, H.; Dai, L.R.; McLoughlin, I. Joint Generative-Contrastive Representation Learning for Anomalous Sound Detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, p. 1–5. https://doi.org/10.1109/ICASSP49357.2023.10095568. 46.Miao, X.; McLoughlin, I. LSTM-TDNN with convolutional front-end for Dialect Identification in the 2019 Multi-Genre Broadcast Challenge, 2019, [arXiv:eess.AS/1912.09003]. 47.Jiang, Y.; Song, Y.; McLoughlin, I.; Gao, Z.; Dai, L.R. An Effective Deep Embedding Learning Architecture for Speaker Verification. In Proceedings of the INTERSPEECH, 2019, p. 4040–4044. 48.Xu, Y.; McLoughlin, I.; Song, Y.; Wu, K. Improved i-vector representation for speaker diarization. Circuits, Systems, and Signal Processing 2016, 35, 3393–3404. 49.Sun, L.; Du, J.; Jiang, C.; Zhang, X.; He, S.; Yin, B.; Lee, C.H. Speaker Diarization with Enhancing Speech for the First DIHARD Challenge. In Proceedings of the Interspeech, 2018, p. 2793–2797. 50.Gao, Z.; Song, Y.; McLoughlin, I.; Li, P.; Jiang, Y.; Dai, L.R. Improving Aggregation and Loss Function for Better Embedding Learning in End-to-End Speaker Verification System. In Proceedings of the Interspeech, 2019, p. 361–365. 51. Pham, L.; Phan, H.; Palaniappan, R.; Mertins, A.; McLoughlin, I. CNN-MoE Based Framework for Classification of Respiratory Anomalies and Lung Disease Detection. IEEE Journal of Biomedical and Health Informatics 2021, 25, 2938–2947. https://doi.org/10 .1109/JBHI.2021.3064237. 52.Nguyen, T.; Pernkopf, F. Lung Sound Classification Using Co-Tuning and Stochastic Normalization. IEEE Transactions on Biomedical Engineering 2022, 69, 2872–2882. https://doi.org/10.1109/TBME.2022.3156293. 53.Milling, M.; Pokorny, F.B.; Bartl-Pokorny, K.D.; Schuller, B.W. Is Speech the New Blood? Recent Progress in AI-Based Disease Detection From Audio in a Nutshell. Frontiers in Digital Health 2022, Volume 4. https://doi.org/10.3389/fdgth.2022.886615. 54.Rashid, M.M.; Li, G.; Du, C. Nonspeech7k dataset: Classification and analysis of human non-speech sound. IET Signal Processing 2023, 17, e12233, [https://ietresearch.onlinelibrary.wiley.com/doi/pdf/10.1049/sil2.12233]. https://doi.org/https: //doi.org/10.1049/sil2.12233. 55.Kim, S.Y.; Lee, H.M.; Lim, C.Y.; Kim, H.W. Detection of Abnormal Symptoms Using Acoustic-Spectrogram-Based Deep Learning. Applied Sciences 2025, 15. https://doi.org/10.3390/app15094679. 56.Moysis, L.; Iliadis, L.A.; Sotiroudis, S.P.; Boursianis, A.D.; Papadopoulou, M.S.; Kokkinidis, K.I.D.; Volos, C.; Sarigiannidis, P.; Nikolaidis, S.; Goudos, S.K. Music Deep Learning: Deep Learning Methods for Music Signal Processing—A Review of the State-of-the-Art. IEEE Access 2023, 11, 17031–17052. https://doi.org/10.1109/ACCESS.2023.3244620. 57.Chen, R.; Ghobakhlou, A.; Narayanan, A. Hierarchical Residual Attention Network for Musical Instrument Recognition Using Scaled Multi-Spectrogram. Applied Sciences 2024, 14. https://doi.org/10.3390/app142310837. 58.Buisson, M.; McFee, B.; Essid, S.; Crayencour, H.C. Self-Supervised Learning of Multi-Level Audio Representations for Music Segmentation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 2024, 32, 2141–2152. https://doi.org/10.1109/ TASLP.2024.3379894. Version March 17, 2026 submitted to Appl. Sci.26 of 30 59.Thapa, N.; Lee, J. Dual-Path Beat Tracking: Combining Temporal Convolutional Networks and Transformers in Parallel. Applied Sciences 2024, 14. https://doi.org/10.3390/app142411777. 60. Verma, P.; Berger, J. Audio transformers: Transformer architectures for large scale audio understanding. adieu convolutions. arXiv preprint arXiv:2105.00335 2021, 2. 61. Grzeszick, R.; Plinge, A.; Fink, G.A. Bag-of-features methods for acoustic event detection and classification. IEEE/ACM Transactions on Audio, Speech, and Language Processing 2017, 25, 1242–1252. 62.Dennis, J.; Tran, H.D.; Chng, E.S. Overlapping sound event recognition using local spectrogram features and the generalised hough transform. Pattern Recognition Letters 2013, 34, 1085–1093. 63.Adavanne, S.; Politis, A.; Nikunen, J.; Virtanen, T. Sound Event Localization and Detection of Overlapping Sources Using Convolutional Recurrent Neural Networks. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING 2019, 13. 64.Xia, W.; Koishida, K. Sound Event Detection in Multichannel Audio using Convolutional Time-Frequency-Channel Squeeze and Excitation, 2019, [arXiv:eess.AS/1908.01399]. 65.Alcázar, J.N.; Zuccarello, P.; Cobos, M. Classification of sound scenes and events in real-world scenarios with deep learning techniques 2020. 66.Wisdom, S.; Tzinis, E.; Erdogan, H.; Weiss, R.; Wilson, K.; Hershey, J. Unsupervised Sound Separation Using Mixture Invariant Training. In Proceedings of the Advances in Neural Information Processing Systems; Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; Lin, H., Eds. Curran Associates, Inc., 2020, Vol. 33, p. 3846–3857. 67. Kahl, S.; Wood, C.M.; Eibl, M.; Klinck, H. BirdNET: A deep learning solution for avian diversity monitoring. Ecological Informatics 2021, 61, 101236. https://doi.org/https://doi.org/10.1016/j.ecoinf.2021.101236. 68. Nath, K.; Sarma, K.K. Separation of overlapping audio signals: A review on current trends and evolving approaches. Signal Processing 2024, 221, 109487. 69.Sudo, Y.; Itoyama, K.; Nishida, K.; Nakadai, K. Environmental sound segmentation utilizing Mask U-Net. In Proceedings of the 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2019, p. 5340–5345. https://doi.org/10.110 9/IROS40897.2019.8967954. 70.Sudo, Y.; Itoyama, K.; Nishida, K.; Nakadai, K. Multi-channel Environmental sound segmentation. In Proceedings of the IEEE/SICE International Symposium on System Integration (SII), 2020, p. 820–825. https://doi.org/10.1109/SII46433.2020.9025 963. 71.Baelde, M.; Biernacki, C.; Greff, R. A mixture model-based real-time audio sources classification method. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, p. 2427–2431. https: //doi.org/10.1109/ICASSP.2017.7952592. 72.Xu, K.; Feng, D.; Mi, H.; Zhu, B.; Wang, D.; Zhang, L.; Cai, H.; Liu, S. Mixup-Based Acoustic Scene Classification Using Multi-channel Convolutional Neural Network. In Proceedings of the Advances in Multimedia Information Processing – PCM 2018; Hong, R.; Cheng, W.H.; Yamasaki, T.; Wang, M.; Ngo, C.W., Eds., Cham, 2018; p. 14–23. 73.Phan, H.; Maass, M.; Mazur, R.; Mertins, A. Early event detection in audio streams. In Proceedings of the 2015 IEEE International Conference on Multimedia and Expo (ICME), June 2015, p. 1–6. https://doi.org/10.1109/ICME.2015.7177439. 74. Phan, H.; Koch, P.; McLoughlin, I.; Mertins, A. Enabling Early Audio Event Detection with Neural Networks. In Proceedings of the 43rd IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, p. 141–145. 75.McLoughlin, I.; Song, Y.; Pham, L.; Palaniappan, R.; Phan, H.; Lang, Y. Early Detection of Continuous and Partial Audio Events Using CNN. In Proceedings of the Interspeech, 2018, p. 3314–3318. 76.Zhao, X.; Zhang, X.; Zhao, C.; Cho, J.H.; Kaplan, L.; Jeong, D.H. Multi-Label Temporal Evidential Neural Networks for Early Event Detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023. https://doi.org/10.1109/ICASSP49357.2023.10096305. 77.Zhao, X.; Zhang, X.; Cheng, W.; Yu, W.; Chen, Y.; Chen, H.; Chen, F. Seed: Sound event early detection via evidential uncertainty. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, p. 3618–3622. 78.Mesaros, A.; Heittola, T.; Virtanen, T. TUT database for acoustic scene classification and sound event detection. In Proceedings of the 2016 24th European Signal Processing Conference (EUSIPCO), 2016, p. 1128–1132. https://doi.org/10.1109/EUSIPCO.2016 .7760424. 79.Stowell, D.; Giannoulis, D.; Benetos, E.; Lagrange, M.; Plumbley, M.D. Detection and Classification of Acoustic Scenes and Events. IEEE Transactions on Multimedia 2015, 17, 1733–1746. https://doi.org/10.1109/TMM.2015.2428998. 80.Çakir, E.; Virtanen, T. End-to-End Polyphonic Sound Event Detection Using Convolutional Recurrent Neural Networks with Learned Time-Frequency Representation Input. In Proceedings of the 2018 International Joint Conference on Neural Networks (IJCNN), 2018, p. 1–7. https://doi.org/10.1109/IJCNN.2018.8489470. 81.Bittner, R.; McFee, B.; Salamon, J.; Li, P.; Bello, J. Deep Salience Representations forF 0 Estimation in Polyphonic Music. In Proceedings of the 18th Int. Soc. for Music Info. Retrieval Conf., Suzhou, China, Oct. 2017. Version March 17, 2026 submitted to Appl. Sci.27 of 30 82.Leng, Y.R.; Tran, H.D.; Kitaoka, N.; Li, H. Selective gammatone filterbank feature for robust sound event recognition. In Proceedings of the Interspeech, 2010, p. 2246–2249. https://doi.org/10.21437/Interspeech.2010-617. 83. Pham, L.; Phan, H.; Nguyen, T.; Palaniappan, R.; Mertins, A.; McLoughlin, I. Robust acoustic scene classification using a multi-spectrogram encoder-decoder framework. Digital Signal Processing 2021, 110, 102943. https://doi.org/https://doi.org/10 .1016/j.dsp.2020.102943. 84.Wang, Y.; Getreuer, P.; Hughes, T.; Lyon, R.F.; Saurous, R.A. Trainable frontend for robust and far-field keyword spotting. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, p. 5670–5674. 85.Yin, H.; Bai, J.; Xiao, Y.; Wang, H.; Zheng, S.; Chen, Y.; Das, R.K.; Deng, C.; Chen, J. Exploring Text-Queried Sound Event Detection with Audio Source Separation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, p. 1–5. https://doi.org/10.1109/ICASSP49660.2025.10889789. 86. Adavanne, S.; Virtanen, T. A Report on Sound Event Detection with Different Binaural Features. Technical report, DCASE2017 Challenge, 2017. 87. Heittola, T.; Mesaros, A. DCASE 2017 Challenge Setup: Tasks, Datasets and Baseline System. Technical report, DCASE2017 Challenge, 2017. 88. Lin, L.; Wang, X. Guided Learning Convolution System For Dcase 2019 Task 4. Technical report, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China, 2019. 89. Gemmeke, J.F.; Ellis, D.P.W.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R.C.; Plakal, M.; Ritter, M. Audio Set: An ontology and human-labeled dataset for audio events. In Proceedings of the Proc. IEEE ICASSP 2017, New Orleans, LA, 2017. 90. Fonseca, E.; Pons, J.; Favory, X.; Font, F.; Bogdanov, D.; Ferraro, A.; Oramas, S.; Porter, A.; Serra, X. Freesound Datasets: a platform for the creation of open audio datasets. In Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR), Suzhou, China, 2017; p. 486–493. 91.McLoughlin, I.; Xie, Z.; Song, Y.; Phan, H.; Palaniappan, R. Time–Frequency Feature Fusion for Noise Robust Audio Event Classification. Circuits, Systems, and Signal Processing 2020, 39, 1672–1687. https://doi.org/10.1007/s00034-019-01203-0. 92.Ebbers, J.; Haeb-Umbach, R. Pre-Training And Self-Training For Sound Event Detection In Domestic Environments. Technical report, DCASE2022 Challenge, 2022. 93.Turpault, N.; Serizel, R.; Parag Shah, A.; Salamon, J. Sound event detection in domestic environments with weakly labeled data and soundscape synthesis. In Proceedings of the Workshop on Detection and Classification of Acoustic Scenes and Events, New York City, United States, 2019. 94.Kim, J.W.; Son, S.W.; Song, Y.; Kim, Hong Kook1, .; Song, I.H.; Lim, J.E. Semi-Supervised Learning-Based Sound Event Detection Using Frequency Dynamic Convolution With Large Kernel Attention For Dcase Challenge 2023 Task 4. Technical report, DCASE2023 Challenge, 2023. 95.Schmid, F.; Primus, P.; Morocutti, T.; Greif, J.; Widmer, G. Improving Audio Spectrogram Transformers For Sound Event Detection Through Multi-Stage Training. Technical report, DCASE2024 Challenge, 2024. 96.Martín-Morató, I.; Mesaros, A. Strong Labeling of Sound Events Using Crowdsourced Weak Labels and Annotator Competence Estimation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 2023, 31, 902–914. https://doi.org/10.1109/TASLP. 2022.3233468. 97.Tang, L.; Tian, H.; Huang, H.; Shi, S.; Ji, Q. A survey of mechanical fault diagnosis based on audio signal analysis. Measurement 2023, 220, 113294. https://doi.org/https://doi.org/10.1016/j.measurement.2023.113294. 98.Qurthobi, A.; Maskeli ̄ unas, R.; Damaševiˇcius, R. Detection of Mechanical Failures in Industrial Machines Using Overlapping Acoustic Anomalies: A Systematic Literature Review. Sensors 2022, 22. https://doi.org/10.3390/s22103888. 99.Zeng, X.M.; Song, Y.; McLoughlin, I.; Liu, L.; Dai, L.R. Robust Prototype Learning for Anomalous Sound Detection. In Proceedings of the Interspeech, 2023, p. 261–265. https://doi.org/10.21437/Interspeech.2023-1173. 100.Koizumi, Y.; Kawaguchi, Y.; Imoto, K.; et al.. Description and Discussion on DCASE2020 Challenge Task2: Unsupervised Anomalous Sound Detection for Machine Condition Monitoring. In Proceedings of the DCASE, November 2020, p. 81–85. 101.Suefusa, K.; Nishida, T.; Purohit, H.; Tanabe, R.; Endo, T.; Kawaguchi, Y. Anomalous Sound Detection Based on Interpolation Deep Neural Network. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, p. 271–275. https://doi.org/10.1109/ICASSP40776.2020.9054344. 102.Zeng, X.M.; Song, Y.; Dai, L.R.; Liu, L. Predictive AutoEncoders Are Context-Aware Unsupervised Anomalous Sound Detectors. In Proceedings of the Man-Machine Speech Communication; Zhenhua, L.; Jianqing, G.; Kai, Y.; Jia, J., Eds., Singapore, 2023; p. 101–113. 103. Hendrycks, D.; Mazeika, M.; Dietterich, T. Deep Anomaly Detection with Outlier Exposure. In Proceedings of the Proc. of IXLR, 2019. Version March 17, 2026 submitted to Appl. Sci.28 of 30 104.Zeng, X.M.; Song, Y.; Zhuo, Z.; Zhou, Y.; Li, Y.H.; Xue, H.; Dai, L.R.; McLoughlin, I. Joint Generative-Contrastive Representation Learning for Anomalous Sound Detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2023, p. 1–5. https://doi.org/10.1109/ICASSP49357.2023.10095568. 105.Liu, Y.; Guan, J.; Zhu, Q.; Wang, W. Anomalous Sound Detection Using Spectral-Temporal Information Fusion. In Proceedings of the ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022. https: //doi.org/10.1109/ICASSP43922.2022.9747868. 106.Han, B.; Lv, Z.; Jiang, A.; Huang, W.; Chen, Z.; Deng, Y.; Ding, J.; Lu, C.; Zhang, W.Q.; Fan, P.; et al. Exploring Large Scale Pre-Trained Models for Robust Machine Anomalous Sound Detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, 2024, p. 1326–1330. https://doi.org/10.1109/ICASSP48485.2024.10447183. 107. Jiang, A.; Han, B.; Lv, Z.; Deng, Y.; Zhang, W.Q.; Chen, X.; Qian, Y.; Liu, J.; Fan, P. AnoPatch: Towards Better Consistency in Machine Anomalous Sound Detection. In Proceedings of the Interspeech, 2024, p. 107–111. https://doi.org/10.21437 /Interspeech.2024-1761. 108.Chakrabarty, D.; Elhilali, M. Abnormal sound event detection using temporal trajectories mixtures. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2016, p. 216–220. https://doi.org/10.1109/ ICASSP.2016.7471668. 109. Li, K.; Zaman, K.; Li, X.; Akagi, M.; Unoki, M. Machine Anomalous Sound Detection Using Spectral-temporal Modulation Representations Derived from Machine-specific Filterbanks. arXiv preprint arXiv:2409.05319 2024. 110. Yin, J.; Gao, Y.; Zhang, W.; Wang, T.; Zhang, M. Diffusion Augmentation Sub-center Modeling for Unsupervised Anomalous Sound Detection with Partially Attribute-Unavailable Conditions. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, p. 1–5. https://doi.org/10.1109/ICASSP49660.2025.10890695. 111. Stowell, D. Computational bioacoustics with deep learning: a review and roadmap. PeerJ 2022, 10, e13152. 112. Tosato, G.; Shehata, A.; Janssen, J.; Kamp, K.; Jati, P.; Stowell, D. Auto deep learning for bioacoustic signals. arXiv preprint arXiv:2311.04945 2023. 113.Heinrich, R.; Sick, B.; Scholz, C. AudioProtoPNet: An Interpretable Deep Learning Model for Bird Sound Classification. arXiv preprint arXiv:2404.10420 2024. 114.Hershey, J.R.; Chen, Z.; Le Roux, J.; Watanabe, S. Deep clustering: Discriminative embeddings for segmentation and separation. arXiv preprint arXiv:1508.04306 2015. 115.Dang, T.M.; Wang, T.S.; Lekhak, H.; Zhu, K.Q. EmotionalCanines: A Dataset for Analysis of Arousal and Valence in Dog Vocalization. In Proceedings of the ACM International Conference on Multimedia. Association for Computing Machinery, 2025, p. 13281–13288. https://doi.org/10.1145/3746027.3758286. 116.Suzuki, K.; Sakamoto, S.; Taniguchi, T.; Kameoka, H. Speak like a dog: Human to non-human creature voice conversion. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), 2022, p. 1388–1393. 117.Kang, M.; Lee, S.; Lee, C.; Cho, N. When Humans Growl and Birds Speak: High-Fidelity Voice Conversion from Human to Animal and Designed Sounds. arXiv preprint arXiv:2505.24336 2025. 118. Ravanelli, M.; Bengio, Y. Speaker recognition from raw waveform with SincNet. In Proceedings of the IEEE Spoken Language Technology Workshop (SLT). IEEE, 2018, p. 1021–1028. 119.Lee, J.; Park, J.; Nam, J. Sample-level deep convolutional neural networks for music auto-tagging using raw waveforms. In Proceedings of the Proc. ISMIR, 2017. 120.Jung, J.w.; Kim, H.S.; Kim, M.J.; Yoon, S.H.; Lee, B.J.; Kim, H. RawNet: Advanced end-to-end deep neural network using raw waveforms for text-independent speaker verification. Proc. Interspeech 2019, p. 1268–1272. 121.Bravo Sánchez, V.; Stowell, D.; Drossos, K.; Virtanen, T. Bioacoustic classification of avian calls from raw sound waveforms with an open-source deep learning architecture. Scientific Reports 2021, 11, 15740. 122.Zeghidour, N.; Luebs, F.; Synnaeve, G.; Collobert, R. LEAF: A learnable frontend for audio classification. In Proceedings of the International Conference on Learning Representations (ICLR), 2021. 123.Allen, A.N.; et al. A CNN for humpback whale song detection in diverse long-term datasets. Frontiers in Marine Science 2021, 8, 653740. 124.Hexeberg, S.; Leite, R.; Ewers, R.M.; Stowell, D. Semi-supervised classification of bird vocalizations using spatiotemporal features. Scientific Reports 2023, 13, 12345. 125.Kong, Q.; Cao, Y.; Iqbal, T.; Wang, Y.; Wang, W.; Plumbley, M.D. PANNs: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 2020, 28, 2880–2894. 126. Lostanlen, V.; Salamon, J.; Cartwright, M.; McFee, B.; Farnsworth, A.; Kelling, S.; Bello, J.P. Per-Channel Energy Normalization: Why and How. IEEE Signal Processing Letters 2019, 26, 39–43. https://doi.org/10.1109/LSP.2018.2878620. 127.Song, H.; McLoughlin, I.; Gu, Q.; Jiang, N.; Song, Y. An Efficient Transfer Learning Method Based on Adapter with Local Attributes for Speech Emotion Recognition, 2025, [arXiv:cs.SD/2509.23795]. Version March 17, 2026 submitted to Appl. Sci.29 of 30 128.O’Shaughnessy, D. Spoken language identification: An overview of past and present research trends. Speech Communication 2025, 167, 103167. https://doi.org/https://doi.org/10.1016/j.specom.2024.103167. 129. Miao, X.; McLoughlin, I.; Wang, W.; Zhang, P. D-MONA: A dilated mixed-order non-local attention network for speaker and language recognition. Neural Networks 2021, 139, 201–211. 130. Liu, H.; Perera, L.P.G.; Khong, A.W.; Chng, E.S.; Styles, S.J.; Khudanpur, S. Efficient self-supervised learning representations for spoken language identification. IEEE Journal of Selected Topics in Signal Processing 2022, 16, 1296–1307. 131.Dey, S.; Sahidullah, M.; Saha, G. Towards Cross-Corpora Generalization for Low-Resource Spoken Language Identification. IEEE/ACM Transactions on Audio, Speech, and Language Processing 2024. 132.Kohler, M.A.; Kennedy, M. Language identification using shifted delta cepstra. In Proceedings of the 45th Midwest Symposium on Circuits and Systems (MWSCAS). IEEE, 2002, Vol. 3, p. I–69. 133.Cai, W.; Cai, Z.; Zhang, X.; Wang, X.; Li, M. A novel learnable dictionary encoding layer for end-to-end language identification. In Proceedings of the 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, p. 5189–5193. 134. Alumäe, T.; Kukk, K.; Le, V.B.; Barras, C.; Messaoudi, A.; Ben Kheder, W. Exploring the impact of pretrained models and web-scraped data for the 2022 NIST language recognition evaluation. In Proceedings of the Proceedings of the Interspeech, 2023, p. 516–520. 135.Jin, M.; Song, Y.; McLoughlin, I.; Dai, L.R. LID-Senones and Their Statistics for Language Identification. IEEE/ACM Transactions on Audio, Speech, and Language Processing 2018, 26, 171–183. 136.Kaiyr, A.; Kadyrov, S.; Bogdanchikov, A. Automatic Language Identification from Spectorgam Images.In Proceedings of the 2021 IEEE International Conference on Smart Information Systems and Technologies (SIST), 2021, p. 1–4.https: //doi.org/10.1109/SIST50301.2021.9465996. 137.Tjandra, A.; Choudhury, D.G.; Zhang, F.; Singh, K.; Conneau, A.; Baevski, A.; Sela, A.; Saraf, Y.; Auli, M. Improved language identification through cross-lingual self-supervised learning. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, p. 6877–6881. 138.Reynolds, D.A.; Quatieri, T.F.; Dunn, R.B. Speaker Verification Using Adapted Gaussian Mixture Models. Digital Signal Processing 2000, 10, 19–41. https://doi.org/http://doi.org/10.1006/dspr.1999.0361. 139.Dehak, N.; Kenny, P.; Dehak, R.; Dumouchel, P.; Ouellet, P. Front-End Factor Analysis for Speaker Verification. IEEE Transactions on Audio, Speech, and Language Processing 2011, 19, 788–798. https://doi.org/http://doi.org/10.1109/TASL.2010.2064307. 140.Variani, E.; Lei, X.; McDermott, E.; Lopez-Moreno, I.; Gonzalez-Dominguez, J. Deep Neural Networks for Small-Footprint Text-Dependent Speaker Verification. In Proceedings of the Proc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2014, p. 4052–4056. https://doi.org/http://doi.org/10.1109/ICASSP.2014.6854338. 141.Liu, Z.L.; Song, Y.; Zeng, X.M.; Dai, L.R.; McLoughlin, I. DP-MAE: A dual-path masked autoencoder based self-supervised learning method for anomalous sound detection. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, p. 1481–1485. 142.Nagrani, A.; Chung, J.S.; Zisserman, A. VoxCeleb: A Large-Scale Speaker Identification Dataset. In Proceedings of the Proc. Interspeech, 2017, p. 2616–2620. https://doi.org/http://doi.org/10.21437/Interspeech.2017-950. 143.Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Proceedings of the Proc. Interspeech, 2020, p. 3830–3834. 144.Campbell, W.M.; Campbell, J.P.; Reynolds, D.A.; Singer, E.; Torres-Carrasquillo, P.A. Support Vector Machines for Speaker and Language Recognition. Computer Speech & Language 2006, 20, 210–229. https://doi.org/http://doi.org/10.1016/j.csl.2005.06.003. 145. Schuller, B.; Steidl, S.; Batliner, A. The interspeech 2009 emotion challenge 2009. 146.Eyben, F.; Wöllmer, M.; Schuller, B. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the Proceedings of the 18th ACM international conference on Multimedia, 2010, p. 1459–1462. 147.Schuller, B.W.; Zhang, Z.; Weninger, F.; Rigoll, G. Using multiple databases for training in emotion recognition: To unite or to vote? In Proceedings of the Interspeech, 2011, p. 1553–1556. 148.Parry, J.; Palaz, D.; Clarke, G.; Lecomte, P.; Mead, R.; Berger, M.; Hofer, G. Analysis of Deep Learning Architectures for Cross-Corpus Speech Emotion Recognition. In Proceedings of the Interspeech, 2019, p. 1656–1660. 149.Satt, A.; Rozenberg, S.; Hoory, R. Efficient emotion recognition from speech using deep convolutional neural networks. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2017. 150.Trigeorgis, G.; Ringeval, F.; Brueckner, R.; Marchi, E.; Nicolaou, M.A.; Schuller, B.; Zafeiriou, S. Adieu features? end-to-end speech emotion recognition using a deep convolutional recurrent network. In Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2016, p. 5200–5204. 151.Mirsamadi, S.; Barsoum, E.; Zhang, C. Automatic speech emotion recognition using recurrent neural networks with local attention. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, p. 2227–2231. https://doi.org/10.1109/ICASSP.2017.7952552. Version March 17, 2026 submitted to Appl. Sci.30 of 30 152.Pepino, L.; Riera, P.; Ferrer, L. Emotion Recognition from Speech Using wav2vec 2.0 Embeddings. In Proceedings of the INTERSPEECH 2021. ISCA, 2021, p. 161–165. 153. Chen, S.; Wang, C.; Chen, Z.; Wu, Y.; Liu, S.; Chen, Z.; Li, J.; Kanda, N.; Yoshioka, T.; Xiao, X.; et al. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing 2022, 16, 1505–1518. 154.Diatlova, D.; Udalov, A.; Shutov, V.; Spirin, E. Adapting WavLM for Speech Emotion Recognition, 2024, [arXiv:cs.LG/2405.04485]. 155.Chowdhury, S.Y.; Banik, B.; Hoque, M.T.; Banerjee, S. A Novel Hybrid Deep Learning Technique for Speech Emotion Detection using Feature Engineering, 2025, [arXiv:cs.SD/2507.07046]. 156. Amjad, A.; Khuntia, S.; Chang, H.T.; Tai, L.C. Multi-Domain Emotion Recognition Enhancement: A Novel Domain Adaptation Technique for Speech-Emotion Recognition. IEEE Transactions on Audio, Speech and Language Processing 2025, 33, 528–541. https://doi.org/10.1109/TASLP.2024.3498694. 157.Chen, Z.; Wang, J.; Hu, W.; Li, L.; Hong, Q. Unsupervised Speaker Verification Using Pre-Trained Model and Label Correction. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, p. 1–5. https://doi.org/10.1109/ICASSP49357.2023.10094610. 158. Ghaffari, H.; Devos, P. Robust Weakly Supervised Bird Species Detection via Peak Aggregation and PIE. IEEE Transactions on Audio, Speech and Language Processing 2025. Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.