首页期刊导航|IEEE/ACM transactions on audio, speech, and language processing
期刊信息/Journal information
IEEE/ACM transactions on audio, speech, and language processing
Institute of Electrical and Electronics Engineers
IEEE/ACM transactions on audio, speech, and language processing

Institute of Electrical and Electronics Engineers

双月刊

2329-9290

IEEE/ACM transactions on audio, speech, and language processing/Journal IEEE/ACM transactions on audio, speech, and language processing
正式出版
收录年代

    Hierarchical Regulated Iterative Network for Joint Task of Music Detection and Music Relative Loudness Estimation

    Bijue JiaJiancheng LvXi PengYao Chen...
    1-13页
    查看更多>>摘要:One practical requirement of the music copyright management is the estimation of music relative loudness, which is mostly ignored in existing music detection works. To solve this problem, we study the joint task of music detection and music relative loudness estimation. To be specific, we observe that the joint task has two characteristics, i.e., temporality and hierarchy, which could facilitate to obtain the solution. For example, a tiny fragment of audio is <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">temporally</italic> related to its neighbor fragments because they may all belong to the same event, and the event classes of the fragment in the two tasks have a <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">hierarchical</italic> relationship. Based on the above observation, we reformulate the joint task as hierarchical event detection and localization problem. To solve this problem, we further propose Hierarchical Regulated Iterative Networks (HRIN), which includes two variants, termed as HRIN-r and HRIN-cr, which are based on recurrent and convolutional recurrent modules. To enjoy the joint task's characteristics, our models employ an iterative framework to achieve encouraging capability in temporal modeling while designing three hierarchical violation penalties to regulate hierarchy. Extensive experiments on the currently largest dataset (i.e., OpenBMAT) show that the promising performance of our HRIN in the segment-level and event-level evaluations.

    Novel Architectures for Unsupervised Information Bottleneck Based Speaker Diarization of Meetings

    Nauman DawalatabadSrikanth MadikeriC. Chandra SekharHema A. Murthy...
    14-27页
    查看更多>>摘要:Speaker diarization is an important problem that is topical, and is especially useful as a preprocessor for conversational speech related applications. The objective of this article is two-fold: (i) segment initialization by uniformly distributing speaker information across the initial segments, and (ii) incorporating speaker discriminative features within the unsupervised diarization framework. In the first part of the work, a varying length segment initialization technique for Information Bottleneck (IB) based speaker diarization system using phoneme rate as the side information is proposed. This initialization distributes speaker information uniformly across the segments and provides a better starting point for IB based clustering. In the second part of the work, we present a Two-Pass Information Bottleneck (TPIB) based speaker diarization system that incorporates speaker discriminative features during the process of diarization. The TPIB based speaker diarization system has shown improvement over the baseline IB based system. During the first pass of the TPIB system, a coarse segmentation is performed using IB based clustering. The alignments obtained are used to generate speaker discriminative features using a shallow feed-forward neural network and linear discriminant analysis. The discriminative features obtained are used in the second pass to obtain the final speaker boundaries. In the final part of the paper, variable segment initialization is combined with the TPIB framework. This leverages the advantages of better segment initialization and speaker discriminative features that results in an additional improvement in performance. An evaluation on standard meeting datasets shows that a significant absolute improvement of 3.9% and 4.7% is obtained on the NIST and AMI datasets, respectively.

    Block-Based High Performance CNN Architectures for Frame-Level Overlapping Speech Detection

    Midia YousefiJohn H. L. Hansen
    28-40页
    查看更多>>摘要:Speech technology systems such as Automatic Speech Recognition (ASR), speaker diarization, speaker recognition, and speech synthesis have advanced significantly by the emergence of deep learning techniques. However, none of these voice-enabled systems perform well in natural environmental circumstances, specifically in situations where one or more potential interfering talkers are involved. Therefore, overlapping speech detection has become an important front-end triage step for speech technology applications. This is crucial for large-scale datsets where manual labeling in not possible. A block-based CNN architecture is proposed to address modeling overlapping speech in audio streams with frames as short as 25 ms. The proposed architecture is robust to both: (i) shifts in distribution of network activations due to the change in network parameters during training, (ii) local variations from the input features caused by feature extraction, environmental noise, or room interference. We also investigate the effect of alternate input features including spectral magnitude, MFCC, MFB, and pyknogram on both computational time and classification performance. Evaluation is performed on simulated overlapping speech signals based on the GRID corpus. The experimental results highlight the capability of the proposed system in detecting overlapping speech frames with 90.5% accuracy, 93.5% precision, 92.7% recall, and 92.8% Fscore on same gender overlapped speech. For opposite gender cases, the network scores exceed 95% in all the classification metrics.

    A Deep Adaptation Network for Speech Enhancement: Combining a Relativistic Discriminator With Multi-Kernel Maximum Mean Discrepancy

    Jiaming ChengRuiyu LiangZhenlin LiangLi Zhao...
    41-53页
    查看更多>>摘要:In deep-learning-based speech enhancement (SE) systems, trained models are often used to handle unseen noise types and language environments in real-life scenarios. However, since production environments differ from training conditions, mismatch problems arise that may cause a serious decrease in the performance of an SE system. In this study, a domain adaptive method combining two adaptation strategies is proposed to improve the generalization of unlabeled noisy speech. In the proposed encoder-decoder-based SE framework, a domain discriminator and a domain confusion adaptation layer are introduced to conduct adversarial training. The model has two main innovations. First, the algorithm optimizes adversarial training by introducing a relativistic discriminator that relies on relative values by applying the difference, thus avoiding possible bias and better reflecting domain differences. Second, the multi-kernel maximum mean discrepancy (MK-MMD) between domains is taken as the regularization term of the domain adversarial loss, thereby further decreasing the edge distribution distance between domains. The proposed model improves the adaptability to unseen noises by encouraging the feature encoder to generate domain-invariant features. The model was evaluated using cross-noise and cross-language-and-noise experiments, and the results show that the proposed method provides considerable improvements over the baseline without an adaptation in the perceptual evaluation of speech quality (PESQ), the short time objective intelligibility (STOI) and the frequency-weighted signal-to-noise ratio (FWSNR).

    Comparison of Artificial Neural Network Types for Infant Vocalization Classification

    Franz AndersMario HlawitschkaMirco Fuchs
    54-67页
    查看更多>>摘要:In this study we compared various neural network types for the task of automatic infant vocalization classification, i.e convolutional, recurrent and fully-connected networks as well as combinations of thereof. The goal was to first determine the optimal configuration for each network type to then identify the type with the highest overall performance. This investigation helps to employ neural networks more effectively to infant vocalization classification tasks, which typically offer low amounts of training data. To this end, we defined a unified neural network architecture scheme for audio classification from which we derived various network types. For each type we performed a semi-random hyperparameter search which employed regression trees to both focus the search space as well as derive insights on the most influential parameters. We finally compared the test performances of the best performing configurations in an contest-like setup. Our key findings are: (1) Networks with convolutional stages reached the highest performance, regardless of being combined with fully-connected or recurrent layers. (2) The most influential architectural hyperparameter for all types were the integration operations for reducing tensor dimensionality between network stages. The best performing configurations reached test performances of 75% unweighted average recall, surpassing previously published benchmarks.

    Harmonic-Temporal Factor Decomposition for Unsupervised Monaural Separation of Harmonic Sounds

    Tomohiko NakamuraHirokazu Kameoka
    68-82页
    查看更多>>摘要:We address the problem of separating a monaural mixture of harmonic sounds into the audio signals of individual semitones in an unsupervised manner. Unsupervised monaural audio source separation has thus far been mainly addressed by two approaches: one rooted in computational auditory scene analysis (CASA) and the other based on non-negative matrix factorization (NMF). These approaches focus on different clues for making source separation possible. A CASA-based method called harmonic-temporal clustering (HTC) focuses on a local time-frequency structure of individual sources, whereas NMF focuses on a global time-frequency structure of music spectrograms. These clues do not conflict with each other and can be used to achieve a more reliable audio source separation algorithm. Hence, we propose a monaural audio source separation framework, harmonic-temporal factor decomposition (HTFD), by developing a spectrogram model that encompasses the features of the models used in the NMF and HTC approaches. We further incorporate a source-filter model to build an extension of HTFD, source-filter HTFD (SF-HTFD). We derive efficient parameter estimation algorithms of HTFD and SF-HTFD based on the auxiliary function principle. We show, through music source separation experiments, the efficacy of HTFD and SF-HTFD compared with conventional methods. Furthermore, we demonstrate the effectiveness of HTFD and SF-HTFD for automatic musical key transposition.

    Computation of Spherical Harmonic Representations of Source Directivity Based on the Finite-Distance Signature

    Jens AhrensStefan Bilbao
    83-92页
    查看更多>>摘要:The measurement of directivity for sound sources that are not electroacoustic transducers is fundamentally limited because the source cannot be driven with arbitrary signals. A consequence is that directivity can only be measured at a sparse set of frequencies—for example, at the stable partial oscillations of a steady tone played by a musical instrument or from the human voice. This limitation prevents the data from being used in certain applications such as time-domain room acoustic simulations where the directivity needs to be available at all frequencies in the frequency range of interest. We demonstrate in this article that imposing the signature of the directivity that is obtained at a given distance on a spherical wave allows for all interpolation that is required for obtaining a complete spherical harmonic representation of the source's directivity, i.e., a representation that is viable at any frequency, in any direction, and at any distance. Our approach is inspired by the far-field signature of exterior sound fields. It is not capable of incorporating the phase of the directivity directly. We argue based on directivity measurement data of musical instruments that the phase of such measurement data is too unreliable or too ambiguous to be useful. We incorporate numerically-derived directivity into the example application of finite difference time domain simulation of the acoustic field, which has not been possible previously.

    Improving Automatic Speech Recognition and Speech Translation via Word Embedding Prediction

    Shun-Po ChuangAlexander H. LiuTzu-Wei SungHung-yi Lee...
    93-105页
    查看更多>>摘要:In this article, we target speech translation (ST). We propose lightweight approaches that generally improve either ASR or end-to-end ST models. We leverage continuous representations of words, known as word embeddings, to improve ASR in cascaded systems as well as end-to-end ST models. The benefit of using word embedding is that word embedding can be obtained easily by training on pure textual data, which alleviates data scarcity issue. Also, word embedding provides additional contextual information to speech models. We motivate to distill the knowledge from word embedding into speech models. In ASR, we use word embeddings as a regularizer to reduce the WER, and further propose a novel decoding method to fuse the semantic relations among words for further improvement. In the end-to-end ST model, we propose leveraging word embeddings as an intermediate representation to enhance translation performance. Our analysis shows that it is possible to map speech signals to semantic space, which motivates future work on applying the proposed methods in spoken language processing tasks.

    A Cross-Entropy-Guided Measure (CEGM) for Assessing Speech Recognition Performance and Optimizing DNN-Based Speech Enhancement

    Li ChaiJun DuQing-Feng LiuChin-Hui Lee...
    106-117页
    查看更多>>摘要:A new cross-entropy-guided measure (CEGM) is proposed to indirectly assess accuracies of automatic speech recognition (ASR) of degraded speech with a speech enhancement front-end and without directly performing ASR experiments. The proposed CEGM is calculated in three steps, namely: (1) a low-level representations via feature extraction, (2) a high-level nonlinear mapping using an acoustic model, and (3) a final CEGM calculation between the high-level representations of clean and enhanced speech. Specifically, state posterior probabilities from outputs of conventional hybrid acoustic model of the target ASR system are adopted as the high-level representations and a cross-entropy criterion is used to calculate the CEGM. Due to CEGM's differentiability, it can also be used to replace the conventional minimum mean squared error (MMSE) criterion as an objective function for deep neural network (DNN)-based speech enhancement. Therefore, the front-end enhancement model can be optimized towards improving the accuracies of the back-end ASR system. Experiments on single-channel CHiME-4 Challenge show that CEGM yields consistently the highest correlations with word error rate (WER) which is often costly to calculate, and achieves the most accurate assessment of ASR performance when compared to the perceptual evaluation metrics commonly used for assessing speech enhancement performance. Furthermore, CEGM-optimized speech enhancement could effectively reduce the WER on the CHiME-4 real test set when compared to unprocessed noisy speech and enhanced speech obtained with MMSE-optimized enhancement for ASR systems with fixed multi-condition acoustic models in various deep architectures.

    Passive Geometry Calibration for Microphone Arrays Based on Distributed Damped Newton Optimization

    De HuZhe ChenFuliang Yin
    118-131页
    查看更多>>摘要:Geometry calibration is an inherent challenge in distributed acoustic sensor networks. To mitigate this problem, a passive geometry calibration approach based on distributed damped Newton optimization is proposed. Specifically, a geometric cost function incorporating direction of arrivals (DoAs) and time difference of arrivals (TDoAs) is first formulated, and then its identifiability conditions are given. Next, to achieve a distributed geometry calibration, the cost function is split into multiple local cost functions that are assigned to every node. After that, a distributed damped Newton optimization is presented to retrieve the geometry of microphone nodes and synchronize the internal delay between each two neighboring nodes. Finally, computational complexity and transmission bandwidth requirements are further analyzed. Compared with the existing approaches, the proposed method estimates the geometry structure of microphone networks in a distributed manner. Moreover, it requires a small number of acoustic sources. Experimental results show the validity of the proposed method.