首页期刊导航|IEEE/ACM transactions on audio, speech, and language processing
期刊信息/Journal information
IEEE/ACM transactions on audio, speech, and language processing
Institute of Electrical and Electronics Engineers
IEEE/ACM transactions on audio, speech, and language processing

Institute of Electrical and Electronics Engineers

双月刊

2329-9290

IEEE/ACM transactions on audio, speech, and language processing/Journal IEEE/ACM transactions on audio, speech, and language processing
正式出版
收录年代

    Statistical Analysis for Speaker Recognition Evaluation With Data Dependence and Three Score Distributions

    Jin Chu WuRaghu N. Kacker
    1-14页
    查看更多>>摘要:The speaker recognition evaluation is conducted in a framework in which three score distributions and two decision thresholds are employed, and the statistic of interest is an average of the two weighted sums of the probabilities of type I and type II errors at the two thresholds correspondingly. And data dependence caused by multiple use of the same subjects exists ubiquitously in order to generate more samples because of limited resources. Under such circumstances, statistical analysis is carried out. First, the standard error (SE) of measure is estimated using the nonparametric three-sample two-layer bootstrap algorithm on a two-layer data structure constructed after dataset optimization due to data dependence, based upon our prior rigorous statistical research in ROC analysis on large datasets with data dependence. Second, only based on such SEs, can the one-classifier and two-classifier significance testing in statistics be carried out to provide quantitative information in terms of the significance level, i.e., p-value, while dealing with evaluation and comparison of classifiers. In comparison, the positive correlation coefficient must be taken into account, which is computed using a synchronized resampling algorithm; otherwise, the likelihood of detecting the statistical significance of difference between the performance levels of two classifiers can be wrongly reduced.

    Operation-Augmented Numerical Reasoning for Question Answering

    Yongwei ZhouJunwei BaoYouzheng WuXiaodong He...
    15-28页
    查看更多>>摘要:Question answering requiring numerical reasoning, which generally involves symbolic operations such as sorting, counting, and addition, is a challenging task. To address such a problem, existing mixture-of-experts (MoE)-based methods design several specific answer predictors to handle different types of questions and achieve promising performance. However, they ignore the modeling and exploitation of fine-grained reasoning-related operations to support numerical reasoning, encountering the inadequacy in reasoning capability and interpretability. To alleviate this issue, we propose OPERA, an operation-augmented numerical reasoning framework. Concretely, we systematically define a scalable operation set to model numerical reasoning. We first identify reasoning-related operations based on context and then softly execute them to imitate the answer reasoning procedure via an operation-aware cross-attention mechanism. Finally, we utilize the operation-augmented semantic representation of execution results to support answer prediction. We verify the effectiveness and generalization of OPERA in two scenarios with different knowledge sources and reasoning capabilities. Specifically, we conduct extensive experiments on two textual datasets, DROP and RACENum, and a table-text hybrid dataset TAT-QA. Experiment results show that OPERA outperforms previous strong methods on the DROP, RACENum, and TAT-QA datasets. Further, we statistically and visually analyze its interpretability.

    Speech Dereverberation With Frequency Domain Autoregressive Modeling

    Anurenjan PurushothamanDebottam DuttaRohit KumarSriram Ganapathy...
    29-38页
    查看更多>>摘要:Speech applications in far-field real world settings often deal with signals that are corrupted by reverberation. The task of dereverberation constitutes an important step to improve the audible quality and to reduce the error rates in applications like automatic speech recognition (ASR). We propose a unified framework of speech dereverberation for improving the speech quality and the ASR performance using the approach of envelope-carrier decomposition provided by an autoregressive (AR) model. The AR model is applied in the frequency domain of the sub-band speech signals to separate the envelope and carrier parts. A novel neural architecture based on dual path long short term memory (DPLSTM) model is proposed, which jointly enhances the sub-band envelope and carrier components. The dereverberated envelope-carrier signals are modulated and the sub-band signals are synthesized to reconstruct the audio signal back. The DPLSTM model for dereverberation of envelope and carrier components also allows the joint learning of the network weights for the down stream ASR task. In the ASR tasks on the REVERB challenge dataset as well as on the VOiCES dataset, we illustrate that the joint learning of speech dereverberation network and the E2E ASR model yields significant performance improvements over the baseline ASR system trained on log-mel spectrogram as well as other benchmarks for dereverberation (average relative improvements of 10-24% over the baseline system). The speech quality improvements, evaluated using subjective listening tests, further highlight the improved quality of the reconstructed audio.

    Disentangling Prosody Representations With Unsupervised Speech Reconstruction

    Leyuan QuTaihao LiCornelius WeberTheresa Pekarek-Rosin...
    39-54页
    查看更多>>摘要:Human speech can be characterized by different components, including semantic content, speaker identity and prosodic information. Significant progress has been made in disentangling representations for semantic content and speaker identity in speech recognition and speaker verification tasks respectively. However, it is still an open challenging question to extract prosodic information because of the intrinsic association of different attributes, such as timbre and rhythm, and because of the need for supervised training schemes to achieve robust speech recognition. The aim of this article is to address the disentanglement of emotional prosody based on unsupervised reconstruction. Specifically, we identify, design, implement and integrate three crucial components in our proposed model Prosody2Vec: (1) a unit encoder that transforms speech signals into discrete units for semantic content, (2) a pretrained speaker verification model to generate speaker identity embeddings, and (3) a trainable prosody encoder to learn prosody representations. We first pretrain Prosody2Vec on unlabelled emotional speech corpora, then fine-tune the model on specific datasets to perform Speech Emotion Recognition (SER) and Emotional Voice Conversion (EVC) tasks. Both objective and subjective evaluations on the EVC task suggest that Prosody2Vec effectively captures general prosodic features that can be smoothly transferred to other emotional speech. In addition, our SER experiments on the IEMOCAP dataset reveal that the prosody features learned by Prosody2Vec are complementary and beneficial for the performance of widely used speech pretraining models and surpass the state-of-the-art methods when combining Prosody2Vec with HuBERT representations. Audio samples can be found on our demo website.

    Data-Driven Non-Intrusive Speech Intelligibility Prediction Using Speech Presence Probability

    Mathias Bach PedersenSøren Holdt JensenZheng-Hua TanJesper Jensen...
    55-67页
    查看更多>>摘要:Time consuming Speech Intelligibility (SI) listening tests with human subjects can be replaced by algorithmic SI predictors. In recent years, data-driven SI predictors have been showing promising results. A major limiting factor in the advancement of data-driven SI prediction is that there is a scarcity of SI listening test data available to train the data-driven methods. In this article we propose a data-driven SI predictor that does not require access to an underlying noise-free reference signal, i.e., non-intrusive, and which does not require listening test data for training. Instead, the proposed method exploits a hypothesized link between SI and Speech Presence Probability (SPP). We show that a neural network can be trained on easily obtainable speech in additive noise data to estimate SPP, and that a simple post-processing stage can be applied in order to map the estimated SPP to SI predictions with high accuracy. The proposed method is evaluated and compared to other state-of-the art non-intrusive SI predictors, and achieves the highest performance even in the presence of processed noisy speech, which the SPP estimator has not been trained on.

    Cooperative Scene-Event Modelling for Acoustic Scene Classification

    Yuanbo HouBo KangAndrew MitchellWenwu Wang...
    68-82页
    查看更多>>摘要:Acoustic scene classification (ASC) can be helpful for creating context awareness for intelligent robots. Humans naturally use the relations between acoustic scenes (AS) and audio events (AE) to understand and recognize their surrounding environments. However, in most previous works, ASC and audio event classification (AEC) are treated as independent tasks, with a focus primarily on audio features shared between scenes and events, but not their implicit relations. To address this limitation, we propose a cooperative scene-event modelling (cSEM) framework to automatically model the intricate scene-event relation by an adaptive coupling matrix to improve ASC. Compared with other scene-event modelling frameworks, the proposed cSEM offers the following advantages. First, it reduces the confusion between similar scenes by aligning the information of coarse-grained AS and fine-grained AE in the latent space, and reducing the redundant information between the AS and AE embeddings. Second, it exploits the relation information between AS and AE to improve ASC, which is shown to be beneficial, even if the information of AE is derived from unverified pseudo-labels. Third, it uses a regression-based loss function for cooperative modelling of scene-event relations, which is shown to be more effective than classification-based loss functions. Instantiated from four models based on either Transformer or convolutional neural networks, cSEM is evaluated on real-life and synthetic datasets. Experiments show that cSEM-based models work well in real-life scene-event analysis, offering competitive results on ASC as compared with other multi-feature or multi-model ensemble methods. The ASC accuracy achieved on the TUT2018, TAU2019, and JSSED datasets is 81.0%, 88.9% and 97.2%, respectively.

    Exploring Scope Detection for Aspect-Based Sentiment Analysis

    Xiaotong JiangPeiwen YouChen ChenZhongqing Wang...
    83-94页
    查看更多>>摘要:Aspect-based sentiment analysis (ABSA) aims to extract the aspect terms from review text, and to predict the polarity towards the aspect term. Although neural models have achieved competitive results, there are still many challenges in this task. Firstly, there is irrelevant and noise information in the review text, and the offsets of aspect term boundary are hard to decide. In addition, sentiment is usually either expressed implicitly or shifted due to the occurrence of negation and rhetorical words. To tackle the above limitations, we propose a scope detection model to distinguish whether the words from the review text are relevant with the aspect term, and to filter irrelevant and noise information. In addition, we investigate a biaffine-based model to constrain the scope detection process of aspect term extraction. We further generate a simplified clause based on the scope of aspect term, and predict the polarity based on the simplified clause. Empirical studies show the effectiveness of our proposed model over several strong baselines. These also justify the importance of scope detection for aspect-based sentiment analysis.

    Beyond the Status Quo: A Contemporary Survey of Advances and Challenges in Audio Captioning

    Xuenan XuZeyu XieMengyue WuKai Yu...
    95-112页
    查看更多>>摘要:Automated audio captioning (AAC), a task that mimics human perception as well as innovatively links audio processing and natural language processing, has overseen much progress over the last few years. AAC requires recognizing contents such as the environment, sound events and the temporal relationships between sound events and describing these elements with a fluent sentence. Currently, an encoder-decoder-based deep learning framework is the standard approach to tackle this problem. Plenty of works have proposed novel network architectures and training schemes, including extra guidance, reinforcement learning, audio-text self-supervised learning and diverse or controllable captioning. Effective data augmentation techniques, especially based on large language models are explored. Benchmark datasets and AAC-oriented evaluation metrics also accelerate the improvement of this field. This article situates itself as a comprehensive survey covering the comparison between AAC and its related tasks, the existing deep learning techniques, datasets, and the evaluation metrics in AAC, with insights provided to guide potential future research directions.

    Deep Prior-Based Audio Inpainting Using Multi-Resolution Harmonic Convolutional Neural Networks

    Federico MiotelloMirco PezzoliLuca ComanducciFabio Antonacci...
    113-123页
    查看更多>>摘要:In this manuscript, we propose a novel method to perform audio inpainting, i.e., the restoration of audio signals presenting multiple missing parts. Audio inpainting can be interpreted in the context of inverse problems as the task of reconstructing an audio signal from its corrupted observation. For this reason, our method is based on a deep prior approach, a recently proposed technique that proved to be effective in the solution of many inverse problems, among which image inpainting. Deep prior allows one to consider the structure of a neural network as an implicit prior and to adopt it as a regularizer. Differently from the classical deep learning paradigm, deep prior performs a single-element training and thus it can be applied to corrupted audio signals independently from the available training data sets. In the context of audio inpainting, a network presenting relevant audio priors will possibly generate a restored version of an audio signal, only provided with its corrupted observation. Our method exploits a time-frequency representation of audio signals and makes use of a multi-resolution convolutional autoencoder, that has been enhanced to perform the harmonic convolution operation. Results show that the proposed technique is able to provide a coherent and meaningful reconstruction of the corrupted audio. It is also able to outperform the methods considered for comparison, in its domain of application.

    Decomposition-Based Wiener Filter Using the Kronecker Product and Conjugate Gradient Method

    Cristian-Lucian StanciuJacob BenestyConstantin PaleologuRuxandra-Liana Costea...
    124-138页
    查看更多>>摘要:The identification of long-length impulse responses represents a challenge in the context of many applications, like echo cancellation. Recently, the problem has been addressed in the framework of low-rank systems, using a decomposition of the impulse response based on the nearest Kronecker product and low-rank approximations. As a result, the original system identification problem that involves a long-length finite impulse response filter is reshaped as a combination of two (much) shorter filters, which leads to significant advantages. In this context, the benchmark Wiener filter can be formulated in terms of an iterative algorithm, where the estimates of the two component filters are sequently updated. However, matrix inversion operations are required within this algorithm. In this article, we develop a new version of the decomposition-based iterative Wiener filter, which relies on the conjugate gradient (CG) method and avoids matrix inversion. Simulations performed in the framework of echo cancellation indicate the good performance of the proposed solution, which outperforms the conventional Wiener filter (implemented using CG updates) and inherits the advantages of the decomposition-based approach.