Publications

Multimodal Representation Loss Between Timed Text and Audio for Regularized Speech Separation

Published in Proc. Interpseech, 2024

In this study, we introduce a timed text-based regularization method to enhance speech separation models by aligning audio and word embeddings using pretrained WavLM and BERT models, leading to improved separation performance without needing auxiliary text data during testing.

Download here

On the Importance of Neural Wiener Filter for Resource Efficient Multichannel Speech Enhancement

Published in Proc. ICASSP, 2024

In this work, we present a low-latency, computationally efficient time-domain framework for multichannel speech enhancement, featuring two compact deep neural networks (DNNs) surrounding a multichannel neural Wiener filter (NWF), which together achieve superior performance with fewer parameters and reduced computational demands.

Download here

Inference and Denoise: Causal Inference-based Neural Speech Enhancement

Published in MLSP, 2023

This study introduces a causal inference-based speech enhancement (CISE) framework that models noise presence as an intervention, using a noise detector and mask-based enhancement modules to perform noise-conditional speech enhancement, demonstrating improved performance and efficiency compared to non-causal and more complex SE models.

Download here

Improving Perceptual Quality by Phone-Fortified Perceptual Loss using Wasserstein Distance for Speech Enhancement

Published in Proc. Interspeech, 2021

In this study, we propose a phone-fortified perceptual loss (PFPL) for speech enhancement, leveraging phonetic information from the wav2vec model and utilizing the Wasserstein distance to improve speech quality and intelligibility, demonstrating superior performance compared to signal-level losses on standardized evaluations.

Download here

Boosting Objective Scores of a Speech Enhancement Model by MetricGAN Post-processing

Published in APSIPA ASC, 2020

TIn this study, we apply a modified Transformer architecture to speech enhancement by replacing positional encoding with convolutional layers and fine-tuning the model using a MetricGAN framework to boost perceptual quality (PESQ) scores, achieving significant improvements over the baseline in both subjective and objective evaluations on the DNS challenge datasets.

Download here

WaveCRN: An Efficient Convolutional Recurrent Neural Network for End-to-end Speech Enhancement

Published in IEEE Signal Processing Letters, 2020

In this letter, we propose an efficient end-to-end speech enhancement model, WaveCRN, which combines a CNN module for capturing speech locality features with a stacked SRU module for modeling sequential properties, using a novel restricted feature masking approach to achieve state-of-the-art performance with reduced complexity and faster inference.

Download here

Tsun-An Hsieh

Publications

Multimodal Representation Loss Between Timed Text and Audio for Regularized Speech Separation

On the Importance of Neural Wiener Filter for Resource Efficient Multichannel Speech Enhancement

Inference and Denoise: Causal Inference-based Neural Speech Enhancement

Improving Perceptual Quality by Phone-Fortified Perceptual Loss using Wasserstein Distance for Speech Enhancement

Boosting Objective Scores of a Speech Enhancement Model by MetricGAN Post-processing

WaveCRN: An Efficient Convolutional Recurrent Neural Network for End-to-end Speech Enhancement