Multimodal Representation Loss Between Timed Text and Audio for Regularized Speech Separation
Published in Proc. Interpseech, 2024
In this study, we introduce a timed text-based regularization method to enhance speech separation models by aligning audio and word embeddings using pretrained WavLM and BERT models, leading to improved separation performance without needing auxiliary text data during testing.
Download here