Abstract. This paper proposes a new unsupervised audio-visual speech enhancement (AVSE) approach that combines a diffusion-based audio-visual speech generative model with a non-negative matrix factorization (NMF) noise model. First, the diffusion model is pre-trained on clean speech conditioned on corresponding video data to simulate the speech generative distribution. This pre-trained model is then paired with the NMF-based noise model to iteratively estimate clean speech. Specifically, a diffusion-based posterior sampling approach is implemented within the reverse diffusion process, where after each iteration, a speech estimate is obtained and used to update the noise parameters. Experimental results confirm that the proposed AVSE approach not only outperforms its audio-only counterpart but also generalizes better than a recent supervised-generative AVSE method. Additionally, the new inference algorithm offers a better balance between inference speed and performance compared to the previous diffusion-based method.
Speech Samples
id_noise_snr_file | Clean | Noisy | UDiffSE [1] | AV-UDiffSE | AO-UDiffSE + (Ours) | AV-UDiffSE + (Ours) | FlowAVSE [2] |
---|---|---|---|---|---|---|---|
09F_SPSQUARE_-5_sx374 | |||||||
24M_STRAFFIC_-5_sx10 | |||||||
26M_TBUS_-5_sx216 | |||||||
27M_STRAFFIC_-5_sx410 | |||||||
27M_TMETRO_-5_si1759 | |||||||
33F_OOFFICE_-5_si1477 | |||||||
33F_SPSQUARE_-5_sx395 | |||||||
40F_TBUS_5_sx388 | |||||||
49F_TMETRO_-5_sx409 | |||||||
56M_OOFFICE_-5_sx435 |
id_noise_snr_file | Clean | Noisy | UDiffSE [1] | AV-UDiffSE | AO-UDiffSE + (Ours) | AV-UDiffSE + (Ours) | FlowAVSE [2] |
---|---|---|---|---|---|---|---|
0ZfSOArXbGQ_Cafe_-5_00003 | |||||||
1bnzVjOJ6NM_LR_-5_00017 | |||||||
95ovIJ3dsNk_LR_-5_00006 | |||||||
9uOMectkCCs_Babble_-5_00001 | |||||||
fxbCHn6gE3U_Babble_-5_00007 | |||||||
Li4S1yyrsTI_White_-5_00009 | |||||||
Mt0PiXLvYlU_Cafe_-5_00009 | |||||||
Mt0PiXLvYlU_Car_-5_00011 | |||||||
SE97Kgi0sR4_White_5_00002 | |||||||
YyXRYgjQXX0_Car_-5_00002 |
[1] Berné Nortier, Mostafa Sadeghi, and Romain Serizel, “Unsupervised Speech Enhancement with Diffusion-based Generative Models,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024.
[2] Chaeyoung Jung, Suyeon Lee, Ji-Hoon Kim, Joon Son Chung, “FlowAVSE: Efficient audio visual speech enhancement models with conditional flow matching,” Interspeech 2024.