CASC-XVC: Zero-Shot Cross-Lingual Voice Conversion with Content Accordant and Speaker Contrastive Losses

Han-Jie Guo, Hui-Peng Du, Zheng-Yan Sheng, Li-Ping Chen, Yang Ai, Zhen-Hua Ling*

National Engineering Research Center of Speech and Language Information Processing,
University of Science and Technology of China, Hefei, P.R.China

{ghj2001,redmist, zysheng}@mail.ustc.edu.cn         {lipchen,yangai,zhling}@ustc.edu.cn

Abstract

Cross-lingual voice conversion (XVC) is a technology that modifies speaker identity while preserving linguistic content in scenarios where the source and target speakers use different languages. Previous non-parallel disentanglement-based methods face severe training-testing inconsistency issues in XVC tasks due to language mismatch and the lack of multilingual parallel data, which inevitably compromise the quality of the synthesized speech. In this paper, we propose CASC-XVC, a zero-shot XVC method incorporating with content accordant (CA) and speaker contrastive (SC) losses. Specifically, this method adopts the framework of FreeVC-s as the backbone. We design a cross-lingual fine-tuning process employing pairs of utterances from speakers in different languages to update the modules used in the inference stage. A CA loss and an SC loss are introduced to deal with the lack of true parallel targets in the fine-tuning process. Moreover, we use shared self-supervised learning (SSL) representations across different languages along with information perturbation for content disentanglement. Both subjective and objective results on a bilingual (English and Chinese) dataset demonstrate that our approach achieves significant improvements in XVC tasks.

English to Chinese

Source: p246 CyclePPG-XVC YourTTS AutoCycle-VC CASC-XVC (ours)

Chinese to English

Source: S0660 CyclePPG-XVC YourTTS AutoCycle-VC CASC-XVC (ours)

English to English

Source: p233 CyclePPG-XVC YourTTS AutoCycle-VC CASC-XVC (ours)

Chinese to Chinese

Source: S0019 CyclePPG-XVC YourTTS AutoCycle-VC CASC-XVC (ours)