Han-Jie Guo, Hui-Peng Du, Zheng-Yan Sheng, Li-Ping Chen, Yang Ai, Zhen-Hua Ling*
        National Engineering Research Center of Speech and Language Information Processing,
University of Science
            and Technology of China, Hefei, P.R.China
        {ghj2001,redmist,
            zysheng}@mail.ustc.edu.cn
                    {lipchen,yangai,zhling}@ustc.edu.cn
        
     
    
        Abstract
        Cross-lingual voice conversion (XVC) is a technology that modifies speaker identity while preserving
            linguistic content
            in scenarios where the source and target speakers use different languages. Previous non-parallel
            disentanglement-based
            methods face severe training-testing inconsistency issues in XVC tasks due to language mismatch and the lack
            of
            multilingual parallel data, which inevitably compromise the quality of the synthesized speech. In this
            paper, we propose
            CASC-XVC, a zero-shot XVC method incorporating with content accordant (CA) and speaker contrastive (SC)
            losses.
            Specifically, this method adopts the framework of FreeVC-s as the backbone. We design a cross-lingual
            fine-tuning
            process employing pairs of utterances from speakers in different languages to update the modules used in the
            inference
            stage. A CA loss and an SC loss are introduced to deal with the lack of true parallel targets in the
            fine-tuning
            process. Moreover, we use shared self-supervised learning (SSL) representations across different languages
            along with
            information perturbation for content disentanglement. Both subjective and objective results on a bilingual
            (English and
            Chinese) dataset demonstrate that our approach achieves significant improvements in XVC tasks.