Han-Jie Guo, Hui-Peng Du, Zheng-Yan Sheng, Li-Ping Chen, Yang Ai, Zhen-Hua Ling*
National Engineering Research Center of Speech and Language Information Processing,
University of Science
and Technology of China, Hefei, P.R.China
{ghj2001,redmist,
zysheng}@mail.ustc.edu.cn
{lipchen,yangai,zhling}@ustc.edu.cn
Abstract
Cross-lingual voice conversion (XVC) is a technology that modifies speaker identity while preserving
linguistic content
in scenarios where the source and target speakers use different languages. Previous non-parallel
disentanglement-based
methods face severe training-testing inconsistency issues in XVC tasks due to language mismatch and the lack
of
multilingual parallel data, which inevitably compromise the quality of the synthesized speech. In this
paper, we propose
CASC-XVC, a zero-shot XVC method incorporating with content accordant (CA) and speaker contrastive (SC)
losses.
Specifically, this method adopts the framework of FreeVC-s as the backbone. We design a cross-lingual
fine-tuning
process employing pairs of utterances from speakers in different languages to update the modules used in the
inference
stage. A CA loss and an SC loss are introduced to deal with the lack of true parallel targets in the
fine-tuning
process. Moreover, we use shared self-supervised learning (SSL) representations across different languages
along with
information perturbation for content disentanglement. Both subjective and objective results on a bilingual
(English and
Chinese) dataset demonstrate that our approach achieves significant improvements in XVC tasks.