arXiv:2401.03816v1 [eess.
Copyright 2024 IEEE. Accepted to ICASSP 2024 - 2024 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), scheduled for 14-19 April 2024 in Seoul, Ko-
rea. Personal use of this material is permitted. However, permission to reprint/republish this
material for advertising or promotional purposes or for creating new collective works for resale
or redistribution to servers or lists, or to reuse any copyrighted component of this work in other
works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE
Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone:
+ Intl. 908-562-3966.
, CREATING PERSONALIZED SYNTHETIC VOICES FROM ARTICULATION IMPAIRED
SPEECH USING AUGMENTED RECONSTRUCTION LOSS
Yusheng Tian, Jingyu Li, Tan Lee
Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong SAR
ABSTRACT sulting in impaired speech production. Such impairments are
most commonly at the articulation level [6, 7]. TTS mod-
This research is about the creation of personalized syn-
els trained with articulation-impaired speech would generate
thetic voices for head and neck cancer survivors. It is focused
synthetic speech that expectantly contains similar types of im-
particularly on tongue cancer patients whose speech might ex-
pairment and may not meet the intelligibility requirement for
hibit severe articulation impairment. Our goal is to restore
communication. This motivates the present study on creat-
normal articulation in the synthesized speech, while maxi-
ing personalized synthetic speech from articulation-impaired
mally preserving the target speaker’s individuality in terms of
training data. Our goal is to restore normal articulation in the
both the voice timbre and speaking style. This is formulated
synthesized speech while maximally maintaining the target
as a task of learning from noisy labels. We propose to aug-
speaker’s individuality. We consider the target speaker’s indi-
ment the commonly used speech reconstruction loss with two
viduality to be well maintained if both the voice timbre and
additional terms. The first term constitutes a regularization
good aspects of the original speaking style are kept.
loss that mitigates the impact of distorted articulation in the
training speech. The second term is a consistency loss that en- From the machine learning perspective, the above goal
courages correct articulation in the generated speech. These can be formulated as a task of learning from noisy labels [8–
additional loss terms are obtained from frame-level articula- 11]. This is justified by the fact that the degree of articulation
tion scores of original and generated speech, which are de- impairment in the speech of a tongue cancer patient varies
rived using a separately trained phone classifier. Experimen- across different types of speech sounds [12]: some sounds re-
tal results on a real case of tongue cancer patient confirm that main largely unaffected and can be viewed as having clean
the synthetic voice achieves comparable articulation quality labels, while those with distorted articulation are with noisy
to unimpaired natural speech, while effectively maintaining labels. Inspired by the re-weighting approach [13, 14] and the
the target speaker’s individuality. Audio samples are available consistency constraint approach [15] developed in studies of
at https://myspeechproject.github.io/ArticulationRepair/. learning with noisy labels, we propose to augment the con-
ventional speech reconstruction loss in TTS model training
Index Terms— Personalized speech synthesis, articula- with two additional terms. The first term is a regularization
tion disorder, learning from noisy labels loss that mitigates the negative impact of distorted articulation
in training speech. The second term is a consistency loss that
1. INTRODUCTION promotes accurate articulation in the output speech. Specifi-
cally, a separately trained phone classifier is incorporated dur-
Tongue cancer is a prevalent form of head and neck cancer. ing training to provide frame-level articulation scores for both
Its incidence rate has been rising in recent years [1]. For ad- original and generated speech. The articulation score of orig-
vanced or recurrent tongue cancer, surgical intervention may inal speech is used as the re-weighting criteria to derive the
involve the removal of both the tongue and larynx [2, 3]. As a regularization loss. The articulation score of generated speech
consequence, the patient would lose voice and speaking abil- quantifies the inconsistency between the phone classifier and
ity permanently. Voice is not only an important means of the TTS model, representing the consistency loss.
communication, but also an integral part of a person’s iden- The proposed approach is validated on a real patient case,
tity. Personalized text-to-speech (TTS) was proposed to en- the same as the one reported in [16]. A personalized synthetic
able people with vocal disabilities to communicate using their voice is built for a female Cantonese speaker, who was ad-
own voices [4]. It was shown that using personalized TTS vised to undertake laryngectomy for recurrent tongue cancer.
as an alternative communication method can significantly im- The patient already underwent partial-glossectomy six years
prove the quality of life of laryngectomees [5]. ago, and about 3/4 of her tongue was removed by surgical
Personalized TTS models are trained with natural speech operation. As a result, she had difficulties in producing cer-
from the target speaker. A major challenge in creating per- tain speech sounds. The synthetic voice is developed from
sonalized synthetic voices for tongue cancer survivors is that the articulation-impaired speech of this patient using the aug-
the training speech is often impaired. Both the tumor and mented reconstruction loss. Objective and subjective evalu-
the treatment process could cause damages to the tongue, re- ations are carried out to demonstrate the effectiveness of the
Copyright 2024 IEEE. Accepted to ICASSP 2024 - 2024 IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), scheduled for 14-19 April 2024 in Seoul, Ko-
rea. Personal use of this material is permitted. However, permission to reprint/republish this
material for advertising or promotional purposes or for creating new collective works for resale
or redistribution to servers or lists, or to reuse any copyrighted component of this work in other
works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE
Service Center / 445 Hoes Lane / P.O. Box 1331 / Piscataway, NJ 08855-1331, USA. Telephone:
+ Intl. 908-562-3966.
, CREATING PERSONALIZED SYNTHETIC VOICES FROM ARTICULATION IMPAIRED
SPEECH USING AUGMENTED RECONSTRUCTION LOSS
Yusheng Tian, Jingyu Li, Tan Lee
Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong SAR
ABSTRACT sulting in impaired speech production. Such impairments are
most commonly at the articulation level [6, 7]. TTS mod-
This research is about the creation of personalized syn-
els trained with articulation-impaired speech would generate
thetic voices for head and neck cancer survivors. It is focused
synthetic speech that expectantly contains similar types of im-
particularly on tongue cancer patients whose speech might ex-
pairment and may not meet the intelligibility requirement for
hibit severe articulation impairment. Our goal is to restore
communication. This motivates the present study on creat-
normal articulation in the synthesized speech, while maxi-
ing personalized synthetic speech from articulation-impaired
mally preserving the target speaker’s individuality in terms of
training data. Our goal is to restore normal articulation in the
both the voice timbre and speaking style. This is formulated
synthesized speech while maximally maintaining the target
as a task of learning from noisy labels. We propose to aug-
speaker’s individuality. We consider the target speaker’s indi-
ment the commonly used speech reconstruction loss with two
viduality to be well maintained if both the voice timbre and
additional terms. The first term constitutes a regularization
good aspects of the original speaking style are kept.
loss that mitigates the impact of distorted articulation in the
training speech. The second term is a consistency loss that en- From the machine learning perspective, the above goal
courages correct articulation in the generated speech. These can be formulated as a task of learning from noisy labels [8–
additional loss terms are obtained from frame-level articula- 11]. This is justified by the fact that the degree of articulation
tion scores of original and generated speech, which are de- impairment in the speech of a tongue cancer patient varies
rived using a separately trained phone classifier. Experimen- across different types of speech sounds [12]: some sounds re-
tal results on a real case of tongue cancer patient confirm that main largely unaffected and can be viewed as having clean
the synthetic voice achieves comparable articulation quality labels, while those with distorted articulation are with noisy
to unimpaired natural speech, while effectively maintaining labels. Inspired by the re-weighting approach [13, 14] and the
the target speaker’s individuality. Audio samples are available consistency constraint approach [15] developed in studies of
at https://myspeechproject.github.io/ArticulationRepair/. learning with noisy labels, we propose to augment the con-
ventional speech reconstruction loss in TTS model training
Index Terms— Personalized speech synthesis, articula- with two additional terms. The first term is a regularization
tion disorder, learning from noisy labels loss that mitigates the negative impact of distorted articulation
in training speech. The second term is a consistency loss that
1. INTRODUCTION promotes accurate articulation in the output speech. Specifi-
cally, a separately trained phone classifier is incorporated dur-
Tongue cancer is a prevalent form of head and neck cancer. ing training to provide frame-level articulation scores for both
Its incidence rate has been rising in recent years [1]. For ad- original and generated speech. The articulation score of orig-
vanced or recurrent tongue cancer, surgical intervention may inal speech is used as the re-weighting criteria to derive the
involve the removal of both the tongue and larynx [2, 3]. As a regularization loss. The articulation score of generated speech
consequence, the patient would lose voice and speaking abil- quantifies the inconsistency between the phone classifier and
ity permanently. Voice is not only an important means of the TTS model, representing the consistency loss.
communication, but also an integral part of a person’s iden- The proposed approach is validated on a real patient case,
tity. Personalized text-to-speech (TTS) was proposed to en- the same as the one reported in [16]. A personalized synthetic
able people with vocal disabilities to communicate using their voice is built for a female Cantonese speaker, who was ad-
own voices [4]. It was shown that using personalized TTS vised to undertake laryngectomy for recurrent tongue cancer.
as an alternative communication method can significantly im- The patient already underwent partial-glossectomy six years
prove the quality of life of laryngectomees [5]. ago, and about 3/4 of her tongue was removed by surgical
Personalized TTS models are trained with natural speech operation. As a result, she had difficulties in producing cer-
from the target speaker. A major challenge in creating per- tain speech sounds. The synthetic voice is developed from
sonalized synthetic voices for tongue cancer survivors is that the articulation-impaired speech of this patient using the aug-
the training speech is often impaired. Both the tumor and mented reconstruction loss. Objective and subjective evalu-
the treatment process could cause damages to the tongue, re- ations are carried out to demonstrate the effectiveness of the