Model architectures to extrapolate emotional expressions in DNN-based text-to-speech

Inoue, Katsuki; Hara, Sunao; Abe, Masanobu; Hojo, Nobukatsu; Ijima, Yusuke

doi:10.1016/j.specom.2020.11.004

Permalink : https://ousar.lib.okayama-u.ac.jp/61445

ID	61445
FullText URL	k_inoue_journal.pdf 1.02 MB
Author	Inoue, Katsuki Graduate school of Interdisciplinary Science and Engineering in Health Systems, Okayama University Hara, Sunao Graduate school of Interdisciplinary Science and Engineering in Health Systems, Okayama University ORCID Kaken ID publons researchmap Abe, Masanobu Graduate school of Interdisciplinary Science and Engineering in Health Systems, Okayama University ORCID Kaken ID publons researchmap Hojo, Nobukatsu NTT Corporation Ijima, Yusuke NTT Corporation
Abstract	This paper proposes architectures that facilitate the extrapolation of emotional expressions in deep neural network (DNN)-based text-to-speech (TTS). In this study, the meaning of “extrapolate emotional expressions” is to borrow emotional expressions from others, and the collection of emotional speech uttered by target speakers is unnecessary. Although a DNN has potential power to construct DNN-based TTS with emotional expressions and some DNN-based TTS systems have demonstrated satisfactory performances in the expression of the diversity of human speech, it is necessary and troublesome to collect emotional speech uttered by target speakers. To solve this issue, we propose architectures to separately train the speaker feature and the emotional feature and to synthesize speech with any combined quality of speakers and emotions. The architectures are parallel model (PM), serial model (SM), auxiliary input model (AIM), and hybrid models (PM&AIM and SM&AIM). These models are trained through emotional speech uttered by few speakers and neutral speech uttered by many speakers. Objective evaluations demonstrate that the performances in the open-emotion test provide insufficient information. They make a comparison with those in the closed-emotion test, but each speaker has their own manner of expressing emotion. However, subjective evaluation results indicate that the proposed models could convey emotional information to some extent. Notably, the PM can correctly convey sad and joyful emotions at a rate of >60%.
Keywords	Emotional speech synthesis Extrapolation DNN-based TTS Text-to-speech Acoustic model Phoneme duration model
Published Date	2021-02
Publication Title	Speech Communication
Volume	volume126
Publisher	Elsevier
Start Page	35
End Page	43
ISSN	0167-6393
NCID	AA10630135
Content Type	Journal Article
language	English
OAI-PMH Set	岡山大学
File Version	author
DOI	10.1016/j.specom.2020.11.004
Web of Science KeyUT	000608358500004
Related Url	isVersionOf https://doi.org/10.1016/j.specom.2020.11.004
License	https://creativecommons.org/licenses/by-nc-nd/4.0/