start-ver=1.4
cd-journal=joma
no-vol=126
cd-vols=
no-issue=
article-no=
start-page=35
end-page=43
dt-received=
dt-revised=
dt-accepted=
dt-pub-year=2021
dt-pub=202102
dt-online=
en-article=
kn-article=
en-subject=
kn-subject=
en-title=
kn-title=Model architectures to extrapolate emotional expressions in DNN-based text-to-speech
en-subtitle=
kn-subtitle=
en-abstract=
kn-abstract=This paper proposes architectures that facilitate the extrapolation of emotional expressions in deep neural network (DNN)-based text-to-speech (TTS). In this study, the meaning of “extrapolate emotional expressions” is to borrow emotional expressions from others, and the collection of emotional speech uttered by target speakers is unnecessary. Although a DNN has potential power to construct DNN-based TTS with emotional expressions and some DNN-based TTS systems have demonstrated satisfactory performances in the expression of the diversity of human speech, it is necessary and troublesome to collect emotional speech uttered by target speakers. To solve this issue, we propose architectures to separately train the speaker feature and the emotional feature and to synthesize speech with any combined quality of speakers and emotions. The architectures are parallel model (PM), serial model (SM), auxiliary input model (AIM), and hybrid models (PM&AIM and SM&AIM). These models are trained through emotional speech uttered by few speakers and neutral speech uttered by many speakers. Objective evaluations demonstrate that the performances in the open-emotion test provide insufficient information. They make a comparison with those in the closed-emotion test, but each speaker has their own manner of expressing emotion. However, subjective evaluation results indicate that the proposed models could convey emotional information to some extent. Notably, the PM can correctly convey sad and joyful emotions at a rate of >60%.
en-copyright=
kn-copyright=

en-aut-name=InoueKatsuki
en-aut-sei=Inoue
en-aut-mei=Katsuki
kn-aut-name=
kn-aut-sei=
kn-aut-mei=
aut-affil-num=1
ORCID=

en-aut-name=HaraSunao
en-aut-sei=Hara
en-aut-mei=Sunao
kn-aut-name=
kn-aut-sei=
kn-aut-mei=
aut-affil-num=2
ORCID=

en-aut-name=AbeMasanobu
en-aut-sei=Abe
en-aut-mei=Masanobu
kn-aut-name=
kn-aut-sei=
kn-aut-mei=
aut-affil-num=3
ORCID=

en-aut-name=HojoNobukatsu
en-aut-sei=Hojo
en-aut-mei=Nobukatsu
kn-aut-name=
kn-aut-sei=
kn-aut-mei=
aut-affil-num=4
ORCID=

en-aut-name=IjimaYusuke
en-aut-sei=Ijima
en-aut-mei=Yusuke
kn-aut-name=
kn-aut-sei=
kn-aut-mei=
aut-affil-num=5
ORCID=

affil-num=1
en-affil=Graduate school of Interdisciplinary Science and Engineering in Health Systems, Okayama University
kn-affil=

affil-num=2
en-affil=Graduate school of Interdisciplinary Science and Engineering in Health Systems, Okayama University
kn-affil=

affil-num=3
en-affil=Graduate school of Interdisciplinary Science and Engineering in Health Systems, Okayama University
kn-affil=

affil-num=4
en-affil=NTT Corporation
kn-affil=

affil-num=5
en-affil=NTT Corporation
kn-affil=
en-keyword=Emotional speech synthesis
kn-keyword=Emotional speech synthesis
en-keyword=Extrapolation
kn-keyword=Extrapolation
en-keyword=DNN-based TTS
kn-keyword=DNN-based TTS
en-keyword=Text-to-speech
kn-keyword=Text-to-speech
en-keyword=Acoustic model
kn-keyword=Acoustic model
en-keyword=Phoneme duration model
kn-keyword=Phoneme duration model
END

start-ver=1.4
cd-journal=joma
no-vol=2019
cd-vols=
no-issue=
article-no=
start-page=143
end-page=147
dt-received=
dt-revised=
dt-accepted=
dt-pub-year=2019
dt-pub=201911
dt-online=
en-article=
kn-article=
en-subject=
kn-subject=
en-title=
kn-title=Speech-like Emotional Sound Generator by WaveNet
en-subtitle=
kn-subtitle=
en-abstract=
kn-abstract=In this paper, we propose a new algorithm to generate Speech-like Emotional Sound (SES). Emotional information plays an important role in human communication, and speech is one of the most useful media to express emotions. Although, in general, speech conveys emotional information as well as linguistic information, we have undertaken the challenge to generate sounds that convey emotional information without linguistic information, which results in making conversations in human-machine interactions more natural in some situations by providing non-verbal emotional vocalizations. We call the generated sounds “speech-like”, because the sounds do not contain any linguistic information. For the purpose, we propose to employ WaveNet as a sound generator conditioned by only emotional IDs. The idea is quite different from WaveNet Vocoder that synthesizes speech using spectrum information as auxiliary features. The biggest advantage of the idea is to reduce the amount of emotional speech data for the training. The proposed algorithm consists of two steps. In the first step, WaveNet is trained to obtain phonetic features using a large speech database, and in the second step, WaveNet is re-trained using a small amount of emotional speech. Subjective listening evaluations showed that the SES could convey emotional information and was judged to sound like a human voice.
en-copyright=
kn-copyright=

en-aut-name=MatsumotoKento
en-aut-sei=Matsumoto
en-aut-mei=Kento
kn-aut-name=
kn-aut-sei=
kn-aut-mei=
aut-affil-num=1
ORCID=

en-aut-name=HaraSunao
en-aut-sei=Hara
en-aut-mei=Sunao
kn-aut-name=
kn-aut-sei=
kn-aut-mei=
aut-affil-num=2
ORCID=

en-aut-name=AbeMasanobu
en-aut-sei=Abe
en-aut-mei=Masanobu
kn-aut-name=
kn-aut-sei=
kn-aut-mei=
aut-affil-num=3
ORCID=

affil-num=1
en-affil=Okayama University
kn-affil=

affil-num=2
en-affil=Okayama University
kn-affil=

affil-num=3
en-affil=Okayama University
kn-affil=
END

start-ver=1.4
cd-journal=joma
no-vol=
cd-vols=
no-issue=
article-no=
start-page=2464
end-page=2468
dt-received=
dt-revised=
dt-accepted=
dt-pub-year=2018
dt-pub=20180902
dt-online=
en-article=
kn-article=
en-subject=
kn-subject=
en-title=
kn-title=Naturalness Improvement Algorithm for Reconstructed Glossectomy Patient's Speech Using Spectral Differential Modification in Voice Conversion
en-subtitle=
kn-subtitle=
en-abstract=
kn-abstract= In this paper, we propose an algorithm to improve the naturalness of the reconstructed glossectomy patient's speech that is generated by voice conversion to enhance the intelligibility of speech uttered by patients with a wide glossectomy. While existing VC algorithms make it possible to improve intelligibility and naturalness, the result is still not satisfying. To solve the continuing problems, we propose to directly modify the speech waveforms using a spectrum differential. The motivation is that glossectomy patients mainly have problems in their vocal tract, not in their vocal cords. The proposed algorithm requires no source parameter extractions for speech synthesis, so there are no errors in source parameter extractions and we are able to make the best use of the original source characteristics. In terms of spectrum conversion, we evaluate with both GMM and DNN. Subjective evaluations show that our algorithm can synthesize more natural speech than the vocoder-based method. Judging from observations of the spectrogram, power in high-frequency bands of fricatives and stops is reconstructed to be similar to that of natural speech.
en-copyright=
kn-copyright=

en-aut-name=MurakamiHiroki
en-aut-sei=Murakami
en-aut-mei=Hiroki
kn-aut-name=
kn-aut-sei=
kn-aut-mei=
aut-affil-num=1
ORCID=

en-aut-name=HaraSunao
en-aut-sei=Hara
en-aut-mei=Sunao
kn-aut-name=
kn-aut-sei=
kn-aut-mei=
aut-affil-num=2
ORCID=

en-aut-name=AbeMasanobu
en-aut-sei=Abe
en-aut-mei=Masanobu
kn-aut-name=
kn-aut-sei=
kn-aut-mei=
aut-affil-num=3
ORCID=

en-aut-name=SatoMasaaki
en-aut-sei=Sato
en-aut-mei=Masaaki
kn-aut-name=
kn-aut-sei=
kn-aut-mei=
aut-affil-num=4
ORCID=

en-aut-name=MinagiShogo
en-aut-sei=Minagi
en-aut-mei=Shogo
kn-aut-name=
kn-aut-sei=
kn-aut-mei=
aut-affil-num=5
ORCID=

affil-num=1
en-affil=
kn-affil=

affil-num=2
en-affil=Graduate School of Natural Science and Technology, Okayama University
kn-affil=

affil-num=3
en-affil=Graduate School of Natural Science and Technology, Okayama University
kn-affil=

affil-num=4
en-affil=Graduate School of Medicine Dentistry and Pharmaceutical Sciences, Okayama University
kn-affil=

affil-num=5
en-affil=Graduate School of Medicine Dentistry and Pharmaceutical Sciences, Okayama University
kn-affil=
en-keyword=voice conversion
kn-keyword=voice conversion
en-keyword=speech intelligibility
kn-keyword=speech intelligibility
en-keyword=glossectomy
kn-keyword=glossectomy
en-keyword=spectral differential
kn-keyword=spectral differential
en-keyword=neural network
kn-keyword=neural network
END

start-ver=1.4
cd-journal=joma
no-vol=
cd-vols=
no-issue=
article-no=
start-page=
end-page=
dt-received=
dt-revised=
dt-accepted=
dt-pub-year=2016
dt-pub=201607
dt-online=
en-article=
kn-article=
en-subject=
kn-subject=
en-title=
kn-title=Sound collection systems using a crowdsourcing approach to construct sound map based on subjective evaluation
en-subtitle=
kn-subtitle=
en-abstract=
kn-abstract=This paper presents a sound collection system that uses crowdsourcing to gather information for visualizing area characteristics. First, we developed a sound collection system to simultaneously collect physical sounds, their statistics, and subjective evaluations. We then conducted a sound collection experiment using the developed system on 14 participants. We collected 693,582 samples of equivalent Aweighted loudness levels and their locations, and 5,935 samples of sounds and their locations. The data also include subjective evaluations by the participants. In addition, we analyzed the changes in sound properties of some areas before and after the opening of a large-scale shopping mall in a city. Next, we implemented visualizations on the server system to attract users’ interests. Finally, we published the system, which can receive sounds from any Android smartphone user. The sound data were continuously collected and achieved a specified result.
en-copyright=
kn-copyright=

en-aut-name=HaraSunao
en-aut-sei=Hara
en-aut-mei=Sunao
kn-aut-name=
kn-aut-sei=
kn-aut-mei=
aut-affil-num=1
ORCID=

en-aut-name=KobayashiShota
en-aut-sei=Kobayashi
en-aut-mei=Shota
kn-aut-name=
kn-aut-sei=
kn-aut-mei=
aut-affil-num=2
ORCID=

en-aut-name=AbeMasanobu
en-aut-sei=Abe
en-aut-mei=Masanobu
kn-aut-name=
kn-aut-sei=
kn-aut-mei=
aut-affil-num=3
ORCID=

affil-num=1
en-affil=Graduate school of Natural Science and Technology, Okayama University
kn-affil=岡山大学大学院自然科学研究科

affil-num=2
en-affil=Graduate school of Natural Science and Technology, Okayama University
kn-affil=岡山大学大学院自然科学研究科

affil-num=3
en-affil=Graduate school of Natural Science and Technology, Okayama University
kn-affil=岡山大学大学院自然科学研究科
en-keyword=Environmental sound
kn-keyword=Environmental sound
en-keyword=Crowdsourcing
kn-keyword=Crowdsourcing
en-keyword=Loudness
kn-keyword=Loudness
en-keyword=Crowdedness
kn-keyword=Crowdedness
en-keyword=Smart City
kn-keyword=Smart City
END

start-ver=1.4
cd-journal=joma
no-vol=
cd-vols=
no-issue=
article-no=
start-page=223
end-page=226
dt-received=
dt-revised=
dt-accepted=
dt-pub-year=2015
dt-pub=201512
dt-online=
en-article=
kn-article=
en-subject=
kn-subject=
en-title=
kn-title=A Spoken Dialog System with Redundant Response to Prevent User Misunderstanding
en-subtitle=
kn-subtitle=
en-abstract=
kn-abstract=We propose a spoken dialog strategy for car navigation systems to facilitate safe driving. To drive safely, drivers need to concentrate on their driving; however, their concentration may be disrupted due to disagreement with their spoken dialog system. Therefore, we need to solve the problems of user misunderstandings as well as misunderstanding of spoken dialog systems. For this purpose, we introduced a driver workload level in spoken dialog management in order to prevent user misunderstandings. A key strategy of the dialog management is to make speech redundant if the driver’s workload is too high in assuming that the user probably misunderstand the system utterance under such a condition. An experiment was conducted to compare performances of the proposed method and a conventional method using a user simulator. The simulator is developed under the assumption of two types of drivers: an experienced driver model and a novice driver model. Experimental results showed that the proposed strategies achieved better performance than the conventional one for task completion time, task completion rate, and user’s positive speech rate. In particular, these performance differences are greater for novice users than for experienced users.
en-copyright=
kn-copyright=

en-aut-name=YamaokaMasaki
en-aut-sei=Yamaoka
en-aut-mei=Masaki
kn-aut-name=
kn-aut-sei=
kn-aut-mei=
aut-affil-num=1
ORCID=

en-aut-name=HaraSunao
en-aut-sei=Hara
en-aut-mei=Sunao
kn-aut-name=
kn-aut-sei=
kn-aut-mei=
aut-affil-num=2
ORCID=

en-aut-name=AbeMasanobu
en-aut-sei=Abe
en-aut-mei=Masanobu
kn-aut-name=
kn-aut-sei=
kn-aut-mei=
aut-affil-num=3
ORCID=

affil-num=1
en-affil=
kn-affil=Okayama University

affil-num=2
en-affil=
kn-affil=岡山大学大学院自然科学研究科

affil-num=3
en-affil=
kn-affil=岡山大学大学院自然科学研究科
END

start-ver=1.4
cd-journal=joma
no-vol=
cd-vols=
no-issue=
article-no=
start-page=390
end-page=395
dt-received=
dt-revised=
dt-accepted=
dt-pub-year=2015
dt-pub=201503
dt-online=
en-article=
kn-article=
en-subject=
kn-subject=
en-title=
kn-title=Sound collection and visualization system enabled participatory and opportunistic sensing approaches
en-subtitle=
kn-subtitle=
en-abstract=
kn-abstract=This paper presents a sound collection system to
visualize environmental sounds that are collected using a crowd-sourcing approach. An analysis of physical features is generally used to analyze sound properties; however, human beings not
only analyze but also emotionally connect to sounds. If we want to visualize the sounds according to the characteristics of the listener,
we need to collect not only the raw sound, but also the subjective feelings associated with them. For this purpose, we developed a sound collection system using a crowdsourcing approach to collect physical sounds, their statistics, and subjective evaluations simultaneously. We then conducted a sound collection experiment using the developed system on ten participants.We collected 6,257 samples of equivalent loudness levels and their locations, and 516 samples of sounds and their locations. Subjective evaluations by
the participants are also included in the data. Next, we tried to visualize the sound on a map. The loudness levels are visualized as a color map and the sounds are visualized as icons which
indicate the sound type. Finally, we conducted a discrimination experiment on the sound to implement a function of automatic conversion from sounds to appropriate icons. The classifier is
trained on the basis of the GMM-UBM (Gaussian Mixture Model and Universal Background Model) method. Experimental results show that the F-measure is 0.52 and the AUC is 0.79.
en-copyright=
kn-copyright=

en-aut-name=HaraSunao
en-aut-sei=Hara
en-aut-mei=Sunao
kn-aut-name=
kn-aut-sei=
kn-aut-mei=
aut-affil-num=1
ORCID=

en-aut-name=AbeMasanobu
en-aut-sei=Abe
en-aut-mei=Masanobu
kn-aut-name=
kn-aut-sei=
kn-aut-mei=
aut-affil-num=2
ORCID=

en-aut-name=SoneharaNoboru
en-aut-sei=Sonehara
en-aut-mei=Noboru
kn-aut-name=
kn-aut-sei=
kn-aut-mei=
aut-affil-num=3
ORCID=

affil-num=1
en-affil=
kn-affil=Graduate School of Natural Science and Technology Okayama University

affil-num=2
en-affil=
kn-affil=Graduate School of Natural Science and Technology Okayama University

affil-num=3
en-affil=
kn-affil=National Institute of Informatics
END