Műegyetemi Digitális Archívum
 

Multi-speaker child speech synthesis in low-resource Hungarian language

Date

Type

Konferenciaközlemény

Language

en

Reading access rights:

Open access

Rights Holder

Szerző

Conference Date

2024-02-05

Conference Place

Budapest

Conference Title

2nd Workshop on Intelligent Infocommunication Networks, Systems and Services (WI2NS2)

ISBN, e-ISBN

978-963-421-944-6

Container Title

2nd Workshop on Intelligent Infocommunication Networks, Systems and Services

Version

Post print

Faculty

Faculty of Electrical Engineering and Informatics

First Page

19

Subject (OSZKAR)

Text-to-Speech
AutoVocoder
child TTS systems
BIGVGAN
Speech synthesis

Gender

Konferenciacikk

University

Budapest University of Technology and Economics

OOC works

Abstract

Current deep learning-based text-to-speech (TTS) models can synthesize speech that sounds remarkably like human voices. Despite recent developments in TTS systems for adults, there are numerous considerations when it comes to TTS systems for children. These considerations involve various obstacles, such as the lack of adequate child speech resources and the unique acoustic and linguistic characteristics specific to children. The main objective of this work is to explore advanced neural vocoders, namely BIGVGAN and AutoVocoder, for synthesizing child speech in Hungarian . In our study, we focused on the Hungarian language to evaluate the efficacy of neural vocoders in capturing specific linguistic nuances and phonetic characteristics relevant to Hungarian-speaking children . In addition, to examine the fine-tuning and adaptation of vocoders to accurately capture the unique attributes of child speech while minimizing dependency on extensive child speech datasets. The experimental outcomes showcased the high performance of BIGVGAN and Autovocoder in effectively synthesizing clear and natural sounds for multi-speaker child speech in conversational settings. Despite the high quality exhibited by the synthesized sounds produced by BIGVGAN, the model has a more complex architecture with a multi-scale discriminator and require more resources and longer training time due to the larger batch size compared to the AutoVocoder. AutoVocoder notably improved the quality and clarity of the generated child's speech. Initial findings suggest that the BIGVGAN model successfully produced high-quality synthesized child sounds.

Description

Keywords