Műegyetemi Digitális Archívum

Improving speech naturalness and nuance using HiFiGAN-Hubert-Soft vocoder: A case study of the Voicebox TTS model

Date

Type

Konferenciaközlemény

Language

en

Reading access rights:

Open access

Rights Holder

Szerző

Conference Date

2024-02-05

Conference Place

Budapest

Conference Title

2nd Workshop on Intelligent Infocommunication Networks, Systems and Services (WI2NS2)

ISBN, e-ISBN

978-963-421-944-6

Container Title

2nd Workshop on Intelligent Infocommunication Networks, Systems and Services

Version

Post print

Faculty

Faculty of Electrical Engineering and Informatics

First Page

25

Subject (OSZKAR)

Speech synthesis
Generative AI model
TTS
Voicebox
Hubert-Soft vocoder

Gender

Konferenciacikk

University

Budapest University of Technology and Economics

OOC works

Abstract

Text-to-speech (TTS) technology has significantly transformed human-machine interactions, facilitating seamless communication between humans and computers. However, achieving high-quality TTS remains a formidable challenge, especially in synthesizing natural and nuanced speech. In this study, we investigate the potential of HiFiGAN-Hubert-Soft (HHS) vocoder to enhance the performance of TTS models, with a focus on integrating the HHS vocoder into the Voicebox TTS model—a versatile and scalable TTS system developed by Meta AI. Through both subjective (mean opinion score) and objective (audio similarity and visualization metric) evaluations, we illustrate that the HHS vocoder significantly enhances the naturalness and nuance of synthesized speech compared to the baseline HiFiGAN vocoder. This improvement is particularly pronounced in cases where pronunciation variations are subtle or context-dependent. Our findings emphasize the potential of the HHS vocoder in elevating TTS performance and laying the foundation for further advancements in TTS technology.

Description

Keywords