Improving speech naturalness and nuance using HiFiGAN-Hubert-Soft vocoder: A case study of the Voicebox TTS model
Date
Type
Language
Reading access rights:
Rights Holder
Conference Date
Conference Place
Conference Title
ISBN, e-ISBN
Container Title
Version
Faculty
First Page
Subject (OSZKAR)
Generative AI model
TTS
Voicebox
Hubert-Soft vocoder
Gender
University
- Cite this item
- https://doi.org/10.3311/WINS2024-005
OOC works
Abstract
Text-to-speech (TTS) technology has significantly transformed human-machine interactions, facilitating seamless communication between humans and computers. However, achieving high-quality TTS remains a formidable challenge, especially in synthesizing natural and nuanced speech. In this study, we investigate the potential of HiFiGAN-Hubert-Soft (HHS) vocoder to enhance the performance of TTS models, with a focus on integrating the HHS vocoder into the Voicebox TTS model—a versatile and scalable TTS system developed by Meta AI. Through both subjective (mean opinion score) and objective (audio similarity and visualization metric) evaluations, we illustrate that the HHS vocoder significantly enhances the naturalness and nuance of synthesized speech compared to the baseline HiFiGAN vocoder. This improvement is particularly pronounced in cases where pronunciation variations are subtle or context-dependent. Our findings emphasize the potential of the HHS vocoder in elevating TTS performance and laying the foundation for further advancements in TTS technology.