Architectural Enhancements and Feature Optimization of AutoVocoder for High-Quality Speech Synthesis
Date
Type
Language
Reading access rights:
Rights Holder
Conference Date
Conference Place
Conference Title
ISBN, e-ISBN
Container Title
Version
Faculty
First Page
Subject (OSZKAR)
Text To Speech
Acoustic Modeling
Neural Networks
Gender
University
- Cite this item
- https://doi.org/10.3311/WINS2025-005
OOC works
Abstract
Neural vocoders are essential for producing high-quality speech in modern Text-to-Speech (TTS) systems. They directly affect how natural and clear the generated speech sounds. This thesis focuses on improving the AutoVocoder by redesigning its architecture to better handle and process speech waveforms. The first step is improving data preprocessing to match the new architecture. This involves modifying how redundant audio features; notably phase and magnitude, are represented to provide cleaner inputs for processing. Next, the architecture is redesigned to include separate encoders and decoders for individual features. These specialized components are then combined into a unified encoder-decoder structure that learns the relationships between features, enabling deeper analysis and better synthesis. The goal is to achieve these improvements while reducing computational requirements. In the final step, post-processing techniques are used to enhance the quality of the generated speech. These methods are tested with noisy data to ensure the vocoder performs well under various conditions. This research provides a systematic approach to improving TTS systems, making them more efficient and reliable. The proposed changes aim to deliver clearer, more natural speech while maintaining adaptability for different environments and use cases.