Műegyetemi Digitális Archívum

Architectural Enhancements and Feature Optimization of AutoVocoder for High-Quality Speech Synthesis

Date

Type

Könyvfejezet

Language

en

Reading access rights:

Open access

Rights Holder

Szerző

Conference Date

2025-02-03

Conference Place

Budapest

Conference Title

3rd Workshop on Intelligent Infocommunication Networks, Systems and Services

ISBN, e-ISBN

978-963-421-982-8

Container Title

3rd Workshop on Intelligent Infocommunication Networks, Systems and Services

Version

Post print

Faculty

Faculty of Electrical Engineering and Informatics

First Page

27

Subject (OSZKAR)

Speech Synthesis
Text To Speech
Acoustic Modeling
Neural Networks

Gender

Konferenciacikk

University

Budapest University of Technology and Economics

OOC works

Abstract

Neural vocoders are essential for producing high-quality speech in modern Text-to-Speech (TTS) systems. They directly affect how natural and clear the generated speech sounds. This thesis focuses on improving the AutoVocoder by redesigning its architecture to better handle and process speech waveforms. The first step is improving data preprocessing to match the new architecture. This involves modifying how redundant audio features; notably phase and magnitude, are represented to provide cleaner inputs for processing. Next, the architecture is redesigned to include separate encoders and decoders for individual features. These specialized components are then combined into a unified encoder-decoder structure that learns the relationships between features, enabling deeper analysis and better synthesis. The goal is to achieve these improvements while reducing computational requirements. In the final step, post-processing techniques are used to enhance the quality of the generated speech. These methods are tested with noisy data to ensure the vocoder performs well under various conditions. This research provides a systematic approach to improving TTS systems, making them more efficient and reliable. The proposed changes aim to deliver clearer, more natural speech while maintaining adaptability for different environments and use cases.

Description

Keywords