Műegyetemi Digitális Archívum

A Hybrid Algorithm for Robust Pitch Estimation in Emotional Speech Synthesis

Date

Type

Könyvfejezet

Language

en

Reading access rights:

Open access

Rights Holder

Szerző

Conference Date

2025-02-03

Conference Place

Budapest

Conference Title

3rd Workshop on Intelligent Infocommunication Networks, Systems and Services

ISBN, e-ISBN

978-963-421-982-8

Container Title

3rd Workshop on Intelligent Infocommunication Networks, Systems and Services

Version

Post print

Faculty

Faculty of Electrical Engineering and Informatics

First Page

81

Subject (OSZKAR)

Synthetic speech
Human-machine interaction
Pitch transitions
Frequency variations

Gender

Konferenciacikk

University

Budapest University of Technology and Economics

OOC works

Abstract

Emotional intelligence in synthetic speech remains a critical challenge in human-machine interaction, despite significant advances in speech synthesis naturalness and intelligibility. Current systems struggle to accurately capture the nuanced emotional expressions characteristic of human speech, including rapid pitch transitions, wide frequency variations, and irregular vibrato patterns. While pitch estimation algorithms like PESTO and FCPE have proven effective for standard speech, their performance on emotional content remains largely unexplored. We present ESCAPE (Emotion Self-Supervised ContextAware Pitch Estimation), a novel algorithm specifically designed for emotional speech processing. ESCAPE synthesizes PESTO's precise frequency variation handling with FCPE's context-aware processing through a hybrid architecture that achieves robust pitch tracking in expressive vocal content. Our approach maintains computational efficiency while excelling at capturing complex acoustic patterns unique to emotional utterances. This paper provides the first comprehensive evaluation of PESTO and FCPE on emotional speech datasets and introduces ESCAPE as a transformative solution for pitch estimation in emotionally expressive speech synthesis. Our results demonstrate significant progress toward bridging the gap between human-like emotional expression and machine-generated speech, marking an important advancement in emotional speech synthesis technology.

Description

Keywords