A Hybrid Algorithm for Robust Pitch Estimation in Emotional Speech Synthesis
Date
Type
Language
Reading access rights:
Rights Holder
Conference Date
Conference Place
Conference Title
ISBN, e-ISBN
Container Title
Version
Faculty
First Page
Subject (OSZKAR)
Human-machine interaction
Pitch transitions
Frequency variations
Gender
University
- Cite this item
- https://doi.org/10.3311/WINS2025-014
OOC works
Abstract
Emotional intelligence in synthetic speech remains a critical challenge in human-machine interaction, despite significant advances in speech synthesis naturalness and intelligibility. Current systems struggle to accurately capture the nuanced emotional expressions characteristic of human speech, including rapid pitch transitions, wide frequency variations, and irregular vibrato patterns. While pitch estimation algorithms like PESTO and FCPE have proven effective for standard speech, their performance on emotional content remains largely unexplored. We present ESCAPE (Emotion Self-Supervised ContextAware Pitch Estimation), a novel algorithm specifically designed for emotional speech processing. ESCAPE synthesizes PESTO's precise frequency variation handling with FCPE's context-aware processing through a hybrid architecture that achieves robust pitch tracking in expressive vocal content. Our approach maintains computational efficiency while excelling at capturing complex acoustic patterns unique to emotional utterances. This paper provides the first comprehensive evaluation of PESTO and FCPE on emotional speech datasets and introduces ESCAPE as a transformative solution for pitch estimation in emotionally expressive speech synthesis. Our results demonstrate significant progress toward bridging the gap between human-like emotional expression and machine-generated speech, marking an important advancement in emotional speech synthesis technology.