Effect of Child Voices on Forensic Voice Comparison using Deep Speaker Embeddings
Date
Type
Language
Reading access rights:
Rights Holder
Conference Date
Conference Place
Conference Title
ISBN, e-ISBN
Container Title
Version
Faculty
First Page
Subject (OSZKAR)
Forensic voice comparison
language mismatch
deep speaker embedding
Gender
University
- Cite this item
- https://doi.org/10.3311/WINS2025-003
OOC works
Abstract
Language mismatching is considered one of the biggest challenges in achieving adequate speaker verification. The number of bilingual speakers worldwide is increasing, making speaker verification for speech technology more challenging. The main objective of this study is to examine the effect of language mismatch between training and test conditions on the performance of the speaker verification model, with a specific focus on children's speech by utilizing pre-trained and fine-tuned deep speaker embedding models and by investigating two child speech datasets (Samromur and kidsTALC). This work utilises two time-delay neural networks (TDNN): X-Vector and Emphasized Channel Attention, Propagation and Aggregation (ECAPA) to extract the embedding features of speech samples. For evaluation of the performance of speaker verification, we used the likelihood-ratio framework by using the likelihood-ratio score calculation method based on children’s voices and employing measures such as log-likelihood ratio cost (Cllr) and equal error rate (EER). The experimental results indicate that the language variety between training and testing utterances significantly degrades speaker verification performance compared to multi-language training; however, fine-tuning still performs better than pretrained models by 11.5% in the best performance of ECAPA-TDNN.