Abstract
This research project proposes an approach to improve duration prediction in deep learning text-to-speech (TTS) models by incorporating psychoacoustic cues of speech rhythm. Rooted in the p-center theory, which posits that listeners center their perception of speech rhythm around the acoustic vowel onset, we aim to implement a vowel onset detection module and integrate it into consolidate…