CitedEvidence
User Settings
Article

Audio-visual asynchrony modeling and analysis for speech alignment and recognition

7

TL;DRAbstract

This work investigates perceived audio-visual asynchrony, specifically anticipatory coarticulation, in which the visual cues (e.g. lip rounding) of a speech sound may occur before the acoustic cues. This phenomenon often gives the impression that the visual and acoustic signals are asynchronous. This effect can be accounted for using models based on multiple hidden Markov models with some synchrony constraints linking states in different modalities, though generally only within phones and not across phone boundaries. In this work, we consider several such models, implemented as dynamic Bayesian networks (DBNs). We study the models' ability to accurately locate audio and viseme (audio and video sub-word units, respectively) boundaries in the audio and video signals, and compare them with human labels of these boundaries. This alignment task is important on its own for purposes of linguistic analysis, as it can serve as an analysis tool and a convenience tool to linguists. Furthermore, t

Chat with Paper

AI Agents for this Paper

This work investigates perceived audio-visual asynchrony, specifically anticipatory coarticulation, in which the visual cues (e.g. lip rounding) of a speech sound may occur before the acoustic cues. This phenomenon often gives the impression that the visual and acoustic signals are asynchronous. This effect can be accounted for using models based on multiple hidden Markov models with some synchrony constraints linking states in different modalities, though generally only within phones and not across phone boundaries. In this work, we consider several such models, implemented as dynamic Bayesian networks (DBNs). We study the models' ability to accurately locate audio and viseme (audio and video sub-word units, respectively) boundaries in the audio and video signals, and compare them with human labels of these boundaries. This alignment task is important on its own for purposes of linguistic analysis, as it can serve as an analysis tool and a convenience tool to linguists. Furthermore, t

Keywords

CoarticulationComputer scienceSpeech recognitionAsynchrony (computer programming)Hidden Markov modelSet (abstract data type)PhoneContext (archaeology)

Chat

Click to start Chat