Acoustic and visual geometry descriptor for multi-modal emotion recognition fromvideos
Abstract
Recognizing human emotions simultaneously from multiple data modalities (e.g., face, and speech) has drawn significant research interest, and numerous research contributions have been investigated in the affective computing community. However, most methods concentrate less on facial alignment and keyframe selection for audio-visual input. Hence, this paper proposed a new audio-visual descriptor, mainly concentrating on describing the emotion through only a few frames. For this purpose, we proposed a new self-similarity distance matrix (SSDM), which computes the spatial, and temporal distances through landmark points on the facial image. The audio signal is described through an asset of composite features, including statistical features, spectral features, formant frequencies, and energies. A support vector machine (SVM) algorithm is employed to classify both models, and the final results are fused to predict the emotion. Surrey audio-visual expressed emotion (SAVEE) and Ryerson multimedia research lab (RML) datasets are utilized for experimental validation, and the proposed method has shown significant improvement from the state of art methods.
Keywords
Acoustic feature; Audio and video; Decision level fusion; Geometric features; Key frames; Multimodal emotion recognition
Full Text:
PDFDOI: http://doi.org/10.11591/ijeecs.v33.i2.pp960-970
Refbacks
- There are currently no refbacks.
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Indonesian Journal of Electrical Engineering and Computer Science (IJEECS)
p-ISSN: 2502-4752, e-ISSN: 2502-4760
This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU).