Acoustic and visual geometry descriptor for multi-modal emotion recognition fromvideos

Kummari Ramyasree, Chennupati Sumanth Kumar


Recognizing human emotions simultaneously from multiple data modalities (e.g., face, and speech) has drawn significant research interest, and numerous research contributions have been investigated in the affective computing community. However, most methods concentrate less on facial alignment and keyframe selection for audio-visual input. Hence, this paper proposed a new audio-visual descriptor, mainly concentrating on describing the emotion through only a few frames. For this purpose, we proposed a new self-similarity distance matrix (SSDM), which computes the spatial, and temporal distances through landmark points on the facial image. The audio signal is described through an asset of composite features, including statistical features, spectral features, formant frequencies, and energies. A support vector machine (SVM) algorithm is employed to classify both models, and the final results are fused to predict the emotion. Surrey audio-visual expressed emotion (SAVEE) and Ryerson multimedia research lab (RML) datasets are utilized for experimental validation, and the proposed method has shown significant improvement from the state of art methods.


Acoustic feature; Audio and video; Decision level fusion; Geometric features; Key frames; Multimodal emotion recognition

Full Text:




  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

The Indonesian Journal of Electrical Engineering and Computer Science (IJEECS)
p-ISSN: 2502-4752, e-ISSN: 2502-4760
This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU).

shopify stats IJEECS visitor statistics