Emotion Recognition Using Machine Learning and Computer Vision: A Hybrid CNN–ViT Multimodal Framework
by Arpitha G A, Bhumika B K, Brunda U Jajur, Deepa Chandrashekhar Rathod, Usha K
Published: May 19, 2026 • DOI: 10.51244/IJRSI.2026.1304000249
Abstract
Automatic recognition of human emotions from facial expressions and multimodal signals constitutes a foundational challenge in affective computing and human–computer interaction, with broad applications spanning healthcare monitoring, autonomous vehicle safety, educational technology, and social robotics. Despite remarkable progress driven by deep learning, particularly convolutional neural networks (CNNs), recurrent neural networks (RNNs), and Vision Transformers (ViT), achieving robust emotion recognition in unconstrained, real-world environments remains an open problem. This paper presents a comprehensive synthesis of over twenty-five state-of-the-art studies on facial and multimodal emotion recognition, encompassing CNN-based systems trained on FER2013, CK+, RAF-DB, and AffectNet; transformer-based hybrid architectures; and multimodal fusion systems integrating facial, speech, and electroencephalography (EEG) cues evaluated on RAVDESS, IEMOCAP, CMU-MOSEI, eNTERFACE'05, and MAHNOB-HCI. Building upon these insights, this work proposes a novel Hybrid CNN–ViT Multimodal Emotion Recognition (HCV-MER) framework comprising: (i) a squeeze-and-excitation ResNet combined with a Vision Transformer facial backbone incorporating region-specific attention over eyes and mouth; (ii) a lightweight temporal aggregation unit for video-level inference; and (iii) a cross-modal attention fusion module integrating facial and speech streams. Experimental evaluations target FER2013, RAF-DB, CK+, and RAVDESS using TensorFlow, PyTorch, and OpenCV. Expected improvements over baseline CNN architectures range from five to ten percentage points on challenging in-the-wild benchmarks. The paper further analyzes unresolved challenges including cross-domain generalization, demographic fairness, micro-expression recognition, and privacy-preserving deployment.