Cloud-Based Emotion Detection Model Using Convolution Neural Networks on the Affectnet Dataset

by Aryan Verma, Ashish Srivastava, Nevesh Divya

Published: June 3, 2026 • DOI: 10.51244/IJRSI.2026.1305000121

Abstract

Emotion recognition powered by AI and Computer Vision is changing how people and machines interact, computers can now read our feelings just by analyzing our faces. In this paper, we dig into a cloud-based emotion classification system built around the massive AffectNet dataset and a fine-tuned MobileNetV2 deep learning model. We needed to process millions of real-world images, so we put together a distributed MLOps pipeline on Google Cloud Platform using Apache Spark and Vertex AI.
Traditional hardware just can’t keep up with this scale of data causing a huge bottleneck. We solved this with a cloud-first architecture. All images landed in Google Cloud Storage, acting as a virtually limitless data lake. When we needed to preprocess everything, we spun up an on-demand Apache Spark cluster with Dataproc, spreading the load across machines. For training, we handed things off to Vertex AI, orchestrating jobs across a cluster of NVIDIA A100 GPUs. Separating out these stages slashed both our processing time and costs while running thirty times faster than a single machine could ever manage.
The AffectNet dataset has an extreme class imbalance. Some emotions, like happiness, dominate while others barely show up. We tackled this early in the preprocessing step by assigning class weights, sidestepping the need for resource-hungry oversampling. For transfer learning, we started by freezing the MobileNetV2 base and letting it extract features, only tuning the top layers at a low learning rate.
The study includes a per-class performance metrics, a confusion matrix and digs into the dataset to give a sense of the model’s strengths and weaknesses. The final model reached 68.2% accuracy and a weighted F1-score of 0.67. In the end, this work lays a solid, reproducible MLOps foundation for more advanced research in temporal and multimodal emotion recognition.