Designing a Multimodal Hate Speech Detection Model For X Platform: A Systematic Analysis of Current Approaches

Edwin Ireri, Josphat Karani, Kennedy Malanga

doi:10.51244/IJRSI.2025.12110174

Designing a Multimodal Hate Speech Detection Model For X Platform: A Systematic Analysis of Current Approaches

by Edwin Ireri, Josphat Karani, Kennedy Malanga

Published: December 23, 2025 • DOI: 10.51244/IJRSI.2025.12110174

Abstract

The proliferation of hate speech on social media platforms poses significant societal challenges, with X platform experiencing a 50% overall increase in hate speech, including a 260% rise in transphobic slurs following recent policy changes. Traditional text-based detection models struggle with modern communication patterns, particularly on platforms like X where the 280-character constraint encourages coded language and linguistic compression. This study addresses critical gaps in multimodal hate speech detection through two primary objectives: systematic analysis of current multimodal models to identify gaps and limitations specific to X platform, and to design of an innovative multimodal architecture optimized for X platform's unique communication environment. This analysis of six prominent models—VisualBERT, UNITER, HGAT, Stacked Ensemble Framework, Multimodal Transformers, and Visual Data Augmentation approaches—reveals that zero percent of existing models address X platform's 280-character patterns, 83% show text over-reliance, and all models fail real-time processing requirements (<500ms). These findings provide performance analysis and gap analysis across multiple evaluation dimensions. In response, this study presents a designed a novel six-layer architecture featuring breakthrough Dynamic Cross-Modal Attention mechanisms, compression-aware text processing, and lightweight vision transformers specifically optimized for X platform. The architectural design addresses identified gaps through platform-specific preprocessing, parallel feature encoding across four specialized components (Platform-Optimized RoBERTa, Lightweight Vision Transformer, Cultural Context Analyzer, and Adaptive Learning Module), and dynamic multimodal fusion achieving balanced processing between textual and visual modalities. This research contributes to advancing hate speech detection methodologies by providing gap analysis and presenting an innovative design framework that addresses real-time processing, platform-specific optimization, and balanced multimodal integration which are critical requirements for practical social media content moderation.

Download PDF