Evaluation Metrics for Deep Learning–Based Semantic Search: A Critical Review with a User Satisfaction Perspective
by ADESANYA Adetola Joel, AYOADE Akintayo Michael, Folarin Israel Bolaji
Published: May 26, 2026 • DOI: 10.51244/IJRSI.2026.1305000044
Abstract
Background: Semantic search, driven by deep learning models like BERT and Sentence-BERT (SBERT), has greatly improved information retrieval. It has shifted from matching keywords to capturing the context of search and user intent. However, to evaluate how effective these systems are, traditional system-focused metrics such as precision, recall, Mean Reciprocal Rank (MRR), and Normalized Discounted Cumulative Gain (NDCG) are still used. These metrics do not adequately reflect user experience. They often overlook important behavioral and contextual factors such as user engagement, search satisfaction, relevance perception, and interaction quality in real-world environments. This review examines existing evaluation metrics for deep learning-based semantic search. It identifies their strengths and limitations, as well as how well they capture real-world user satisfaction. It also explores helpful ways to incorporate user-centered approaches into the evaluation of these systems.
Method: A critical review approach was used, synthesizing literature from 2020 to 2025 across databases like IEEE Xplore, ACM Digital Library, Scopus, and Google Scholar. Studies on semantic search evaluation, deep learning-based retrieval, and user-centered metrics were thematically analyzed for information. The reviewed studies were selected using predefined inclusion and exclusion criteria, and the analysis categorized evaluation methods into traditional and user-centered approaches.
Findings: The review finds that while traditional metrics provide reproducibility and comparability, they fail to capture important aspects of user experience such as clarity, usability, and satisfaction. Emerging user-oriented alternatives like click-through rates, dwell time, and satisfaction surveys offer valuable insights, but they remain secondary, fragmented, and lack standardization. The review highlights an ongoing gap between the leaderboard performance of search systems and their real-world utility. The review further reveals that many high-performing semantic retrieval systems achieve strong benchmark scores while still failing to fully satisfy users in practical search scenarios.
Conclusion: Semantic search evaluation must change from traditional, system-focused measures to hybrid metrics that integrate algorithmic precision with user-centered awareness. By combining these traditional metrics with behavioral signals and subjective feedback, future evaluation methods can ensure that semantic search systems are not only technically sound but also practical, usable, and satisfying for end-users. The study therefore recommends the development of standardized hybrid evaluation frameworks capable of balancing retrieval accuracy with measurable user experience indicators.