HPCM: A Hybrid Multi-Layered Machine Learning Pipeline for Plagiarism Content Matching with Dynamic Threshold Calibration
by Piyush Chavan, Prof.Moushmee Kuri, Pushkar Thombare, Tanvi Bokade
Published: May 23, 2026 • DOI: 10.51584/IJRIAS.2026.11050027
Abstract
Conventional approaches to detecting plagiarism involve mainly string-matching and n-gram fingerprinting methods, which can detect plagiarised documents involving verbatim plagiarism, but they cannot catch paraphrasing, synonym substitutions, or imitations of writing styles. Such shortcomings have now gained importance due to developments of sophisticated intelligent paraphrasing and the use of advanced large language models, which help evade detection by conventional approaches. In this research, we present HPCM, an end-to-end plagiarism detection system that utilises a nine-module machine-learning-based pipeline combining three analysis components: the first is the cosine similarity of terms using the TF-IDF method, secondly, embedding-based semantic similarity using the all-miniLM-L6-v2 model, and thirdly, stylistic similarity based on the analysis of POS Distribution, Type Token Ratio, and Sentence length statistics. These results are combined through the application of a weighted sum fusion function that gives greater emphasis to the semantic similarity score. Additionally, a novel Dynamic Similarity Calibration (DSC) module adjusts the plagiarism score per pair based on the relative length of documents, their vocabulary richness, and topic similarity. Experiments conducted over four different categories of plagiarism reveal that HPCM scores 69.0% in detecting paraphrases compared to 24.9% by conventional approaches, showing a remarkable 44.1 percentage point improvement. It is implemented as a microservices system on Vercel, Render, Hugging Face Spaces, and MongoDB Atlas, proving the practicality of using multilayered neural models for detecting plagiarism even with only free-tier cloud resources. The source code, along with the testing data, is publicly available.