Interpretable, Expert-Aligned Composite Metric with Domain-Aware Calibration for Evaluating Natural Language Generation
by Allan C. Taracatac, Arnel C. Fajardo
Published: February 27, 2026 β’ DOI: 10.51244/IJRSI.2026.13020053
Abstract
Automated metrics for natural language generation (NLG) often show weak or unstable alignment with expert judgment in domain-specific settings that require interpretability and tunability. Hence, this study designs and validates an interpretable composite metric that can be calibrated to expert consensus while being transparent. The researchers propose Comprehensive Quality Scoring (CQS), a hierarchical metric integrating contextual coherence and continuity (C3) with 5 interpretable linguistic factors, specifically relevance, readability, conciseness, structure, and information density, also introducing CLARION-G. This constrained calibrator learns a nonnegative simplex weight vector while preserving factor-level attribution. Evaluation uses 20 agriculture-oriented farmer FAQ items with responses generated by a local LLaMA 3.1 (8B) model and scored by expert panels across Agriculture, Linguistics, and Information Technology using a rubric based on MetricEval. Expert ratings are z-scored per rater and aggregated into a consensus target, with reliability assessed via ICC(2,1). To prevent leakage under π=20, calibration is performed strictly within leave-one-out cross-validation (LOOCV) (train on π-1, freeze weights, score the held-out item), with uncertainty quantified via Fisher-z confidence intervals and bootstrap resampling (B=1000). CLARION-G maximizes a penalized correlation objective with fixed coefficients π1=0.01, ππππ=0.005, and ππ£ππ=0.003, optimized using Differential Evolution (population=15, maxiter=50, tol=10β4, polish=True) with optional L-BFGS-B refinement (maxiter=300-500, ftol=10β6-10β8). In Agriculture, calibrated CQS achieves Pearsonβs π=0.688 with 95% CI [0.353, 0.867], surpassing baselines (e.g., BERTScore, Prometheus, METEOR) with statistically significant dependent-correlation gains. Learned top-level weights allocate 0.4 to C3 and 0.6 to linguistic quality, emphasizing relevance and information density. Bland-Altman analysis shows no fixed bias with limits of agreement Β±0.1134, and runtime remains practical (β1.254 ms/item), supporting CQS/CLARION-G as an interpretable and operationally lightweight framework for expert-aligned NLG evaluation in specialized domains.