A1285
Title: Handling missing responses and latent dependence with applications to language model evaluation
Authors: Zhenghao Zeng - Carnegie Mellon University (United States)
David Arbour - Adobe Research (United States) [presenting]
Avi Feller - University of California at Berkeley (United States)
Ishita Dasgupta - Adobe Research (United States)
Atanu Sinha - Adobe Research (United States)
Edward Kennedy - Carnegie Mellon University (United States)
Abstract: Human annotations play a crucial role in evaluating the performance of GenAI models. Two common challenges in practice, however, are missing annotations (the response variable of interest) and cluster dependence among human-AI interactions (e.g., questions asked by the same user may be highly correlated). Reliable inference must address both these issues to achieve unbiased estimation and appropriately quantify uncertainty when estimating average scores from human annotations. We analyze the doubly robust estimator, a widely used method in missing data analysis and causal inference, applied to this setting and establish novel theoretical properties under cluster dependence. We further illustrate our findings through simulations and a real-world conversation quality dataset. The theoretical and empirical results underscore the importance of incorporating cluster dependence in missing response problems to perform valid statistical inference.