View Submission

A0180

Title: How to validate an AI that outperforms humans Authors: Karl Rohe - University of Wisconsin-Madison (United States) [presenting]
Abstract: In various repetitive tasks, modern language models (e.g., ChatGPT) have the potential to exceed the quality of human-generated data. This creates a fundamental challenge in evaluating/validating language models for these tasks: How is a system validated when the validation labels (i.e., from humans) are potentially less accurate than the system's outputs? If humans are the gold standard, then disagreements are errors "blamed on" the AI. A path forward is provided. It is a class of statistical models and algorithms that help better understand and use Cohen's Kappa.