EcoSta 2024: Start Registration
View Submission - EcoSta2024
A0717
Title: Is this model reliable for everyone? Testing for strong calibration Authors:  Jean Feng - UCSF (United States) [presenting]
Abstract: In well-calibrated risk prediction models, the average predicted probability is close to the true event rate for any given subgroup. Such models are reliable across heterogeneous populations and satisfy strong notions of algorithmic fairness. However, auditing machine learning models for strong calibration is difficult due to the number of potential subgroups. As such, common practice is to only assess calibration with respect to a few subgroups. Recent developments in goodness-of-fit testing offer potential solutions but are not designed for settings with weak signals or small, poorly calibrated subgroups. A new testing procedure is introduced based on the following insight: if observations can be reordered by their expected residuals, there should be a change in the association between the predicted and observed residuals along this sequence if a poorly calibrated subgroup exists. This reframes the problem of calibration testing into one of changepoint detection, for which powerful methods already exist. A sample-splitting procedure is first introduced where a portion of the data is used to train candidate models for predicting the residual, and the remaining data are used to perform an adaptive score-based cumulative sum (CUSUM) test. This test is then extended to incorporate cross-validation while maintaining Type I error control. Compared to existing methods, the proposed procedure consistently achieved higher power in simulation studies and real-world data analyses.