CFE-CMStatistics 2025: Start Registration
View Submission - CFE-CMStatistics 2025
A0843
Title: Enhanced validation of tabular synthetic data: Assessing propensity score resemblance metrics Authors:  Nora Amama Ben Hassun - Universitat Politecnica de Catalunya, BarcelonaTech (UPC) (Spain) [presenting]
Daniel Fernandez - Universitat Politecnica de Catalunya, BarcelonaTech (UPC) (Spain)
Jordi Cortes Martinez - Universitat Politecnica de Catalunya - BarcelonaTech (UPC) (Spain)
Abstract: Rigorous assessment of validation metrics is a prerequisite for a unified, variable-class-aware framework for tabular synthetic data. In particular, the assessment of resemblance by multivariate metrics quantifies OD and SD similarity, guides synthesizer refinement, and standardizes method comparison. A simulation study, followed by a real data case study, was implemented in R using the synthpop package. Synthetic datasets are generated over different sample sizes (n) and number of variables (p) under the hypothesis of OD and SD coming from the same population. Null distributions of three propensity score-based metrics were derived to quantify type I error. To evaluate statistical power, alternative scenarios introduced controlled shifts in means, variances, intervariable correlations, and distributional symmetry. Propensity scores were estimated via logistic regression with a train and test split to prevent classifier overfitting. All three metrics controlled type I error; scenarios are also delineated where each metric fails. The metric is identified with superior statistical power across alternative scenarios. The findings suggest that certain metrics may be employed to validate the resemblance of SD from a multivariate perspective in the context of numerical variables. Further research is required in order to explore the expansion for different classes of variables.