CMStatistics 2023: Start Registration
View Submission - CMStatistics
B1192
Title: Vine copula based synthetic data generation for classification: A privacy and utility analysis Authors:  Elisabeth Griesbauer - University of Oslo (Norway) [presenting]
Claudia Czado - Technical University of Munich (Germany)
Arnoldo Frigessi - University of Oslo (Norway)
Ingrid Hobaek Haff - University of Oslo (Norway)
Abstract: Synthetic data are faithful copies of real data. They can be used as a substitute for real data in situations when the latter cannot be shared or made public due to privacy reasons. Synthetic data preserve privacy if they do not leak specific information on a single observation in the real data, and they achieve utility if they allow answering the research question originally posed to the real data. Commonly used methods for synthetic data generation include generative adversarial networks and variational autoencoders, which are based on neural networks and whose training can be computationally intensive. Vine copulas are used as a synthetic data generator, and the focus is on the case when the task is classification. In such a situation, the synthetic data should allow estimating a classification rule, which is similar to the classification rule that would be estimated on the real data. To increase privacy while maintaining utility, the tree structure and truncation level of the vine copula are exploited. In a privacy and utility analysis, vine copulas outperform differentially private competitor models in terms of utility. At the same time, they achieve comparably high privacy.