A0517
Title: Synthetic tabular data detection in the wild
Authors: Gaspard Charbel Novixi Kindji - Orange Innovation (France) [presenting]
Abstract: The rapid progress of generative models offers remarkable capabilities, but also raises data integrity concerns, especially in distinguishing authentic from synthetic data. This concern has gained significant attention in the realms of image and text. However, for data types such as tabular data, the landscape of generative models is getting richer, but little attention is paid to detection techniques. Detecting synthetic tabular data is uniquely difficult due to its heterogeneous and variable structure, with the main difficulty lying in data representation rather than the classifier itself. The challenge of detecting synthetic tabular data in real-world scenarios, where detectors must generalize to unseen table formats, is addressed. A novel datum-wise transformer architecture is introduced, designed to operate effectively across arbitrary tabular structures, which encodes all features as text with an independent embedding for each feature. The method achieves an AUC of 0.67 compared to 0.60 for the best competitor. This result is later improved to 0.69 with domain adaptation techniques. The result is the reliable, scalable detection of synthetic tabular data, which can be extended to other mainstream predictive tasks involving tabular data.