EcoSta 2023: Start Registration
View Submission - EcoSta2023
A0634
Title: Statistical properties of compression analytics Authors:  Kurtis Shuler - Sandia National Laboratories (United States) [presenting]
Alexander Foss - Sandia National Laboratories (United States)
Christina Ting - Sandia National Laboratories (United States)
Travis Bauer - Sandia National Laboratories (United States)
Richard Field - Sandia National Laboratories (United States)
Abstract: Compression analytics (CA) uses file compression algorithms to perform many predictive and inferential tasks typically associated with statistics and machine learning, such as clustering, anomaly detection, and classification. Unlike more traditional approaches, CA does not require explicitly defined covariates or engineered features but can be applied to any set of arbitrary bitstreams. CAs great flexibility allows it to be rapidly prototyped, tested, and deployed across a wide range of problems and domains, but its black-box nature has hindered connections to existing statistical theory. For lossless or near-lossless compression, this disconnect can be bridged by relating a bitstreams compression ratio to an explicit or implied model likelihood, enabling a wide variety of existing statistical theories and techniques to be applied to CA. As examples of how these connections can be employed, this relationship is exploited to show how existing model selection techniques such as AIC and BIC can be utilized in CA and develop a novel EM-like CA clustering algorithm. Finally, the efficacy of these algorithms is demonstrated by applying these CA techniques to both real and simulated datasets.