Project Overview
Overview
These works apply Rademacher averages to obtain state-of-the-art approximation algorithms for classic data science tasks, including pattern mining and betweenness centrality estimation. Bavarian focuses on approximating Betweenness Centrality (BC) in graphs. By using the MCERA, Bavarian provides strong, sample-dependent approximation guarantees, significantly improving over existing methods by reducing error bounds by 2-4 times. The algorithms offer a unifying framework for comparing various BC estimators and introduce new sample-complexity results based on the graph's vertex-diameter. Similarly, MCRapper applies MCERAs to POSET-structured function families common in pattern mining tasks. It efficiently computes upper bounds on the maximum deviation of sample means, enabling the identification of statistically significant patterns and approximations of high-expectation function sets. MCRapper also leads to the development of TFP-R, an algorithm for True Frequent Pattern mining, which outperforms current methods by offering higher precision and recall.
Both algorithms demonstrate the flexibility and power of MCERAs in providing robust, efficient solutions to complex problems in graph analysis and pattern mining. Their ability to deliver strong statistical guarantees and improve computational efficiency highlights their significant potential applications in various fields requiring precise data analysis and pattern recognition. In machine learning, Rademacher averages are generally used to bound the deviation between the training and test risk of the learned model, but these applications leverage the strength of the Rademacher average more effectively, by simultaneously bounding large families of statistics, which are then used to produce domain-specific approximations. These projects are closely related to my work on empirical centralization, and I have employed similar techniques in adversarial learning and fair adversarial learning settings, as well as empirical game theory, to simultaneously bound many statistics efficiently.