Learning Low-Dimensional Models for Heterogeneous Data
Add to Google Calendar
Modern data analysis increasingly involves extracting patterns from large and messy data collected from myriad heterogeneous sources. The scale and diversity present exciting new opportunities for discovery, but also create a need for new statistical techniques and theory tailored to these settings. This defense reviews two projects. The first project considers data with heterogeneous quality, i.e., some samples noisier than others, and analyzes a weighted variant of the ubiquitous dimensionality reduction method, Principal Component Analysis (PCA). Weights allow the user to give noisier samples less influence, but how to optimally choose these weights was an open question. We characterize the asymptotic performance of any choice of weights and find optimal weights. Surprisingly, the commonly chosen inverse noise variance weights are *not* optimal. The second project generalizes the (increasingly) standard method of Canonical Polyadic (CP) tensor decomposition to allow for general notions of fit beyond the traditional least-squares, all within a single algorithmic framework. The flexibility of the Generalized CP (GCP) tensor decomposition makes it easier to quickly look at data through multiple low-rank lenses by using different fit functions, and we illustrate the benefits of this capability on various real datasets arising from social networks, neuroscience studies and weather patterns.