Faculty Candidate Seminar
Scalable Metric Learning and Fast Similarity Search
Add to Google Calendar
Massive data sets are now commonplace in many fields, and such data
presents a tremendous opportunity for inference and knowledge
discovery. However, the challenges in working with large-scale data
are abundant; the data is often complex and high-dimensional, the
number of objects may be in the millions or more, and the data itself
may change over time. In this talk, I will consider one of the core
problems in this domain: how to efficiently retrieve similar items
from a large database to a query item. This seemingly simple problem
is central to a variety of data mining and machine learning tasks
including clustering, classification, anomaly detection, duplicate
detection, and others.
The problem can be broken into two parts, and I will first consider
what it means for two objects to be "similar." For instance, what does
it mean for two images to be similar? How does one compare two program
executions? While most existing applications employ generic measures
or hand-tuned functions (for example, the Euclidean distance), the
distance between objects should ideally depend on the end application
and the data itself. I will describe an approach that learns a
Mahalanobis distance function using items of the database and
information about the desired distance function; this approach uses
LogDet regularization, can be scaled to large and high-dimensional
data sets, and can be applied in online scenarios where the data or
distance function changes over time. The second part of the problem is
how to extract similar objects from the database to a query in
sub-linear time given an appropriate distance function. Because the
distance functions are learned, standard techniques such as
locality-sensitive hashing are not directly applicable. I will
describe a hashing-based technique that significantly speeds up
similarity searches so that only a small percentage of the database is
searched for each query. The resulting scheme integrates the metric
learning with the hashing and features strong theoretical performance
guarantees. I will present applications of this work to several
domains, including fast image search and statistical software
debugging.
Brian Kulis is a postdoctoral fellow at UC Berkeley EECS and the
International Computer Science Institute. He obtained his PhD in
computer science from the University of Texas in 2008, and his BA
degree from Cornell University in computer science and mathematics in
2003. His research focuses on machine learning, data mining, and
large-scale optimization. For his research, he has won three best
student paper awards at top-tier conferences—two at the
International Conference on Machine Learning (in 2005 and 2007) and
one at the IEEE Conference on Computer Vision and Pattern Recognition
(in 2008). He is also the recipient of an MCD graduate fellowship from
the University of Texas (2003-2007) and an Award of Excellence from
the College of Natural Sciences at the University of Texas. In the
fall of 2007, he was a research fellow at the Institute for Pure and
Applied Mathematics at UCLA.