Dissertation Defense

Harnessing the Power of Multi-source Data: an Exploration of Diversity and Similarity

Yang LiuPhD CandidateEE: Systems

This dissertation studies a sequence of problems concerning the collection and utilization
of data from disparate sources, e.g., that arising in a crowd-sourcing system. It aims at developing learning methods to enhance the quality of decision-making and learning task performance by exploiting a multitude of diversity, similarity and interdependency inherent in a crowd-sourcing system and among disparate data sources. We start our study with a family of problems on sequential decision-making combined with data collection in a crowd-sourcing system, where the goal is to improve the quality of data input or computational output, while reducing the cost in using such a system. In this context, the learning methods we develop are closed-loop and online, i.e., decisions made are functions of past data observations, present actions determine future observations, and the learning occurs as data inputs arrive. The similarity and disparity among different data sources help us in some cases to speed up the learning process (e.g., in a recommender system), and in some other cases to perform quality control over data input for which ground-truth may be non-existent (e.g., in a crowd-sourcing market using
Amazon Mechanical Turks (AMTs)).

We then apply our algorithms to the processing of a large quantity of network malicious activity data collected from a diverse set of sources, with a goal of uncovering interconnectedness/similarity between different network entities' malicious behaviors. Specifically, we apply our online prediction algorithm presented and analyzed in earlier parts of the dissertation to this data and show its effectiveness in predicting next-day maliciousness. Furthermore, we show that data-specific properties of this set of data allow us to map networks' behavioral similarity to similarity in their topological features. This in turn enables prediction even in the absence of measurement data.

Sponsored by


Faculty Host

Mingyan Liu