Improving Deep Representation Learning for Complex and Multimodal Data
Add to Google Calendar
Supervised deep learning has made a breakthrough in many applications of machine learning, including visual object recognition and detection, speech recognition, and text understanding and machine translations. To build an end-to-end system for structured output prediction; however, one needs to incorporate the probabilistic inference, as it may not be a simple many-to-one function approximation problem (e.g., recognition), but could be a task of mapping an input to many possible outputs.
In this thesis, we establish a new representation learning framework for structured output prediction and multimodal learning. First, we propose a conditional graphical model for visual object segmentation by combining the best properties of the CRF that enforces local consistency and the RBM that regularizes with the global shape of an object. Furthermore, in the light of advancements in variational learning of deep generative models, we develop a stochastic convNet trainable by backpropagation. We demonstrate the importance of shape prior and the probabilistic inference for visual object segmentation and labeling. Second, we develop a novel multimodal learning framework that minimizes the variation of information. One important property of a generative model of multimodal data is an ability to reason about missing modality from partial observation. This motivates us to cast a problem into a structured output representation learning problems, where the output is a to be predicted from rest. We explain as to how our method could be more effective than maximum likelihood learning, and provide theoretical insights why it is sufficient to estimate the data-generating distribution of multimodal data. We demonstrate the state-of-the-art performance on visual-textual and visual-only recognition tasks.