Dissertation Defense

Grounding Language Learning in Vision for Artificial Intelligence and Brain Research

Yizhen Zhang

Most models for natural language processing learn words merely from texts. Humans learn language by referring to real-world experiences. My research aims to ground language learning in visual perception, taking one step closer to making machines learn language like humans. To achieve this goal, I have designed a two-stream model with deep neural networks. One stream extracts image features. The other stream extracts language features. The two streams merge to connect image and language features in a joint representation space. By contrastive learning, I have first trained the model to align images with their captions, and then refined the model to retrieve visual objects with language queries and infer their visual relations. After training, the model’s language stream is a stand-alone language model capable of embedding words in a visually grounded semantic space. This space manifests principal dimensions explainable with human intuition and neurobiological knowledge. The visually grounded language model also enables compositional language understanding based on visual knowledge and multimodal image search with queries based on the image-text combination. This model can also explain human brain activity observed with functional magnetic resonance imaging during natural language comprehension. It sheds new light on how the brain stores and organizes concepts.

Chair:  Professor Zhongming Liu