Dissertation Defense
Connecting Sight and Sound through Space, Time and Language
This event is free and open to the publicAdd to Google Calendar

PASSCODE: 6wy3Z8
Sight and sound are interconnected modalities that shape our perception of the world. While semantic and temporal correspondences between vision and audio have been widely studied, many other audio-visual correlations remain underexplored. In this talk, we examine these underexplored correspondences through space, time, and language, demonstrating how they can be leveraged self-supervisedly. We first investigate the geometric consistency between visual and spatial audio to learn sound localization and camera rotation jointly and explore how ambient sounds can be used to predict 3D scene structure. Next, we introduce a video-guided sound generation framework that learns semantic and temporal associations across audio, video, and text. Finally, we study the visual correspondence between images and spectrograms, creating visual spectrograms that resemble images and can also be played as sound using diffusion models.
CHAIR: Professor Andrew Owens