Faculty Candidate Seminar

Toward Visual Learning with Minimal Human Supervision

Vittorio FerrariAssistant ProfessorETH, Zurich

A main goal of computer vision is interpreting images of complex scenes, by recognizing objects and persons, localizing them and understanding their relations. Applications include semantic image search, computational photography, robot navigation and visual surveillance. To acquire a large number of complex class models, visual learning should require only little manual supervision. In this talk I will present two ways of reducing supervision.

The first way is labeling images only by the object class they contain, without telling where. Learning from cluttered images is very challenging in this weakly supervised setting. In the traditional paradigm, each class is learned starting from scratch. In our work instead, knowledge generic over classes is first learned during a meta-training stage from images of diverse classes with given object locations, and is then used to support learning any new class without location annotation. Generic knowledge simplifies weakly supervised learning because during meta-training the system can learn about localizing objects in general. As demonstrated experimentally, this approach enables learning from more challenging images than possible before, such as the PASCAL VOC 2007, containing extensive clutter and large scale and appearance variations between object instances.

The second way is the multi-modal analysis of news items from the web, consisting of images and text captions. We associate names and action verbs in the captions to the face and body pose of the persons in the images. We introduce a joint probabilistic model for simultaneously recovering image-caption correspondences and learning appearance models for the face and pose classes occurring in the corpus. As demonstrated experimentally, this joint `face and pose' model solves the correspondence problem better than earlier models covering only the face, and it can perform recognition of new uncaptioned images.

I will conclude with an outlook on the idea of visual culture, where new visual concepts are learned incrementally on top of all visual knowledge acquired so far. Beside generic knowledge, visual culture includes also knowledge specific to a class, knowledge of scene structures and other forms of visual knowledge. Potentially, this approach could considerably extend current visual recognition capabilities and produce an integrated body of visual knowledge.

Sponsored by