Andrew Owens’ research group uses visual illusions to test the limits of diffusion models
Is it a canyon or a cat? Ambiguous images contain two distinct scenes within a single image, revealed to the viewer by a change in perspective. For example, the image below looks like a painting of a canyon; however, when turned 90° counterclockwise, it reveals a maine coon.
Illusions like these—long found in art and popular culture as entertainment—have been used in psychology to understand human visual perception since the late 19th century. Now, Electrical and Computer Engineering (ECE) Prof. Andrew Owens and members of his research group are purposefully generating ambiguous images to understand the capacity and inner workings of diffusion models.
The development of diffusion models, which remove noise from images, has been a breakthrough over the last 10 years in the field of machine learning. They can remove the noise you might notice as the grainy quality of a photo taken at night, but—perhaps, more interestingly—they can also take completely random noise and generate an image from it.
“Diffusion models are kind of like a sculptor carving a block of marble; they start with pure noise, like tv static, and remove that noise incrementally until you’re left with a clean image,” explained Daniel Geng, a member of Owens’ group and PhD student in Computer Science and Engineering. “In the beginning, the model is hallucinating things from the structure of the random noise itself, but then, later on, it solidifies and adds all the fine details.”
Geng, ECE master’s student Inbum “Aaron” Park, and Owens prompted pretrained diffusion models to create multi-view optical illusions. Although the models had been trained on single images, they were able to create these “visual anagrams” that contained two images at once. In these visual anagrams, a transformation of the first image would reveal a second image—they were able to create these second images through rotation, color inversions, jigsaw rearrangements, and more.
The research team also created hybrid images with two or more interpretations, depending on the distance of the viewer from the image (i.e., image size), whether the image is in motion, or whether the image is grayscale or colorized. The latter images, which the researchers coined “color hybrids,” also change appearance when viewed in dark or bright light, a physical manifestation of our inability to see color in the dark.
“This project started as pure exploration but I think it provides a lot of insights into how generative models work and their capabilities,” said Owens.
“It’s interesting and surprising that models trained on regular images have the capability to incorporate ambiguity and reuse visual elements to function in multiple capacities,” added Geng.
Diffusion models are successful at this task partially because they work iteratively, denoising an image thousands of times to arrive at the final output. The iterative nature of this process allows both images to be refined and adjusted relative to each other, so a change to one image is also reflected in the alternative image. The illusions also work better with certain image styles and prompts that take advantage of the ways that humans visually interpret the world.
For example, Geng said, the prompt used to generate the image below of people clustered around a campfire, which reveals an image of an old man when flipped upside down, checks all the boxes for a successful prompt. The oil painting style allows for ambiguity in interpretation; the scene of people around a campfire is both abstract and familiar, and humans are very sensitive to human faces.
“People see faces everywhere—in cereal bowls, on the moon, wherever,” Geng noted with a laugh.
Illusions have historically been handcrafted, requiring human supervision to organize the elements strategically for human interpretation—and even then, people can often spot the presence of an illusion easily (for example, in the first hybrid images). Geng and Park’s method is one of the first to create unique, realistic-looking illusions using a diffusion model. Concurrent work at Stony Brook University has generated similar illusions using different techniques.
So far, Geng and Park have pushed the capability of the models to generate up to four transformations within a single image. His PhD research will continue to explore and expand the classes of illusions he can prompt the diffusion models to produce.
“These models are powerful. They can make all of these amazing images, but the way we control them is limited. We just use text,” Geng said. “Part of my work is figuring out useful ways to control them and harness their power.”
In the meantime, he has released the code online so that people can test the models and make their own illusions. To Geng’s delight, one person has already used the code to propose to his fiancé with an illusion-based puzzle.
Another of Owens’ research group members, ECE PhD student Ziyang Chen, has focused on using diffusion models to generate images that can be played as sound—or, in other words, spectrograms that look like images. The sound information, as audio frequencies, is contained in the grayscale images, but they can be colorized to enhance the visual imagery.
“In this work, we demonstrate that, perhaps surprisingly, there is non-trivial overlap between the distribution of natural images and the distribution of natural spectrograms,” said Chen.
The following three papers describing the research have been accepted and/or presented at professional conferences:
Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models, by Daniel Geng, Inbum Park, and Andrew Owens, Conference on Computer Vision and Pattern Recognition (CVPR 2024)
Factorized Diffusion: Perceptual Illusions by Noise Decomposition, by Daniel Geng, Inbum Park, and Andrew Owens, European Conference on Computer Vision (ECCV 2024)
Images that Sound: Composing Images and Sounds on a Single Canvas, by Ziyang Chen, Daniel Geng, and Andrew Owens, Conference on Neural Information Processing Systems (NeurIPS 2024)