Andrew Owens’ research group uses visual illusions to test the limits of diffusion models

Mena Davidson • November 18, 2024

Members of Andrew Owens's research group, Daniel Geng, Aaron Park, and Ziyang Chen, are using ambiguous image generation to understand diffusion models.

Is it a canyon or a cat? Ambiguous images contain two distinct scenes within a single image, revealed to the viewer by a change in perspective. For example, the image below looks like a painting of a canyon; however, when turned 90° counterclockwise, it reveals a maine coon.

A framed image, in oil painting style, depicts a canyon when viewed right-side-up; when turned 90° counterclockwise, it reveals a cat. — A framed image depicts a canyon when viewed right-side-up; when turned 90° counterclockwise, it reveals a cat. Image: Daniel Geng

Illusions like these—long found in art and popular culture as entertainment—have been used in psychology to understand human visual perception since the late 19th century. Now, Electrical and Computer Engineering (ECE) Prof. Andrew Owens and members of his research group are purposefully generating ambiguous images to understand the capacity and inner workings of diffusion models.

The development of diffusion models, which remove noise from images, has been a breakthrough over the last 10 years in the field of machine learning. They can remove the noise you might notice as the grainy quality of a photo taken at night, but—perhaps, more interestingly—they can also take completely random noise and generate an image from it.

“Diffusion models are kind of like a sculptor carving a block of marble; they start with pure noise, like tv static, and remove that noise incrementally until you’re left with a clean image,” explained Daniel Geng, a member of Owens’ group and PhD student in Computer Science and Engineering. “In the beginning, the model is hallucinating things from the structure of the random noise itself, but then, later on, it solidifies and adds all the fine details.”

Six panels show random noise, appearing as tv static, iteratively become more clear into a photo of a puffin. — A diffusion model takes random (Gaussian) noise and, using a text prompt, denoises the image iteratively into a clear image. Diagram: Daniel Geng

Geng, ECE master’s student Inbum “Aaron” Park, and Owens prompted pretrained diffusion models to create multi-view optical illusions. Although the models had been trained on single images, they were able to create these “visual anagrams” that contained two images at once. In these visual anagrams, a transformation of the first image would reveal a second image—they were able to create these second images through rotation, color inversions, jigsaw rearrangements, and more.

The research team also created hybrid images with two or more interpretations, depending on the distance of the viewer from the image (i.e., image size), whether the image is in motion, or whether the image is grayscale or colorized. The latter images, which the researchers coined “color hybrids,” also change appearance when viewed in dark or bright light, a physical manifestation of our inability to see color in the dark.

The research team created a t-shirt for the CVPR 2024 conference, featuring a hybrid image of the Seattle skyline that reveals “CVPR 2024” when viewed from a distance. Video: Daniel Geng

An example of a “color inversion” hybrid image that changes content from a mountain to house plants when the colors are changed. Video: Daniel Geng

An example of a “jigsaw rearrangement”: a puzzle that has two correct solutions of a duck and a rabbit. Video Daniel Geng.

“This project started as pure exploration but I think it provides a lot of insights into how generative models work and their capabilities,” said Owens.

“It’s interesting and surprising that models trained on regular images have the capability to incorporate ambiguity and reuse visual elements to function in multiple capacities,” added Geng.

Diffusion models are successful at this task partially because they work iteratively, denoising an image thousands of times to arrive at the final output. The iterative nature of this process allows both images to be refined and adjusted relative to each other, so a change to one image is also reflected in the alternative image. The illusions also work better with certain image styles and prompts that take advantage of the ways that humans visually interpret the world.

For example, Geng said, the prompt used to generate the image below of people clustered around a campfire, which reveals an image of an old man when flipped upside down, checks all the boxes for a successful prompt. The oil painting style allows for ambiguity in interpretation; the scene of people around a campfire is both abstract and familiar, and humans are very sensitive to human faces.

“People see faces everywhere—in cereal bowls, on the moon, wherever,” Geng noted with a laugh.

An oil painting of people sitting around a campfire in the forest next to an oil painting of an old man wearing a suit. — L: The right-side-up image of people sitting around a campfire in oil painting style. R: The same image turned upside down reveals an older man wearing a suit. Images: Daniel Geng

Illusions have historically been handcrafted, requiring human supervision to organize the elements strategically for human interpretation—and even then, people can often spot the presence of an illusion easily (for example, in the first hybrid images). Geng and Park’s method is one of the first to create unique, realistic-looking illusions using a diffusion model. Concurrent work at Stony Brook University has generated similar illusions using different techniques.

So far, Geng and Park have pushed the capability of the models to generate up to four transformations within a single image. His PhD research will continue to explore and expand the classes of illusions he can prompt the diffusion models to produce.

“These models are powerful. They can make all of these amazing images, but the way we control them is limited. We just use text,” Geng said. “Part of my work is figuring out useful ways to control them and harness their power.”

In the meantime, he has released the code online so that people can test the models and make their own illusions. To Geng’s delight, one person has already used the code to propose to his fiancé with an illusion-based puzzle.

An example of a four-way transformation, with new content revealed each time the image is rotated 90° counterclockwise. Video: Daniel Geng

Another of Owens’ research group members, ECE PhD student Ziyang Chen, has focused on using diffusion models to generate images that can be played as sound—or, in other words, spectrograms that look like images. The sound information, as audio frequencies, is contained in the grayscale images, but they can be colorized to enhance the visual imagery.

“In this work, we demonstrate that, perhaps surprisingly, there is non-trivial overlap between the distribution of natural images and the distribution of natural spectrograms,” said Chen.

Two images of tigers generate sounds of growling tigers when played as spectrograms. Video: Ziyang Chen

Both the colorful and grayscale versions of this image of a water-lily pond generate audio of frogs croaking when played as a spectrogram. Video: Ziyang Chen

The following three papers describing the research have been accepted and/or presented at professional conferences:

Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models, by Daniel Geng, Inbum Park, and Andrew Owens, Conference on Computer Vision and Pattern Recognition (CVPR 2024)

Factorized Diffusion: Perceptual Illusions by Noise Decomposition, by Daniel Geng, Inbum Park, and Andrew Owens, European Conference on Computer Vision (ECCV 2024)

Images that Sound: Composing Images and Sounds on a Single Canvas, by Ziyang Chen, Daniel Geng, and Andrew Owens, Conference on Neural Information Processing Systems (NeurIPS 2024)

Explore:

Andrew Owens; Machine Learning; Research News; Signal & Image Processing and Machine Learning