Last Chance - Claim your Lifetime deal now

Generative AI Research Empowers Creators with Guided Image Structure Control

A groundbreaking research study is revolutionizing the creative potential of generative AI by introducing a text-guided image-editing tool. This innovative framework utilizes plug-and-play diffusion features (PnP DFs) to guide the generation of realistic and precise images. With this advancement, visual content creators can effortlessly transform images into visually captivating compositions using just a single prompt image and a few descriptive words.

The newfound ability to reliably edit and generate content has the potential to greatly expand creative possibilities for artists, designers, and creators. Furthermore, it could bolster industries that heavily rely on animation, visual design, and image editing.

Image generated on

“Navigating the realm of text-to-image generative models marks a significant milestone in digital content creation. However, the primary obstacle hindering their application in real-world scenarios has been the lack of user control, limited to guiding the generation solely through textual input. Our research represents one of the initial approaches that empowers users with control over the layout of the generated image,” stated Narek Tumanyan, a lead author and Ph.D. candidate at the Weizmann Institute of Science.

Recent advancements in generative AI have paved the way for novel methods for developing powerful text-to-image models. Nonetheless, existing rendering techniques are constrained by complexities, ambiguity, and the requirement for customized content.

This study introduces a fresh approach that employs PnP DFs to enhance the process of image editing and generation, granting creators greater authority over their final output.

The researchers begin by addressing a fundamental question: How do diffusion models represent and capture the shape or outline of an image? The study delves into the internal representations of evolving images throughout the generation process, exploring how these representations encode both shape and semantic information.

The novel method enables control over the generated layout without necessitating the training or tuning of a new diffusion model. Instead, it achieves this by comprehending how spatial information is encoded in a pre-trained text-to-image model. During the generation process, the model extracts diffusion features from a provided guidance image and incorporates them at each step, thereby enabling meticulous control over the structure of the resulting image.

By incorporating these spatial features, the diffusion model refines the newly generated image to match the layout of the guidance image. This refinement occurs iteratively, with the image features being updated until a final image is achieved that not only preserves the structure of the guide image but also aligns with the provided text prompt.

“This approach offers a straightforward and efficient solution, where features obtained from the guidance image are directly incorporated into the image generation process, requiring no training or fine-tuning,” the authors explain.

Image generated on

This methodology sets the stage for more advanced techniques in controlled generation and manipulation.

The researchers developed and tested the PNP model using the cuDNN-accelerated PyTorch framework on a single NVIDIA A100 GPU. The team emphasizes that the GPU’s ample capacity enabled them to focus on method development. As recipients of the NVIDIA Applied Research Accelerator Program, the researchers were awarded an A100 GPU.

Deployed on the A100, the framework can transform a new image based on the guidance image and text in approximately 50 seconds. 

Not only is the process effective, but it also produces highly accurate and visually striking imagery. Moreover, it extends beyond images, allowing for the translation of sketches, drawings, and animations, while enabling modifications in lighting, color, and backgrounds.

Their approach surpasses existing text-to-image models, striking a superior balance between preserving the guidance layout and deviating from its appearance.

However, the model does have limitations. It struggles to edit image sections with arbitrary colors since it cannot extract semantic information from the input image.

Image generated on

The researchers are currently working on extending this approach to text-guided video editing. Furthermore, their work has proven valuable in other research endeavors that leverage the analysis of internal image representations in diffusion models.

For instance, one study utilizes the insights from this research to enhance computer vision tasks like semantic point correspondence. Another focuses on expanding controls for text-to-image generation, encompassing object shape, placement, and appearance.

The research team from the Weizmann Institute of Science will present their study at CVPR 2023. The work is also available as an open-source project on GitHub.

Ready to level-up?

Get images 8x faster, engage your audience, & never struggle with getting the perfect images again.