How to Guide Stable Diffusion with VGG Features, Style Loss, and Latent MAE¶
Sometime ago I started on an experiment combining stable diffusion with more traditional computer vision techniques. The goal? To create images that weren't just prompted by text, but guided by a starting image and also both visual style and content. Here are some the results and also a working implementation you can try yourself.
What is Stable Diffusion?¶
Before diving into the details, let's briefly explain Stable Diffusion. It's a powerful collection of multiple neural network models that can create images from text descriptions. The process works by gradually transforming random noise into a coherent image, guided by your text prompt. Think of it like a digital artist that starts with a canvas of static and progressively refines it into a detailed image.
What makes Stable Diffusion particularly interesting is that it works in a "latent space" - a compressed representation of images where similar visual concepts are mapped close together. This is different from working directly with pixels, and it's one of the reasons the model can generate such high-quality results.
The Challenge¶
- Imagine trying to recreate Van Gogh's vision of Saint-Rémy-de-Provence, but starting with a modern photograph. The challenge isn't just about generating an image that looks like a Van Gogh painting - it's about maintaining the essential features of the location while adopting his distinctive style. This requires working at multiple levels of representation, from raw latent space to high-level visual features.
-
An Example with inputs of a photograph of Saint-Rémy-de-Provence (noisy latents), a style reference of the Starry night painting and a prompt describing the painting.
The Technical Approach¶
The solution combines four key elements:
- Stable Diffusion's text-to-image capabilities
- VGG network's ability to extract meaningful features from images
- Style transfer techniques using gram matrices
- Direct latent space comparison between generated and reference images
What makes this approach interesting is how it steers the diffusion process at multiple levels. The VGG network provides a hierarchy of features - from basic textures to complex patterns. Meanwhile, the latent space comparison ensures coherence at a fundamental level. By combining these different forms of guidance, we can achieve more nuanced control over the generation process.
How It Works¶
At each step of the denoising process, we compute four types of loss:
- The standard diffusion loss guiding the image towards the prompt
- A content loss comparing VGG features with our reference photograph
- A style loss based on gram matrices from Van Gogh's painting
- A latent space MAE loss measuring the direct similarity between the current image and style reference in Stable Diffusion's latent space
Each of these losses provides different guidance. The content loss helps maintain the structural integrity of the scene, while the style loss captures Van Gogh's characteristic brush strokes and color relationships. The latent space loss adds an additional constraint, encouraging the overall structure to align with the style reference at a fundamental level.
The Core Process¶
Stable Diffusion starts with a noised version of the input image and gradually refines it through multiple steps. At each step, the text prompt helps guide the denoising process toward the desired output.
Feature-Level Control¶
The VGG network extracts meaningful features from both images. Gram matrices capture artistic style, while direct feature comparison maintains content structure. These two types of comparison provide different forms of guidance for the diffusion process.
Direct Comparison¶
In Stable Diffusion's native latent space, we directly compare the current image with the style reference. This helps maintain overall structural coherence at a fundamental level, complementing the feature-based guidance.
The Results¶
The progression from left to right shows:
- Standard stable diffusion output
- Van Gogh's painting (style reference)
- Output with style guidance
- Original photograph (content reference)
- Output with content guidance
- Final output combining all guidance
The final image maintains the essential structure of Saint-Rémy while adopting Van Gogh's distinctive style. The combination of feature-based guidance and latent space comparison helps ensure the result feels coherent at both micro and macro scales.
Try It Yourself¶
I've made the implementation available as a Hugging Face Space. You can experiment with:
- Different style images
- Various content references
- Fine-tuning the balance between style and content preservation
- Adjusting the influence of different VGG layers
- Controlling the strength of latent space guidance
Technical Details¶
For those interested in the implementation details, the key innovation is in how we integrate multiple forms of guidance into the diffusion process. Each component contributes differently:
- The content and style loss uses the VGG-16 model activations. VGG is a CNN architecture devised in 2014, a network pretrained on ImageNet. Normally VGG would be used as a classifier to tell you what the image is, for example is this a person, a dog or cat. The head of the VGG model is ignored and the loss function uses the intermediate activations in the backbone of the network, which represent the feature detections. Those activations can be found by looking through the VGG model to find all the max pooling layers. These are where the grid size changes and features are detected. This loss is used to evaluate the image and steer the gradiants at specific steps as the latents are denoised.
-
The important acivations before the Max Pooling layers in the VGG neural network.
Source
This well known image from the Zeiler and Fergus "Visualizing and Understanding Convolutional Networks" paper shows a Visualization of features in a trained model:
Content loss typically uses higher-level feature maps where as Style loss usually uses multiple layers throughout the network, computing Gram matrices of the feature maps. The VGG layers provide feature-level control:
- Early layers (Texture Fundamentals and Pattern Assembly) primarily influence style transfer
- Middle layers (Style Motifs and Compositional Grammar) help maintain structural integrity
Meanwhile, the latent space MAE loss works in Stable Diffusion's native representation space, providing a more direct form of guidance. This helps ensure that the overall composition maintains coherence with the style reference, while still allowing the VGG-based losses to fine-tune the details.
The interface allows you to adjust all these influences individually, letting you find the sweet spot between faithful reproduction and artistic interpretation. You can control:
- The strength of each VGG layer's contribution to style and content
- The weight of the latent space MAE loss
- The overall balance between different types of guidance
Looking Forward¶
This implementation opens up interesting possibilities for controlled image generation. While style transfer isn't new, integrating it directly into the diffusion process at multiple levels - from latent space to high-level features - provides unprecedented control over the output.
I'm particularly interested in seeing how this multi-level guidance technique could be applied to other use cases, where maintaining structural accuracy while adopting specific artistic styles could be valuable.
More examples¶
- Using noisy latents from an initial image of myself and a self-portrait of Rembrandt for style reference.
- The first example with higher weighted latent space closeness to the Style image and higher weighted "Style Signature" from the layer activations prior to the last Max Pooling layer of the VGG network.
- An Example with input (noisy latents) of a photograph of a person looking out of Van Gogh's window at Saint-Rémy-de-Provence, a style reference of the Starry night painting and a prompt describing the painting.
Acknowledgments¶
This work builds on several key innovations in the field, particularly the original stable diffusion implementation and the insights from classical style transfer papers. The VGG feature analysis draws inspiration from super-resolution and style transfer work in the fast.ai community, while the latent space guidance approach emerged from experiments with Stable Diffusion's internal representations.
References¶
Chris Thomas is an AI consultant helping organizations validate and implement practical AI solutions.