Updated on August 27, 2024 • Written By Sherlock Xu
In my previous article, I talked about the world of Large Language Models (LLMs), introducing some of the most advanced open-source text generation models over the past year. However, LLMs are only one of the important players in today’s rapidly evolving AI world. Equally transformative and innovative are the models designed for visual creation, like text-to-image, image-to-image, and image-to-video models. They have opened up new opportunities for creative expression and visual communication, enabling us to generate beautiful visuals, change backgrounds, inpaint missing parts, replicate compositions, and even turn simple scribbles into professional images.
One of the most mentioned names in this field is Stable Diffusion, which comes with a series of open-source visual generation models, like Stable Diffusion 1.4, 2.0 and XL, mostly developed by Stability AI. However, in the expansive universe of AI-driven image generation, they represent merely a part of it and things can get really complicated as you begin to choose the right model for serving and deployment. A quick search on Hugging Face gives over 30,000 text-to-image models alone.
In this blog post, we will provide a featured list of open-source models that stand out for their ability in generating creative visuals. Just like the previous blog post, we will also answer frequently asked questions to help you navigate this exciting yet complex domain, providing insights into using these models in production.
Stable Diffusion (SD) has quickly become a household name in generative AI since its launch in 2022. It is capable of generating photorealistic images from both text and image prompts. You might often hear people use the term “diffusion models” together with Stable Diffusion, which is the base AI technology that powers Stable Diffusion. Simply put, diffusion models generate images by starting with a pattern of random noise and gradually shaping it into a coherent image through a process that reversibly adds and removes noise. This process is computationally intensive but has been optimized in Stable Diffusion with latent space technology.
Latent space is like a compact, simplified map of all the possible images that the model can create. Instead of dealing with every tiny detail of an image (which takes a lot of computing power), the model uses this map to find and create new images more efficiently. It's a bit like sketching out the main ideas of a picture before filling in all the details.
In addition to static images, Stable Diffusion can also produce videos and animations, making it a comprehensive tool for a variety of creative tasks.
Why should you use Stable Diffusion:
Points to be cautious about:
Note: Stable Diffusion 3 Medium was released in June, 2024. See our blog post Stable Diffusion 3: Text Master, Prone Problems? to learn how it performs compared with SD 2 and SDXL and how you can improve its generated images.
In August, Black Forest Labs introduced the FLUX.1 model family. It defines a new state-of-the-art suite of models and sets a new benchmark in image detail, prompt adherence, style diversity and scene complexity for text-to-image synthesis.
The suite includes three variants: [pro], [dev], and [schnell]. Each variant is designed for specific use cases, from high-performance professional use to efficient, non-commercial applications, and rapid local development. In less than a month, its fastest model FLUX.1 [schnell] achieved over 1.5 million downloads in Hugging Face, positioning it fourth among all text-to-image models (the first three are all SD models).
An interesting fact here is that the creators behind FLUX.1 are the original developers of SD. After leaving Stability AI, they founded Black Forest Labs with a vision to innovate beyond their previous works.
Why should you use FLUX.1:
Points to be cautious about:
Commercial licensing options: The choice of FLUX.1 variant impacts how you use the model in commercial contexts.
At this stage, there is still much to explore with FLUX.1. I recommend starting with FLUX.1 [schnell], as it offers easy access with minimal restrictions, enabling you to thoroughly explore its capabilities without significant initial investment or complex setup requirements.
DeepFloyd IF is a text-to-image generation model developed by Stability AI and the DeepFloyd research lab. It stands out for its ability to produce images with remarkable photorealism and nuanced language understanding.
DeepFloyd IF's architecture is particularly noteworthy for its approach to diffusion in pixel space. Specifically, it contains a text encoder and three cascaded pixel diffusion modules. Each module plays a unique role in the process: Stage 1 is responsible for the creation of a base 64x64 px image, which is then progressively upscaled to 1024x1024 px across Stage 2 and Stage 3. This distinguishes itself from latent diffusion models like Stable Diffusion. This pixel-level processing allows DeepFloyd IF to directly manipulate images for generating or enhancing visuals without the need for translating into and from a compressed latent representation.
Why should you use DeepFloyd IF:
Points to be cautious about:
ControlNet can be used to enhance the capabilities of diffusion models like Stable Diffusion, allowing for more precise control over image generation. It operates by dividing neural network blocks into "locked" and "trainable" copies, where the trainable copy learns specific conditions you set, and the locked one preserves the integrity of the original model. This structure allows you to train the model with small datasets without compromising its performance, making it ideal for personal or small-scale device use.
Why should you use ControlNet:
Points to be cautious about:
Text-to-image AI models hold significant potential for the animation industry. Artists can quickly generate concept art by providing simple descriptions, allowing for rapid exploration of visual styles and themes. In this area, Animagine XL is one of the important players leading the innovation. It represents a series of open-source anime text-to-image generation models. Built upon Stable Diffusion XL, its latest release Animagine XL 3.1 adopts tag ordering for prompts, which means the sequence of prompts will significantly impact the output. To ensure the generated results are aligned with your intention, you may need to follow certain template as the model was trained this way.
Why should you use Animagine XL:
Points to be cautious about:
Stable Video Diffusion (SVD) is a video generation model from Stability AI, aiming to provide high-quality videos from still images. As mentioned above, this model is a part of Stability AI’s suite of AI tools and represents their first foray into open video model development.
Stable Video Diffusion is capable of generating 14 and 25 frames at customizable frame rates between 3 and 30 frames per second. According to this evaluation graph, SVD gained more human voters in terms of video quality over GEN-2 and PikaLabs.
In fact, Stability AI is still working on it to improve both its safety and quality. Stability AI emphasized that “this model is not intended for real-world or commercial applications at this stage and it is exclusively for research”. That said, it is one of the few open-source video generation models available in this industry. If you just want to play around with it, pay attention to the following:
Now let’s answer some of the frequently asked questions for open-source image generation models. Questions like “Why should I choose open-source models over commercial ones?” and “What should I consider when deploying models in production?” are already covered in my previous blog post, so I do not list them here.
LoRA, or Low-Rank Adaptation, is an advanced technique designed for fine-tuning machine learning models, including generative models like Stable Diffusion. It works by using a small number of trainable parameters to fine-tune these models on specific tasks or to adapt them to new data. As it significantly reduces the number of parameters that need to be trained, it does not require extensive computational resources.
With LoRA, you can enhance Stable Diffusion models by customizing generated content with specific themes and styles. If you don’t want to create LoRA weights yourself, check out the LoRA resources on Civitai.
Creating high-quality images with image generation models involves a blend of creativity, precision, and technical understanding. Some key strategies to improve your outcomes:
The short answer is YES.
Copyright concerns are a significant aspect to consider when using image generation models, including not just open-source models but commercial ones. There have been lawsuits against companies behind popular image generation models like this one.
Many models are trained on vast datasets that include copyrighted images. This raises questions about the legality of using these images as part of the training process.
Another thing is that determining the copyright ownership of AI-generated images can be complex. If you're planning to use these images commercially, it's important to consider who holds the copyright — the user who inputs the prompt, the creators of the AI model, or neither.
So, what can you do?
At this stage, the best suggestion I can give to someone using these models and the images they create is to stay informed. The legal landscape around AI-generated images is still evolving. Keep abreast of ongoing legal discussions and rulings related to AI and copyright law. Understanding your rights and the legal status of AI-generated images is crucial for using these tools ethically and legally.
Deploying LLMs and image generation models in production requires similar considerations on factors like scalability and observability, but they also have their unique challenges and requirements.
Just like LLMs, choosing the right model for image generation requires us to understand their strengths and limitations. Each model brings its unique capabilities to the table, supporting different real-world use cases. Currently, I believe the biggest challenge for image generation models is ethical and copyright concerns. As we embrace the potential of them to augment our creative process, it's equally important to use these tools responsibly and respect copyright laws, privacy rights, and ethical guidelines.