
LLMs are only one of the important players in today’s rapidly evolving AI world. Equally transformative and innovative are the models designed for visual creation, like text-to-image, image-to-image, and image-to-video models. They have opened up new opportunities for creative expression and visual communication, enabling us to generate beautiful visuals, change backgrounds, inpaint missing parts, replicate compositions, and even turn simple scribbles into professional images.
One of the most mentioned names in this field is Stable Diffusion, which comes with a series of open-source visual generation models, like Stable Diffusion 1.4, XL and 3.5 Large, mostly developed by Stability AI. However, in the expansive universe of AI-driven image generation, they represent merely a part of it and things can get really complicated as you begin to choose the right model for serving and deployment. A quick search on Hugging Face gives over 87,000 text-to-image models alone.
In this blog post, we will provide a featured list of open-source models that stand out for their ability in generating creative visuals. After that, we will also answer frequently asked questions to help you navigate this exciting yet complex domain, providing insights into using these models in production.
Stable Diffusion (SD) has quickly become a household name in generative AI since its launch in 2022. It is capable of generating photorealistic images from both text and image prompts.
You might often hear people use the term “diffusion models” together with Stable Diffusion, which is the base AI technology that powers Stable Diffusion. Simply put, diffusion models generate images by starting with a pattern of random noise and gradually shaping it into a coherent image through a process that reversibly adds and removes noise. This process is computationally intensive but has been optimized in Stable Diffusion with latent space technology.
Latent space is like a compact, simplified map of all the possible images that the model can create. Instead of dealing with every tiny detail of an image (which takes a lot of computing power), the model uses this map to find and create new images more efficiently. It's a bit like sketching out the main ideas of a picture before filling in all the details.
In addition to static images, Stable Diffusion can also produce videos and 3D objects, making it a comprehensive tool for a variety of creative tasks.
Why should you use Stable Diffusion:
Multiple variants: Stable Diffusion comes with a variety of popular base models, such as Stable Diffusion 1.4, 1.5, 2.0, and 3.5 (Medium, Large and Turbo), Stable Diffusion XL, Stable Diffusion XL Turbo, and Stable Video Diffusion. They also provide optimized models for NVIDIA and AMD GPUs respectively.
According to this evaluation graph, the SDXL base model performs significantly better than the previous variants. Nevertheless, I think it is not 100% easy to say which model generates better images than others. This is because the results can impacted by various factors, like prompt, inference steps and LoRA weights. Some models even have more LoRAs available, which is an important factor when choosing the right model. For beginners, I recommend you start with SD 1.5 or SDXL 1.0. They're user-friendly and rich in features, perfect for exploring without getting into the technical details.
Customization and fine-tuning: Stable Diffusion base models can be fine-tuned with as little as five images for generating visuals in specific styles or of particular subjects, enhancing the relevance and uniqueness of generated images. One of my favorites is SDXL-Lightning, built upon Stable Diffusion XL; it is known for its lightning-fast capability to generate high-quality images in just a few steps (1, 2, 4, and 8 steps).
Controllable: Stable Diffusion provides you with extensive control over the image generation process. For example, you can adjust the number of steps the model takes during the diffusion process, set the image size, specify the seed for reproducibility, and tweak the guidance scale to influence the adherence to the input prompt.
Future potential: There's vast potential for integration with animation and video AI systems, promising even more expansive creative possibilities.
Points to be cautious about:
Note: See our blog post Stable Diffusion 3: Text Master, Prone Problems? to learn how it performs compared with SD 2 and SDXL and how you can improve its generated images.
Here is a code example of serving Stable Diffusion models with BentoML:

Deploy Stable DiffusionDeploy Stable Diffusion
In August 2024, Black Forest Labs introduced the FLUX.1 model family. It defines a new state-of-the-art suite of models and sets a new benchmark in image detail, prompt adherence, style diversity and scene complexity for text-to-image synthesis.
The initial suite included three variants: [pro], [dev], and [schnell]. Each variant is designed for specific use cases, from high-performance professional use to efficient, non-commercial applications, and rapid local development. In less than a month of its release, its fastest model FLUX.1 [schnell] achieved over 1.5 million downloads in Hugging Face.
An interesting fact here is that the creators behind FLUX.1 are the original developers of SD. After leaving Stability AI, they founded Black Forest Labs to innovate beyond their previous works.
Why should you use FLUX.1:
State-of-the-art performance: FLUX.1 models claim to surpass popular models like Midjourney v6.0 and DALL·E 3 in visual quality, prompt adherence, and output diversity. In benchmark tests, the [pro] and [dev] variants outperformed competitors like SD3-Ultra and Ideogram. This makes them an attractive choice for creative workers seeking the highest standards in image generation.
Long text rendering: Generating text remains a significant challenge for image generation models. While other models struggle with this, FLUX.1 has demonstrated exceptional text rendering capability, especially with lengthy text.
Architecture: All public FLUX.1 models are based on a hybrid architecture of multimodal and parallel diffusion transformer blocks, scaled to 12B parameters. This structure incorporates sophisticated techniques like flow matching and rotary positional embeddings, which not only boost image fidelity but also enhance hardware efficiency.
Growing ecosystem: Black Forest Labs is actively working to expand the FLUX.1 ecosystem:
FLUX.1 tools: A suite of open-access models to modify and recreate both real and generated images. Integrated into the FLUX.1 [dev] model series, the suite includes:
FLUX.1 Kontext [dev]: An image editing model with 12B parameters, designed for iterative and precise edits, including strong character preservation and local/global transformations. It offers near-proprietary editing quality and is available under the FLUX.1 Non-Commercial License for free research and non-commercial use.
Points to be cautious about:
Commercial licensing options: The choice of FLUX.1 variant impacts how you use the model in commercial contexts.
At this stage, there is still much to explore with FLUX.1. I recommend starting with FLUX.1 [schnell], as it offers easy access with minimal restrictions, enabling you to thoroughly explore its capabilities without significant initial investment or complex setup requirements.
HiDream-I1 is a powerful open-source image generation foundation model developed by HiDream.ai. Featuring 17 billion parameters, it delivers state-of-the-art visual quality across a wide range of styles, from photorealistic to artistic images. Since its release in April 2025, it has quickly become a strong player in the AI art ecosystem, consistently outperforming SDXL, DALL·E 3, and FLUX.1 on key benchmarks.
HiDream-I1 is built on a Sparse Diffusion Transformer (Sparse DiT) architecture combined with Sparse Mixture-of-Experts (MoE). In simple terms, this allows the model to dynamically route input through specialized expert blocks. This results in better performance with lower compute costs, especially during inference.
Why should you use HiDream-I1:
Exceptional prompt adherence: HiDream-I1 is good at understanding complex descriptions and nuanced prompts, thanks to Llama-3.1–8B-Instruct as the text encoder.
Flexible variants: The HiDream-I1 family includes three versions to support different needs:
Natural-language Image editing: The HiDream ecosystem includes HiDream-E1, an advanced editing model built on top of HiDream-I1. It allows you to modify images using plain language instructions, with no need for masks or manual tuning. For example, you can simply say: Convert the image into Ghibli style, and the model will handle the rest, similar to prompting ChatGPT.
Cautious: For some reason, when using HiDream-E1, the input image must be resized to 768Ă—768. You can keep to this resolution for better results, though further experimentation is needed.
Open source: Both HiDream-I1 and HiDream-E1 are released under the MIT License, allowing free use in personal, academic, and commercial applications. This makes it an excellent choice for building private, fully customized image generation services.
ControlNet can be used to enhance the capabilities of diffusion models like Stable Diffusion, allowing for more precise control over image generation. It operates by dividing neural network blocks into "locked" and "trainable" copies, where the trainable copy learns specific conditions you set, and the locked one preserves the integrity of the original model. This structure allows you to train the model with small datasets without compromising its performance, making it ideal for personal or small-scale device use.
Why should you use ControlNet:
Points to be cautious about:
Deploy ControlNetDeploy ControlNet
Developed by the Qwen team at Alibaba, Qwen-Image is the image generation foundation model in the Qwen series. It stands out as a next-generation diffusion model that brings together text-aware visual generation, intelligent editing, and vision understanding. It adopts Apache 2.0, making it an excellent choice for commercial-ready image generation.
Why should you use Qwen-Image:
Note that the image editing version is Qwen-Image-Edit, which is built upon the 20B Qwen-Image model. The latest iteration, Qwen-Image-Edit-2509, further enhances editing consistency and introduces multi-image editing, supporting operations across one to three input images (e.g., “person + product” or “person + scene”). It also adds ControlNet-based conditioning (depth, edge, and keypoint maps) for more structured and controllable results.
Points to be cautious about:
Developed by Tencent’s Hunyuan team, HunyuanImage-3.0 is a native multimodal autoregressive image generation model. Unlike the traditional DiT-style pipelines, it models text and image tokens in a single framework, improving world-knowledge reasoning and prompt adherence. It’s also the largest open-source image-generation MoE model to date, with 80B total parameters and 64 experts (~13B active per token).
Why should you use HunyuanImage-3.0:
Points to be cautious about:
Â
Â
Now let’s answer some of the FAQs for open-source image generation models. Questions like “Why should I choose open-source models over proprietary ones” are already covered in my previous blog post, so they are not listed here.
LoRA, or Low-Rank Adaptation, is an advanced technique designed for fine-tuning machine learning models, including generative models like Stable Diffusion. It works by using a small number of trainable parameters to fine-tune these models on specific tasks or to adapt them to new data. As it significantly reduces the number of parameters that need to be trained, it does not require extensive computational resources.
With LoRA, you can enhance Stable Diffusion models by customizing generated content with specific themes and styles. If you don’t want to create LoRA weights yourself, check out the LoRA resources on Civitai.
ComfyUI is a powerful, node-based interface for creating images with diffusion models. Unlike traditional interfaces, ComfyUI gives users advanced control over the image generation process by allowing them to customize workflows visually, using "nodes" to link different parts of the pipeline. I highly recommend it for anyone who wants more control and precision in their AI artwork. Read this blog post about ComfyUI custom nodes.
However, sharing ComfyUI workflows with others and deploying them as scalable APIs can be challenging due to missing custom nodes, incorrect model files, or Python dependencies. A simple solution is comfy-pack. It packages everything you need into a .cpack.zip file for easy sharing. It also allows you to serve and deploy ComfyUI workflows as scalable and secure APIs with just one click.

Serve ComfyUI workflows as APIsServe ComfyUI workflows as APIs
A1111, short for AUTOMATIC1111’s Stable Diffusion Web UI, is one of the most popular open-source interfaces for running Stable Diffusion locally.
It uses a Gradio-based interface, which makes it very beginner-friendly. You can easily switch between common workflows such as text-to-image, image-to-image, and outpainting/inpainting directly from the UI without coding.
Compared to ComfyUI, the biggest advantage of A1111 is simplicity. You can install it, load a model, and start generating AI art in minutes. However, it lacks features for users who need advanced customization.
My suggestion is:
Creating high-quality images with image generation models involves a blend of creativity, precision, and technical understanding. Some key strategies to improve your outcomes:
The short answer is YES.
Copyright concerns are a significant aspect to consider when using image generation models, including not just open-source models but commercial ones. There have been lawsuits against companies behind popular image generation models like this one.
Many models are trained on vast datasets that include copyrighted images. This raises questions about the legality of using these images as part of the training process.
Another thing is that determining the copyright ownership of AI-generated images can be complex. If you're planning to use these images commercially, it's important to consider who holds the copyright — the user who inputs the prompt, the creators of the AI model, or neither.
So, what can you do?
At this stage, the best suggestion I can give to someone using these models and the images they create is to stay informed. The legal landscape around AI-generated images is still evolving. Keep abreast of ongoing legal discussions and rulings related to AI and copyright law. Understanding your rights and the legal status of AI-generated images is crucial for using these tools ethically and legally.
Deploying LLMs and image generation models in production requires similar considerations on factors like scalability and observability, but they also have their unique challenges and requirements.
Choosing the right model for image generation requires you to understand their strengths and limitations. Each model brings its unique capabilities to the table, supporting different real-world use cases. Currently, I believe the biggest challenge for image generation models is ethical and copyright concerns. As we embrace the potential of them to augment our creative process, it's equally important to use these tools responsibly and respect copyright laws, privacy rights, and ethical guidelines.
At Bento, we work to help enterprises build scalable AI systems with production-grade reliability using any model (including the diffusion models mentioned above). Our unified inference platform lets developers bring their custom inference code and libraries to build AI systems 10x faster without the infrastructure complexity. You can scale your application efficiently in your cloud and maintain full control over security and compliance.
To deploy diffusion models, explore our examples and sign up for BentoCloud for enterprise-grade security, scalability and deployment management. Additionally, learn to choose the right NVIDIA or AMD GPUs and the right deployment patterns (e.g., BYOC, multi-cloud and cross-region, on-prem and hybrid) for your use case.
Still have questions? Schedule a call with our support engineers or join our Slack community to get expert guidance.