Stay updated on AI infrastructure, inference techniques, and performance optimization.

<p>Subscribe to our <u>newsletter</u></p>

Expert how-tos, deep-dive guides, and real-world stories from the Bento team, to help you build and scale AI at blazing speed.

Sherlock Xu

<p>DeepSeek has once again set the AI world buzzing with its new model, DeepSeek-OCR.</p><p>At first glance, DeepSeek-OCR looks like just another <a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/blog/multimodal-ai-a-guide-to-open-source-vision-language-models">vision language model (VLM)</a>. But as their research paper shows, it is much more than that. DeepSeek-OCR introduces a completely new way to think about how AI models store, process, and compress information.</p><p>Rather than simply improving OCR, it challenges a fundamental assumption: LLMs must process information as long sequences of text tokens. DeepSeek-OCR shows that this doesn’t have to be the case. A model can “see” information instead of just reading it, achieving the same understanding with a fraction of the computation.</p><p>In this blog post, we’ll look at:</p><ul><li>What Contexts Optical Compression is</li><li>How DeepSeek-OCR works</li><li>Why it matters for the future of long-context and efficient LLMs</li></ul><h2>What is Contexts Optical Compression?</h2><p>LLMs today face a growing computational bottleneck when handling long pieces of text. Every token they process consume resources: floating-point operations per second (FLOPs), memory, time, and energy. A 10,000-token article means 10,000 discrete processing steps. That’s like forcing a model to read every single word in sequence, even when much of the content is repetitive or predictable.</p><p>DeepSeek-OCR rethinks what a token can be with Contexts Optical Compression. Instead of treating long text sequences as endless strings of small, low-information text tokens, it uses the visual modality as a more efficient compression channel for textual information.</p><p>In this framework, it compresses the same content into a smaller set of dense visual tokens. Each visual token carries much richer information, such as typography, layout, and spatial relationships between words. This allows the model to encode and understand entire chunks of text at once.</p><p>The result: The model achieved the same semantic understanding with an order-of-magnitude fewer computation steps. What once required 1,000 text tokens might now be represented by just 100 visual ones. This can reduce processing time and cost dramatically while preserving context.</p><p>DeepSeek-OCR is thus more than just an open-source OCR model. It is a proof of concept for a new paradigm in AI efficiency: let models see information instead of merely reading it.</p><p>Even <a target="_blank" rel="noopener noreferrer" href="https://x.com/karpathy/status/1980397031542989305">Andrej Karpathy</a> noted that DeepSeek-OCR raises a deeper question: are pixels better inputs to LLMs than text? He suggests that text tokens might be inherently wasteful and that “historical baggage” could eventually be replaced by visual inputs for efficiency.</p><figure class="image image_resized" style="width:75%;"><img alt="andrej-deepseek-ocr.png" src="/uploads/andrej_deepseek_ocr_9d36164077.png" srcset="/uploads/xsmall_andrej_deepseek_ocr_9d36164077.png 64w, /uploads/thumbnail_andrej_deepseek_ocr_9d36164077.png 245w, /uploads/small_andrej_deepseek_ocr_9d36164077.png 500w, /uploads/medium_andrej_deepseek_ocr_9d36164077.png 750w, /uploads/large_andrej_deepseek_ocr_9d36164077.png 1000w" sizes="100vw" width="1000"></figure><h2>How does DeepSeek-OCR work?</h2><p>DeepSeek-OCR features a unified end-to-end VLM architecture built around two brains that work together: a visual encoder and a language decoder.</p><figure class="image image_resized" style="width:75%;"><img alt="deepseek-ocr-architecture.png" src="/uploads/deepseek_ocr_architecture_9d41aeb8d3.png" srcset="/uploads/xsmall_deepseek_ocr_architecture_9d41aeb8d3.png 64w, /uploads/thumbnail_deepseek_ocr_architecture_9d41aeb8d3.png 245w, /uploads/small_deepseek_ocr_architecture_9d41aeb8d3.png 500w, /uploads/medium_deepseek_ocr_architecture_9d41aeb8d3.png 750w, /uploads/large_deepseek_ocr_architecture_9d41aeb8d3.png 1000w" sizes="100vw" width="1000"><figcaption>Image Source: <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2510.18234">DeepSeek-OCR Research Paper</a></figcaption></figure><h3>DeepEncoder</h3><p>The DeepEncode is where the compression happens. It handles high-resolution inputs more efficiently in terms of memory and token counts.</p><p>The DeepSeek team built it from the ground up because no existing open-source encoder met their requirements. They needed a model that could:</p><ul><li>Process high resolutions efficiently</li><li>Maintain low activation at high resolutions</li><li>Create a small number of vision tokens</li><li>Support multi-resolution inputs</li><li>Keep a moderate parameter count</li></ul><p>To meet these conditions, the team designed a 380M parameter encoder that achieves high compression ratios and can output a manageable number of vision tokens. It combines three main components:</p><ul><li>SAM-base (80M) for visual perception using window attention</li><li>CLIP-large (300M) for knowledge with dense global attention</li><li>A 16× Token Compressor bridges the two, capable of reducing thousands of patch tokens into a few hundred vision tokens</li></ul><h3>DeepSeek-3B-MoE Decoder</h3><p>Once the encoder compresses the input into visual tokens, the DeepSeek-3B-MoE Decoder turns them back into text.</p><p>This decoder uses a MoE design. During inference, it activates only 6 of 64 experts plus 2 shared ones, totaling about 570M activated parameters. This gives it the power of a 3B model but the inference cost of one under 600M, an ideal balance between performance and efficiency.</p><p>The decoder receives the vision tokens and prompts, and generates the final output, including chemical structures and planar geometric figures.</p><h2>How well does DeepSeek-OCR perform?</h2><p>DeepSeek-OCR demonstrates strong efficiency and accuracy across benchmarks.</p><p>At a 10× compression ratio, it retains around 97% accuracy. Even at higher compression levels (up to 20×), it can still produce usable results at roughly 60% accuracy.</p><figure class="image image_resized" style="width:75%;"><img alt="deepseek-ocr-performance.png" src="/uploads/deepseek_ocr_performance_ef8b74a2b2.png" srcset="/uploads/xsmall_deepseek_ocr_performance_ef8b74a2b2.png 64w, /uploads/thumbnail_deepseek_ocr_performance_ef8b74a2b2.png 245w, /uploads/small_deepseek_ocr_performance_ef8b74a2b2.png 500w, /uploads/medium_deepseek_ocr_performance_ef8b74a2b2.png 750w, /uploads/large_deepseek_ocr_performance_ef8b74a2b2.png 1000w" sizes="100vw" width="1000"><figcaption>Image Source: <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2510.18234">DeepSeek-OCR Research Paper</a></figcaption></figure><p>On OmniDocBench, a leading benchmark for document understanding, DeepSeek-OCR outperforms established baselines such as GOT-OCR 2.0 and MinerU 2.0, achieving higher accuracy with far fewer tokens.</p><figure class="image"><img alt="deepseek-ocr-benchmarks.png" src="/uploads/deepseek_ocr_benchmarks_a12d4e8510.png" srcset="/uploads/xsmall_deepseek_ocr_benchmarks_a12d4e8510.png 64w, /uploads/thumbnail_deepseek_ocr_benchmarks_a12d4e8510.png 245w, /uploads/small_deepseek_ocr_benchmarks_a12d4e8510.png 500w, /uploads/medium_deepseek_ocr_benchmarks_a12d4e8510.png 750w, /uploads/large_deepseek_ocr_benchmarks_a12d4e8510.png 1000w, /uploads/xlarge_deepseek_ocr_benchmarks_a12d4e8510.png 1920w" sizes="100vw" width="1920"><figcaption>Image Source: <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2510.18234">DeepSeek-OCR Research Paper</a></figcaption></figure><p>In deployment, it also proves highly practical. With a single A100 40GB GPU, DeepSeek-OCR can process more than 200K pages per day. This makes it a viable solution for large-scale document processing and training data generation for LLMs and VLMs.</p><h2>Why DeepSeek-OCR matters?</h2><p>By turning text into visual representations, DeepSeek-OCR shows that LLMs don’t always have to process information through text tokens. A single image can carry far more meaning with far fewer tokens. This idea opens up a promising direction for building long-context and more efficient LLMs.</p><p>What this means:</p><ul><li><strong>Massive efficiency gains</strong>: DeepSeek-OCR can represent the same content using up to 20× fewer tokens, cutting computation time and memory usage.</li><li><strong>Lower cost</strong>: With fewer tokens to process, models become cheaper and quicker to run, especially at scale.</li><li><strong>Open source</strong>: The model code and weights are public and available for anyone to experiment and extend.</li><li><strong>Potential for long-context LLMs</strong>: Vision-based tokens could help future LLMs handle far more context (theoretically unlimited context).</li></ul><p>For the last point, the paper introduces a fascinating concept inspired by human memory. As human beings, we gradually forget details while keeping what’s important. DeepSeek-OCR proposes something similar; it uses optical compression to shrink and blur older conversation history over time:</p><ul><li>Recent information remains sharp and detailed.</li><li>Older context fades and consumes fewer resources.</li></ul><figure class="image"><img alt="deepseek-ocr-forgetting-curve.png" src="/uploads/deepseek_ocr_forgetting_curve_f5fea65386.png" srcset="/uploads/xsmall_deepseek_ocr_forgetting_curve_f5fea65386.png 64w, /uploads/thumbnail_deepseek_ocr_forgetting_curve_f5fea65386.png 245w, /uploads/small_deepseek_ocr_forgetting_curve_f5fea65386.png 500w, /uploads/medium_deepseek_ocr_forgetting_curve_f5fea65386.png 750w, /uploads/large_deepseek_ocr_forgetting_curve_f5fea65386.png 1000w, /uploads/xlarge_deepseek_ocr_forgetting_curve_f5fea65386.png 1920w" sizes="100vw" width="1920"><figcaption>Image Source: <a target="_blank" rel="noopener noreferrer" href="https://arxiv.org/abs/2510.18234">DeepSeek-OCR Research Paper</a></figcaption></figure><p>This “visual forgetting” mechanism could enable models to manage ultra-long conversations more efficiently. It preserves what matters most while letting less relevant details fade naturally. It’s an early but promising step toward more powerful multimodal AI and theoretically unlimited context architectures.</p><h2>Conclusion</h2><p>DeepSeek-OCR is more than an OCR breakthrough. It’s a glimpse into a new way of thinking about efficiency and memory in AI systems. By replacing text tokens with compact visual representations, it redefines how models can scale, reason, and remember.</p><p>Learn more:</p><ul><li><a target="_blank" rel="noopener noreferrer" href="https://l.bentoml.com/join-slack-from-blog-deepseek-ocr">Join our Slack community</a> to stay updated on the latest frontier models</li><li><a target="_blank" rel="noopener noreferrer" href="https://l.bentoml.com/contact-us-from-blog-deepseek-ocr">Contact us</a> if you need help integrating DeepSeek-OCR into your own workflows</li><li><a target="_blank" rel="noopener noreferrer" href="https://l.bentoml.com/signup-from-blog-deepseek-ocr">Sign up for Bento Inference Platform</a> to build, deploy, and scale AI applications with DeepSeek-OCR</li></ul>

Learn how DeepSeek-OCR redefines AI efficiency with Contexts Optical Compression, turning text into vision for faster, cheaper, long-context LLMs.

deepseek-ocr-blog-image.png

DeepSeek-OCR Explained: How Contexts Optical Compression Redefines AI Efficiency

Learn ChatGPT usage limits for Free, Plus, Business, and Pro plans (2025 update). Understand why they exist and how to remove them with self-hosted LLMs.

chatgpt-usage-limits.png

ChatGPT Usage Limits: What They Are and How to Get Rid of Them

Explore the best open-source LLMs and find answers to common FAQs about performance, inference optimization, and self-hosted deployment.

best-open-source-llms.png

Top-Rated LLMs for Chat in 2025

Understand the differences between DeepSeek-V3, R1, V3.1, V3.2, and distilled models. Learn how to choose the right model and deploy them securely with BentoML.

deepseek-models.png

The Complete Guide to DeepSeek Models: V3, R1, V3.1, V3.2 and Beyond

Explore the top open-source VLMs and find answers to some FAQs about them.

vision-language-models.png

Multimodal AI: A Guide to Open-Source Vision Language Models

Explore the top open-source embedding models and find answers to some FAQs about them.

embedding-models.png

A Guide to Open-Source Embedding Models

Explore the top open-source TTS models and find answers to some FAQs about them.

tts-models.png

Exploring the World of Open-Source Text-to-Speech Models

Explore top open-source image generation models and find answers to FAQs about them.

image-generation-models.png

A Guide to Open-Source Image Generation Models

Chaoyu Yang

This guide breaks down how your enterprise can unlock the maximum ROI from its inference infrastructure.

Bento x ROI on Inference Infrastructure.png

How to Maximize ROI on Inference Infrastructure

Learn how a consumer lending company streamlined model deployment, reduced infrastructure spend by 90%, and shipped 50% more models with the Bento Inference Platform.

Bento x Finance Header.png

Fintech Loan Servicer Cuts Model Deployment Costs by 90% with Bento

This guide will walk you through what an inference platform is, the criteria that should guide your evaluation, and how the leading solutions stack up.

Vet Inference Platforms.png

How to Vet Inference Platforms: A Buyer’s Guide for Enterprise AI Teams

<p>As an AI leader, you often face a conflicting set of priorities: deploying new models faster than ever, while ensuring strict governance, compliance, and data ownership.</p><p>Both are crucial in today’s AI arms race. Every month brings a new frontier model, a competitor rolling out another product, or a customer expecting more innovative services. This means falling behind on deployment speed isn’t just a technical delay; it’s a competitive disadvantage. At the same time, the governance mandate has never been stronger. Regulators, customers, and boards expect sensitive data to remain protected and auditable, and the <a target="_blank" rel="noopener noreferrer" href="https://www.ibm.com/reports/data-breach">costs of non-compliance</a> can be substantial.</p><p>Fortunately, the days of having to choose between speed and stewardship are quickly coming to an end. With&nbsp;<a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/blog/byoc-to-bentocloud-privacy-flexibility-and-cost-efficiency-in-one-package">Bring Your Own Cloud (BYOC)</a> and&nbsp;<a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/blog/inference-platform-the-missing-layer-in-on-prem-llm-deployments">on-prem deployment</a>, your enterprise can scale AI inference securely, efficiently, and in full compliance.</p><h2>Why Traditional Model Inference Falls Short</h2><p>After years of accelerated adoption, with over&nbsp;<a target="_blank" rel="noopener noreferrer" href="https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai">78% of enterprises</a> implementing AI across business functions, most AI leaders have come to rely on fully managed inference API services or DIY on-prem deployment setups.</p><p>While both approaches often work well in early experimentation, such as proof-of-concept demos or hackathon prototypes, their limitations become apparent once workflows move into production.</p><h3>Inference APIs compromise control and flexibility</h3><p>While serverless inference APIs like GPT-4o and Claude 4 promise agility by removing infrastructure overhead, they shift critical data and workflows into vendor cloud environments. That means customer records, intellectual property, and internal knowledge bases all live outside your environment. Not only is this an auditability and compliance nightmare across highly regulated industries, but it also creates risk for any enterprise entrusted with the handling, storage, and auditing of sensitive data.</p><p>Flexibility is another challenge. Inference APIs are designed for broad adoption, which makes them difficult to integrate into bespoke systems. This means that teams often run into friction when trying to plug APIs into their own CI/CD systems, custom monitoring dashboards, or access policies. The rigidity of these vendor-managed systems also impacts inference optimizations, as tradeoffs between cost savings and latency are often dictated by the vendor — not by the enterprise.</p><p>And despite inference APIs saving enterprises money from an engineering perspective, egress fees, storage charges, and premium add-ons can quickly eat into those savings. In fact, cloud waste accounted for&nbsp;<a target="_blank" rel="noopener noreferrer" href="https://www.cloudzero.com/blog/cloud-computing-statistics/">30% of enterprise cloud budgets</a> in 2021, according to Flexera, and rose to 32% in 2022. That 2% jump alone represents a massive amount of wasted spend, highlighting just how quickly inference APIs can become unsustainable at enterprise scale.</p><h3>DIY on-prem deployments create operational drag</h3><p>The shortcomings of serverless inference APIs might push teams toward self-hosting, such as DIY on-prem deployments. While this path provides enterprises with complete control over their models and data, that ownership comes at a steep operational cost.</p><p>Standing up production-grade infrastructure requires a hefty capital investment, in-house expertise, months of GPU provisioning, manual configuration, and ongoing maintenance. Even when these needs have been met, idle capacity is common, leaving expensive hardware underutilized.</p><p>To fully utilize your compute, you need fast autoscaling that adapts to changing demand. But scaling adds further complexity. Building reliable autoscaling, seamless upgrade paths, and robust monitoring requires significant engineering investment. Additionally, features that are now expected as table stakes, like prefix caching or prefill–decode disaggregation, can also take entire teams months to develop and test internally.</p><p>And if standing up infrastructure wasn’t intensive enough, establishing a baseline for compliance is another hurdle. Enterprises are often forced to reinvent encryption, secret management, and authentication from scratch. While these are essential for meeting baseline compliance standards, the reinvention process is time-intensive and often results in patchwork systems that don’t integrate cleanly with the rest of the enterprise stack.</p><p>Over time, the costs of maintaining this infrastructure climb sharply. What starts as a path to sovereignty often turns into a maintenance treadmill, draining engineering bandwidth while slowing innovation.</p><h2>BYOC: Control and Agility in One Unified Package</h2><p>Today’s AI leaders require the speed of a managed service while maintaining control of their data — that’s where BYOC comes in. By combining self-hosting control with the convenience of a managed platform, BYOC deployments offer operational agility without sacrificing ownership or oversight.</p><p>The core advantage of BYOC is data residency and ownership. Models and sensitive datasets remain within your virtual private cloud (VPC), which also hosts third-party applications and software. This structure keeps customer records, internal knowledge, and IP under your governance, while accelerating model deployment.</p><p>BYOC is also designed for compliance readiness. Because all workflows execute within your environment, you can enforce the same policies you use elsewhere, including RBAC, SSO, encryption, and audit logs. This also makes aligning with enterprise data compliance standards such as SOC 2 Type II, HIPAA, and ISO 27001 considerably easier. Instead of rebuilding controls from scratch, teams adopt an operating model that’s engineered for compliance and governance.</p><p>There’s a clear cost benefit as well. Processing data within your own accounts minimizes egress charges, and autoscaling keeps GPU utilization high to avoid paying for idle capacity. If you’re involved in a startup or incubator program and qualify for cloud credits, they can also be applied within BYOC deployments, improving ROI without duplicating spend.</p><p>And for enterprises with multi-cloud requirements, BYOC supports inference deployment on AWS, Azure, or GCP, helping distribute workflows evenly to meet regional data requirements, avoid single-cloud lock-in, and serve global customers with consistent performance.</p><p>Far from being experimental, BYOC has already been proven in production.&nbsp;<a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/blog/accelerating-ai-innovation-at-yext-with-bentoml">Yext</a> uses BentoCloud’s BYOC to serve multi-cloud/global customers while keeping data compliant, reducing compute costs by up to 80% while shipping 2× more models. Today, they run more than 150 models in production, supported by efficient autoscaling and multi-region deployments.</p><h2>On-prem Deployments for Full Sovereignty without the Overhead</h2><p>For enterprises operating within sensitive industries such as defense, healthcare, and government, on-prem deployments ensure complete ownership of both AI workflows and the physical systems that power them.</p><p>On their own, however, most DIY on-prem deployments are notoriously resource-intensive, from managing compliance frameworks to standing up scalable infrastructure. That’s why many AI leaders turn to the Bento Inference Platform to make on-prem viable in practice. With Bento, enterprises can achieve production-ready inference at scale without having to reinvent the wheel.</p><p>On-prem deployments through Bento provide the highest levels of data control, flexibility, and security, while also eliminating common bottlenecks with autoscaling, distributed serving, and observability. Instead of piecing together disparate tools, enterprises gain a resilient, efficient system capable of supporting mission-critical workloads.</p><p>The cost equation improves as well. Efficient GPU scheduling reduces idle time, while standardized upgrade paths and monitoring minimize the need for constant engineering intervention. And uniquely, the Bento Inference Platform supports hybrid bursting: enterprises can run steady workloads on in-house GPUs while seamlessly extending into the cloud during peak demand. This balance of sovereignty and elasticity helps enterprises meet strict regulatory requirements without locking themselves into brittle, high-maintenance infrastructure.</p><h2>How Bento Exceeds Enterprise Speed and Compliance Requirements</h2><p><a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/">Bento</a> helps enterprises deploy and operate AI models in their own cloud or on-prem environments with full data ownership, strict access controls, and built-in scalability, automation, and auditability. Its BYOC architecture is designed for security-first enterprises and is already trusted by Fortune 500 companies and global teams with complex compliance needs.</p><h3>Data privacy and ownership</h3><p>Bento’s architecture ensures your sensitive data and models never leave your cloud’s secure environment and is compatible with all general-purpose cloud providers. Control and data planes are separated, with full VPC encapsulation to guarantee isolation. The Bento Inference Platform control plane is also SOC 2 Type II certified, and additional certifications such as HIPAA and ISO 27001 are underway. The result of these benefits? Provable stewardship and data compliance, without vendor lock-in.</p><h3>Access controls and auditability</h3><p>With Bento, you retain fine-grained control over who can access models and data. Role-based access control (RBAC), SSO integration, and audit logs are supported natively, while authentication and encryption are enforced across AI workflows. Usage tracking and cost analysis can be broken down by team or project, helping demonstrate compliance externally while uncovering optimization opportunities internally.</p><h3>Secrets management</h3><p>Managing sensitive credentials securely is often a weak spot in enterprise AI deployments. Bento addresses this with a dedicated secrets feature that stores and injects API keys, authentication tokens, and credentials safely. <a target="_blank" rel="noopener noreferrer" href="https://docs.bentoml.com/en/latest/scale-with-bentocloud/manage-secrets-and-env-vars.html">Secrets</a> can be mounted as environment variables or read-only files and managed via CLI or YAML. This reduces the risk of credential leakage while supporting custom access policies and automated rotation.</p><h3>Sandboxes for risk reduction</h3><p>Bento also provides <a target="_blank" rel="noopener noreferrer" href="https://docs.bentoml.com/en/latest/scale-with-bentocloud/deployment/sandboxes.html">isolated, ephemeral sandboxes</a> where you can safely test untrusted or dynamically generated code. These environments give teams a controlled way to validate new code paths before they reach production. This reduces risks such as prompt injections or misuse of third-party tools, while keeping core systems secure.</p><h2>The Solution for Speed, Cost Savings, and Compliance</h2><p>Today’s AI leaders know that successful AI deployments are defined by more than technical performance alone. The ultimate goal is balancing speed, cost efficiency, and compliance in a way that scales, whether you’re serving patients, processing financial transactions, or powering consumer apps.</p><p>With the right infrastructure, it’s possible to accelerate time-to-market, cut GPU costs dramatically, and prove governance at every step. These gains don’t just improve ROI; they give teams the confidence to innovate without sacrificing oversight.</p><p>Bento delivers that balance, helping enterprises scale LLM workflows confidently and securely to stay ahead in a highly competitive market.</p><p>Want a custom BYOC or on-prem deployment solution for your use case? <a target="_blank" rel="noopener noreferrer" href="https://l.bentoml.com/contact-us-from-blog-how-enterprises-can-scale-ai-securely-with-byoc-and-on-prem-deployments">Schedule a call with our experts</a>.</p>

Fortunately, the days of having to choose between speed and stewardship are quickly coming to an end.

Bento Scaling AI Securely.png

How Enterprises Can Scale AI Securely with BYOC and On-Prem Deployments

Larme Zhao

Fog Dong

<p>At Bento, we work to help AI teams deploy and scale models with tailored optimization. That means giving them the ability to tune performance for their specific use cases with ease and speed, so they can achieve better price-performance ratios.</p><p>As part of that effort, we’re excited to introduce <a target="_blank" rel="noopener noreferrer" href="https://github.com/bentoml/llm-optimizer">llm-optimizer</a>, an open-source tool for benchmarking and optimizing LLM inference. It works across multiple inference frameworks and supports any open-source LLM.</p><p>Unlike traditional benchmarking tools that only generate raw numbers, llm-optimizer allows you to define constraints, such as “TTFT under 200ms” or “P99 ITL below 10ms.” This makes it easy to quickly identify the configurations that meet your specific requirements without endless trial and error.</p><p>At the same time, we’re launching the <a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/llm-perf/">LLM Performance Explorer</a>, a companion website powered by llm-optimizer. It lets you browse LLM benchmark results directly without running experiments yourself. You can compare different configurations, apply constraints, and quickly see which setup works best for your specific use case.</p><figure class="image"><img></figure><figure class="image"><img src="/uploads/llm_perf_explorer_7131fbb216.png" alt="llm-perf-explorer.png" srcset="/uploads/xsmall_llm_perf_explorer_7131fbb216.png 64w, /uploads/thumbnail_llm_perf_explorer_7131fbb216.png 245w, /uploads/small_llm_perf_explorer_7131fbb216.png 500w, /uploads/medium_llm_perf_explorer_7131fbb216.png 750w, /uploads/large_llm_perf_explorer_7131fbb216.png 1000w, /uploads/xlarge_llm_perf_explorer_7131fbb216.png 1920w" sizes="100vw" width="1920"></figure><h2>Motivation</h2><p>When it comes to self-hosting LLMs, performance tuning can be very tricky. There are so many factors to balance:</p><ul><li><strong>Server&nbsp;parameters</strong>: <a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/inference-optimization/data-tensor-pipeline-expert-hybrid-parallelism">Parallelism strategies (e.g., data, tensor, pipeline)</a>, batch sizes, <a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/inference-optimization/kv-cache-offloading">caching</a> strategies, etc.</li><li><strong>Client parameters</strong>: Request rates, concurrency&nbsp;levels,&nbsp;number of prompts, etc.</li><li><strong>Framework&nbsp;differences</strong>: Each&nbsp;<a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/getting-started/choosing-the-right-inference-framework">inference&nbsp;framework</a>&nbsp;(vLLM,&nbsp;SGLang, TensorRT-LLM, etc.) has&nbsp;its own optimization&nbsp;strategies and tuning knobs.</li><li><strong>Workload variations</strong>: Different models, input&nbsp;lengths, and request patterns require different optimizations.</li></ul><p>For most AI teams, the only way to figure it out has been trial and error: endless tests, parameter tweaks, and results that are hard to compare. It’s tedious, time-consuming, and often error-prone.</p><p>This led us to build llm-optimizer. We want to provide engineers a systematic way to test configurations, and give them a clear view of what really works with just a few commands. More importantly, it must support constraint filtering, so that engineers don’t need to sift through mountains of raw result data. This means they can focus on the right parameters for their specific optimization goals. Ultimately, engineers spend less time tuning and more time building.</p><h2>Key features</h2><p>llm-optimizer was built to take the tedious manual work out of LLM performance optimization. You can run structured experiments, apply constraints, and visualize results in one place - all with a few commands.</p><h3>Systematic parameter testing</h3><p>llm-optimizer supports systematic tests of different combinations of server and client parameters across multiple inference frameworks. It works with any open-source LLM. Simply provide the Hugging Face model tag.</p><p>Here is an example with vLLM:</p><pre><code class="language-bash">llm-optimizer \
  --framework vllm \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --server-args "tensor_parallel_size=[1,2];max_num_batched_tokens=[4096,8192]" \
  --client-args "max_concurrency=[32,64];num_prompts=1000;dataset_name=sharegpt" \
  --output-json vllm_results.json</code></pre><p>Parameters like <code>tensor_parallel_size</code> and <code>max_num_batched_tokens</code> come directly from the respective frameworks (currently support vLLM and SGLang). If you’re familiar with other native parameters, you can add them as well.</p><p>Here is another example with SGLang:</p><pre><code class="language-bash">llm-optimizer \
  --framework sglang \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --server-args "tp_size*dp_size=[(1,4),(2,2),(4,1)];chunked_prefill_size=[2048,4096]" \
  --client-args "max_concurrency=[32,64,128];num_prompts=[500,1000]" \
  --output-dir optimization_results \
  --output-json results.json \
  --continue \
  --rest 10 \
  --warmup-requests 3</code></pre><p>This command generates <strong>36 unique test runs</strong>:</p><ul><li>3 tensor/data parallelism combinations × 2 prefill sizes = 6 server configs</li><li>3 concurrency values × 2 prompt counts = 6 client configs</li><li>6 × 6 = 36 total configurations, each with 3 warmup requests and a 10-second rest between runs</li></ul><p>This means 36 results you can analyze or filter further using constraints.</p><h3>Constraint settings</h3><p>Not every benchmark result is useful. What really matters is finding the configuration that meets your performance goals, especially when you have specific SLOs for inference. Without filtering, large test runs can generate too much noise and make it harder to analyze results.</p><p>With llm-optimizer, you can:</p><ul><li><strong>Filter results</strong> to only include configurations that meet your requirements</li><li><strong>Identify optimal setups</strong> for latency or throughput targets</li></ul><p>The following is a typical use case where you can optimize for chatbots or other low-latency workloads.</p><pre><code class="language-bash">llm-optimizer \
  --framework vllm \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --server-args "tensor_parallel_size*data_parallel_size=[(1,2),(2,1)];max_num_seqs=[16,32,64]" \
  --client-args "max_concurrency=[8,16,32];num_prompts=500" \
  --constraints "ttft&lt;200ms;itl:p99&lt;10ms" \
  --output-json chat_optimized.json</code></pre><p>This command will only return results where:</p><ul><li>Time to First Token (TTFT) is below&nbsp;200ms</li><li>99% of Inter-Token Latency (ITL)&nbsp;are below 10ms</li></ul><h3>Comprehensive benchmarking results</h3><p>Benchmarks are only useful if the results tell you something actionable. llm-optimizer provides detailed inference-specific metrics, giving you a clear view of how your model behaves under different conditions.</p><p>For LLMs, performance is best measured at the token level. <a target="_blank" rel="noopener noreferrer" href="https://bentoml.com/llm/inference-optimization/llm-inference-metrics">Key metrics</a> include:</p><ul><li><strong>Time to First Token (TTFT):</strong> How long it takes for the first response token to appear after a request is sent</li><li><strong>Inter-Token Latency (ITL):</strong> The average time between tokens in a response.</li><li><strong>Input/Output Token Throughput:</strong> How many tokens are processed/generated per second during the benchmark.</li></ul><p>In addition, you also get traditional request-level metrics such as end-to-end latency, request throughput, and optimal concurrency. These measurements combined help you make trade-offs for your performance goals.</p><p>Results are output in JSON, easy to filter and compare. Here’s a simplified example:</p><pre><code class="language-yaml language-json">[
  {
    "config": {
      "client": {
        "max_concurrency": 8,
        "num_prompts": 1000,
        "dataset_name": "sharegpt",
        "sharegpt_output_len": 256
      },
      "server": {
        "tensor_parallel_size": 1,
        "max_num_batched_tokens": 4096,
        "max_num_seqs": 16
      },
      "server_args": [
        "--tensor-parallel-size=1",
        "--max-num-batched-tokens=4096",
        "--max-num-seqs=16"
      ]
    },
    "results": {
      "backend": "vllm",
      "dataset_name": "sharegpt",
      "max_concurrency": 8,
      "total_input_tokens": 296523,
      "total_output_tokens": 256000,
      "total_output_tokens_retokenized": 255910,
      "request_throughput": 2.323720163971585,
      "input_throughput": 689.0364741813463,
      "output_throughput": 594.8723619767258,
      "mean_e2e_latency_ms": 3441.463320413,
      "median_e2e_latency_ms": 3413.3226294999304,
      "std_e2e_latency_ms": 130.47813647173385,
      "p99_e2e_latency_ms": 4003.9951376500244,
			...
    },
    "cmd": "vllm serve meta-llama/Llama-3.1-8B-Instruct --host 127.0.0.1 --port 8000 --tensor-parallel-size=1 --max-num-batched-tokens=4096 --max-num-seqs=16"
  },
  
// More results below</code></pre><h3>Performance estimation</h3><p>Sometimes you want to understand how a model will perform before running a full benchmark. Maybe you don’t have the required hardware, your time is limited, or you just want a quick estimate of what to expect. llm-optimizer supports performance estimation that analyzes a model and generates theoretical results without running live tests.</p><p>Here is an example:</p><pre><code class="language-bash">llm-optimizer estimate \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --input-len 1024 \
  --output-len 512 \
  --gpu A100 \
  --num-gpus 2</code></pre><p>This command:</p><ul><li>Analyzes the 8B Llama model on 2x A100 GPUs</li><li>Estimates performance for 1024 input + 512 output tokens</li><li>Shows best latency and throughput configurations</li><li>Provides roofline analysis and identifies bottlenecks</li><li>Generates the commands to run actual benchmarks with SGLang and vLLM</li></ul><p>Expected output:</p><pre><code class="language-bash">=== Configuration ===
Model: meta-llama/Llama-3.1-8B-Instruct
GPU: 2x A100
Precision: fp16
Input/Output: 1024/512 tokens
Target: throughput

Fetching model configuration...
Model: 8029995008.0B parameters, 32 layers

=== Performance Analysis ===
Best Latency (concurrency=1):
  TTFT: 43.1 ms
  ITL: 2.6 ms
  E2E: 1.39 s

Best Throughput (concurrency=512):
  Output: 18873.3 tokens/s
  Input: 23767.8 tokens/s
  Requests: 14.24 req/s
  Bottleneck: Memory

=== Roofline Analysis ===
Hardware Ops/Byte Ratio: 142.5 ops/byte
Prefill Arithmetic Intensity: 52205.5 ops/byte
Decode Arithmetic Intensity: 50.9 ops/byte
Prefill Phase: Compute Bound
Decode Phase: Memory Bound

=== Concurrency Analysis ===
KV Cache Memory Limit: 688 concurrent requests
Prefill Compute Limit: 8 concurrent requests
Decode Capacity Limit: 13 concurrent requests
Theoretical Overall Limit: 8 concurrent requests
Empirical Optimal Concurrency: 16 concurrent requests

=== Tuning Commands ===

--- SGLANG ---
Simple (concurrency + TP/DP):
  llm-optimizer --framework sglang --model meta-llama/Llama-3.1-8B-Instruct --gpus 2 --host 127.0.0.1 --server-args "tp_size*dp_size=[(1, 2), (2, 1)]" --client-args "num_prompts=1000;dataset_name=sharegpt;random_input=1024;random_output=512;num_prompts=1000;max_concurrency=[256, 512, 768]" --output-dir tuning_results --output-json tuning_results/config_1_sglang.json
Advanced (additional parameters):
  llm-optimizer --framework sglang --model meta-llama/Llama-3.1-8B-Instruct --gpus 2 --host 127.0.0.1 --server-args "tp_size*dp_size=[(1, 2), (2, 1)];chunked_prefill_size=[1434, 2048, 2662];schedule_conservativeness=[0.3, 0.6, 1.0];schedule_policy=fcfs" --client-args "num_prompts=1000;dataset_name=sharegpt;random_input=1024;random_output=512;num_prompts=1000;max_concurrency=[256, 512, 768]" --output-dir tuning_results --output-json tuning_results/config_1_sglang.json

--- VLLM ---
Simple (concurrency + TP/DP):
  llm-optimizer --framework vllm --model meta-llama/Llama-3.1-8B-Instruct --gpus 2 --host 127.0.0.1 --server-args "tensor_parallel_size*data_parallel_size=[(1, 2), (2, 1)]" --client-args "num_prompts=1000;dataset_name=sharegpt;random_input=1024;random_output=512;num_prompts=1000;max_concurrency=[256, 512, 768]" --output-dir tuning_results --output-json tuning_results/config_1_vllm.json
Advanced (additional parameters):
  llm-optimizer --framework vllm --model meta-llama/Llama-3.1-8B-Instruct --gpus 2 --host 127.0.0.1 --server-args "tensor_parallel_size*data_parallel_size=[(1, 2), (2, 1)];max_num_batched_tokens=[1024, 1177, 1331]" --client-args "num_prompts=1000;dataset_name=sharegpt;random_input=1024;random_output=512;num_prompts=1000;max_concurrency=[256, 512, 768]" --output-dir tuning_results --output-json tuning_results/config_1_vllm.json</code></pre><p>If you’re not sure what parameters to provide, you can use the interactive mode. It walks you through the setup step by step and generates the same results as if you had provided them directly:</p><pre><code class="language-bash">$ llm-optimizer estimate --interactive

=== LLM Performance Estimation (Interactive Mode) ===

🤖 Model Selection
Popular options: meta-llama/Llama-3.2-1B, meta-llama/Meta-Llama-3-8B, meta-llama/Meta-Llama-3-70B
HuggingFace model ID: meta-llama/Llama-3.2-1B

📏 Sequence Length Configuration
Typical values: 512 (short), 1024 (medium), 2048 (long), 4096 (very long)
Input sequence length (tokens) [1024]: 512
Output sequence length (tokens) [512]: 512

## More options...</code></pre><figure class="image"><img src="/uploads/interactive_demo_14433d2ba1.gif" alt="interactive-demo.gif" srcset="/uploads/xsmall_interactive_demo_14433d2ba1.gif 64w, /uploads/thumbnail_interactive_demo_14433d2ba1.gif 245w, /uploads/small_interactive_demo_14433d2ba1.gif 500w, /uploads/medium_interactive_demo_14433d2ba1.gif 750w, /uploads/large_interactive_demo_14433d2ba1.gif 1000w" sizes="100vw" width="1000"></figure><h3>Interactive user interface</h3><p>Raw numbers are useful, but they’re much easier to interpret when you can see patterns and trade-offs. llm-optimizer offers a visualization tool that turns benchmark results into interactive dashboards.</p><p>After running benchmarks, you don’t have to parse JSON by hand. Simply launch a local dashboard and explore your results visually:</p><pre><code class="language-bash"># Visualize results with Pareto frontier analysis
llm-optimizer visualize --data-file results.json --port 8080

# Combine multiple result files
llm-optimizer visualize --data-file "sglang_results.json,vllm_results.json" --port 8080</code></pre><p>The dashboard will be served at <code>http://localhost:8080/pareto_llm_dashboard.html</code>.</p><blockquote><p>Note: <span style="background-color:rgb(255,255,255);color:rgb(31,35,40);font-family:-apple-system, &quot;system-ui&quot;, &quot;Segoe UI&quot;, &quot;Noto Sans&quot;, Helvetica, Arial, sans-serif, &quot;Apple Color Emoji&quot;, &quot;Segoe UI Emoji&quot;;font-size:16px;"><span style="-webkit-text-stroke-width:0px;display:inline !important;float:none;font-style:normal;font-variant-caps:normal;font-variant-ligatures:normal;font-weight:400;letter-spacing:normal;orphans:2;text-align:start;text-decoration-color:initial;text-decoration-style:initial;text-decoration-thickness:initial;text-indent:0px;text-transform:none;white-space:normal;widows:2;word-spacing:0px;">This feature is still experimental, and we’ll continue improving it in the coming days. For visualized results, check out the </span></span><a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/llm-perf/"><a style="-webkit-text-stroke-width:0px;background-color:rgb(255, 255, 255);box-sizing:border-box;color:rgb(9, 105, 218);font-family:-apple-system, &quot;system-ui&quot;, &quot;Segoe UI&quot;, &quot;Noto Sans&quot;, Helvetica, Arial, sans-serif, &quot;Apple Color Emoji&quot;, &quot;Segoe UI Emoji&quot;;font-size:16px;font-style:normal;font-variant-caps:normal;font-variant-ligatures:normal;font-weight:400;letter-spacing:normal;orphans:2;text-align:start;text-indent:0px;text-transform:none;text-underline-offset:0.2rem;white-space:normal;widows:2;word-spacing:0px;" rel="nofollow"><u>LLM Performance Explorer</u></a></a><span style="background-color:rgb(255,255,255);color:rgb(31,35,40);font-family:-apple-system, &quot;system-ui&quot;, &quot;Segoe UI&quot;, &quot;Noto Sans&quot;, Helvetica, Arial, sans-serif, &quot;Apple Color Emoji&quot;, &quot;Segoe UI Emoji&quot;;font-size:16px;"><span style="-webkit-text-stroke-width:0px;display:inline !important;float:none;font-style:normal;font-variant-caps:normal;font-variant-ligatures:normal;font-weight:400;letter-spacing:normal;orphans:2;text-align:start;text-decoration-color:initial;text-decoration-style:initial;text-decoration-thickness:initial;text-indent:0px;text-transform:none;white-space:normal;widows:2;word-spacing:0px;">.</span></span></p></blockquote><h2>Getting started</h2><p>Install llm-optimizer:</p><pre><code class="language-bash">git clone https://github.com/bentoml/llm-optimizer.git
pip install -e .</code></pre><p>Run a quick estimation:</p><pre><code class="language-bash">llm-optimizer estimate \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --gpu A100 \
  --num-gpus 4 \
  --input-len 1024 \
  --output-len 512</code></pre><p>Run your first benchmark (make sure you have enough GPU resources):</p><pre><code class="language-bash">llm-optimizer \
  --framework vllm \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --server-args "tensor_parallel_size*data_parallel_size=[(1,4),(2,2),(4,1)]";max_num_seqs=[16,32,64]" \
  --client-args "max_concurrency=[8,16,32];num_prompts=500" \
  --constraints "ttft&lt;200ms;itl:p99&lt;10ms" \
  --output-json latency_optimized.json</code></pre><p>LLM inference optimization doesn’t have to mean endless trial and error. llm-optimizer makes it easy to benchmark different configurations, apply constraints, estimate performance, and visualize results — all with a few commands.</p><p>More resources:</p><ul><li>View the benchmark results of top open-source LLMs on the <a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/llm-perf/">LLM Performance Explorer</a></li><li><a target="_blank" rel="noopener noreferrer" href="https://l.bentoml.com/join-slack-from-blog-announcing-llm-optimizer">Join our Slack community</a> to connect with other AI engineers and share feedback</li><li>Read our <a target="_blank" rel="noopener noreferrer" href="https://www.bentoml.com/llm/">LLM Inference Handbook</a> for deeper dives into performance optimization</li><li><a target="_blank" rel="noopener noreferrer" href="https://l.bentoml.com/contact-us-from-blog-announcing-llm-optimizer">Schedule a call</a> with our experts to discuss your use case and performance goals</li></ul>

Benchmark and optimize LLM inference performance with SLO constraints across frameworks like vLLM and SGLang.

large_llm-optimizer-cover-image.png

medium_llm-optimizer-cover-image.png

small_llm-optimizer-cover-image.png

thumbnail_llm-optimizer-cover-image.png

xsmall_llm-optimizer-cover-image.png

llm-optimizer-cover-image.png

llm-optimizer: An Open-Source Tool for LLM Inference Benchmarking and Performance Optimization

📢 Introducing [llm-optimizer](/blog/announcing-llm-optimizer) & [LLM Performance Explorer](/llm-perf/) — benchmark and optimize LLM inference with ease.

Bento Blog

llm-optimizer: An Open-Source Tool for LLM Inference Benchmarking and Performance Optimization