Unified Inference Platform for any model, on any cloud

Build scalable AI systems with unparalleled speed and flexibility. Deploy in your cloud, iterate faster, and scale at a lower cost.

Trusted by visionary AI teams worldwide

me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me

Accelerate time to market
for your business-critical

me

Open-Source Serving Engine

  • Build Inference APIs, Job Queues, and Compound AI Systems
  • Local development and debugging
  • Open eco-system with 100s of integrations
me

Performance

  • High throughput and low latency LLM inference
  • Balance cost, speed and throughput to your need
  • Fully utilize your GPU resources
me

Auto-Scaling

  • Automatic horizontal scaling based on traffic
  • Blazing fast cold starts
  • Modular scaling for multi-model pipelines
me

Rapid Iteration

  • Build and debug with Cloud GPUs
  • Sync and preview local changes instantly
  • Seamlessly promote to prod

Your Cloud, Your Control Built for Enterprise AI

Our BYOC offering brings the leading inference infrastructure to your cloud, giving you full control over your AI workload.

Deploy on your own Cloud - AWS, GCP, Azure, and more

Efficient provisioning across multiple clouds and regions

Leverage existing cloud commitment and credits

SOC II certified, ensuring your models and data remain secure

From Models, To AI Systems

BentoML is the most flexible way to build production-grade AI systems with any open-source or custom fine-tuned models. We run the infrastructure, so you can focus on innovating.

01. Build 10x Faster

Bring your models and code to create inference APIs, job queues, and multi-model pipelines. BentoML's Open-Source framework offers customizable scaling, queuing, batching, and model composition to accelerate production-grade AI system development.

Llama
RAG
Function Calling
LLM Structured Outputs
ControlNet
@openai_endpoints( model_id=MODEL_ID, default_chat_completion_parameters=dict(stop=["<|eot_id|>"]), ) @bentoml.service( name="bentovllm-llama3.1-405b-instruct-awq-service", traffic={ "timeout": 1200, "concurrency": 256, # Matches the default max_num_seqs in the VLLM engine }, resources={ "gpu": 4, "gpu_type": "nvidia-a100-80gb", }, ) class VLLM: def __init__(self) -> None: from transformers import AutoTokenizer from vllm import AsyncEngineArgs, AsyncLLMEngine ENGINE_ARGS = AsyncEngineArgs( model=MODEL_ID, max_model_len=MAX_TOKENS, enable_prefix_caching=True, tensor_parallel_size=4, ) self.engine = AsyncLLMEngine.from_engine_args(ENGINE_ARGS) tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) self.stop_token_ids = [ tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>"), ] @bentoml.api async def generate( self, prompt: str = "Explain superconductors in plain English", system_prompt: Optional[str] = SYSTEM_PROMPT, max_tokens: Annotated[int, Ge(128), Le(MAX_TOKENS)] = MAX_TOKENS, ) -> AsyncGenerator[str, None]: from vllm import SamplingParams SAMPLING_PARAM = SamplingParams( max_tokens=max_tokens, stop_token_ids=self.stop_token_ids, ) if system_prompt is None: system_prompt = SYSTEM_PROMPT prompt = PROMPT_TEMPLATE.format(user_prompt=prompt, system_prompt=system_prompt) stream = await self.engine.add_request(uuid.uuid4().hex, prompt, SAMPLING_PARAM) cursor = 0 async for request_output in stream: text = request_output.outputs[0].text yield text[cursor:] cursor = len(text)

02. Scale with Confidence

Seamlessly transition from local prototypes to secure, scalable production deployment, with a single command.

bentoml deploy . 🍱 Built bento "vllm:7ftwkpztah74bdwk" ✅ Pushed Bento "vllm:7ftwkpztah74bdwk" ✅ Created deployment "vllm:7ftwkpztah74bdwk" in cluster "gcp-us-central-1" 💻 View Dashboard: https://ss-org-1.cloud.bentoml.com/deployments/vllm-t1y6

03. AI APIs made easy

Simplify access to deployed AI applications with auto-generated web UI, Python client, and REST API. Enable secure, controlled access for client applications with token-based authorization.

curl
python
curl -s -X POST \ 'https://bentovllm-llama3-1-405b-instruct-awq-service.mt-guc1.bentoml.ai/generate' \ -H 'Content-Type: application/json' \ -d '{ "max_tokens": 4096, "prompt": "Explain superconductors in plain English", "system_prompt": "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don'"'"'t know the answer to a question, please don'"'"'t share false information." }'

Supercharge Production AI Operations

Power your mission-critical AI with BentoML's optimized inference infrastructure.

Fast GPU auto-scaling with minimal cold starts

Low-latency, high-throughput model serving

Intelligent resource management for cost efficiency

Real-time monitoring and logging for reliable deployments

What our customers say

"BentoML provides our research teams a streamlined way to quickly iterate on their POCs and when ready, deploy their AI services at scale. In addition, the flexible architecture allows us to showcase and deploy many different types of models and workflows from Computer Vision to LLM use cases."

Thariq Khalid, Senior Manager, Computer Vision, ELM Research Center

“Koo started to adopt BentoML more than a year ago as a platform of choice for model deployments and monitoring. From our early experience it was clear that deploying ML models, a statistic that most companies struggle with, was a solved problem for Koo. The BentoML team works closely with their community of users like I've never seen before. Their AMAs, the advocacy on Slack and getting on calls with their customers, are much appreciated by early-adopters and seasoned users”

Harsh Singhal, Head of Machine Learning, Koo

“BentoML is helping us future-proof our machine learning deployment infrastructure at Mission Lane. It is enabling us to rapidly develop and test our model scoring services , and to seamlessly deploy them into our dev, staging, and production Kubernetes clusters.”

Mike Kuhlen, Data Science & Machine Learning Solutions and Strategy, Mission Lane

"BentoML is an excellent tool for saving resources and running ML at scale in production"

Woongkyu Lee, Data and ML Engineer, LINE

"Bento have given us the tools and confidence to build our own Voice AI Agent solution. We are really excited to be working with Bento. They have made our development path to production much easier."

Mark Brooker, CEO, MBit