Run Any Model

in the Cloud

Inference Platform for building fast, secure and scalable AI apps

Trusted by the best AI teams

me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me

How it works

Our Open-Source model serving framework BentoML offers a unified standard for AI inference, model packaging, and serving optimizations.

01. Build

Our Open-Source model serving framework BentoML makes it easy to create high-performance AI API service with custom code.

Llama 7B
Stable Diffusion XL
WhisperX
ControlNet
XTTS
@bentoml.service( traffic={ "timeout": 300, }, resources={ "gpu": 1, "gpu_type": "nvidia-l4", }, ) class VLLM: def __init__(self) -> None: from vllm import AsyncEngineArgs, AsyncLLMEngine ENGINE_ARGS = AsyncEngineArgs( model='meta-llama/Llama-2-7b-chat-hf', max_model_len=MAX_TOKENS ) self.engine = AsyncLLMEngine.from_engine_args(ENGINE_ARGS) @bentoml.api async def generate( self, prompt: str = "Explain superconductors like I'm five years old", max_tokens: Annotated[int, Ge(128), Le(MAX_TOKENS)] = MAX_TOKENS, ) -> AsyncGenerator[str, None]: from vllm import SamplingParams SAMPLING_PARAM = SamplingParams(max_tokens=max_tokens) prompt = PROMPT_TEMPLATE.format(user_prompt=prompt) stream = await self.engine.add_request(uuid.uuid4().hex, prompt, SAMPLING_PARAM) cursor = 0 async for request_output in stream: text = request_output.outputs[0].text yield text[cursor:] cursor = len(text)

02. Deploy

A prototype running locally is just one command away from a reliable and secure cloud deployment, ready for scaling in production.

bentoml deploy .

03. Run

Call the deployed endpoint via auto-generated web UI, Python Client or REST API.

curl
python
curl -s -X POST \ 'https://vllm-llama-7b-e3c1c7db.mt-guc1.bentoml.ai/generate' \ -H 'Content-Type: application/json' \ -d '{ "max_tokens": 1024, "prompt": "Explain superconductors like I'"'"'m five years old" }'

With BentoML,

You can deploy

From pre-trained models on HuggingFace, models fine-tuned with your own data, to custom model and inference code - all seamlessly in one platform. Here are some example models you can run with BentoML.

me

Large Language Models

Llama2 70B

Mixtral 8x7B

DBRX

Solar

Gemma

me

Image/Video Generation

StableDiffusionXL-turbo

LCM-LoRA

Stable Video Diffusion

ComfyUI Pipeline

me

AI Applications

Private RAG

Multi-Model Inference Graph

Computer Vision Pipelines

Real-time LLM Chat

me

Bring Your Own Model

Fine-tuned LLM

Dynamic LoRA loading

Custom models trained with PyTorch, Tensorflow, JAX, XGBoost, etc

You own the models We run the infrastructure

We've built the fastest cloud infrastructure for AI inference, comes with everything you need to streamline the path to production AI.

Automatic scale up and down to zero, only pay for what you use

Flexible APIs for deploying online API services, batch inference jobs and async job queues

See how your models are performing and troubleshoot issues with built-in observability tools

Bring Your Own Cloud

Self-host BentoCloud runtime in your own cloud account

Let your models and data live within your Virtual Private Cloud (VPC)

Full visibility and control over the compute resources and network access

Easily utilize compute across multiple clouds

What our customers say

“Koo started to adopt BentoML more than a year ago as a platform of choice for model deployments and monitoring. From our early experience it was clear that deploying ML models, a statistic that most companies struggle with, was a solved problem for Koo. The BentoML team works closely with their community of users like I've never seen before. Their AMAs, the advocacy on Slack and getting on calls with their customers, are much appreciated by early-adopters and seasoned users”

Harsh Singhal, Head of Machine Learning, Koo

“BentoML is helping us future-proof our machine learning deployment infrastructure at Mission Lane. It is enabling us to rapidly develop and test our model scoring services , and to seamlessly deploy them into our dev, staging, and production Kubernetes clusters.”

Mike Kuhlen, Data Science & Machine Learning Solutions and Strategy, Mission Lane

“BentoML enables us to deliver business value quickly by allowing us to deploy ML models to our existing infrastructure and scale the model services easily.”

Shihgian Lee, Senior Machine Learning Engineer, Porch

"BentoML is an excellent tool for saving resources and running ML at scale in production"

Woongkyu Lee, Data and ML Engineer, LINE

“BentoML has helped us scale the way we help our users package and test their models. Their framework is core piece of our product. Really happy to be a part of the BentoML community.”

Gabriel Bayomi, CEO, OpenLayer