Build and Scale

Compound AI Systems

Compute orchestration platform for rapid and reliable GenAI adoption, from model inference to advanced AI applications.

Trusted by the best AI teams

me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me
me

From Models, To AI Systems

Building compound AI systems requires efficient model inference, fully customizable application code, and scalable compute infrastructure to work seamlessly together. The BentoML platform unifies the development and deployment experience across these three layers, allowing AI teams to focus on building the core product and ship faster.

01. Rapid Iterations

BentoML’s code-first approach makes it highly flexible for developers to architect multi-model, multi-component distributed systems, offering the key building blocks for defining custom AI APIs, inference workers, async task queues, scaling behavior, batching optimization, and model composition.

RAG
Function Calling
LLM Structured Outputs
ControlNet
@openai_endpoints(model_id=MODEL_ID, default_chat_completion_parameters=dict(stop=["<|eot_id|>"])) @bentoml.service(resources={"gpu": 1, "gpu_type": "nvidia-l4"}) class Llama: @bentoml.api async def generate(self, prompt: str, system_prompt: Optional[str] = SYSTEM_PROMPT, ) -> AsyncGenerator[str, None]: """Omitted for simplicity, see https://github.com/bentoml/BentoVLLM/tree/main/llama3.1-8b-instruct for details""" ... @bentoml.service(resources={"gpu": 1, "gpu_type": "nvidia-t4"}) class SentenceTransformer: @bentoml.api def embed(self, sentences: t.List[str]) -> np.ndarray: """Omitted for simplicity, see https://github.com/bentoml/BentoSentenceTransformers for details""" ... @bentoml.service(resources={"cpu": 1}) class WallStreetJournal: embedding_service = bentoml.depends(SentenceTransformers) llm_service = bentoml.depends(Llama) def __init__(self) -> None: # init vector_db self.vectordb = init_vector_db() ... @bentoml.api async def query(self, query: str) -> AsyncGenerator[str, None]: # get embedding of the input embeddings = embedding_service.embed([query]) # query vector_db docs = self.vectordb.query(embeddings) # fill system prompt with related docs system_prompt = SYSTEM_PROMPT_TEMPLATE.format(docs=docs) stream = llm_service.generate(prompt, system_prompt) async for request_output in stream: text = request_output.outputs[0].text yield text[cursor:] cursor = len(text)

02. Launch with confidence

BentoML lets you build with production in mind. A prototype running locally is just one command away from a secure and scalable cloud deployment, ready for end-to-end evaluation or scaling in production.

bentoml deploy . ╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ 🍱 Built bento "wall_street_journal:6huvjoswig65acvj"╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Successfully pushed Bento "wall_street_journal:6huvjoswig65acvj"│ ✅ Pushed Bento "wall_street_journal:6huvjoswig65acvj"╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ ✅ Created deployment "wall_street_journal-t1y6" in cluster "gcp-us-central-1"│ 💻 View Dashboard: https://ss-org-1.cloud.bentoml.com/deployments/wall_street_journal-t1y6 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

03. AI APIs made easy

BentoML standardizes how users and client applications access deployed applications via auto-generated web UI, Python Client, or REST API, with the option to enable token-based authorization or network security.

curl
python
curl -s -X POST \ 'https://wall-street-journal-t1y6.bentoml.ai/query' \ -H 'Content-Type: application/json' \ -d '{ "query": "What are the key risk factors mentioned in the latest quarterly earnings report for XYZ Corporation, and how do they compare to the previous earning report?" }'

Accelerate time to market

for your business-critical

Unlock the power of Compound AI with BentoML! We’ve optimized every layer in the inference stack, streamlined the development workflow for fast iteration cycles and simplified production operations for robust and secure AI deployments.

me

Streamlined Developer Workflow

  • Best open-source serving engine
  • Local dev with instant remote test runs on cloud GPUs
  • Open eco-system with 100s of integrations
me

Inference Performance

  • 20x inference throughput for open-source LLMs
  • Fully customizable inference tailored to you
  • Support Nvidia, AMD, TPU, Inferentia
me

Scalable Infrastructure

  • Auto-scale to meet demand and reduce cost
  • Blazing fast cold starts
  • Complete infra for online and batch inference
me

Control & Privacy

  • Own your model and data
  • Run within your virtual private cloud
  • Flexible multi-cloud and multi-region deployments

Your Cloud, Your Control Built for Enterprise AI

Our BYOC (Bring Your Own Cloud) offering takes the best AI infrastructure to your cloud, providing full control over AI workload and security, with the benefits of a managed service.

Future-Proof Strategy: Stay ahead with an extensible, open-source based platform

Multi-Cloud Flexibility: Easily utilize compute across AWS, GCP, Azure, Oracle, or on-premises infrastructure

Uncompromised Security: your models and data never leave your secured network and environment

Supercharge Production AI Operations

Compute Resource Optimization: efficient deployments across multiple clusters and regions, with intelligent resource provisioning.

Instant Scaling: Minimize latency and improve utilization with rapid cold starts and flexible scaling.

Full Observability: Real-time monitoring and logging for reliable AI operations.

What our customers say

“Koo started to adopt BentoML more than a year ago as a platform of choice for model deployments and monitoring. From our early experience it was clear that deploying ML models, a statistic that most companies struggle with, was a solved problem for Koo. The BentoML team works closely with their community of users like I've never seen before. Their AMAs, the advocacy on Slack and getting on calls with their customers, are much appreciated by early-adopters and seasoned users”

Harsh Singhal, Head of Machine Learning, Koo

“BentoML is helping us future-proof our machine learning deployment infrastructure at Mission Lane. It is enabling us to rapidly develop and test our model scoring services , and to seamlessly deploy them into our dev, staging, and production Kubernetes clusters.”

Mike Kuhlen, Data Science & Machine Learning Solutions and Strategy, Mission Lane

“BentoML enables us to deliver business value quickly by allowing us to deploy ML models to our existing infrastructure and scale the model services easily.”

Shihgian Lee, Senior Machine Learning Engineer, Porch

"BentoML is an excellent tool for saving resources and running ML at scale in production"

Woongkyu Lee, Data and ML Engineer, LINE

“BentoML has helped us scale the way we help our users package and test their models. Their framework is core piece of our product. Really happy to be a part of the BentoML community.”

Gabriel Bayomi, CEO, OpenLayer