Building compound AI systems requires efficient model inference, fully customizable application code, and scalable compute infrastructure to work seamlessly together. The BentoML platform unifies the development and deployment experience across these three layers, allowing AI teams to focus on building the core product and ship faster.
BentoML’s code-first approach makes it highly flexible for developers to architect multi-model, multi-component distributed systems, offering the key building blocks for defining custom AI APIs, inference workers, async task queues, scaling behavior, batching optimization, and model composition.
@openai_endpoints(model_id=MODEL_ID, default_chat_completion_parameters=dict(stop=["<|eot_id|>"])) @bentoml.service(resources={"gpu": 1, "gpu_type": "nvidia-l4"}) class Llama: @bentoml.api async def generate(self, prompt: str, system_prompt: Optional[str] = SYSTEM_PROMPT, ) -> AsyncGenerator[str, None]: """Omitted for simplicity, see https://github.com/bentoml/BentoVLLM/tree/main/llama3.1-8b-instruct for details""" ... @bentoml.service(resources={"gpu": 1, "gpu_type": "nvidia-t4"}) class SentenceTransformer: @bentoml.api def embed(self, sentences: t.List[str]) -> np.ndarray: """Omitted for simplicity, see https://github.com/bentoml/BentoSentenceTransformers for details""" ... @bentoml.service(resources={"cpu": 1}) class WallStreetJournal: embedding_service = bentoml.depends(SentenceTransformers) llm_service = bentoml.depends(Llama) def __init__(self) -> None: # init vector_db self.vectordb = init_vector_db() ... @bentoml.api async def query(self, query: str) -> AsyncGenerator[str, None]: # get embedding of the input embeddings = embedding_service.embed([query]) # query vector_db docs = self.vectordb.query(embeddings) # fill system prompt with related docs system_prompt = SYSTEM_PROMPT_TEMPLATE.format(docs=docs) stream = llm_service.generate(prompt, system_prompt) async for request_output in stream: text = request_output.outputs[0].text yield text[cursor:] cursor = len(text)
BentoML lets you build with production in mind. A prototype running locally is just one command away from a secure and scalable cloud deployment, ready for end-to-end evaluation or scaling in production.
bentoml deploy . ╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ 🍱 Built bento "wall_street_journal:6huvjoswig65acvj" │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ Successfully pushed Bento "wall_street_journal:6huvjoswig65acvj" │ │ ✅ Pushed Bento "wall_street_journal:6huvjoswig65acvj" │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ ╭──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │ ✅ Created deployment "wall_street_journal-t1y6" in cluster "gcp-us-central-1" │ │ 💻 View Dashboard: https://ss-org-1.cloud.bentoml.com/deployments/wall_street_journal-t1y6 │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
BentoML standardizes how users and client applications access deployed applications via auto-generated web UI, Python Client, or REST API, with the option to enable token-based authorization or network security.
curl -s -X POST \ 'https://wall-street-journal-t1y6.bentoml.ai/query' \ -H 'Content-Type: application/json' \ -d '{ "query": "What are the key risk factors mentioned in the latest quarterly earnings report for XYZ Corporation, and how do they compare to the previous earning report?" }'
Our BYOC (Bring Your Own Cloud) offering takes the best AI infrastructure to your cloud, providing full control over AI workload and security, with the benefits of a managed service.
Future-Proof Strategy: Stay ahead with an extensible, open-source based platform
Multi-Cloud Flexibility: Easily utilize compute across AWS, GCP, Azure, Oracle, or on-premises infrastructure
Uncompromised Security: your models and data never leave your secured network and environment
Compute Resource Optimization: efficient deployments across multiple clusters and regions, with intelligent resource provisioning.
Instant Scaling: Minimize latency and improve utilization with rapid cold starts and flexible scaling.
Full Observability: Real-time monitoring and logging for reliable AI operations.