Our BYOC offering brings the leading inference infrastructure to your cloud, giving you full control over your AI workload.
Deploy on your own Cloud - AWS, GCP, Azure, and more
Efficient provisioning across multiple clouds and regions
Leverage existing cloud commitment and credits
SOC II certified, ensuring your models and data remain secure
BentoML is the most flexible way to build production-grade AI systems with any open-source or custom fine-tuned models. We run the infrastructure, so you can focus on innovating.
Bring your models and code to create inference APIs, job queues, and multi-model pipelines. BentoML's Open-Source framework offers customizable scaling, queuing, batching, and model composition to accelerate production-grade AI system development.
@openai_endpoints( model_id=MODEL_ID, default_chat_completion_parameters=dict(stop=["<|eot_id|>"]), ) @bentoml.service( name="bentovllm-llama3.1-405b-instruct-awq-service", traffic={ "timeout": 1200, "concurrency": 256, # Matches the default max_num_seqs in the VLLM engine }, resources={ "gpu": 4, "gpu_type": "nvidia-a100-80gb", }, ) class VLLM: def __init__(self) -> None: from transformers import AutoTokenizer from vllm import AsyncEngineArgs, AsyncLLMEngine ENGINE_ARGS = AsyncEngineArgs( model=MODEL_ID, max_model_len=MAX_TOKENS, enable_prefix_caching=True, tensor_parallel_size=4, ) self.engine = AsyncLLMEngine.from_engine_args(ENGINE_ARGS) tokenizer = AutoTokenizer.from_pretrained(MODEL_ID) self.stop_token_ids = [ tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids("<|eot_id|>"), ] @bentoml.api async def generate( self, prompt: str = "Explain superconductors in plain English", system_prompt: Optional[str] = SYSTEM_PROMPT, max_tokens: Annotated[int, Ge(128), Le(MAX_TOKENS)] = MAX_TOKENS, ) -> AsyncGenerator[str, None]: from vllm import SamplingParams SAMPLING_PARAM = SamplingParams( max_tokens=max_tokens, stop_token_ids=self.stop_token_ids, ) if system_prompt is None: system_prompt = SYSTEM_PROMPT prompt = PROMPT_TEMPLATE.format(user_prompt=prompt, system_prompt=system_prompt) stream = await self.engine.add_request(uuid.uuid4().hex, prompt, SAMPLING_PARAM) cursor = 0 async for request_output in stream: text = request_output.outputs[0].text yield text[cursor:] cursor = len(text)
Seamlessly transition from local prototypes to secure, scalable production deployment, with a single command.
bentoml deploy . 🍱 Built bento "vllm:7ftwkpztah74bdwk" ✅ Pushed Bento "vllm:7ftwkpztah74bdwk" ✅ Created deployment "vllm:7ftwkpztah74bdwk" in cluster "gcp-us-central-1" 💻 View Dashboard: https://ss-org-1.cloud.bentoml.com/deployments/vllm-t1y6
Simplify access to deployed AI applications with auto-generated web UI, Python client, and REST API. Enable secure, controlled access for client applications with token-based authorization.
curl -s -X POST \ 'https://bentovllm-llama3-1-405b-instruct-awq-service.mt-guc1.bentoml.ai/generate' \ -H 'Content-Type: application/json' \ -d '{ "max_tokens": 4096, "prompt": "Explain superconductors in plain English", "system_prompt": "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.\n\nIf a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don'"'"'t know the answer to a question, please don'"'"'t share false information." }'
Power your mission-critical AI with BentoML's optimized inference infrastructure.
Fast GPU auto-scaling with minimal cold starts
Low-latency, high-throughput model serving
Intelligent resource management for cost efficiency
Real-time monitoring and logging for reliable deployments