February 20, 2025 • Written By Sherlock Xu
From RAG applications to multi-model pipelines, AI systems are growing increasingly complex. When deploying these systems, traditional development workflows often feel like they're holding us back. We've heard your pain points, especially around managing complex environments, battling slow iteration time, and overcoming deployment bottlenecks.
Today, we're excited to announce BentoML 1.4, a major release packed with new features and enhancements, including:
These features were built from real feedback within our community, directly addressing the challenges you face when building AI applications. Now, let’s dive into how these improvements will empower your development workflow!
Picture this: You're building a cutting-edge RAG application with multiple models, and every time you make a small code change, you wait... and wait... and wait. Your local GPU struggles to keep up, dependencies conflict with each other, and what works on your machine mysteriously breaks in production. It’s like trying to build a spaceship in your garage, which is technically doable, but painfully inefficient.
We’ve seen how these day-to-day dev pains slow the deployment of modern AI applications like RAG and AI agents. That's why we're thrilled to introduce BentoML Codespaces — our solution to put an end to those pains for good.
With BentoML Codespaces, you can develop and iterate on AI applications up to 20 times faster than before. Getting started is as easy as running the following command:
bentoml code
This command connects your local development environment to a remote sandbox on BentoCloud, our unified AI Inference Platform. Every change you make locally synchronizes instantly in the cloud. Here’s what you can do:
In short, you can achieve 20x faster iteration speed than traditional local development. You will be able to focus on building great AI applications instead of fighting with your development environment.
To learn more about how it all works, see our blog post Accelerate AI Application Development with BentoML Codespaces and BentoML documentation.
In previous versions, users need to set runtime configurations for a Bento in separate files like bentofile.yaml
or pyproject.toml
. While this works, it means juggling multiple files and contexts when building your AI services. We heard your feedback about this scattered configuration approach, and we're excited to introduce a more intuitive, unified solution with BentoML’s new Python SDK.
Now, you can define the entire runtime environment directly alongside your Service code in a single Python file service.py
. This update makes runtime configuration simpler, cleaner, and more dynamic.
Let’s take a quick look at how this works:
import bentoml # Set the Bento Python version and required dependencies my_image = bentoml.images.PythonImage(python_version='3.11') \ .python_packages("torch", "transformers") @bentoml.service(image=my_image) class MyService: # Service implementation
With everything in one place, you can take full advantage of the dynamic features in Python, such as subclassing, to customize your runtime environment with just a few lines of code.
Another benefit of this approach is caching. If the image definition doesn’t change, BentoML intelligently caches the layers, avoiding unnecessary rebuilds. Plus, we've drastically improved the image pulling speed — what used to take minutes now happens in seconds.
But that’s not all. The new API gives you fine-grained control over every aspect of your runtime environment. You can specify parameters like the operating system distro and system packages. Here are more available methods and parameters:
# Specify distro, system packages, use a requirements.txt file my_image = bentoml.images.PythonImage(python_version='3.11', distro='alpine') \ .system_packages("curl", "git") \ .requirements_file("requirements.txt")
One of the most powerful features of the new API is the ability to run custom commands during the build process. The run()
method can be chained together with other configuration options, giving you total control over your build pipeline:
import bentoml image = bentoml.images.PythonImage(python_version='3.11') \ .run('echo "Starting build process..."') \ .system_packages("curl", "git") \ .run('echo "System packages installed"') \ .python_packages("pillow", "fastapi") \ .run('echo "Python packages installed"')
run()
is context-sensitive, meaning commands execute in the correct order. For instance, commands placed before .python_packages()
will run before the Python dependencies are installed.
Note: if you prefer the old approach, we still fully support configuration through bentofile.yaml
and pyproject.toml
.
For details, check out the full documentation.
If you've ever worked with large models from repositories like Hugging Face (HF), you're all too familiar with the pain of waiting for them to load. Whether you're building a new image or scaling up your deployment, those precious minutes spent waiting can feel like hours. That's why we introduced significant improvements to the model loading mechanism in BentoML 1.4.
In this release, BentoML accelerates model loading in two ways:
Here’s an example of loading HF models with the new API HuggingFaceModel
. By default, it returns the downloaded model path as a string, which you can directly pass into libraries like transformers
:
import bentoml from bentoml.models import HuggingFaceModel from transformers import AutoModelForSequenceClassification, AutoTokenizer @bentoml.service(resources={"cpu": "200m", "memory": "512Mi"}) class MyService: # Specify a model from HF with its ID model_path = HuggingFaceModel("google-bert/bert-base-uncased") def __init__(self): # Load the actual model and tokenizer within the instance context self.model = AutoModelForSequenceClassification.from_pretrained(self.model_path) self.tokenizer = AutoTokenizer.from_pretrained(self.model_path)
For models hosted in private HF repositories, simply specify your custom endpoint:
model_path = HuggingFaceModel("your_model_id", endpoint="https://my.huggingface.co/")
For models from sources other than HF, use BentoModel
:
import bentoml from bentoml.models import BentoModel import joblib @bentoml.service(resources={"cpu": "200m", "memory": "512Mi"}) class MyService: # Define model reference at the class level # Load a model from the Model Store or BentoCloud iris_ref = BentoModel("iris_sklearn:latest") def __init__(self): self.iris_model = joblib.load(self.iris_ref.path_of("model.pkl"))
Note: Always define your model references in the class scope of your Service. This ensures they're properly tracked as dependencies and available when your Service is deployed.
For more information, see model loading and management.
The ability to run distributed Services has always been a key feature in BentoML. From orchestrating CPU and GPU processing to configuring multi-model pipelines, BentoML makes it easy to connect Services, optimize resource utilization, and scale models efficiently.
The core way to set up these interactions is through bentoml.depends()
. It allows one Service to call another Service’s methods as if they were local calls. BentoML abstracts away the complexities of network communication, serialization, and deserialization, so you don’t need to worry about how data travels between Services.
Previously, you could set up dependencies only on internal BentoML Services:
class ServiceA: service_b = bentoml.depends(ServiceB) @bentoml.api def predict(self, input: np.ndarray) -> int: # Call the predict function from ServiceB return int(self.service_b.predict(input)[0][0])
This works great for local Services, but what about calling AI services already deployed on BentoCloud or running on other servers? We heard your need for more flexibility, and now we're extending the power of bentoml.depends()
to reach beyond your local environment.
With BentoML 1.4, you can easily depend on any deployed model, whether it’s hosted on BentoCloud or running on your own infrastructure. Here's how simple it is:
import bentoml @bentoml.service class MyService: # `cluster` is optional if your Deployment is in a non-default cluster iris = bentoml.depends(deployment="iris-classifier-x6dewa", cluster="my_cluster_name") # Call the model deployed on BentoCloud by specifying its URL # iris = bentoml.depends(url="https://iris.example-url.bentoml.ai") # Call the model served elsewhere # iris = bentoml.depends(url="http://192.168.1.1:3000") @bentoml.api def predict(self, input: np.ndarray) -> int: # Call the predict function from the remote Deployment return int(self.iris.predict(input)[0][0])
The syntax stays exactly the same whether you're calling a local Service or a remote Deployment. All the complexity of network communication, serialization, and deserialization is handled automatically by BentoML.
For details, check out the BentoML documentation.
BentoML 1.4 brings significant enhancements to your AI development and deployment process. From accelerating your iteration cycle to simplifying runtime configurations, each feature was designed to address the pain points you face daily.
As always, BentoML is built with your needs in mind, and we can’t wait to hear how these new features help you bring your AI applications to life!
Check out the following resources to learn more and stay connected: