Announcing BentoML 1.4

February 20, 2025 • Written By Sherlock Xu

From RAG applications to multi-model pipelines, AI systems are growing increasingly complex. When deploying these systems, traditional development workflows often feel like they're holding us back. We've heard your pain points, especially around managing complex environments, battling slow iteration time, and overcoming deployment bottlenecks.

Today, we're excited to announce BentoML 1.4, a major release packed with new features and enhancements, including:

  • Codespaces: Iterate 20x faster with a cloud development platform
  • New runtime configuration SDK: Define your entire Bento runtime using pure Python
  • Model download acceleration: Load AI models faster than ever
  • External dependencies: Seamlessly connect to any deployed model

These features were built from real feedback within our community, directly addressing the challenges you face when building AI applications. Now, let’s dive into how these improvements will empower your development workflow!

20x faster iteration with Codespaces

Picture this: You're building a cutting-edge RAG application with multiple models, and every time you make a small code change, you wait... and wait... and wait. Your local GPU struggles to keep up, dependencies conflict with each other, and what works on your machine mysteriously breaks in production. It’s like trying to build a spaceship in your garage, which is technically doable, but painfully inefficient.

We’ve seen how these day-to-day dev pains slow the deployment of modern AI applications like RAG and AI agents. That's why we're thrilled to introduce BentoML Codespaces — our solution to put an end to those pains for good.

With BentoML Codespaces, you can develop and iterate on AI applications up to 20 times faster than before. Getting started is as easy as running the following command:

bentoml code

This command connects your local development environment to a remote sandbox on BentoCloud, our unified AI Inference Platform. Every change you make locally synchronizes instantly in the cloud. Here’s what you can do:

  • Develop and iterate on AI applications with your favorite IDE
  • Leverage powerful cloud GPUs for building modern AI applications with any open-source model
  • Eliminate dependency management headaches with auto-provisioned environments
  • Debug with real-time updates mirroring production and live logs on a cloud dashboard

In short, you can achieve 20x faster iteration speed than traditional local development. You will be able to focus on building great AI applications instead of fighting with your development environment.

To learn more about how it all works, see our blog post Accelerate AI Application Development with BentoML Codespaces and BentoML documentation.

New runtime SDK: Define Bento specifications in pure Python

In previous versions, users need to set runtime configurations for a Bento in separate files like bentofile.yaml or pyproject.toml. While this works, it means juggling multiple files and contexts when building your AI services. We heard your feedback about this scattered configuration approach, and we're excited to introduce a more intuitive, unified solution with BentoML’s new Python SDK.

Now, you can define the entire runtime environment directly alongside your Service code in a single Python file service.py. This update makes runtime configuration simpler, cleaner, and more dynamic.

Let’s take a quick look at how this works:

import bentoml # Set the Bento Python version and required dependencies my_image = bentoml.images.PythonImage(python_version='3.11') \ .python_packages("torch", "transformers") @bentoml.service(image=my_image) class MyService: # Service implementation

With everything in one place, you can take full advantage of the dynamic features in Python, such as subclassing, to customize your runtime environment with just a few lines of code.

Another benefit of this approach is caching. If the image definition doesn’t change, BentoML intelligently caches the layers, avoiding unnecessary rebuilds. Plus, we've drastically improved the image pulling speed — what used to take minutes now happens in seconds.

But that’s not all. The new API gives you fine-grained control over every aspect of your runtime environment. You can specify parameters like the operating system distro and system packages. Here are more available methods and parameters:

# Specify distro, system packages, use a requirements.txt file my_image = bentoml.images.PythonImage(python_version='3.11', distro='alpine') \ .system_packages("curl", "git") \ .requirements_file("requirements.txt")

One of the most powerful features of the new API is the ability to run custom commands during the build process. The run() method can be chained together with other configuration options, giving you total control over your build pipeline:

import bentoml image = bentoml.images.PythonImage(python_version='3.11') \ .run('echo "Starting build process..."') \ .system_packages("curl", "git") \ .run('echo "System packages installed"') \ .python_packages("pillow", "fastapi") \ .run('echo "Python packages installed"')

run() is context-sensitive, meaning commands execute in the correct order. For instance, commands placed before .python_packages() will run before the Python dependencies are installed.

Note: if you prefer the old approach, we still fully support configuration through bentofile.yaml and pyproject.toml.

For details, check out the full documentation.

Lightning-fast model loading

If you've ever worked with large models from repositories like Hugging Face (HF), you're all too familiar with the pain of waiting for them to load. Whether you're building a new image or scaling up your deployment, those precious minutes spent waiting can feel like hours. That's why we introduced significant improvements to the model loading mechanism in BentoML 1.4.

In this release, BentoML accelerates model loading in two ways:

  • Build-time downloads: BentoML now downloads models during image building, not at Service startup. These models are then cached and mounted directly into containers. This means they’re readily available as soon as your Service starts, reducing cold start time and improving scaling performance.
  • Parallel loading with safetensors: Instead of loading model weights one part at a time, BentoML now loads multiple parts of the model simultaneously. This is especially beneficial for large models, where sequential loading would cause significant delays.

Here’s an example of loading HF models with the new API HuggingFaceModel. By default, it returns the downloaded model path as a string, which you can directly pass into libraries like transformers:

import bentoml from bentoml.models import HuggingFaceModel from transformers import AutoModelForSequenceClassification, AutoTokenizer @bentoml.service(resources={"cpu": "200m", "memory": "512Mi"}) class MyService: # Specify a model from HF with its ID model_path = HuggingFaceModel("google-bert/bert-base-uncased") def __init__(self): # Load the actual model and tokenizer within the instance context self.model = AutoModelForSequenceClassification.from_pretrained(self.model_path) self.tokenizer = AutoTokenizer.from_pretrained(self.model_path)

For models hosted in private HF repositories, simply specify your custom endpoint:

model_path = HuggingFaceModel("your_model_id", endpoint="https://my.huggingface.co/")

For models from sources other than HF, use BentoModel:

import bentoml from bentoml.models import BentoModel import joblib @bentoml.service(resources={"cpu": "200m", "memory": "512Mi"}) class MyService: # Define model reference at the class level # Load a model from the Model Store or BentoCloud iris_ref = BentoModel("iris_sklearn:latest") def __init__(self): self.iris_model = joblib.load(self.iris_ref.path_of("model.pkl"))

Note: Always define your model references in the class scope of your Service. This ensures they're properly tracked as dependencies and available when your Service is deployed.

For more information, see model loading and management.

Call any AI service with external deployment dependencies

The ability to run distributed Services has always been a key feature in BentoML. From orchestrating CPU and GPU processing to configuring multi-model pipelines, BentoML makes it easy to connect Services, optimize resource utilization, and scale models efficiently.

The core way to set up these interactions is through bentoml.depends(). It allows one Service to call another Service’s methods as if they were local calls. BentoML abstracts away the complexities of network communication, serialization, and deserialization, so you don’t need to worry about how data travels between Services.

Previously, you could set up dependencies only on internal BentoML Services:

class ServiceA: service_b = bentoml.depends(ServiceB) @bentoml.api def predict(self, input: np.ndarray) -> int: # Call the predict function from ServiceB return int(self.service_b.predict(input)[0][0])

This works great for local Services, but what about calling AI services already deployed on BentoCloud or running on other servers? We heard your need for more flexibility, and now we're extending the power of bentoml.depends() to reach beyond your local environment.

With BentoML 1.4, you can easily depend on any deployed model, whether it’s hosted on BentoCloud or running on your own infrastructure. Here's how simple it is:

import bentoml @bentoml.service class MyService: # `cluster` is optional if your Deployment is in a non-default cluster iris = bentoml.depends(deployment="iris-classifier-x6dewa", cluster="my_cluster_name") # Call the model deployed on BentoCloud by specifying its URL # iris = bentoml.depends(url="https://iris.example-url.bentoml.ai") # Call the model served elsewhere # iris = bentoml.depends(url="http://192.168.1.1:3000") @bentoml.api def predict(self, input: np.ndarray) -> int: # Call the predict function from the remote Deployment return int(self.iris.predict(input)[0][0])

The syntax stays exactly the same whether you're calling a local Service or a remote Deployment. All the complexity of network communication, serialization, and deserialization is handled automatically by BentoML.

For details, check out the BentoML documentation.

Conclusion

BentoML 1.4 brings significant enhancements to your AI development and deployment process. From accelerating your iteration cycle to simplifying runtime configurations, each feature was designed to address the pain points you face daily.

As always, BentoML is built with your needs in mind, and we can’t wait to hear how these new features help you bring your AI applications to life!

Check out the following resources to learn more and stay connected: