November 9, 2023 • Written By Eric Liu
Note: This blog post is converted from Eric Liu’s speech at KubeCon China 2023.
Back in 2015 and 2016, our CEO and co-founder of BentoML, Chaoyu Yang, was an early Databricks engineer and the first part-time PM of the MLflow project. While assisting clients such as Riot Games and Capital One, he recognized the challenges in taking machine learning models to production. The process was fraught with conflicts due to the involvement of various personas: data engineers, data scientists, ML engineers, DevOps engineers, and product managers.
For example, before data scientists could train models, data engineers might need to clean the data; after training the models, data scientists needed to ask the ML engineers to take the models into a callable web service API. The ML engineers also needed to talk to DevOps engineers to make sure the service was reliable and scalable. At the same time, the PM wanted to access the data of the machine learning services.
The evolution of machine learning wasn't just about the code; it also involved data and models, making the CI/CD process complex. This led to the emergence of a new concept back then: MLOps. This concept was designed to make sure the machine learning assets are treated as other software assets, like code within a CI/CD environment. This required us to make ML assets versionable and testable with an automated and reproducible process in place to leverage these assets.
When we started BentoML, we interviewed different users deploying various solutions. Each company had its own implementation, which we categorized based on two dimensions: ease of use and flexibility.
In terms of ease of use, the most direct solution came from ML training frameworks. For example, TensorFlow offers TensorFlow Serve, PyTorch comes with Torch Serve, and NVIDIA provides the Triton Inference Server. The key advantage of such off-the-shelf solutions is their simplicity, allowing data scientists to quickly get started for simple use cases. Specialized runtimes also ensure low-latency serving, especially tailored to specific frameworks.
However, as ML organizations expand, the limitations of these solutions become evident:
Our investigation revealed a trend: many technology-driven companies with strong engineering teams might develop their in-house solutions. Such solutions offer high flexibility, adapting not just to different frameworks but also aligning with the company’s specific infrastructure. Nevertheless, not every enterprise, especially those non-tech ones, have the required engineering resources or expertise. Building such infrastructure could cost 6 to 9 months, which means lost opportunities and revenue.
Moreover, many ML organizations begin with data teams comprising data engineers and data scientists. Due to the skill gap, most data scientists, with their backgrounds in statistics or mathematics, need considerable time to acquire the necessary DevOps or engineering expertise.
Our solution emerged from these insights. At that time, we envisioned a framework balancing flexibility with user-friendliness.
BentoML, our open-source AI application framework, is the realization of this vision with the following highlighted features.
Today, with over 3000 community members, BentoML serves billions of predictions daily, empowering over 1000 organizations in production. Here's what our users share:
"BentoML enables us to deliver business value quickly by allowing us to deploy ML models to our existing infrastructure and scale the model services easily." - Shihgian Lee, Senior Machine Learning Engineer, Porch
"We achieved big speedups migrating from FastAPI to BentoML recently, 5-10x with some very basic tuning." - Axel Goblet, Machine Learning Engineer, BigData Republic
"BentoML 1.0 is helping us future-proof our ML deployment infrastructure at Mission Lane. It is enabling us to rapidly develop and test our model scoring services, and to seamlessly deploy them into our dev, staging, and production Kubernetes clusters." - Mike Kuhlen, Data Science & ML Lead, Mission Lane
When expanding into the Asian market, particularly with the open-source community, having a champion in the region is vital. A prime example is Sungjun Kim, a software engineer from LINE. In 2019, Mr. Kim came across BentoML and began implementing it to build what they called the "ML universe" within LINE. A year later, BentoML had been integrated into at least three use cases (shopping-related search, content recommendations, and user targeting) within the LINE application and its associated organizations. Later on, we found that another individual, Woongkyu Lee from LINE (though not from the same division as Mr. Kim, given LINE's vast organizational structure) also adopted BentoML with their machine learning team, mainly to calculate credit scores.
The popularity of BentoML continued to grow, particularly in South Korea, attracting a large number of internet companies, including the tech giant, NAVER.
Founded in 1999, NAVER stands as South Korea's largest search engine and is the country's most frequented website. Globally, NAVER boasts over 6,000 employees, with over half of them based in its headquarters in Seongnam, South Korea. In 2020, NAVER generated revenues surpassing 5.5 trillion Korean won (approximately US$4.8 billion), making it the largest internet company in Korea by market capitalization.
At NAVER, each team selects their own framework based on their specific needs. SoungRyoul Kim, an MLOps engineer at the AI Serving Dev team of NAVER, shares their story of how and why they use BentoML.
Kim’s team selects BentoML for the following reasons:
Simplicity. Kim’s team follows the standard BentoML workflow to build a Bento and containerize it to create a Docker image. This process only takes them a handful of commands like
bentoml build and
bentoml containerize. After that, they can easily retrieve the Docker image to deploy their AI application anywhere.
Inference flexibility. BentoML Runners can work in either distributed or embedded mode.
.to_runner(embedded=True), they run the API Server and the Runner within the same process, reducing the communication overhead and achieving better performance. See Embedded Runners for more information.
One of their use cases is running distributed Runners for online serving. In this scenario, Kim’s team spawn two identical Runners and then distribute inference requests to both Runners. “In a Kubernetes environment where each Runner is deployed independently, this approach can efficiently improve the performance of the model server, especially useful for inference requests with a large batch size,” Kim explains.
The image below is a simple example of using two distributed Runners. Each of them handles 100 rows when the size is 200. To further reduce latency, you can spawn more Runners to handle inference requests. “When you use this strategy, there is only few code changes,” Kim says.
For Kim’s team, BentoML is more than a unified model serving framework. It also allows them to implement preprocessing logic. In practical ML serving scenarios, incoming inference requests often don't contain data in an ideal format. The raw data is rarely a neatly packaged tensor ready for the model. This means you often need to preprocess the data before it can be fed into the model for inference.
In this connection, BentoML provides an easy way to implement pre-processing logic in the Service with Python. “We need to add preprocessing logic and Feature Store connections. In this case, you may need FastAPI and other server tools with pre-processing logic, but BentoML supports this,” Kim says.
We believe that an ideal solution for production model serving should have optimized runtimes and hardware scheduling, which allows for independent and automatic scaling. This is why we created the Runner architecture.
Reflecting on BentoML's journey from its inception to its current standing is a testament to the power of community-driven development and the necessity for a robust, flexible ML serving solution. Today, we are glad to see significant contributions from adopters like LINE and NAVER who not only utilize the framework but also enrich it.
This year, we've continued to push the boundaries of what the BentoML ecosystem can do. With OpenLLM, users can serve and deploy any large language model with ease and efficiency. OneDiffusion allows us to tackle the challenges of deploying Stable Diffusion models in production. Moreover, the launch of BentoCloud marks a significant milestone, offering a serverless platform tailored for AI application production deployment.
We invite practitioners, enthusiasts, and organizations to join our thriving community, shaping the future of MLOps. Let's build, share, and innovate together.
To learn more about BentoML and its ecosytem tools, check out the following resources: