Dec 8, 2022 • Written By Sean Sheng
The value of an ML model is not realized until it is deployed and served in production. Building an ML application is more challenging compared to a traditional application due to the added complexities from models and data in addition to the application code. Using web serving frameworks (e.g. FastAPI) can work for the simple cases but falls short for performance and efficiency. Alternatively, using pre-packaged models servers (e.g. Triton Inference Server) can be ideal for low-latency serving and resource utilization but lacks flexibility in defining custom logic and dependency. BentoML abstracts the complexities by creating separate runtimes for IO-intensive preprocessing logic and compute-intensive model inference logic. Simultaneously, BentoML offers an intuitive and flexible Python-first SDK for defining custom preprocessing logic, orchestrating multi-model inference, and integrating with other frameworks in the MLOps ecosystem.