February 27, 2025 • Written By Chaoyu Yang
Machine learning projects involve many moving parts - from experimentation to production deployment. Two tools that work wonderfully together to streamline this process are MLflow and BentoML. In this tutorial, we’ll demonstrate how to use MLflow for experiment tracking and BentoML for model serving and production deployment.
Specifically, you’ll learn to:
You can find all the source code in the BentoMLflow repository.
Let’s get started!
Install the necessary packages:
pip install bentoml mlflow scikit-learn
Note: While we use scikit-learn for demo purposes, both MLflow and BentoML support a wide variety of frameworks, such as PyTorch, TensorFlow and XGBoost.
Start your MLflow tracking server:
mlflow server --host 127.0.0.1 --port 8080
This server will track our experiments and store our model artifacts.
Let's train a simple classification model using the Iris dataset and log the results with MLflow:
import mlflow from mlflow.models import infer_signature import pandas as pd from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score # Load the Iris dataset X, y = datasets.load_iris(return_X_y=True) # Split the data into training and test sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # Define the model hyperparameters params = { "solver": "lbfgs", "max_iter": 1000, "random_state": 8888, } # Train the model lr = LogisticRegression(**params) lr.fit(X_train, y_train) # Predict on the test set y_pred = lr.predict(X_test) # Calculate metrics accuracy = accuracy_score(y_test, y_pred) print(accuracy)
Here, we log model parameters, track metrics, and save the model artifact to MLflow.
from datetime import date # Set our tracking server uri for logging mlflow.set_tracking_uri(uri="http://127.0.0.1:8080") # Create a new MLflow Experiment mlflow.set_experiment("MLflow+BentoML Quickstart") # Start an MLflow run with mlflow.start_run(): # Log the hyperparameters mlflow.log_params(params) # Log the loss metric mlflow.log_metric("accuracy", accuracy) # Set a tag that we can use to remind ourselves what this run was for mlflow.set_tag("Training Info", "Basic LR model for iris data") # Infer the model signature signature = infer_signature(X_train, lr.predict(X_train)) # Log the model model_info = mlflow.sklearn.log_model( sk_model=lr, artifact_path="iris_model", signature=signature, input_example=X_train, registered_model_name="iris_demo", ) model_uri = mlflow.get_artifact_uri("iris_model")
At this point, MLflow has:
You can view all the information in the MLflow UI by visiting http://127.0.0.1:8080
.
Once you’re happy with the performance, register the model into the BentoML Model Store for deployment.
import bentoml bento_model = bentoml.mlflow.import_model( 'iris', model_uri=model_uri, labels={ "team": "bento", "stage": "dev", "accuracy": accuracy, "training_date": str(date.today()) } )
Note that:
Verify the model is saved to the Model Store:
$ bentoml models list Tag Module Size Creation Time iris:hu5d7xxs3oxmnuqj bentoml.mlflow 11.75 KiB 2025-02-24 10:14:51
You can test loading the model from the Model Store:
import numpy as np import bentoml # Load the latest version of iris model: iris_model = bentoml.mlflow.load_model("iris:latest") # Alternatively, load the model by specifying the model tag # iris_model = bentoml.mlflow.load_model("iris:hu5d7xxs3oxmnuqj") input_data = np.array([[5.9, 3, 5.1, 1.8]]) res = iris_model.predict(input_data) print(res)
Now that the model is ready, create a BentoML Service to serve it. By convention, you define a file called service.py
to implement the model serving logic.
import bentoml import numpy as np from bentoml.models import BentoModel # Define the runtime environment for your Bento demo_image = bentoml.images.PythonImage(python_version="3.11") \ .python_packages("mlflow", "scikit-learn") target_names = ['setosa', 'versicolor', 'virginica'] @bentoml.service( image=demo_image, resources={"cpu": "2"}, traffic={"timeout": 10}, ) class IrisClassifier: # Declare the model as a class attribute bento_model = BentoModel("iris:latest") def __init__(self): self.model = bentoml.mlflow.load_model(self.bento_model) # Define an API endpoint @bentoml.api def predict(self, input_data: np.ndarray) -> list[str]: preds = self.model.predict(input_data) return [target_names[i] for i in preds]
Note that:
PythonImage
is used to define the runtime environment for a Bento, the unified distribution format in BentoML. You can customize the build by setting the required Python version, dependencies, run commands, and more.@bentoml.service
marks a Python class as a BentoML Service. It allows you to specify configurations like request timeouts and resource requirements.Serve the model using the BentoML CLI:
$ bentoml serve service.py:IrisClassifier [INFO] [cli] Starting production HTTP BentoServer from "service:IrisClassifier" listening on http://localhost:3000 (Press CTRL+C to quit) [INFO] [entry_service:IrisClassifier:1] Service IrisClassifier initialized
The model is now running at http://localhost:3000
. Query the endpoint:
curl -X 'POST' \ 'http://localhost:3000/predict' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "input_data": [[ 0.1, 0.2, 0.1, 0.1 ]] }' # ["setosa"]
Alternatively, use the BentoML Python client:
import bentoml import numpy as np client = bentoml.SyncHTTPClient("http://localhost:3000") client.predict(np.array([[5.9, 3, 5.1, 1.8]])) # ['virginica']
A common problem is handling unexpected data formats or types from clients. For example, if a client sends integer values instead of float:
curl -X 'POST' \ 'http://localhost:3000/predict' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "input_data": [[ 5,3,5,2 ]] }'
This would result in an error:
# client side error: # {"error":"An unexpected error has occurred, please check the server log."} # server side log: # mlflow.exceptions.MlflowException: Failed to enforce schema of data '[[5 3 5 2]]' with schema '[Tensor('float64', (-1, 4))]'. Error: dtype of input int64 does not match expected dtype float64
Using the BentoML Python client with explicit float dtype can solve it:
import bentoml import numpy as np client = bentoml.SyncHTTPClient("http://localhost:3000") client.predict(np.array([[1,1,1,1]], dtype='float64'))
However, this poses challenges when integrating ML services with downstream services. To further help with input validation, BentoML extends Pydantic to handle common ML data types (e.g., images, text streams, floats). You can define a strict schema in your BentoML Service:
import bentoml import numpy as np import numpy.typing as npt from bentoml.models import BentoModel from pydantic import Field from bentoml.validators import Shape, DType from typing import Annotated demo_image = bentoml.images.PythonImage(python_version="3.11") \ .python_packages("mlflow", "scikit-learn") target_names = ['setosa', 'versicolor', 'virginica'] @bentoml.service( image=demo_image, resources={"cpu": "2"}, traffic={"timeout": 10}, ) class IrisClassifier: bento_model = BentoModel("iris:latest") def __init__(self): self.model = bentoml.mlflow.load_model(self.bento_model) # Enforce and validate input schemas for the API @bentoml.api def predict( self, input_data: Annotated[npt.NDArray[np.float64], Shape((-1, 4)), DType("float64")] = Field(default=[[0.1, 0.4, 0.2, 1.0]]) ) -> list[str]: preds = self.model.predict(input_data) return [target_names[i] for i in preds]
Now, any integer input can automatically get validated (and converted if possible). You can try it with a generic HTTP client:
curl -X 'POST' \ 'http://localhost:3000/predict' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "input_data": [[ 5,3,5,2 ]] }' # ["virginica"]
You can also inspect the OpenAPI documentation to see the required schema for your service:
curl localhost:3000/docs.json
This returns a JSON schema that describes the input and output formats of the API.
"paths": { ... "/predict": { "post": { "responses": { "200": { "description": "Successful Response", "content": { "application/json": { "schema": { "type": "array", "items": { "type": "number" } } } } }, "400": { "description": "Bad Request", "content": { "application/json": { "schema": { "$ref": "#/components/schemas/InvalidArgument" } } } }, "500": { "description": "Internal Server Error", "content": { "application/json": { "schema": { "$ref": "#/components/schemas/InternalServerError" } } } } }, "requestBody": { "content": { "application/json": { "schema": { "type": "object", "title": "Input", "properties": { "input_data": { "default": [ [ 0.1, 0.4, 0.2, 1 ] ], "items": { "items": { "type": "number" }, "type": "array" }, "title": "Input Data", "type": "array" } } } } } }, "operationId": "IrisClassifier__predict" } } }, ... "components": { "schemas": { "predict__Input": { "type": "object", "title": "predict__Input", "properties": { "input_data": { "default": [ [ 0.1, 0.4, 0.2, 1 ] ], "dim": -4, "dtype": "float64", "format": "numpy-array", "shape": [ -1, 4 ], "title": "Input Data", "type": "tensor" } } }, ...
BentoML can optimize performance through adaptive batching, which combines multiple individual requests into a single batch for more efficient processing.
Let's update our Service to support batching:
import bentoml import numpy as np from bentoml.models import BentoModel demo_image = bentoml.images.PythonImage(python_version="3.11") \ .python_packages("mlflow", "scikit-learn") target_names = ['setosa', 'versicolor', 'virginica'] @bentoml.service( image=demo_image, resources={"cpu": "2"}, traffic={"timeout": 10}, ) class IrisClassifier: bento_model = BentoModel("iris:latest") def __init__(self): self.model = bentoml.mlflow.load_model(self.bento_model) # Enable adaptive batching @bentoml.api(batchable=True) def predict( self, input_data: np.ndarray ) -> list[str]: print(f"batch_size: {len(input_data)}") preds = self.model.predict(input_data) return [target_names[i] for i in preds]
You can test it with a script that simulates multiple concurrent clients:
import requests from concurrent.futures import ThreadPoolExecutor import time import random CONCURRENCY = 20 # Number of threads (concurrent requests) TOTAL_REQUESTS = 1000 # Total number of requests to send client = bentoml.SyncHTTPClient("http://localhost:3000") from sklearn.datasets import load_iris iris = load_iris() data_samples = iris.data.tolist() payloads = [random.choice(data_samples) for _ in range(TOTAL_REQUESTS)] def send_request(index, data): """Send a single HTTP request and print the result.""" try: start_time = time.time() response = client.predict(np.array([data])) duration = time.time() - start_time except Exception as e: print(f"Request {index}: Error -> {e}") print(f"Sending {TOTAL_REQUESTS} requests to {client.url} with concurrency {CONCURRENCY}...") with ThreadPoolExecutor(max_workers=CONCURRENCY) as executor: for i, data in enumerate(payloads, start=1): executor.submit(send_request, i, data) print("Done.")
Although each client sends a single data point, you'll notice from the server logs that BentoML dynamically batches multiple requests together. This improves throughput and increases computational efficiency.
You can also monitor the batch size metrics at http://localhost:3000/metrics
. Here are some example metrics after running the above script:
# HELP bentoml_service_adaptive_batch_size Service adaptive batch size # TYPE bentoml_service_adaptive_batch_size histogram bentoml_service_adaptive_batch_size_sum{method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 1000.0 bentoml_service_adaptive_batch_size_bucket{le="1.0",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 66.0 bentoml_service_adaptive_batch_size_bucket{le="2.0",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 111.0 bentoml_service_adaptive_batch_size_bucket{le="4.0",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 171.0 bentoml_service_adaptive_batch_size_bucket{le="8.0",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 227.0 bentoml_service_adaptive_batch_size_bucket{le="16.0",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 249.0 bentoml_service_adaptive_batch_size_bucket{le="32.0",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 251.0 bentoml_service_adaptive_batch_size_bucket{le="64.0",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 251.0 bentoml_service_adaptive_batch_size_bucket{le="100.0",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 251.0 bentoml_service_adaptive_batch_size_bucket{le="+Inf",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 251.0 bentoml_service_adaptive_batch_size_count{method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 251.0
For larger teams collaborating on multiple models and projects, BentoML provides tools to standardize ML service development.
First, create a common.py
file that defines shared components:
# common.py import bentoml import numpy as np import numpy.typing as npt from pydantic import Field from bentoml.validators import Shape, DType from typing import Annotated my_image = bentoml.images.PythonImage(python_version="3.11") \ .python_packages("mlflow", "scikit-learn") class MyInputParams(bentoml.IODescriptor): input_data: Annotated[npt.NDArray[np.float64], Shape((-1, 4)), DType("float64")] = Field(default=[[0.1, 0.4, 0.2, 1.0]]) client_id: str
Then, use these components in your Service:
import bentoml import numpy as np import numpy.typing as npt from bentoml.models import BentoModel from common import MyInputParams, my_image @bentoml.service( image=my_image, resources={"cpu": "2"}, traffic={"timeout": 10}, ) class IrisClassifier: bento_model = BentoModel("iris:latest") def __init__(self): self.model = bentoml.mlflow.load_model(self.bento_model) @bentoml.api(input_spec=MyInputParams) def predict( self, input_data, client_id, ) -> list[str]: print(f"processing request form user {client_id}") rv = self.model.predict(input_data) return np.asarray(rv)
BentoML makes it easy to serve multiple models in a single Service (or distributed Services).
import bentoml import numpy as np import numpy.typing as npt from bentoml.models import BentoModel from common import MyInputParams, my_image @bentoml.service( image=my_image, resources={"cpu": "2"}, traffic={"timeout": 10}, ) class IrisClassifier: bento_model_1 = BentoModel("iris:v1") bento_model_2 = BentoModel("iris:v2") def __init__(self): self.model_1 = bentoml.mlflow.load_model(self.bento_model_1) self.model_2 = bentoml.mlflow.load_model(self.bento_model_2) @bentoml.api(route="/v1/predict", input_spec=MyInputParams) def predict_1( self, input_data, client_id, ) -> np.ndarray: rv = self.model_1.predict(input_data) return np.asarray(rv) @bentoml.api(route="/v2/predict", input_spec=MyInputParams) def predict_2( self, input_data, client_id, ) -> np.ndarray: rv = self.model_2.predict(input_data) return np.asarray(rv) # Combine predictions @bentoml.api(input_spec=MyInputParams) def predict_combined( self, input_data, client_id, ) -> np.ndarray: rv_a = self.model_1.predict(input_data) rv_b = self.model_2.predict(input_data) return np.asarray([rv_a, rv_b])
This approach allows you to:
For more information, see the BentoML documentation about multi-model composition.
BentoML provides multiple options for production deployment:
Containerization: Build an OCI-compliant image for your ML service for deployment on any container platform:
bentoml build bentoml containerize iris_classifier:latest
Refer to the containerization guide for more details.
BentoCloud: Sign up for BentoCloud and deploy directly to the unified inference platform for easy management, fast autoscaling, enterprise-grade security, and comprehensive observability:
bentoml deploy
Refer to the cloud deployment guide for more details.
In this tutorial, we've seen how MLflow and BentoML work together to create a seamless workflow from experimentation to production:
The integration allows data scientists to focus on model development while ensuring their models can be reliably deployed to production. Check out the following to learn more: