From Experiments to Production: Building ML Pipelines with MLflow and BentoML

February 27, 2025 • Written By Chaoyu Yang

Machine learning projects involve many moving parts - from experimentation to production deployment. Two tools that work wonderfully together to streamline this process are MLflow and BentoML. In this tutorial, we’ll demonstrate how to use MLflow for experiment tracking and BentoML for model serving and production deployment.

Specifically, you’ll learn to:

  • Log and track your models and experiments with MLflow.
  • Save those models into BentoML for unified management and serving.
  • Turn an MLflow model into a production-ready API.
  • Enforce input data validation and leverage adaptive batching with BentoML.
  • Standardize deployment workflows for large teams.

You can find all the source code in the BentoMLflow repository.

Let’s get started!

Setting Up the Environment

Install the necessary packages:

pip install bentoml mlflow scikit-learn

Note: While we use scikit-learn for demo purposes, both MLflow and BentoML support a wide variety of frameworks, such as PyTorch, TensorFlow and XGBoost.

Start your MLflow tracking server:

mlflow server --host 127.0.0.1 --port 8080

This server will track our experiments and store our model artifacts.

Training a Model with MLflow

Let's train a simple classification model using the Iris dataset and log the results with MLflow:

import mlflow from mlflow.models import infer_signature import pandas as pd from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.linear_model import LogisticRegression from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score # Load the Iris dataset X, y = datasets.load_iris(return_X_y=True) # Split the data into training and test sets X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) # Define the model hyperparameters params = { "solver": "lbfgs", "max_iter": 1000, "random_state": 8888, } # Train the model lr = LogisticRegression(**params) lr.fit(X_train, y_train) # Predict on the test set y_pred = lr.predict(X_test) # Calculate metrics accuracy = accuracy_score(y_test, y_pred) print(accuracy)

Tracking Experiments with MLflow

Here, we log model parameters, track metrics, and save the model artifact to MLflow.

from datetime import date # Set our tracking server uri for logging mlflow.set_tracking_uri(uri="http://127.0.0.1:8080") # Create a new MLflow Experiment mlflow.set_experiment("MLflow+BentoML Quickstart") # Start an MLflow run with mlflow.start_run(): # Log the hyperparameters mlflow.log_params(params) # Log the loss metric mlflow.log_metric("accuracy", accuracy) # Set a tag that we can use to remind ourselves what this run was for mlflow.set_tag("Training Info", "Basic LR model for iris data") # Infer the model signature signature = infer_signature(X_train, lr.predict(X_train)) # Log the model model_info = mlflow.sklearn.log_model( sk_model=lr, artifact_path="iris_model", signature=signature, input_example=X_train, registered_model_name="iris_demo", ) model_uri = mlflow.get_artifact_uri("iris_model")

At this point, MLflow has:

  • Recorded the model hyperparameters
  • Tracked the model accuracy
  • Saved the model artifacts
  • Created a signature that defines the model's input and output formats

You can view all the information in the MLflow UI by visiting http://127.0.0.1:8080.

Saving and Versioning the Model with BentoML

Once you’re happy with the performance, register the model into the BentoML Model Store for deployment.

import bentoml bento_model = bentoml.mlflow.import_model( 'iris', model_uri=model_uri, labels={ "team": "bento", "stage": "dev", "accuracy": accuracy, "training_date": str(date.today()) } )

Note that:

  • The MLflow Model Registry is designed for tracking and storing model artifacts created during experimentation. It's helpful for selecting and comparing models, reproducing training runs, or continuously training models.
  • The BentoML Model Store manages approved models for application development and production deployment. It focuses on streamlining the productionization workflow and improving the efficiency of model distribution and loading. The Model Store also supports versioning and links models to their MLflow experimentation metadata.

Verify the model is saved to the Model Store:

$ bentoml models list Tag                           Module           Size        Creation Time iris:hu5d7xxs3oxmnuqj         bentoml.mlflow   11.75 KiB   2025-02-24 10:14:51

You can test loading the model from the Model Store:

import numpy as np import bentoml # Load the latest version of iris model: iris_model = bentoml.mlflow.load_model("iris:latest") # Alternatively, load the model by specifying the model tag # iris_model = bentoml.mlflow.load_model("iris:hu5d7xxs3oxmnuqj") input_data = np.array([[5.9, 3, 5.1, 1.8]]) res = iris_model.predict(input_data) print(res)

Serving the MLflow Model with BentoML

Now that the model is ready, create a BentoML Service to serve it. By convention, you define a file called service.py to implement the model serving logic.

import bentoml import numpy as np from bentoml.models import BentoModel # Define the runtime environment for your Bento demo_image = bentoml.images.PythonImage(python_version="3.11") \ .python_packages("mlflow", "scikit-learn") target_names = ['setosa', 'versicolor', 'virginica'] @bentoml.service( image=demo_image, resources={"cpu": "2"}, traffic={"timeout": 10}, ) class IrisClassifier: # Declare the model as a class attribute bento_model = BentoModel("iris:latest") def __init__(self): self.model = bentoml.mlflow.load_model(self.bento_model) # Define an API endpoint @bentoml.api def predict(self, input_data: np.ndarray) -> list[str]: preds = self.model.predict(input_data) return [target_names[i] for i in preds]

Note that:

  • PythonImage is used to define the runtime environment for a Bento, the unified distribution format in BentoML. You can customize the build by setting the required Python version, dependencies, run commands, and more.
  • @bentoml.service marks a Python class as a BentoML Service. It allows you to specify configurations like request timeouts and resource requirements.

Serve the model using the BentoML CLI:

$ bentoml serve service.py:IrisClassifier [INFO] [cli] Starting production HTTP BentoServer from "service:IrisClassifier" listening on http://localhost:3000 (Press CTRL+C to quit) [INFO] [entry_service:IrisClassifier:1] Service IrisClassifier initialized

The model is now running at http://localhost:3000. Query the endpoint:

curl -X 'POST' \ 'http://localhost:3000/predict' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "input_data": [[ 0.1, 0.2, 0.1, 0.1 ]] }' # ["setosa"]

Alternatively, use the BentoML Python client:

import bentoml import numpy as np client = bentoml.SyncHTTPClient("http://localhost:3000") client.predict(np.array([[5.9, 3, 5.1, 1.8]])) # ['virginica']

Validating Input Data with BentoML

A common problem is handling unexpected data formats or types from clients. For example, if a client sends integer values instead of float:

curl -X 'POST' \ 'http://localhost:3000/predict' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "input_data": [[ 5,3,5,2 ]] }'

This would result in an error:

# client side error: # {"error":"An unexpected error has occurred, please check the server log."} # server side log: # mlflow.exceptions.MlflowException: Failed to enforce schema of data '[[5 3 5 2]]' with schema '[Tensor('float64', (-1, 4))]'. Error: dtype of input int64 does not match expected dtype float64

Using the BentoML Python client with explicit float dtype can solve it:

import bentoml import numpy as np client = bentoml.SyncHTTPClient("http://localhost:3000") client.predict(np.array([[1,1,1,1]], dtype='float64'))

However, this poses challenges when integrating ML services with downstream services. To further help with input validation, BentoML extends Pydantic to handle common ML data types (e.g., images, text streams, floats). You can define a strict schema in your BentoML Service:

import bentoml import numpy as np import numpy.typing as npt from bentoml.models import BentoModel from pydantic import Field from bentoml.validators import Shape, DType from typing import Annotated demo_image = bentoml.images.PythonImage(python_version="3.11") \ .python_packages("mlflow", "scikit-learn") target_names = ['setosa', 'versicolor', 'virginica'] @bentoml.service( image=demo_image, resources={"cpu": "2"}, traffic={"timeout": 10}, ) class IrisClassifier: bento_model = BentoModel("iris:latest") def __init__(self): self.model = bentoml.mlflow.load_model(self.bento_model) # Enforce and validate input schemas for the API @bentoml.api def predict( self, input_data: Annotated[npt.NDArray[np.float64], Shape((-1, 4)), DType("float64")] = Field(default=[[0.1, 0.4, 0.2, 1.0]]) ) -> list[str]: preds = self.model.predict(input_data) return [target_names[i] for i in preds]

Now, any integer input can automatically get validated (and converted if possible). You can try it with a generic HTTP client:

curl -X 'POST' \ 'http://localhost:3000/predict' \ -H 'accept: application/json' \ -H 'Content-Type: application/json' \ -d '{ "input_data": [[ 5,3,5,2 ]] }' # ["virginica"]

You can also inspect the OpenAPI documentation to see the required schema for your service:

curl localhost:3000/docs.json

This returns a JSON schema that describes the input and output formats of the API.

"paths": { ... "/predict": { "post": { "responses": { "200": { "description": "Successful Response", "content": { "application/json": { "schema": { "type": "array", "items": { "type": "number" } } } } }, "400": { "description": "Bad Request", "content": { "application/json": { "schema": { "$ref": "#/components/schemas/InvalidArgument" } } } }, "500": { "description": "Internal Server Error", "content": { "application/json": { "schema": { "$ref": "#/components/schemas/InternalServerError" } } } } }, "requestBody": { "content": { "application/json": { "schema": { "type": "object", "title": "Input", "properties": { "input_data": { "default": [ [ 0.1, 0.4, 0.2, 1 ] ], "items": { "items": { "type": "number" }, "type": "array" }, "title": "Input Data", "type": "array" } } } } } }, "operationId": "IrisClassifier__predict" } } }, ... "components": { "schemas": { "predict__Input": { "type": "object", "title": "predict__Input", "properties": { "input_data": { "default": [ [ 0.1, 0.4, 0.2, 1 ] ], "dim": -4, "dtype": "float64", "format": "numpy-array", "shape": [ -1, 4 ], "title": "Input Data", "type": "tensor" } } }, ...

Advanced Use Case: Enable Adaptive Batching

BentoML can optimize performance through adaptive batching, which combines multiple individual requests into a single batch for more efficient processing.

Let's update our Service to support batching:

import bentoml import numpy as np from bentoml.models import BentoModel demo_image = bentoml.images.PythonImage(python_version="3.11") \ .python_packages("mlflow", "scikit-learn") target_names = ['setosa', 'versicolor', 'virginica'] @bentoml.service( image=demo_image, resources={"cpu": "2"}, traffic={"timeout": 10}, ) class IrisClassifier: bento_model = BentoModel("iris:latest") def __init__(self): self.model = bentoml.mlflow.load_model(self.bento_model) # Enable adaptive batching @bentoml.api(batchable=True) def predict( self, input_data: np.ndarray ) -> list[str]: print(f"batch_size: {len(input_data)}") preds = self.model.predict(input_data) return [target_names[i] for i in preds]

You can test it with a script that simulates multiple concurrent clients:

import requests from concurrent.futures import ThreadPoolExecutor import time import random CONCURRENCY = 20 # Number of threads (concurrent requests) TOTAL_REQUESTS = 1000 # Total number of requests to send client = bentoml.SyncHTTPClient("http://localhost:3000") from sklearn.datasets import load_iris iris = load_iris() data_samples = iris.data.tolist() payloads = [random.choice(data_samples) for _ in range(TOTAL_REQUESTS)] def send_request(index, data): """Send a single HTTP request and print the result.""" try: start_time = time.time() response = client.predict(np.array([data])) duration = time.time() - start_time except Exception as e: print(f"Request {index}: Error -> {e}") print(f"Sending {TOTAL_REQUESTS} requests to {client.url} with concurrency {CONCURRENCY}...") with ThreadPoolExecutor(max_workers=CONCURRENCY) as executor: for i, data in enumerate(payloads, start=1): executor.submit(send_request, i, data) print("Done.")

Although each client sends a single data point, you'll notice from the server logs that BentoML dynamically batches multiple requests together. This improves throughput and increases computational efficiency.

You can also monitor the batch size metrics at http://localhost:3000/metrics. Here are some example metrics after running the above script:

# HELP bentoml_service_adaptive_batch_size Service adaptive batch size # TYPE bentoml_service_adaptive_batch_size histogram bentoml_service_adaptive_batch_size_sum{method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 1000.0 bentoml_service_adaptive_batch_size_bucket{le="1.0",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 66.0 bentoml_service_adaptive_batch_size_bucket{le="2.0",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 111.0 bentoml_service_adaptive_batch_size_bucket{le="4.0",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 171.0 bentoml_service_adaptive_batch_size_bucket{le="8.0",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 227.0 bentoml_service_adaptive_batch_size_bucket{le="16.0",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 249.0 bentoml_service_adaptive_batch_size_bucket{le="32.0",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 251.0 bentoml_service_adaptive_batch_size_bucket{le="64.0",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 251.0 bentoml_service_adaptive_batch_size_bucket{le="100.0",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 251.0 bentoml_service_adaptive_batch_size_bucket{le="+Inf",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 251.0 bentoml_service_adaptive_batch_size_count{method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 251.0

Standardizing Your Deployment Workflow

For larger teams collaborating on multiple models and projects, BentoML provides tools to standardize ML service development.

Example: Enforcing Environment Dependencies and API Specifications Across Multiple Projects

First, create a common.py file that defines shared components:

# common.py import bentoml import numpy as np import numpy.typing as npt from pydantic import Field from bentoml.validators import Shape, DType from typing import Annotated my_image = bentoml.images.PythonImage(python_version="3.11") \ .python_packages("mlflow", "scikit-learn") class MyInputParams(bentoml.IODescriptor): input_data: Annotated[npt.NDArray[np.float64], Shape((-1, 4)), DType("float64")] = Field(default=[[0.1, 0.4, 0.2, 1.0]]) client_id: str

Then, use these components in your Service:

import bentoml import numpy as np import numpy.typing as npt from bentoml.models import BentoModel from common import MyInputParams, my_image @bentoml.service( image=my_image, resources={"cpu": "2"}, traffic={"timeout": 10}, ) class IrisClassifier: bento_model = BentoModel("iris:latest") def __init__(self): self.model = bentoml.mlflow.load_model(self.bento_model) @bentoml.api(input_spec=MyInputParams) def predict( self, input_data, client_id, ) -> list[str]: print(f"processing request form user {client_id}") rv = self.model.predict(input_data) return np.asarray(rv)

Example: Serving Multiple Models

BentoML makes it easy to serve multiple models in a single Service (or distributed Services).

import bentoml import numpy as np import numpy.typing as npt from bentoml.models import BentoModel from common import MyInputParams, my_image @bentoml.service( image=my_image, resources={"cpu": "2"}, traffic={"timeout": 10}, ) class IrisClassifier: bento_model_1 = BentoModel("iris:v1") bento_model_2 = BentoModel("iris:v2") def __init__(self): self.model_1 = bentoml.mlflow.load_model(self.bento_model_1) self.model_2 = bentoml.mlflow.load_model(self.bento_model_2) @bentoml.api(route="/v1/predict", input_spec=MyInputParams) def predict_1( self, input_data, client_id, ) -> np.ndarray: rv = self.model_1.predict(input_data) return np.asarray(rv) @bentoml.api(route="/v2/predict", input_spec=MyInputParams) def predict_2( self, input_data, client_id, ) -> np.ndarray: rv = self.model_2.predict(input_data) return np.asarray(rv) # Combine predictions @bentoml.api(input_spec=MyInputParams) def predict_combined( self, input_data, client_id, ) -> np.ndarray: rv_a = self.model_1.predict(input_data) rv_b = self.model_2.predict(input_data) return np.asarray([rv_a, rv_b])

This approach allows you to:

  • Serve multiple model versions under different endpoints
  • Create ensemble models that combine predictions from multiple models
  • Implement A/B testing between model versions

For more information, see the BentoML documentation about multi-model composition.

Deploying to Production

BentoML provides multiple options for production deployment:

  1. Containerization: Build an OCI-compliant image for your ML service for deployment on any container platform:

    bentoml build bentoml containerize iris_classifier:latest

    Refer to the containerization guide for more details.

  2. BentoCloud: Sign up for BentoCloud and deploy directly to the unified inference platform for easy management, fast autoscaling, enterprise-grade security, and comprehensive observability:

    bentoml deploy

    Refer to the cloud deployment guide for more details.

Conclusion

In this tutorial, we've seen how MLflow and BentoML work together to create a seamless workflow from experimentation to production:

  • MLflow handles experiment tracking, model metrics, and artifact storage during the development phase
  • BentoML takes care of the production aspects: model serving, validation, batching, and deployment

The integration allows data scientists to focus on model development while ensuring their models can be reliably deployed to production. Check out the following to learn more: