From Experiments to Production: Building ML Pipelines with MLflow and BentoML

February 27, 2025 • Written By Chaoyu Yang

Machine learning projects involve many moving parts - from experimentation to production deployment. Two tools that work wonderfully together to streamline this process are MLflow and BentoML. In this tutorial, we’ll demonstrate how to use MLflow for experiment tracking and BentoML for model serving and production deployment.

Specifically, you’ll learn to:

Log and track your models and experiments with MLflow.
Save those models into BentoML for unified management and serving.
Turn an MLflow model into a production-ready API.
Enforce input data validation and leverage adaptive batching with BentoML.
Standardize deployment workflows for large teams.

You can find all the source code in the BentoMLflow repository.

Let’s get started!

Setting Up the Environment

Install the necessary packages:

pip install bentoml mlflow scikit-learn

Note: While we use scikit-learn for demo purposes, both MLflow and BentoML support a wide variety of frameworks, such as PyTorch, TensorFlow and XGBoost.

Start your MLflow tracking server:

mlflow server --host 127.0.0.1 --port 8080

This server will track our experiments and store our model artifacts.

Training a Model with MLflow

Let's train a simple classification model using the Iris dataset and log the results with MLflow:

import mlflow
from mlflow.models import infer_signature

import pandas as pd
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Load the Iris dataset
X, y = datasets.load_iris(return_X_y=True)

# Split the data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

# Define the model hyperparameters
params = {
    "solver": "lbfgs",
    "max_iter": 1000,
    "random_state": 8888,
}

# Train the model
lr = LogisticRegression(**params)
lr.fit(X_train, y_train)

# Predict on the test set
y_pred = lr.predict(X_test)

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
print(accuracy)

Tracking Experiments with MLflow

Here, we log model parameters, track metrics, and save the model artifact to MLflow.

from datetime import date

# Set our tracking server uri for logging
mlflow.set_tracking_uri(uri="http://127.0.0.1:8080")

# Create a new MLflow Experiment
mlflow.set_experiment("MLflow+BentoML Quickstart")

# Start an MLflow run
with mlflow.start_run():
    # Log the hyperparameters
    mlflow.log_params(params)

    # Log the loss metric
    mlflow.log_metric("accuracy", accuracy)

    # Set a tag that we can use to remind ourselves what this run was for
    mlflow.set_tag("Training Info", "Basic LR model for iris data")

    # Infer the model signature
    signature = infer_signature(X_train, lr.predict(X_train))

    # Log the model
    model_info = mlflow.sklearn.log_model(
        sk_model=lr,
        artifact_path="iris_model",
        signature=signature,
        input_example=X_train,
        registered_model_name="iris_demo",
    )
    model_uri = mlflow.get_artifact_uri("iris_model")

At this point, MLflow has:

Recorded the model hyperparameters
Tracked the model accuracy
Saved the model artifacts
Created a signature that defines the model's input and output formats

You can view all the information in the MLflow UI by visiting http://127.0.0.1:8080.

Saving and Versioning the Model with BentoML

Once you’re happy with the performance, register the model into the BentoML Model Store for deployment.

    import bentoml
    bento_model = bentoml.mlflow.import_model(
        'iris', 
        model_uri=model_uri,
        labels={
            "team": "bento",
            "stage": "dev",
            "accuracy": accuracy,
            "training_date": str(date.today())
        }
    )

Note that:

The MLflow Model Registry is designed for tracking and storing model artifacts created during experimentation. It's helpful for selecting and comparing models, reproducing training runs, or continuously training models.
The BentoML Model Store manages approved models for application development and production deployment. It focuses on streamlining the productionization workflow and improving the efficiency of model distribution and loading. The Model Store also supports versioning and links models to their MLflow experimentation metadata.

Verify the model is saved to the Model Store:

$ bentoml models list

Tag                           Module           Size        Creation Time
iris:hu5d7xxs3oxmnuqj         bentoml.mlflow   11.75 KiB   2025-02-24 10:14:51

You can test loading the model from the Model Store:

import numpy as np

import bentoml

# Load the latest version of iris model:
iris_model = bentoml.mlflow.load_model("iris:latest")

# Alternatively, load the model by specifying the model tag
# iris_model = bentoml.mlflow.load_model("iris:hu5d7xxs3oxmnuqj")

input_data = np.array([[5.9, 3, 5.1, 1.8]])
res = iris_model.predict(input_data)
print(res)

Serving the MLflow Model with BentoML

Now that the model is ready, create a BentoML Service to serve it. By convention, you define a file called service.py to implement the model serving logic.

import bentoml
import numpy as np

from bentoml.models import BentoModel

# Define the runtime environment for your Bento
demo_image = bentoml.images.PythonImage(python_version="3.11") \
    .python_packages("mlflow", "scikit-learn")

target_names = ['setosa', 'versicolor', 'virginica']

@bentoml.service(
    image=demo_image,
    resources={"cpu": "2"},
    traffic={"timeout": 10},
)
class IrisClassifier:
    # Declare the model as a class attribute
    bento_model = BentoModel("iris:latest")

    def __init__(self):
        self.model = bentoml.mlflow.load_model(self.bento_model)

    # Define an API endpoint
    @bentoml.api
    def predict(self, input_data: np.ndarray) -> list[str]:
        preds = self.model.predict(input_data)
        return [target_names[i] for i in preds]

Note that:

PythonImage is used to define the runtime environment for a Bento, the unified distribution format in BentoML. You can customize the build by setting the required Python version, dependencies, run commands, and more.
@bentoml.service marks a Python class as a BentoML Service. It allows you to specify configurations like request timeouts and resource requirements.

Serve the model using the BentoML CLI:

$ bentoml serve service.py:IrisClassifier

[INFO] [cli] Starting production HTTP BentoServer from "service:IrisClassifier" listening on http://localhost:3000 (Press CTRL+C to quit)
[INFO] [entry_service:IrisClassifier:1] Service IrisClassifier initialized

The model is now running at http://localhost:3000. Query the endpoint:

curl -X 'POST' \
  'http://localhost:3000/predict' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "input_data": [[
    0.1, 0.2, 0.1, 0.1
  ]]
}'

# ["setosa"]

Alternatively, use the BentoML Python client:

import bentoml
import numpy as np

client = bentoml.SyncHTTPClient("http://localhost:3000")
client.predict(np.array([[5.9, 3, 5.1, 1.8]])) # ['virginica']

Validating Input Data with BentoML

A common problem is handling unexpected data formats or types from clients. For example, if a client sends integer values instead of float:

curl -X 'POST' \
  'http://localhost:3000/predict' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "input_data": [[
    5,3,5,2
  ]]
}'

This would result in an error:

# client side error:
# {"error":"An unexpected error has occurred, please check the server log."}

# server side log:
# mlflow.exceptions.MlflowException: Failed to enforce schema of data '[[5 3 5 2]]' with schema '[Tensor('float64', (-1, 4))]'. Error: dtype of input int64 does not match expected dtype float64

Using the BentoML Python client with explicit float dtype can solve it:

import bentoml
import numpy as np

client = bentoml.SyncHTTPClient("http://localhost:3000")
client.predict(np.array([[1,1,1,1]], dtype='float64'))

However, this poses challenges when integrating ML services with downstream services. To further help with input validation, BentoML extends Pydantic to handle common ML data types (e.g., images, text streams, floats). You can define a strict schema in your BentoML Service:

import bentoml
import numpy as np
import numpy.typing as npt
from bentoml.models import BentoModel
from pydantic import Field
from bentoml.validators import Shape, DType
from typing import Annotated

demo_image = bentoml.images.PythonImage(python_version="3.11") \
    .python_packages("mlflow", "scikit-learn")

target_names = ['setosa', 'versicolor', 'virginica']

@bentoml.service(
    image=demo_image,
    resources={"cpu": "2"},
    traffic={"timeout": 10},
)
class IrisClassifier:
    bento_model = BentoModel("iris:latest")

    def __init__(self):
        self.model = bentoml.mlflow.load_model(self.bento_model)

    # Enforce and validate input schemas for the API
    @bentoml.api
    def predict(
        self,
        input_data: Annotated[npt.NDArray[np.float64], Shape((-1, 4)), DType("float64")] = Field(default=[[0.1, 0.4, 0.2, 1.0]])
    ) -> list[str]:
        preds = self.model.predict(input_data)
        return [target_names[i] for i in preds]

Now, any integer input can automatically get validated (and converted if possible). You can try it with a generic HTTP client:

curl -X 'POST' \
  'http://localhost:3000/predict' \
  -H 'accept: application/json' \
  -H 'Content-Type: application/json' \
  -d '{
  "input_data": [[
    5,3,5,2
  ]]
}'

# ["virginica"]

You can also inspect the OpenAPI documentation to see the required schema for your service:

curl localhost:3000/docs.json

This returns a JSON schema that describes the input and output formats of the API.

"paths": {
...
	"/predict": {
		"post": {
			"responses": {
				"200": {
					"description": "Successful Response",
					"content": {
						"application/json": {
							"schema": {
								"type": "array",
								"items": {
								"type": "number"
								}
							}
						}
					}
				},
				"400": {
					"description": "Bad Request",
						"content": {
							"application/json": {
							"schema": {
							"$ref": "#/components/schemas/InvalidArgument"
							}
						}
					}
				},
				"500": {
					"description": "Internal Server Error",
						"content": {
							"application/json": {
							"schema": {
							"$ref": "#/components/schemas/InternalServerError"
							}
						}
					}
				}
			},
			"requestBody": {
				"content": {
					"application/json": {
						"schema": {
							"type": "object",
							"title": "Input",
							"properties": {
								"input_data": {
									"default": [
										[
										0.1,
										0.4,
										0.2,
										1
										]
									],
									"items": {
										"items": {
											"type": "number"
										},
										"type": "array"
									},
									"title": "Input Data",
									"type": "array"
								}
							}
						}
					}
				}
			},
			"operationId": "IrisClassifier__predict"
		}
	}
},
...
"components": {
	"schemas": {
		"predict__Input": {
			"type": "object",
			"title": "predict__Input",
			"properties": {
				"input_data": {
					"default": [
							[
							0.1,
							0.4,
							0.2,
							1
							]
					],
					"dim": -4,
					"dtype": "float64",
					"format": "numpy-array",
					"shape": [
						-1,
						4
					],
					"title": "Input Data",
					"type": "tensor"
			}
		}
	},
...

Advanced Use Case: Enable Adaptive Batching

BentoML can optimize performance through adaptive batching, which combines multiple individual requests into a single batch for more efficient processing.

Let's update our Service to support batching:

import bentoml
import numpy as np

from bentoml.models import BentoModel

demo_image = bentoml.images.PythonImage(python_version="3.11") \
    .python_packages("mlflow", "scikit-learn")

target_names = ['setosa', 'versicolor', 'virginica']

@bentoml.service(
    image=demo_image,
    resources={"cpu": "2"},
    traffic={"timeout": 10},
)
class IrisClassifier:
    bento_model = BentoModel("iris:latest")

    def __init__(self):
        self.model = bentoml.mlflow.load_model(self.bento_model)

    # Enable adaptive batching
    @bentoml.api(batchable=True)
    def predict(
        self,
        input_data: np.ndarray
    ) -> list[str]:
		    print(f"batch_size: {len(input_data)}")
        preds = self.model.predict(input_data)
        return [target_names[i] for i in preds]

You can test it with a script that simulates multiple concurrent clients:

import requests
from concurrent.futures import ThreadPoolExecutor
import time
import random

CONCURRENCY = 20        # Number of threads (concurrent requests)
TOTAL_REQUESTS = 1000     # Total number of requests to send
client = bentoml.SyncHTTPClient("http://localhost:3000")

from sklearn.datasets import load_iris
iris = load_iris()
data_samples = iris.data.tolist()
payloads = [random.choice(data_samples) for _ in range(TOTAL_REQUESTS)]

def send_request(index, data):
    """Send a single HTTP request and print the result."""
    try:
        start_time = time.time()
        response = client.predict(np.array([data]))
        duration = time.time() - start_time
    except Exception as e:
        print(f"Request {index}: Error -> {e}")

print(f"Sending {TOTAL_REQUESTS} requests to {client.url} with concurrency {CONCURRENCY}...")
with ThreadPoolExecutor(max_workers=CONCURRENCY) as executor:
    for i, data in enumerate(payloads, start=1):
        executor.submit(send_request, i, data)

print("Done.")

Although each client sends a single data point, you'll notice from the server logs that BentoML dynamically batches multiple requests together. This improves throughput and increases computational efficiency.

You can also monitor the batch size metrics at http://localhost:3000/metrics. Here are some example metrics after running the above script:

# HELP bentoml_service_adaptive_batch_size Service adaptive batch size
# TYPE bentoml_service_adaptive_batch_size histogram
bentoml_service_adaptive_batch_size_sum{method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 1000.0
bentoml_service_adaptive_batch_size_bucket{le="1.0",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 66.0
bentoml_service_adaptive_batch_size_bucket{le="2.0",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 111.0
bentoml_service_adaptive_batch_size_bucket{le="4.0",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 171.0
bentoml_service_adaptive_batch_size_bucket{le="8.0",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 227.0
bentoml_service_adaptive_batch_size_bucket{le="16.0",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 249.0
bentoml_service_adaptive_batch_size_bucket{le="32.0",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 251.0
bentoml_service_adaptive_batch_size_bucket{le="64.0",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 251.0
bentoml_service_adaptive_batch_size_bucket{le="100.0",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 251.0
bentoml_service_adaptive_batch_size_bucket{le="+Inf",method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 251.0
bentoml_service_adaptive_batch_size_count{method_name="predict",runner_name="IrisClassifier",service_name="IrisClassifier",service_version="not available",worker_index="1"} 251.0

Standardizing Your Deployment Workflow

For larger teams collaborating on multiple models and projects, BentoML provides tools to standardize ML service development.

Example: Enforcing Environment Dependencies and API Specifications Across Multiple Projects

First, create a common.py file that defines shared components:

# common.py
import bentoml
import numpy as np
import numpy.typing as npt
from pydantic import Field
from bentoml.validators import Shape, DType
from typing import Annotated

my_image = bentoml.images.PythonImage(python_version="3.11") \
    .python_packages("mlflow", "scikit-learn")

class MyInputParams(bentoml.IODescriptor):
    input_data: Annotated[npt.NDArray[np.float64], Shape((-1, 4)), DType("float64")] = Field(default=[[0.1, 0.4, 0.2, 1.0]])
    client_id: str

Then, use these components in your Service:

import bentoml
import numpy as np
import numpy.typing as npt
from bentoml.models import BentoModel
from common import MyInputParams, my_image

@bentoml.service(
    image=my_image,
    resources={"cpu": "2"},
    traffic={"timeout": 10},
)
class IrisClassifier:
    bento_model = BentoModel("iris:latest")

    def __init__(self):
        self.model = bentoml.mlflow.load_model(self.bento_model)

    @bentoml.api(input_spec=MyInputParams)
    def predict(
        self,
        input_data,
        client_id,
    ) -> list[str]:
        print(f"processing request form user {client_id}")
        rv = self.model.predict(input_data)
        return np.asarray(rv)

Example: Serving Multiple Models

BentoML makes it easy to serve multiple models in a single Service (or distributed Services).

import bentoml
import numpy as np
import numpy.typing as npt
from bentoml.models import BentoModel

from common import MyInputParams, my_image

@bentoml.service(
    image=my_image,
    resources={"cpu": "2"},
    traffic={"timeout": 10},
)
class IrisClassifier:
    bento_model_1 = BentoModel("iris:v1")
    bento_model_2 = BentoModel("iris:v2")

    def __init__(self):
        self.model_1 = bentoml.mlflow.load_model(self.bento_model_1)
        self.model_2 = bentoml.mlflow.load_model(self.bento_model_2)

    @bentoml.api(route="/v1/predict", input_spec=MyInputParams)
    def predict_1(
        self,
        input_data,
        client_id,
    ) -> np.ndarray:
        rv = self.model_1.predict(input_data)
        return np.asarray(rv)

    @bentoml.api(route="/v2/predict", input_spec=MyInputParams)
    def predict_2(
        self,
        input_data,
        client_id,
    ) -> np.ndarray:
        rv = self.model_2.predict(input_data)
        return np.asarray(rv)
    
    # Combine predictions
    @bentoml.api(input_spec=MyInputParams)
    def predict_combined(
        self,
        input_data,
        client_id,
    ) -> np.ndarray:
        rv_a = self.model_1.predict(input_data)
        rv_b = self.model_2.predict(input_data)
        return np.asarray([rv_a, rv_b])

This approach allows you to:

Serve multiple model versions under different endpoints
Create ensemble models that combine predictions from multiple models
Implement A/B testing between model versions

For more information, see the BentoML documentation about multi-model composition.

Deploying to Production

BentoML provides multiple options for production deployment:

Containerization: Build an OCI-compliant image for your ML service for deployment on any container platform:
```
bentoml build
bentoml containerize iris_classifier:latest
```
Refer to the containerization guide for more details.
BentoCloud: Sign up for BentoCloud and deploy directly to the unified inference platform for easy management, fast autoscaling, enterprise-grade security, and comprehensive observability:
```
bentoml deploy
```
Refer to the cloud deployment guide for more details.

Conclusion

In this tutorial, we've seen how MLflow and BentoML work together to create a seamless workflow from experimentation to production:

MLflow handles experiment tracking, model metrics, and artifact storage during the development phase
BentoML takes care of the production aspects: model serving, validation, batching, and deployment

The integration allows data scientists to focus on model development while ensuring their models can be reliably deployed to production. Check out the following to learn more:

Sign up for BentoCloud to deploy your first MLflow model to the cloud.
Talk to our experts for customized guidance on implementing AI solutions.
Be part of our Slack community of AI practitioners and engineers. Share experiences, get real-time support, and stay updated on best practices in AI deployment.

Join our global Community

Over 1 million new deployments a month 5000+ community members 200+ open-source contributors

Start a free trial

Schedule a demo

Subscribe our newsletter