October 19, 2023 • Written By Sherlock Xu
In the ever-evolving world of MLOps, the ability to observe, monitor, and analyze the operations of machine learning (ML) models has become increasingly important. Observability in MLOps is not just a buzzword — it's the foundation upon which we ensure the reliability, performance, and efficiency of deployed ML solutions. As such, having the right tools and processes in place to monitor model performance and system health becomes indispensable. Without proper observability, teams may find themselves flying blind, unable to identify potential issues until it's too late.
Now, how does BentoML fit into this picture? BentoML integrates seamlessly with Prometheus and Grafana, two of the most popular and powerful tools in the observability space. More specifically, BentoML automatically collects some default metrics for all API Servers and Runners with the help of Prometheus, such as the total number of requests; users can then visualize them through Grafana to set up alerts and create custom dashboards for more in-depth analysis. With the integration, ML practitioners can ensure that their models are not only performant but also robust and reliable.
So, how exactly do we achieve this integration? In this article, I will demonstrate how to monitor a BentoML project with Prometheus, create a custom histogram with a custom Runner, and create a custom dashboard in Grafana. The model involved in this project will be trained on the widely-used MNIST dataset of handwritten digits.
Let’s get started!
This project's source code is available in the BentoML repository. Start by cloning the repository and navigating to the project directory:
Let’s have a quick look at the key files in this project.
train.py: Trains an image classification model on the MNIST dataset, which is a collection of handwritten digits, and saves the model to the BentoML local Model Store with the name
service.py: Defines the BentoML Service, including the model serving logic, API endpoint configuration, and preprocessing and postprocessing steps. You can start a server with it in either HTTP or gRPC modes. More details of this file will be given in the next section.
requirements.txt: The required dependencies for this project.
bentofile.yaml: Contains a set of configurations for building the Bento for this project, such as the Service, Python files, and dependencies. See the BentoML documentation to learn more about this file.
net.py: Defines a convolutional neural network (CNN) model for image classification. It will be used by
utils.py: Provides utility functions that are used in other parts of the project.
locustfile.py: Locust is an open-source load testing tool that allows you to define user behavior with Python code, and then simulate traffic by simultaneous users to swarm your system. This file sends image requests (
./mnist_png/testing/9/1000.png) to the
/predictendpoint, pausing for a random duration (between 0.05 to 2 seconds) between requests.
prometheus: This directory contains Prometheus configuration files for monitoring gRPC and HTTP traffic respectively. In both configurations, Prometheus scrapes metric targets every 5 seconds and evaluates established rules every 15 seconds.
Note: The machine I used in this demo runs Ubuntu 20.04. As the trained model is small, you don’t need high-performance hardware to run this project.
Before diving into the monitoring part, it's crucial to ensure the BentoML Service is running correctly. Here's a step-by-step guide to test it:
Install the required Python packages.
Train a CNN model on the MNIST dataset using the
This saves the model to the BentoML local Model Store.
Download the test data of a set of images with handwritten numbers ranging from
9. You can use them later for sending requests to the Service.
Before starting the Service, let’s take a look at the code.
This Service file does the following:
bentoml.pytorch.getand create the
_BuiltinRunnableobject, a Runnable to wrap the model.
inference_durationhistogram metric. This metric measures the duration of inference. It has labels for torch version and device ID. The
exponential_bucketsfunction is used to define the bucket intervals for this histogram, determining the granularity of tracking. See the Prometheus Histogram documentation for more details.
_BuiltinRunnable. This custom Runnable adds additional logging and metrics information with
__init__method logs the device and torch version being used, while the
__call__method measures and records the duration of the inference using the
inference_durationmetric defined earlier. Every time the model makes a prediction (i.e., every time the
__call__method is invoked), the time taken for that prediction is measured and recorded in the Prometheus
inference_durationhistogram metric. This allows for monitoring the model's inference performance over time and can provide insights into how the model's response time varies under different conditions or loads.
Launch the BentoML Service locally. You can then interact with the server at http://0.0.0.0:3000.
Note: Adding the
--reload option allows the Service to be reloaded when code changes are detected.
Send a request to the
predict endpoint. Make sure you are in the project root directory when running the following command. The expected output should be
3, which means the model thinks the sent handwritten number is likely to be
If you see the following error when interacting with the server, it means the model's output tensor is on a CUDA device (most likely a GPU) and the Service is trying to convert it to a NumPy array directly. NumPy works on the CPU, so tensors from the GPU need to be moved to the CPU before you can convert them.
To solve this, modify the converting logic code in the
service.py file, namely the line where the tensor is being converted to a NumPy array. Before conversion, the tensor should be moved to the CPU. Specifically, replace
service.py as shown below:
Send your request again and you should be able to see the expected output.
With the server up and running, you can use the
locustfile.py file with Locust to simulate multiple users sending requests to test the performance of model serving.
Make sure the BentoML Service is running and start Locust in another terminal:
Visit the Locust web UI at http://0.0.0.0:8089. Set your desired number of users and spawn rate, then click Start swarming.
Monitor RPS (Requests Per Second) on the Charts tab to observe performance under loads.
Now that the server is actively processing requests sent by Locust, you can use Prometheus for further analysis.
Install Prometheus if you haven’t.
Start Prometheus with either the gRPC or HTTP configuration. I used HTTP in this demo.
Prometheus should now be scraping metrics from the BentoML Service. To visualize these metrics, access the Prometheus web UI by visiting
http://localhost:9090 (default Prometheus port). As I mentioned earlier, BentoML automatically collects a number of metrics for all API Servers and Runners, such as
bentoml_runner_request_total. In addition, BentoML also allows you to define custom metrics.
I used the following PromQL expression to return the average request count per second over the last 1 minute for the
/predict endpoint of the
Here is the graph:
Also, use the following PromQL expression for the 95th percentile inference latency:
For a more advanced visualization, you can consider setting up Grafana and integrating it with Prometheus. Grafana provides a feature-rich platform to create dashboards that display metrics collected by Prometheus.
Make sure you have installed Grafana. The commands may be different depending on your system. See the Grafana documentation for details.
If your BentoML Service is running on port
3000 and you've started Grafana, which by default also runs on port
3000, you'll encounter a port conflict. To solve this, configure Grafana to run on a different port. Do the following:
Open the Grafana configuration file.
In this file, look for the
[http] section. You'll see a line that looks like this:
Change the port number to an available port like
4000, and remove the semicolon (
;) at the beginning to uncomment the line:
Save the file and restart the Grafana service for the change to take effect:
Access the Grafana web UI at
http://localhost:4000/ (use your own port). The default login information is
admin for both the username and password. You'll be prompted to change the password.
In the Grafana search box at the top, enter “Data sources” and add Prometheus as an available data source. In the HTTP section, set the URL to
http://localhost:9090 (or wherever your Prometheus instance is running). Save the configuration to ensure Grafana can connect to Prometheus.
Add a new Grafana dashboard based on your desired metric. I created the following dashboard using the
bentoml_api_server_request_total metric for your reference:
Once you are happy with the results, use the
bentofile.yaml file to build a Bento by running
bentoml build. You can then containerize it or deploy it to BentoCloud.
In the rapidly evolving world of machine learning and AI, having a reliable and efficient deployment tool is crucial. Equally important is the tool's ability to offer powerful observability, ensuring you stay informed about your application's health and performance. I hope that you find this tutorial helpful, especially in gaining insights from key model serving metrics. Happy coding!
To learn more about BentoML, check out the following resources: