November 16, 2023 • Written By Sherlock Xu
Embeddings in the context of AI and machine learning are a way to convert words, sentences, or images into numbers that the computer can understand and work with. Embeddings are super useful because they allow computers to do things like recommending a product that is similar to what you're looking at, translating languages, or finding a photo that you're searching for, by understanding the “meaning” behind the words or images.
In this blog post, let’s see how we can use BentoML to build a sentence embedding service, deploy it on BentoCloud, and try the autoscaling feature of the serverless platform.
Do the following to set up your environment and get familiar with the project.
Clone the project directory to your local machine.
sentence-embedding-bento folder, inspect the following key files:
import_model.py: Downloads and saves both the all-MiniLM-L6-v2 model and its tokenizer to the BentoML Model Store. It is a sentence-transformers model used to generate sentence embeddings. You can change it to other models based on your needs.
requirements.txt: Required dependencies for this project.
service.py: Creates a BentoML Service using the custom Runner created from
embedding_runnable.py. It defines the API Server, including the endpoints, the input and output formats, and how the input data is processed to produce the output.
embedding_runnable.py: Creates the
SentenceEmbeddingRunnableclass, which is responsible for executing the embedding process and can run either on CPU or GPU settings. In this class, the
encodemethod tokenizes input sentences and computes their embeddings using the model, followed by mean pooling to generate a single embedding vector per sentence; the
mean_poolingmethod aggregates the token embeddings into sentence embeddings by calculating the weighted average of token embeddings, considering the attention mask.
bentofile.yaml/bentofile-gpu.yaml: The configurations used to build the entire project into the standardized distribution format in the BentoML ecosystem, also known as a Bento.
Install the required dependencies.
Before you build the Bento for the project, it is always a good practice to test it locally.
Download both the model and the tokenizer.
Verify that the download is successful:
Start the BentoML Service.
The server should be active at http://0.0.0.0:3000. Visit the Swagger Web UI or send a request using
The expected output should be two arrays of numerical vectors, which represent the embeddings of the input sentences. As I mentioned at the beginning of this article, you can use these embeddings for various NLP tasks. One common use case is to measure how semantically similar two sentences are. For example, you can calculate the cosine similarity between vectors by running the following script.
The cosine similarity computes the cosine of the angle between these two vectors, which is between -1 and 1. If it is close to 1, it means the angle between the vectors is small, and the sentences are similar to each other; if it is close to 0, the angle is 90 degrees, and the sentences are not particularly similar or dissimilar; if it is close to -1, the sentences are dissimilar. Expected output:
Once you are happy with the performance of the model, package it and all the associated files and dependencies into a Bento and push it to BentoCloud.
bentoml build under the
Note: If you are running the project on GPU devices, you can use the bentofile-gpu.yaml file to build the Bento.
Make sure you have already logged in to BentoCloud, then push the Bento to the serverless platform. This way, you can better deploy the sentence embedding service in production with enhanced features like automatic scaling and observability.
Note: If you don’t have access to BentoCloud, you can also run
bentoml containerize sentence-embedding-svc:latest to create a Bento Docker image, and then deploy it to any Docker-compatible environment.
On the BentoCloud console, you can find the uploaded Bento on the Bentos page in the sentence-embedding-svc Bento repository. Each Bento repository contains a set of Bentos of different versions for the same ML service.
With the Bento pushed to BentoCloud, you can start to deploy it.
Navigate to the Deployments page and click Create.
Select On-Demand Function, which is useful for scenarios with loose latency requirements and sparse traffic.
Specify the required fields for the Deployment. This application does not require heavy resources, so I selected cpu.medium for both the API Server and Runner Pods. In addition, I set the minimum number of replicas allowed for scaling to 0, so there shouldn’t be any active Pods when there is no traffic.
You can directly access the exposed URL to interact with the application. Alternatively, use the same script but this time add a loop to create some traffic for monitoring. For example:
After you run the above script to send requests, you may not see the result immediately as the application needs some time to scale both the API Server and the Runner. On the Overview tab, you should be able to see at least one active replica of the API Server and the Runner when you get the response.
View the related metrics on the Monitoring tab.
We've seen how BentoML simplifies the journey of a sentence embedding service from development to deployment. With its smooth integration with BentoCloud, you can easily scale and monitor your AI application. This empowers developers to deploy NLP models efficiently, bringing AI applications closer to their full potential. In the next blog post, I will demonstrate how to build and deploy an image embedding application with BentoML.
To learn more about BentoML and its ecosystem tools, check out the following resources: