October 5, 2023 • Written By Sherlock Xu
This is Part 2 of our comprehensive blog series on OpenLLM. The initial installment delves into the challenges of LLM deployment and the OpenLLM fundamentals with a simple quickstart. This second post dives deeper by providing a practical example of deploying an LLM in production using OpenLLM and BentoCloud.
BentoCloud is a serverless platform specifically designed for deploying AI applications in production with great scalability and flexibility. I have already explained why BentoCloud stands out as the platform of choice for heavyweights like Llama 2 in my article Deploying Llama 2 7B on BentoCloud, so I will bypass a recap to delve directly into the heart of the matter - deploying Llama 2 13B with OpenLLM and BentoCloud.
Make sure you meet the following prerequisites.
As mentioned in the previous blog post, Bentos are the unified distribution unit in BentoML. To deploy the model on BentoCloud, you need to package it into a Bento and push it to BentoCloud.
Log in to Hugging Face. Since OpenLLM sources the Llama 2 model from Hugging Face, logging in is essential.
Log in to BentoCloud. This requires you to have a Developer API token, which allows you to access BentoCloud and manage different cloud resources. Upon generating this token, a pop-up should appear, displaying the following login command.
Note: After you log in, you can create and manage Bento Deployments directly with commands like
bentoml deployment create/update. See the BentoML documentation to learn more.
Build a Bento for the Llama 2 13B model and upload it directly to BentoCloud by adding the
--push option. For demonstration purposes, I use meta-llama/Llama-2-13b-chat-hf as an example, but feel free to choose any other 13B or Llama 2 variant.
The above command, if the model isn’t already stored locally, fetches it from Hugging Face. Once downloaded, it can be verified using
bentoml models list.
When the entire process is done, you can find the Bento on the Bento Repositories page. On BentoCloud, each Bento Repository stands for a group of Bentos - same name, different versions. Navigate to the Bento details page for a closer look.
Now that the Bento is on BentoCloud, let's deploy it.
Go to the Deployments page and click Create. On BentoCloud, you have two Deployment options - Online Service and On-Demand Function. For this example, choose the latter, ideal for large inference requests and situations where immediate responses aren't critical.
Set up your Bento Deployment on one of the three tabs.
Select the Advanced tab and specify the required fields. Pay attention to the following fields:
cpu.xlargefor API Servers and
runners."llm-llama-runner".workers_per_resource=0.25in this field.
For more information about properties on this page, see Deployment creation and update information.
Click Submit. When the Deployment is ready, both the API Server and Runner Pods should be active.
With the Llama 2 13B application up and running, it's time to access it through the URL exposed.
On the Overview tab of its details page, click the link under URL. If you do not set any access control policy, you should be able to access the link directly.
In the Service APIs section, select the
generate API and click Try it out. Enter your prompt and click Execute. You can find the answer in the Responses section. Alternatively, use the
curl command, also presented within the Responses section, to send your queries.
On the Monitoring tab, you can view different metrics of the workloads:
As we wrap up this post, it's essential to note that this is only the tip of the iceberg when it comes to the potential of OpenLLM. I encourage you to experiment with more models and share your experience with us! Stay tuned for the next post, in which I will talk more about OpenLLM’s integrations with other tools.
Happy coding ⌨️!
To learn more about BentoML, OpenLLM, and other ecosystem tools, check out the following resources: