August 31, 2023 • Written By Sherlock Xu
Llama 2, developed by Meta, is a series of pretrained and fine-tuned generative text models, spanning from 7 billion to a staggering 70 billion parameters. These models have outperformed many of their open-source counterparts on different external benchmarks, showcasing superiority in areas like reasoning, coding, proficiency, and knowledge. Since its launch, many model variants of customized Llama 2 have also emerged, providing support for task specific use cases.
For those looking to deploy Llama 2 and its customized variants, OpenLLM can be a helpful open-source platform, since it supports running inference with any open-source large language models (LLMs). With adequate resources, such as memory and GPUs, launching a Llama 2 model locally via OpenLLM is easy and straightforward. Moreover, as OpenLLM provides first-class support for BentoML, you can easily use the unified AI application framework to package the model, create a Docker image, and set it into motion for production.
However, the real challenge surfaces when attempting to scale AI applications embedded with LLMs like Llama 2. You may need to plan your resources on cloud platforms wisely and even delve into Kubernetes intricacies.
In this connection, you can choose BentoCloud as an end-to-end solution for deploying and scaling production-ready AI applications. BentoCloud is designed to streamline AI application development and expedite the delivery lifecycle. Its seamless integration with both OpenLLM and BentoML ensures that deploying Llama 2 models on the cloud is reduced to a simple command and a handful of clicks. In addition, it allows for flexible autoscaling (with scale-to-zero support) based on the workload traffic so you only pay for what you use. The best part? You’re liberated from the complexities of managing the underlying infrastructure, while retaining comprehensive insights into your application’s observability metrics.
In this blog post, I will guide you step-by-step on deploying the Llama 2 7B model using BentoCloud.
Make sure you meet the following prerequisites.
First, log in to BentoCloud. This requires you to have a Developer API token, which allows you to access BentoCloud and manage different cloud resources. See the BentoCloud documentation to learn more.
After you log in, run the following command to build a Bento with any of the Llama 2 variants and push it to BentoCloud. This example uses
meta-llama/Llama-2-7b-chat-hf for demonstration (run
openllm models to see all the supported models). The
--backend=vllm option activates vLLM optimizations, ensuring maximum throughput and minimal latency for the model's performance. The
--push option allows you to push the resulting Bento to BentoCloud directly. Pushing the Bento to BentoCloud may take some time, depending on your network conditions.
Note: When running the above command, make sure you use the same context as the one for BentoCloud login. The context can be specified via
openllm build command builds a Bento with the model specified. If the model has not been registered to the BentoML local Model Store before, OpenLLM first downloads the model automatically and then builds the Bento. Run
bentoml list to view the Bento.
After the Bento has been uploaded to BentoCloud, you can find it on the Bento Repositories page. Following is the details page of the Bento.
With the Bento pushed to BentoCloud, you can start to deploy it.
Go to the Deployments page and click Create. On BentoCloud, there are two Deployment options - Online Service and On-Demand Function. For this example, you can select the latter, which is useful for scenarios with loose latency requirements and large inference requests.
You can then set up the Bento Deployment in one of the following three ways.
Select the Advanced tab and specify the required fields (marked with asterisks). Pay attention to the following fields:
bentoml list locally.
cpu.medium for API Servers and
gpu.a10g.xlarge for Runners. The Llama 7B model weights approximately occupy 14GB of GPU memory. Given that the
gpu.a10g.xlarge is equipped with 24GB of GPU memory, this allocation not only accommodates the model weights comfortably but also provides ample memory headroom for efficient inference processing.
meta-llama/Llama-2-7b-chat-hf model is gated, it necessitates obtaining approval and providing the HuggingFace token via an environment variable. For both the API Server and Runner, you should set
HUGGING_FACE_HUB_TOKEN as the key, with your Hugging Face token (beginning with
hf_) as its value. If you are using an open Llama 2 compatible model, setting environment variable is not needed.
For other fields, you can use the default values or customize them as needed. For more information about properties on this page, see Deployment creation and update information.
When you are done, click Submit. The deployment may take some time. When it is ready, both the API Server and Runner Pods should be active.
With the Llama 2 application ready, you can access it with the URL exposed by BentoML.
On the Overview tab of its details page, click the link under URL. If you do not set any access control policy (i.e. select Public for Endpoint Access Type), you should be able to access the link directly. The Swagger UI looks like the following:
In the Service APIs section, select the
generate API and click Try it out. Enter your prompt, configure other parameters as needed, and click Execute. You can find the answer in the Responses section. Instead of using the Swagger UI, you can also use
curl to send requests, the command of which is also displayed in the Responses section.
On the Monitoring tab, you can view different metrics of the workloads:
Deploying sophisticated models like Llama 2 can often seem like a daunting task. However, as I’ve explained in this article, BentoCloud can significantly simplify this process. Whether you’re a seasoned developer or just starting out in the AI landscape, using BentoCloud allows you to focus on what truly matters: building and refining your AI applications. As the AI realm continues to evolve, tools that enhance efficiency and reduce complexities will be paramount. I hope this guide has provided you with a clear way to harness the power of Llama 2 using BentoCloud, and I encourage you to explore further and experiment with your own AI projects.
Happy coding ⌨️, and until next time!
To learn more about BentoML, OpenLLM, and other ecosystem tools, check out the following resources: