September 21, 2023 • Written By Sherlock Xu
Note: Some of the content in this blog post may not be applicable. To learn more about the latest features of OpenLLM and its usage, see the OpenLLM readme.
When ChatGPT took the world by storm, it illuminated the vast potential of large language models (LLMs) and their profound implications for businesses spanning a variety of industries. During the same period, the AI industry witnessed rapid transformations, with much of this dynamism fueled by the ChatGPT frenzy. However, as companies race to seize the ensuing opportunities, they realize that the deployment of LLMs in production has become a complex yet essential task. From creating responsive chatbots to building intuitive search systems, the foundation lies in effectively and efficiently harnessing the power of LLMs. Enter OpenLLM, an open-source solution in the BentoML ecosystem designed specifically to solve the challenges in deploying LLMs in production.
This is the first installment of our blog series, OpenLLM in Action. Whether you’re a seasoned application developer, an AI enthusiast, or someone new to the domain, this series is aimed to guide you through the intricacies of deploying LLMs in production with OpenLLM. By the end of this comprehensive guide, our goal is for you to possess a deep understanding of OpenLLM’s capabilities and know how to leverage its features to your advantage.
Throughout this series, we’ll embark on a journey that begins with the rudiments of OpenLLM and gradually levels up to mastering its advanced features. Here’s a glimpse of our roadmap:
These topics will be covered across the blog series with concrete examples and hands-on practices, explaining how to build production-ready AI applications with state-of-the-art machine learning models.
In this Part 1 blog post, we will look at OpenLLM from a high-level perspective. First, let’s briefly review the challenges facing us when deploying LLMs in production.
Deploying LLMs like Llama 2 is no small feat. While they do bring about revolutionary capabilities, utilizing them effectively in real-world applications presents several challenges.
Deploying state-of-the-art LLMs involves significant computational resources. Training, fine-tuning, and even inferencing with these models can result in hefty cloud bills. The financial burden becomes a limiting factor, especially for startups or smaller entities that may not have large budgets for AI projects.
Real-time applications require prompt responses. However, with the sheer size of LLMs, there's inherent latency from model inferencing. This latency can impede the user experience, especially in applications like chatbots or real-time content generation where users expect instantaneous feedback.
With LLMs learning from vast amounts of data, organizations can be exposed to significant security risks in terms of sensitive data. Ensuring that outputs respect user privacy and maintaining ethical considerations becomes paramount. There’s also the challenge of deploying models in a way that adheres to data protection regulations.
As user bases grow, the demands on your LLM-powered services will increase. Ensuring that the deployment architecture can handle such growth without compromising speed or accuracy is a pressing concern.
The rapid evolution of the AI industry means tools and frameworks are continually emerging. Integrating LLMs with a plethora of tools can be daunting. Ensuring compatibility, staying updated with the latest tools, and leveraging them for optimal deployment also represents a continuous challenge.
To address these challenges, it’s essential to introduce a solution that can simplify this process and this is where OpenLLM shines.
OpenLLM is a one-stop solution tailored to solve the major challenges of LLM deployment. Specifically, it provides the following features:
/v1/generate_stream
endpoint for streaming responses from LLMs.As an important component in the BentoML ecosystem, OpenLLM follows a similar workflow as BentoML for users to ship models to production. Following is the deployment workflow from a high-level perspective.
openllm import
command. Even without a ready-to-use model locally, you can still use OpenLLM to start an LLM server directly, since it automatically downloads the model for you. This behavior is applicable to common commands used in the OpenLLM workflow like openllm start
and openllm build
.openllm start
and test inference runs. The platform offers great flexibility — whether you’re looking to fine-tune the model or integrate it with tools like BentoML, LangChain, or Transformers Agent, you can easily define the serving logic.Now, Let’s explore a quickstart guide to get your hands on OpenLLM following the above workflow.
To get started with OpenLLM, ensure you have Python 3.8 (or a later version) and pip
installed. You can then install OpenLLM with this simple command:
pip install openllm
To quickly start an LLM server, use openllm start LLM_NAME
and specify a model ID. For example, to start a facebook/opt-2.7b model, run:
openllm start facebook/opt-2.7b
OpenLLM downloads the model to the BentoML Model Store automatically if it is not available locally. Check out OpenLLM-supported LLMs with openllm models
, and to view downloaded models, run bentoml models list
.
The LLM server starts at http://0.0.0.0:3000/ by default. You can visit the web UI to interact with the /v1/generate
API or send a request using curl
. Alternatively, use OpenLLM’s built-in Python client:
import openllm client = openllm.client.HTTPClient('<http://localhost:3000>') result = client.query('What is the age of Earth?') print(result)
As mentioned above, when testing model inference, you may want to fine-tune it or add custom code to integrate it with other tools (for example, defining a BentoML Service file). This process involves more advanced use cases of OpenLLM so I will explain them in more detail in subsequent blog posts. For a quick start, you can build the model directly by running openllm build
.
openllm build facebook/opt-2.7b
This creates a Bento in your local Bento Store with a specific tag. To containerize it, run:
bentoml containerize BENTO_TAG
This generates an OCI-compatible Docker image that can be deployed anywhere. For best scalability and reliability of your LLM service in production, you can deploy it via BentoCloud.
Our exploration of OpenLLM is far from over. In the upcoming blog post, our focus will pivot to the synergy of OpenLLM and BentoCloud, especially in the context of deploying monumental LLMs like Llama 2 13B and 70B. Stay tuned and join us on this exciting journey in the next installment!
To learn more about BentoML, OpenLLM, and other ecosystem tools, check out the following resources: