OpenLLM in Action Part 1: Understanding the Basics of OpenLLM

September 21, 2023 • Written By Sherlock Xu

⚠️ Outdated content

Note: Some of the content in this blog post may not be applicable. To learn more about the latest features of OpenLLM and its usage, see the OpenLLM readme.

When ChatGPT took the world by storm, it illuminated the vast potential of large language models (LLMs) and their profound implications for businesses spanning a variety of industries. During the same period, the AI industry witnessed rapid transformations, with much of this dynamism fueled by the ChatGPT frenzy. However, as companies race to seize the ensuing opportunities, they realize that the deployment of LLMs in production has become a complex yet essential task. From creating responsive chatbots to building intuitive search systems, the foundation lies in effectively and efficiently harnessing the power of LLMs. Enter OpenLLM, an open-source solution in the BentoML ecosystem designed specifically to solve the challenges in deploying LLMs in production.

This is the first installment of our blog series, OpenLLM in Action. Whether you’re a seasoned application developer, an AI enthusiast, or someone new to the domain, this series is aimed to guide you through the intricacies of deploying LLMs in production with OpenLLM. By the end of this comprehensive guide, our goal is for you to possess a deep understanding of OpenLLM’s capabilities and know how to leverage its features to your advantage.

Throughout this series, we’ll embark on a journey that begins with the rudiments of OpenLLM and gradually levels up to mastering its advanced features. Here’s a glimpse of our roadmap:

  • The basics of OpenLLM
  • Deploying LLMs like Llama 2 13B and 70B in production on BentoCloud
  • OpenLLM’s integration with tools like BentoML, LangChain, and Transformers Agents
  • Keys features like quantization, embeddings, streaming, and fine-tuning
  • Best practices, tips, and FAQs

These topics will be covered across the blog series with concrete examples and hands-on practices, explaining how to build production-ready AI applications with state-of-the-art machine learning models.

In this Part 1 blog post, we will look at OpenLLM from a high-level perspective. First, let’s briefly review the challenges facing us when deploying LLMs in production.

Challenges in LLM deployment

Deploying LLMs like Llama 2 is no small feat. While they do bring about revolutionary capabilities, utilizing them effectively in real-world applications presents several challenges.


Deploying state-of-the-art LLMs involves significant computational resources. Training, fine-tuning, and even inferencing with these models can result in hefty cloud bills. The financial burden becomes a limiting factor, especially for startups or smaller entities that may not have large budgets for AI projects.


Real-time applications require prompt responses. However, with the sheer size of LLMs, there's inherent latency from model inferencing. This latency can impede the user experience, especially in applications like chatbots or real-time content generation where users expect instantaneous feedback.

Data privacy

With LLMs learning from vast amounts of data, organizations can be exposed to significant security risks in terms of sensitive data. Ensuring that outputs respect user privacy and maintaining ethical considerations becomes paramount. There’s also the challenge of deploying models in a way that adheres to data protection regulations.


As user bases grow, the demands on your LLM-powered services will increase. Ensuring that the deployment architecture can handle such growth without compromising speed or accuracy is a pressing concern.


The rapid evolution of the AI industry means tools and frameworks are continually emerging. Integrating LLMs with a plethora of tools can be daunting. Ensuring compatibility, staying updated with the latest tools, and leveraging them for optimal deployment also represents a continuous challenge.

To address these challenges, it’s essential to introduce a solution that can simplify this process and this is where OpenLLM shines.


OpenLLM is a one-stop solution tailored to solve the major challenges of LLM deployment. Specifically, it provides the following features:

  • Diverse model support: OpenLLM extends its compatibility to an wide array of state-of-the-art LLMs, such as Llama 2, StableLM, Falcon, Dolly, Flan-T5, ChatGLM, and StarCoder. The platform is built to handle a diverse range of open-source models and runtimes.
  • Flexible APIs: OpenLLM eliminates the complexities of model interaction by offering a RESTful API and gRPC support. Whether you prefer a Web UI, CLI, Python/JavaScript clients, or any other HTTP client, OpenLLM has got you covered.
  • Ecosystem integration: OpenLLM seamlessly integrates with LangChain, BentoML, and Hugging Face, enabling developers to create bespoke AI applications by composing LLMs with other models and services.
  • Effortless deployment: With OpenLLM, deployment is no longer a daunting task. You can use it to create Docker images for your LLM server or deploy it as serverless endpoints via BentoCloud. This serverless cloud platform helps you manage the underlying infrastructure, scales based on traffic, and guarantees cost-efficiency.
  • Bring your own LLM: OpenLLM doesn’t confine you. You can fine-tune any LLM to align with your requirements. With support for LoRA layers, you can enhance model accuracy and performance for specialized tasks.
  • Quantization: OpenLLM allows you to run inference with less computational and memory costs though quantization techniques like bitsandbytes and GPTQ.
  • Streaming: OpenLLM supports token streaming through server-sent events (SSE). You can use the /v1/generate_stream endpoint for streaming responses from LLMs.
  • Continuous batching: You can maximize throughput with OpenLLM’s support for continuous batching through vLLM.


As an important component in the BentoML ecosystem, OpenLLM follows a similar workflow as BentoML for users to ship models to production. Following is the deployment workflow from a high-level perspective.

  1. Prepare a model. Using OpenLLM to initiate an LLM service starts with either downloading an ML model from Hugging Face or integrating your model using the openllm import command. Even without a ready-to-use model locally, you can still use OpenLLM to start an LLM server directly, since it automatically downloads the model for you. This behavior is applicable to common commands used in the OpenLLM workflow like openllm start and openllm build.
  2. Try and test. You can easily spin up an LLM server with openllm start and test inference runs. The platform offers great flexibility — whether you’re looking to fine-tune the model or integrate it with tools like BentoML, LangChain, or Transformers Agent, you can easily define the serving logic.
  3. Build a Bento. Once you are satisfied with the model, package the model and all the necessary dependencies into a single unit known as a Bento. This artifact is BentoML’s standard distribution unit, streamlining deployment, sharing, and containerization.
  4. Deploy the Bento. You can containerize the Bento with Docker and deploy it to any Docker-compatible environment. Alternatively, push the Bento to BentoCloud, which runs AI applications on Kubernetes and scales responsively with incoming traffic. By offloading the complexities of infrastructure and Kubernetes management, you’re free to concentrate solely on model refinement and development.

Now, Let’s explore a quickstart guide to get your hands on OpenLLM following the above workflow.


To get started with OpenLLM, ensure you have Python 3.8 (or a later version) and pip installed. You can then install OpenLLM with this simple command:

pip install openllm

To quickly start an LLM server, use openllm start LLM_NAME and specify a model ID. For example, to start a facebook/opt-2.7b model, run:

openllm start facebook/opt-2.7b

OpenLLM downloads the model to the BentoML Model Store automatically if it is not available locally. Check out OpenLLM-supported LLMs with openllm models, and to view downloaded models, run bentoml models list.

The LLM server starts at by default. You can visit the web UI to interact with the /v1/generate API or send a request using curl. Alternatively, use OpenLLM’s built-in Python client:

import openllm client = openllm.client.HTTPClient('<http://localhost:3000>') result = client.query('What is the age of Earth?') print(result)

As mentioned above, when testing model inference, you may want to fine-tune it or add custom code to integrate it with other tools (for example, defining a BentoML Service file). This process involves more advanced use cases of OpenLLM so I will explain them in more detail in subsequent blog posts. For a quick start, you can build the model directly by running openllm build.

openllm build facebook/opt-2.7b

This creates a Bento in your local Bento Store with a specific tag. To containerize it, run:

bentoml containerize BENTO_TAG

This generates an OCI-compatible Docker image that can be deployed anywhere. For best scalability and reliability of your LLM service in production, you can deploy it via BentoCloud.


Our exploration of OpenLLM is far from over. In the upcoming blog post, our focus will pivot to the synergy of OpenLLM and BentoCloud, especially in the context of deploying monumental LLMs like Llama 2 13B and 70B. Stay tuned and join us on this exciting journey in the next installment!

More on BentoML and OpenLLM

To learn more about BentoML, OpenLLM, and other ecosystem tools, check out the following resources: