Announcing OpenLLM: An Open-Source Platform for Running Large Language Models in Production

June 21, 2023 • Written By Sherlock Xu

We are thrilled to announce the open-source release of OpenLLM under the Apache 2.0 license! OpenLLM is an open platform designed to streamline the deployment and operation of large language models (LLMs) in production. With OpenLLM, you can run inference with any open-source LLM, deploy models to the cloud or on-premises, and build powerful AI applications. It supports a wide range of open-source LLMs and offers flexible APIs with first-class support for BentoML and LangChain.


LLMs are deep learning models that have been trained on extensive text data, enabling them to understand and generate new content. While LLMs like GPT-4 from OpenAI and PaLM 2 from Google have shown promising results, organizations may be hesitant to adopt the technology due to several limitations.

  • Security concerns: Using solutions from commercial LLM providers can expose organizations to significant security risks in terms of sensitive data, such as personally identifiable information and corporate secrets.
  • Fine-tuning requirements: Each model may perform differently and may only be suitable for certain types of tasks. Large enterprises often require their models to perform specific tasks based on their own datasets and cater to their specific use cases. Currently, however, there is no easy way to fine-tune these foundational models to meet their needs.
  • High costs: Running inference on fundamental models like GPT-4 can be an extremely costly process. For example, OpenAI charges based on the number of tokens processed, and this can become especially expensive when dealing with a large token size.

In the wake of the ChatGPT frenzy, open-source LLMs such as Dolly and Flan-T5 have emerged, providing more flexibility as organizations can deploy them locally and run smaller models that are fine-tuned for their specific use cases.

At BentoML, our goal is to bridge the gap between training ML models and deploying them in production. We believe this process should be facilitated with tools that prioritize ease-of-use, flexibility, openness, and transparency. As such, we are open-sourcing OpenLLM to empower software engineers to better fine-tune, serve, and deploy their models to production. In addition to deploying an LLM into an API endpoint, OpenLLM enables you to build applications on top of it. It integrates seamlessly with BentoML and LangChain, allowing you to compose or chain LLM inference with other AI models, such as StableDiffusion, Whisper, or any custom models, and build LangChain applications with OpenLLM and BentoML.

Key features

OpenLLM offers a rich set of features:

  • SOTA LLM support: Natively supports a wide range of open-source LLMs and model runtimes, such as StableLM, Falcon, Dolly, Flan-T5, ChatGLM, and StarCoder. 
  • Self-hosted: Run OpenLLM on your own GPU server.
  • Flexible APIs: Serve LLMs over a RESTful API or gRPC with one command and make queries through a Web UI, CLI commands, Python/JavaScript clients, or any HTTP client.
  • Freedom to build: Provides first-class support for BentoML and LangChain, enabling you to easily create your own AI applications by combining LLMs with other models and services.
  • Streamlined deployment: Automatically generate Docker Images for your LLM server or deploy models as serverless endpoints through BentoCloud. BentoCloud automatically provisions GPU resources and supports auto scaling to handle traffic fluctuations (support scaling to zero), and you only pay for the resources you use.

Getting started

To use OpenLLM, you need to have Python 3.8 or later and pip installed on your machine. We highly recommend using a virtual environment to prevent package conflicts.

1. Install OpenLLM with pip.

pip install openllm

2. Once the installation is complete, you can view the supported open-source LLMs with the following command.

openllm models -o porcelain

The example output is as follows. By default, OpenLLM doesn't include the dependencies to run all models. You may need to install model-specific dependencies to run them. See Supported Models for details.

flan-t5 dolly-v2 chatglm starcoder falcon stablelm

3. You can easily start a model as a REST server with OpenLLM. The following command uses dolly-v2 as an example.

openllm start dolly-v2

To specify a specific variant of the model to be served, provide the --model-id option as follows:

openllm start dolly-v2 --model-id databricks/dolly-v2-7b

4. OpenLLM provides a built-in Python client. You can interact with the model by creating a client and sending a query to the endpoint at http://localhost:3000 as follows. This endpoint provides a Web UI for interaction and experimentation.

import openllm client = openllm.client.HTTPClient('http://localhost:3000') client.query('What are large language models?')

5. Alternatively, use the openllm query command to query the model from a separate terminal:

export OPENLLM_ENDPOINT=http://localhost:3000 openllm query 'What are large language models?'

Expected output:

Processing query: What are large language models? Responses: Large language models (LLMs) are artificial intelligence (AI) systems that can parse natural language and provide responses similar to human responses. These systems can be trained with vast amounts of data in order to produce human-like responses to natural language.

6. After you fine-tune the model, you can build a bento based on it. A bento is a deployable artifact containing all the application information, including the model, code, and dependencies.

openllm build dolly-v2

7. You can then containerize your model and deploy it to BentoCloud or your own Kubernetes cluster. For more information, see the BentoML documentation.

What's next?

We seek to empower every organization to compete and succeed with AI applications and the release of OpenLLM marks an important milestone in this endeavor. As we work to further expand the BentoML ecosystem, we will continue improving OpenLLM in terms of quantization, performance, and fine-tune capabilities, and welcome contributions of all kinds to the project. Check out the following resources to start your OpenLLM journey and stay tuned for more announcements about OpenLLM and BentoML.

About BentoML


BentoML is the platform for AI developers to build, ship, and scale AI applications. Headquartered in San Francisco, BentoML’s open source products are enabling thousands of organizations’ mission-critical AI applications around the globe. Our serverless cloud platform brings developer velocity and cost-efficiency to enterprise AI use cases. BentoML is on a mission to empower every organization to compete and succeed with AI. Visit to learn more.