OpenLLM in Action Part 3: Streamlining AI Application Development with Integrations

December 7, 2023 • Written By Sherlock Xu

In the dynamic world of AI, an open-source tool's success isn't just about its standalone capabilities; it's also about how well it integrates with the broader ecosystem of evolving AI technologies. This is especially important for newcomers in the industry. Developers, when considering a new tool, often weigh its compatibility and integration potential with their existing systems.

One of OpenLLM's design philosophies is a commitment to meet the diverse needs of LLM application developers. We understand that they work with a variety of frameworks and tools, and seamless integration can significantly streamline their workflow and boost their efficiency. On this note, OpenLLM isn't designed to become another standalone project; it's a versatile building block aimed to integrate with a wide range of popular LLM development frameworks.

In the third stop of our OpenLLM in Action blog series journey, I'll be introducing the integrations supported by OpenLLM with quickstart code examples for each. These integrations illustrate OpenLLM's flexibility and its capability to enhance and enrich LLM application development.

Note: Before you try the code examples in the following sections, I suggest you set up a separate virtual environment for each integration. This helps you manage dependencies and avoid potential conflicts.


As an important component in the BentoML ecosystem, OpenLLM allows you to easily integrate it into the BentoML workflow. When defining a BentoML Service, you can create a Runner object with an LLM instance created through openllm.LLM. Below is a simple example of creating a BentoML Service with OpenLLM, using the facebook/opt-2.7b model for text generation.

import bentoml import openllm # Initialize an LLM instance llm = openllm.LLM('facebook/opt-2.7b') # Create a BentoML Service with the LLM Runner svc = bentoml.Service(name='llm-opt-service', runners=[llm.runner]) # Define an API endpoint for the Service @svc.api(, async def prompt(input_text: str) -> str: generation = await llm.generate(input_text) return generation.outputs[0].text

To serve the LLM Service locally, save the above script as and simply run:

bentoml serve service:svc

This command starts the server at, exposing the prompt API endpoint for external interactions. You can send a request using the following curl command:

curl -X 'POST' \ '' \ -H 'accept: text/plain' \ -H 'Content-Type: text/plain' \ -d 'What is the weight of the Earth?'

Example output:

The mass of the Earth is 6.6 million tonnes, which is about equal to 6.6 million times the mass of the moon. The average mass of the Earth is about 50,000 tonnes, making up about one fifth of the mass of the Solar System. The density of the Earth is about 50 kg/m3, which is slightly lower than the density of water at about 50 kg/m3. To find out the weight of the Earth we can use the mass of the Moon and the mass of the planet itself. We know how much each object weighs and we can find out how much the Earth’s mass is by multiplying those weights together. To find out the mass of the Moon we can use the following formula which takes into account the density of the Earth, the mass of the Moon and the mass of the Earth’s core: The mass of the Moon is about 1.1 x 1019 tonnes. Multiplying that by its density of 50 kg/m3 we can find out that the Moon has a mass of 1.1 x 1019 tonnes. To find out how much the Earth’s mass is we can multiply both the Moon and the Earth’s

After validating the Service, you can package it into a Bento for easy distribution. This standard unit encapsulates the Service, its configuration, and dependencies, making it ready for production deployment on BentoCloud or for Docker containerization. The packaging and deployment process follows the standard BentoML workflow.

For enhanced performance, you can choose a more powerful language model and change the backend component (for example, vLLM) in the Service code. In addition, OpenLLM also allows you to add streaming support for a better user experience. For more information, check out this quickstart Deploy a large language model with OpenLLM and BentoML.


LlamaIndex, formerly known as GPT Index, is a data framework specifically designed for use with LLMs like GPT-4. It offers a wide range of functionalities that facilitate the connection of custom data sources to LLMs for various applications. The LlamaIndex-OpenLLM integration allows you to either use a local LLM or connect to a remote LLM server.

First, install the dependencies (I recommend you use vLLM as the backend component for increased throughput and lower latency):

pip install "openllm[vllm]" llama-index-llms-openllm

Start a server, which will be used later to handle requests. The model HuggingFaceH4/zephyr-7b-alpha used in the following command is a fine-tuned version of mistralai/Mistral-7B-v0.1, trained on a mix of public and synthetic datasets using Direct Preference Optimization (DPO).

openllm start HuggingFaceH4/zephyr-7b-alpha --backend vllm

The server is now active at http://localhost:3000. Use OpenLLMAPI to specify the server that will process your requests. Below is an example demonstrating a standard completion request and a streaming completion request using the server.

import os from llama_index.llms.openllm import OpenLLM, OpenLLMAPI # Create a remote LLM instance and set your server address remote_llm = OpenLLMAPI(address="http://localhost:3000") # Make a standard completion request completion_response = remote_llm.complete("Of course I still love you, and") print(completion_response) # Streaming completion request print("Streaming completion response:") for it in remote_llm.stream_complete("Of course I still love you, and", max_new_tokens=128): print(it, end="", flush=True)

Alternatively, use OpenLLM for local model inference without starting a server separately.

local_llm = OpenLLM("HuggingFaceH4/zephyr-7b-alpha")

Here is the output I received:

I would do anything for you. You're the best thing that ever happened to me, and I'm so grateful to have you in my life. I just need some time to figure things out, and to figure out what I really want. I promise that I will always be here for you, no matter what. I love you, and I want to make this work, but I need your patience and understanding. I'm sorry for any pain or hurt that I've caused, and I'm willing to do whatever it takes to make things right.

The integration also supports other useful APIs such as chat, stream_chat, achat, and astream_chat. For more information, see the integration pull request and the LlamaIndex documentation.

OpenAI compatible endpoints

This integration allows you to use OpenLLM as a direct replacement for OpenAI's API, especially useful for those familiar with or already using OpenAI's endpoints. It helps you easily switch to using OpenLLM in your existing applications with minimal changes to your code.

First, install required packages:

pip install "openllm[vllm]" openai

Launch an OpenLLM server using a specific model and let’s try Llama 2 this time. I also recommend you use vLLM as the backend for better performance.

openllm start meta-llama/Llama-2-7b-chat-hf --backend vllm

Now, create a similar script as below to set up an OpenAI client to connect to your local OpenLLM server. It lists available models and uses the first model for a completion request. Note that you can set the base_url to the OpenLLM endpoint, which is retrieved from the OPENLLM_ENDPOINT environment variable. If this variable is not set, it defaults to http://localhost:3000.

import os, openai # Configure the OpenAI client for OpenLLM client = openai.OpenAI(base_url=os.getenv('OPENLLM_ENDPOINT', 'http://localhost:3000') + '/v1', api_key='na') models = client.models.list() print('Models:', models.model_dump_json(indent=2)) model =[0].id # Completion API stream = str(os.getenv('STREAM', False)).upper() in ['TRUE', '1', 'YES', 'Y', 'ON'] completions = client.completions.create( prompt='Write a self introduction for a software engineer role.', model=model, max_tokens=1024, stream=stream ) # Print the completion result print(f'Completion result (stream={stream}):') if stream: for chunk in completions: text = chunk.choices[0].text if text: print(text, flush=True, end='') else: print(completions)

To run the script (for example, save it as with streaming enabled:

STREAM=True python

Here is the output I received:

Models: { "data": [ { "id": "meta-llama--Llama-2-7b-chat-hf", "created": 2723, "object": "model", "owned_by": "na" } ], "object": "list" } Completion result (stream=True): Here are some tips to help you write one: A self-introduction for a software engineer role should highlight your technical skills, experience, and personal qualities that make you a great fit for the position. Here are some tips to help you write one: 1. Highlight your technical skills: Make a list of your technical skills and experience in programming languages, frameworks, and software development tools. Be specific and include the versions you are proficient in. 2. Share your experience: Highlight your experience in software development, including the types of projects you have worked on, your role in the development process, and any notable accomplishments. 3. Show your passion for software development: Share your passion for software development and why you enjoy working in this field. Explain how you stay current with the latest trends and technologies in software engineering. 4. Include your educational background: If you have a degree in computer science or a related field, include it in your introduction. You can also mention any relevant certifications or training you have received. 5. Keep it concise: Aim for a self-introduction that is around 250-500 words. Keep it concise and focused on the most relevant information. Here is an example of a self-introduction for a software engineer role: "I am a highly motivated software engineer with over 5 years of experience in developing innovative software solutions. I hold a degree in Computer Science from XYZ University and have honed my skills in languages such as Java, Python, and C++, as well as frameworks like Spring and Django. In my current role at ABC Company, I have led the development of several successful projects, including a mobile app for tracking personal finances and a web application for managing customer relationships. I am passionate about staying current with the latest trends and technologies in software engineering, and I regularly attend industry conferences and participate in online forums to stay informed. I am also committed to mentoring junior developers and sharing my knowledge and experience with the broader software engineering community. In my free time, I enjoy creating software tutorials and sharing them on my blog. I believe that by sharing my knowledge and experience, I can help others learn and grow in the field of software engineering. I am excited to bring my skills and experience to a new role and continue to contribute to the growth and success of a dynamic and innovative company." Remember to customize your self-introduction to fit the specific role and organization you are applying for, and to highlight your unique strengths and qualifications. Good luck with your job search!

This OpenAI integration offers a flexible and convenient way for developers to leverage the power of LLMs using a familiar interface, with the added benefits of self-hosting and customization that OpenLLM provides.


LangChain is an open-source framework for developing applications powered by language models. You can create an OpenLLM wrapper to create an OpenLLM instance, which allows for both in-process loading of LLMs and accessing remote OpenLLM servers.

Install LangChain:

pip install langchain

To run inference with a local server:

from langchain.llms import OpenLLM # Create an OpenLLM instance with a local model llm = OpenLLM(model_name="dolly-v2", model_id='databricks/dolly-v2-7b') res = llm("What are Large Language models?") print(res)

To run inference with a remote language server:

from langchain.llms import OpenLLM # Connect to a remote OpenLLM server llm = OpenLLM(server_url='', server_type='http') llm('What are Large Language models?')

For a more comprehensive understanding, see my previous blog post Building A Production-Ready LangChain Application with BentoML and OpenLLM. This step-by-step guide walks you through the development process, from a simple script to a sophisticated application capable of exposing an API endpoint for broader external interactions.

Transformers Agents

Hugging Face Transformers Agents is designed to act as intelligent interpreters that can understand natural language requests and perform a variety of tasks. These agents leverage different models to interact with a set of defined tools, enabling them to execute functions like text classification, image generation, and language translation, among others.

Install Transformers Agents:

pip install "transformers[agents]" "openllm[vllm]"

Start a server and let’s try StarCoder, an LLM designed specifically for coding tasks. Note that you need to accept related conditions on Hugging Face to access its files and content.

openllm start bigcode/starcoder --backend vllm

The integration of OpenLLM with Transformers Agents is quite straightforward. With the OpenLLM server running, you can define an agent to interact with it like this:

import transformers agent = transformers.HfAgent('http://localhost:3000/hf/agent') # URL that runs the OpenLLM server'Is the following `text` positive or negative?', text="I don't like the answer the model provides.")

After running the code, the agent will process the request and provide a response like this:

==Explanation from the agent== I will use the following tool: `text_classifier` to classify the text. ==Code generated by the agent== print(f"The text is {text_classifier(text, labels=['positive', 'negative'])}.") ==Result== The text is negative.

It's important to note that Transformers Agents is still in an experimental phase. This means that the API and the results it returns can change at any time. The underlying models and APIs used by the agents are subject to updates, which might affect the consistency of the results.


The integrations we've explored in this article – BentoML, LlamaIndex, OpenAI API, LangChain, and Transformers Agents – are just a glimpse into the expansive potential of OpenLLM. As the AI community continues to innovate and grow, OpenLLM's list of integrations is also expected to expand, bringing more capabilities and possibilities.

While this series has covered some ground so far, you might still be curious about other topics, like a model’s performance with OpenLLM, particularly in terms of throughput and latency. In the next article, we'll focus on these aspects, offering a detailed benchmark analysis to provide a deeper understanding of what OpenLLM has to offer for model performance.

More on OpenLLM and BentoML

To learn more about OpenLLM and BentoML, check out the following resources: