January 3, 2024 • Written By Sherlock Xu
Over the past year, Large Language Models (LLMs) like GPT-4 have not only transformed how we interact with machines but also have redefined the possibilities within the realm of natural language processing (NLP). A notable trend in this evolution is the increasing popularity of open-source LLMs like Llama 2, Falcon, OPT and Yi. Some may prefer them over their commercial counterparts in terms of accessibility, data security and privacy, customization potential, cost, and vendor dependency. Among the tools gaining increasing traction in the LLM space are OpenLLM and LlamaIndex — two powerful platforms that, when combined, unlock new use cases for building AI-driven applications.
OpenLLM is an open-source platform for deploying and operating any open-source LLMs in production. Its flexibility and ease of use make it an ideal choice for AI application developers seeking to harness the power of LLMs. You can easily fine-tune, serve, deploy, and monitor LLMs in a wide range of creative and practical applications.
LlamaIndex provides a comprehensive framework for managing and retrieving private and domain-specific data. It acts as a bridge between the extensive knowledge of LLMs and the unique, contextual data needs of specific applications.
OpenLLM’s support for a diverse range of open-source LLMs and LlamaIndex’s ability to seamlessly integrate custom data sources provide great customization for developers in both communities. This combination allows them to create AI solutions that are both highly intelligent and properly tailored to specific data contexts, which is very important for query-response systems.
In this blog post, I will explain how you can leverage the combined strengths of OpenLLM and LlamaIndex to build an intelligent query-response system. This system can understand, process, and respond to queries by tapping into a custom corpus.
The first step is to create a virtual environment in your machine, which helps prevent conflicts with other Python projects you might be working on. Let’s just call it
llamaindex-openllm and activate it.
Install the required packages. This command installs OpenLLM with the optional
vllm component (I will explain it later).
For handling requests, you need to have an LLM server. Here, I use the following command to start a Llama 2 7B local server at http://localhost:3000. Feel free to choose any model that fits your needs. If you already have a remote LLM server, you can skip this step.
OpenLLM automatically selects the most suitable runtime implementation for the model. For models with vLLM support, OpenLLM uses vLLM by default. Otherwise, it falls back to PyTorch. vLLM is a high-throughput and memory-efficient inference and serving engine for LLMs. According to this report, you can achieve 23x LLM inference throughput while reducing P50 latency using vLLM.
Note: To use the vLLM backend, you need a GPU with at least the Ampere architecture (or newer) and CUDA version 11.8. This demo uses a machine with an Ampere A100–80G GPU. If your machine has a compatible GPU, you can also choose vLLM. Otherwise, simply install the standard OpenLLM package (
pip install openllm) in the previous command.
Before building a query-response system, let’s get familiar with the integration of OpenLLM and LlamaIndex and use it to create a simple completion service.
The integration offers two APIs for interactions with LLMs:
OpenLLM: This can be used to initiate a local LLM server directly without the need for starting a separate one using commands like
openllm start. Here’s how you can use it:
OpenLLMAPI: This can be used to interact with a server hosted elsewhere, like the Llama 2 7B model I started previously.
Let’s try the
complete endpoint and see if the Llama 2 7B model is able to tell what OpenLLM is by completing the sentence “OpenLLM is an open source tool for”.
Run this script and here is the output:
Obviously, the model couldn’t correctly explain OpenLLM with some hallucinations 🤣. Nevertheless, the code works well as the server outputs a response for the request. This is a good start as we proceed with building our system.
The initial version revealed a key limitation: the model’s lack of specific knowledge about OpenLLM. One solution is to feed the model with domain-specific information, allowing it to learn and respond according to topic-specific queries. This is where LlamaIndex comes into play, enabling you to build a local knowledge base with pertinent information. Specifically, you create a directory (for example,
data) and build an index for all the documents in the folder.
Create a folder and let’s import the GitHub README file of OpenLLM into the folder:
Go back to the previous directory and create a script called
starter.py like the following:
To improve the quality of your response, I recommend you define a
SentenceSplitter to provide finer control over the input processing, leading to better output quality.
In addition, you can set
streaming=True to stream your response:
Your directory structure should look like this now:
starter.py to test the query-response system. The output should be consistent with the content of the OpenLLM README. Here is the response I received:
OpenLLM is an open-source platform for deploying and managing large language models (LLMs) in a variety of environments, including on-premises, cloud, and edge devices. It provides a comprehensive suite of tools and features for fine-tuning, serving, deploying, and monitoring LLMs, simplifying the end-to-end deployment workflow for LLMs.
The exploration in this article underscores the importance of customizing AI tools to fit specific needs. By using OpenLLM for flexible deployment of LLMs and LlamaIndex for data management, I have demonstrated how to create an AI-powered system. It not only understands and processes queries but also delivers responses based on a unique knowledge base. I hope this blog post has inspired you to explore more capabilities and use cases of OpenLLM and LlamaIndex. Happy coding! ⌨️
To learn more about OpenLLM and BentoML, check out the following resources: