January 14, 2025 • Written By BentoML and Red Hat
TL;DR
vLLM is the high-throughput and efficient inference engine for running large-language models (LLMs). In this post, we will explore the annotated history of language models, describe the current state of structured decoding in vLLM, as well as the recent integration with XGrammar, and share our tentative roadmap for future improvements.
We would also invite users to tackle this blog post from a philosophical perspective, and in the process trying to posit that structured decoding represents a fundamental shift in how we think about LLM outputs. It also plays an important role in building complex agentic systems.
In 1950, Alan Turing proposed that a high-speed digital computer, programmed with rules, could exhibit emergent behaviour of intelligence (Turing, 1950). This led to two main approaches in AI development:
In summary:
LLMs excel at the following heuristic: given a blob of text, the model will generate a contiguous piece of text that it predicts as the most probable tokens. For example, if you give it a Wikipedia article, the model should produce text consistent with the remainder of said article.
These models work well given the following assumption: the inputs prompt must be coherent and well-structured surrounding a given problem the users want to achieve. In other words, LLMs can be unpredictable when you need output in specific formats. Think of asking a model to generate JSON - without guidance, it might produce valid text that breaks JSON specification [5].
This is where structured decoding comes in. It enables LLMs to generate outputs that follow a desired structure while preserving the non-deterministic nature of the system.
Companies like OpenAI have recognised this need, implementing features like JSON mode to constrain [6] the output format. If you have built with these functionalities before (such as agentic workflows, function calling, coding assistant), chances are you are using structured decoding under the hood.
Guided decoding is to LLMs what validation is to APIs - it acts as a guarantee that what comes out matches what you expect. Guided decoding ensures structure integrity that allows developers to integrate LLMs into their application with ease!
For BentoML users, this is useful for both OpenLLM and BentoVLLM where structured decoding offers improvement in decoding performance. It provides users building compound AI systems with more fine-grained control over the LLM components within their stack.
In simple terms, structured decoding gives LLMs a "template" to follow. Users provide a schema that "influences" the model's output, ensuring compliance with the desired structure:
From a technical perspective, an inference engine can modify the probability distribution for next-tokens by applying bias (often via logit masks) for all tokens from any given schemas. To apply these biases, outlines proposed guided generations via finite-state machine [7] (FSM) for any given schemas (Willard & Louf, 2023). This allows us to track the current state during decoding and filter out invalid tokens by applying logit bias to the output.
In vLLM, you can use this by passing a JSON schema to the sampling params (either through Python SDK or HTTP requests).
In some cases, it can even improve the native decoding performance for LLMs!
There are few limitations with current vLLM's support of the outlines backend:
Slow decoding: FSM has to be constructed at a token-level, meaning it can only transition the state one token per step. Therefore, it can only decode one token at a time, resulting in slow decoding.
Batch processing bottlenecks: Implementation in vLLM relies heavily on logit processor [8]. As such, this is on the critical path of the sampling process [9].
In batching use cases, compiling FSM per requests as well as computing the mask synchronous means that all requests in any given batches will get blocked, resulting in high time-to-first-tokens (TTFT) and lower throughput.
We found that compiling FSM is proven to be a relatively expensive task, making it a significant contributor to the increased TTFT.
Performance issues with CFG mode: With outlines integrations, while JSON mode is relatively fast, the CFG mode runs significantly slower, and can occasionally crashes the engine.
Limited advanced feature support: Techniques like jump-forward decoding are currently not possible with logit-processor approach. It requires prefilling a set of $k$-next tokens, whereas for logit processors we can only deal with the next-token.
XGrammar introduces a new technique that batch constrained decoding via pushdown automaton (PDA). You can think of a PDA as a "collection of FSMs, and each FSM represents a context-free grammar (CFG)." One significant advantage of PDA is its recursive nature, allowing us to execute multiple state transitions. They also include additional optimisation (for those who are interested) to reduce grammar compilation overhead.
This advancement addresses limitation (1) by moving grammar compilation out of Python into C, utilising pthread
. Additionally, XGrammar lays the groundwork for addressing limitation (4) in future releases. Below are performance comparisons between the XGrammar and Outlines backends:
In vLLM’s v0 architecture, we've implemented XGrammar as a logit processor, optimizing it with caching for tokenizer data. While the performance improvements are encouraging, we believe there's still significant room for optimization.
There are still a few usability concerns in XGrammar v0 integration to match feature parity with all use cases:
vLLM now has a basic support for XGrammar by default. In case where we know XGrammar is insufficient to serve the request, we fall back to Outlines.
Note that vLLM also includes support for lm-format-enforcer. However, from our testing we found that in some long context test cases, lm-format-enforcer fails to enforce correct outputs, and not up to par with Outlines in terms of both performance and feature parity.
With the release of v1 on the horizon, we're working on a tentative plan for structured decoding:
Moving guided decoding towards scheduler-level
Allowing bit-mask calculation in one process instead of each GPU workers
Good baseline for speculative decoding and tool-use
If you have any more suggestions we are more than happy to take it into consideration. Consider joining vLLM slack via #feat-structured-output
.
We want to thank the vLLM team, XGrammar team, Aaron Pham (BentoML), Michael Goin (Red Hat), Chendi Xue (Intel), and Russell Bryant (Red Hat) for their valuable feedback and collaboration on bringing XGrammar to vLLM and the continuous effort to improve structured decoding in vLLM.
[1]: Allen Newell and Herbert Simon’s work at RAND initially showed that computers can simulate important aspects of intelligence.
Another notable application was found in the medical domain (Haugeland, 1997). MYCIN, developed at Stanford University in the 1970s, diagnosed and recommended treatments for blood infections (Shortliffe, 1974). MYCIN’s developers recognized the importance of justifying recommendations, implementing what were known as “rule traces” to explain the system’s reasoning in human-understandable terms.
[2]: In the 1990s, IBM released a sequence of complex statistical models that is trained to perform machine translations tasks (Statistical Machine Translation, n.d.) (see also: this lecture from Cornell).
In 2001, Bag of words (BoW)-variants model was trained on 0.3B tokens and was considered SOTA at the time (Mikolov et al., 2013). These earlier works proved to the research community that statistical modelling triumphs over symbolic counterpart for language processing given it can capture the general patterns for large corpuses of text.
[3]: In 2017, The landmark paper “Attention is all You Need” introduced Transformers architecture (Vaswani et al., 2023) for neural machine translations tasks, which is based on the attention mechanism first proposed by (Bahdanau et al., 2016).
OpenAI then introduced the scaling law for neural language models (Kaplan et al., 2020), which sets off the race towards building these systems based on foundational language models.
[4]: Prior to Attention-based transformers, seq-to-seq models uses RNNs given its ability for longer context length and better memory. However, they are more susceptible to vanishing/exploding gradients comparing to feed-forward network, and thus LSTM (Hochreiter & Schmidhuber, 1997) was proposed to solve this problem. Yet, one of the main problems with LSTM is that they tend to have poor memory recall with data they have seen many steps ago.
The Attention paper addresses this problem by encoding additional positional data into the inputs. Additionally, the paper also proposed a encoder-decoder architecture for translation tasks, however, most of text-generation models nowadays are decoder-only, given its superior performance over zero-shot tasks.
One of the many reasons why attention-based transformers works better than LSTM is because transformers are very scalable and hardware-aware (you can’t just arbitrary add more LSTM block and hope for better long-term retention). For more information, please refer back to the original paper.
[5]: One might argue that we can reliably achieve these through few-shot promptings, i.e “Give me a JSON that yields the address of users. Example output can be …”. However, there is no guarantee that the generated outputs is a valid JSON. This is because these models are probabilistic systems, as they are “sampling” the next token based on the distribution of data that it was trained on. One might also argue that one should use specific fine-tuned models for JSON outputs to perform such cases. However, fine-tuning often requires extensive training and a lot more labor to curate data, monitor progress, and perform evaluation, which is a huge resources not everyone can afford to do.
[6] Note that the phrase "[structured/constrained/guided] decoding" are used interchangeably, but they all refer to the same mechanism of "using a format for the model to structurally sampling outputs.”
[7]: We define a finite-state machine, given by $(Q, \Sigma , \delta, q_0, F)$ where character comprising the strings in are drawn from $\Sigma$, i.e $\mathcal{V} \in \mathcal{P}(\Sigma)$ where:
- $Q$ is a finite set of states
- $\Sigma$ is a finite alphabet
- $\delta: Q \times \Sigma \to Q$ is the start state
- $F \subseteq Q$ is the set of all accepted states.
Intuitively, we can describe any structured format (JSON, YAML, etc.) using a context-free grammar due to its expressive nature.
[8]: See this blog post from HuggingFace for using logit processors to control generations process.
[9]: *The following was excerpt from *vllm-project/vllm#5329:
Check out the following resources to learn more: