Static, dynamic and continuous batching

GPUs are designed for highly parallel computation workloads, capable of performing trillions or even quadrillions of floating-point operations per second (FLOPs). Nevertheless, LLMs often fail to fully utilize these GPUs because much of the chip's memory bandwidth is spent loading model parameters.

Batching helps mitigate this bottleneck. In production, your service might be flooded with multiple requests arriving at the same time. Instead of processing each request individually, batching them together allows you to use the same loaded model parameters across multiple requests, thus dramatically improving throughput.

Use the simulator below to understand different batching strategies at a high level.

Batching Strategy Simulator

See how each strategy schedules requests and uses GPU capacity

batch size = 3Collects exactly 3 requests before starting. Every request in a batch is held until the slowest one finishes.

1Batch 1 (t=7): R1, R2, R3 — waits until the 3rd request arrives; all held until R2 finishes at t=15

2Batch 2 (t=15): R4, R5, R6 — all held until R4 finishes at t=20

t = 0.0

Waiting for batch

Processing

Done, held for batch

✓Result returned

Static

Dynamic

Continuous

Processing starts when

Batch is full

Batch full or timeout

Request arrives (if capacity is available)

Short requests held?

Yes, until slowest finishes

No, completes immediately

GPU idle between batches

Yes

Latency

High

Medium

Low

Throughput

Low–medium

Medium

High

Static batching

The simplest form of batching is static batching. Here, the server waits until a fixed number of requests arrive and then processes them together as a single batch.

While static batching is easy to implement, it has notable downsides.

The first request in a batch is forced to wait for the last one, adding unnecessary delay. Picture a printer that won’t start printing until you’ve queued up a set number of documents, regardless of how long it takes for the last document to arrive.
Not all requests in a batch are created equal. In LLM inference, some requests may generate very short responses, while others could involve lengthy, step-by-step reasoning. Since all requests in the batch must wait until the slowest one finishes, this can lead to wasted compute resources and increased latency.

Dynamic batching

To address the issues in static batching, many systems use dynamic batching. This approach still collects incoming requests into batches, but it doesn’t insist on a fixed batch size. Instead, it sets a time window and processes whatever requests have arrived in that time frame. If the batch reaches its size limit sooner, it launches immediately. This is like a bus that leaves on a strict schedule or whenever it’s full, whichever happens first.

Dynamic batching helps balance throughput and latency. It ensures that early requests aren’t delayed indefinitely by later ones. However, because some batches might not be completely full when launched, it doesn’t always achieve maximum GPU efficiency. Another drawback is that, like static batching, the longest request in a batch still dictates when the batch finishes; short requests have to wait unnecessarily.

Continuous batching

For LLM inference, output sequences vary widely in length. Some users might ask simple questions, while others request detailed explanations. Static and dynamic batching force the short requests to wait for the longest one. This leaves GPU resources unsaturated.

Continuous batching, also known as in-flight batching, addresses these inefficiencies. Continuous batching doesn’t force the entire batch to complete before returning results. Instead, it lets each sequence in a batch finish independently and immediately replaces it with a new one. This is like an assembly line where, as soon as one item is finished (no matter how long it takes), a new item is added to keep the line running at full capacity.

Generating seven sequences with continuous batching. On the first iteration (left), each sequence generates a token (blue) from its prompt (yellow). Over time (right), sequences complete at different iterations by emitting an end-of-sequence token (red), at which point new sequences are inserted. Image source

This technique uses iteration-level scheduling, meaning the batch composition changes dynamically at each decoding iteration. As soon as a sequence in the batch finishes generating tokens, the server inserts a new request in its place. This maximizes GPU occupancy and keeps compute resources busy by avoiding idle time that would otherwise be spent waiting for the slowest sequence in a batch to finish.

Major inference frameworks such as vLLM, SGLang, TensorRT-LLM (in-flight batching) and LMDeploy (persistent batching) all support continuous batching or similar mechanisms. For memory management in long or mixed-length batches, see PagedAttention.

Chunked prefill

Continuous batching introduces a scheduling conflict when a new request arrives while other requests are decoding. Processing the entire prompt of a new request in one prefill iteration minimizes the Time to First Token (TTFT), but a long prefill can delay the next token for every active decode request. In a streaming application, users may see the output pause while another user's prompt is processed.

Chunked prefill splits a prompt into smaller token ranges and schedules them across multiple iterations. Each chunk extends the KV cache of the request, and later chunks attend to the prompt tokens processed earlier. Because the attention computation is unchanged (every token still attends to all earlier tokens through the KV cache), the chunked prefill is mathematically equivalent to processing the prompt in one pass, and the first token is emitted only after the final chunk.

Chunked Prefill Scheduler

A and B are decoding when C arrives with a long prompt. See how the scheduler fits C's prefill onto the same GPU timeline.

C's promptChunk size

PrefillDecodeStalled

Request A

Prefill

Decode

Stalleddecode paused

Decode

…

Request B

Prefill

Decode

Stalleddecode paused

Decode

…

Request C

Prefill32K tokens

Decode

…

Time →

The whole prompt of request C runs in one iteration. A and B can't emit a token until it finishes.

Decode stall for A & B1 long iteration

Prefill rounds for C1

Largest prefill block32K tokens

The scheduler can combine one prefill chunk with decode tokens from active requests in the same batch. SARATHI calls this decode-maximal batching: the prefill chunk supplies enough parallel work to saturate the compute capacity of the GPU, while decode tokens piggyback on the same model execution at little extra cost. This prevents a long prompt from monopolizing one iteration and can also reduce pipeline bubbles under pipeline parallelism.

The benefit of chunked prefill is that even when a request has a long prompt, the prefill computation is divided into smaller chunks. Instead of waiting for one long prefill iteration to finish, active decode requests can continue generating tokens between prefill chunks. This reduces Inter-Token Latency (ITL) and makes streamed responses smoother.

The trade-off is that chunking introduces additional scheduling and attention overhead because the prompt is processed through multiple smaller prefill steps instead of one large step. As a result, TTFT may increase, especially when smaller chunk sizes are used.

Most inference frameworks let you tune the chunk size. Note that:

Smaller chunks give the scheduler more opportunities to run decodes, reducing ITL spikes for active requests.
Larger chunks process the new prompt more efficiently and usually improve the TTFT, but active decodes may wait longer between tokens.
Chunks that are too small can lower GPU utilization and add attention overhead because later chunks must reread KV cache entries created by earlier chunks.

There is no universal best value. The right size depends on the model, GPU, prompt-length distribution, concurrency, and latency targets. It should be treated as a workload-specific scheduling parameter rather than a universal constant.

FAQs

What are padding tokens in LLM batching?

Text sequences naturally have different lengths after tokenization. For example:

Sentence A: "Hello world"        # [15496, 995] - length 2
Sentence B: "How are you today?" # [2437, 389, 345, 1909] - length 4

You cannot directly stack these sequences into one dense rectangular tensor because their lengths differ. Padding solves this by adding placeholder tokens to shorter sequences so every request in the batch has the same tensor length. Common padding token forms include PAD, <pad>, or token ID 0, though the exact ID depends on the tokenizer.

An attention mask tells the model which positions are padding, but padding can still waste compute and memory. If a batch contains sequences with lengths like this:

[1024, 1000, 50, 20]

The shorter sequences may be padded to length 1024. That keeps the tensor shape regular for GPU kernels, but the 50-token and 20-token requests now carry a lot of unused positions.

What are ragged tensors in LLM inference?

Ragged tensors represent variable-length sequences without padding them all to one fixed length. Instead, the runtime stores the real tokens plus metadata such as sequence lengths, offsets, or KV cache block locations. This helps inference engines handle mixed-length prompts and continuous batching more efficiently, especially when paired with paged KV cache layouts such as PagedAttention.

Static batching​

Dynamic batching​

Continuous batching​

Chunked prefill​

FAQs​

What are padding tokens in LLM batching?​

What are ragged tensors in LLM inference?​

Additional resources​

Static batching

Dynamic batching

Continuous batching

Chunked prefill

FAQs

What are padding tokens in LLM batching?

What are ragged tensors in LLM inference?

Additional resources