LLM inference basics

LLM inference is where models meet the real world. It powers everything from instant chat replies to code generation, and directly impacts latency, cost, and user experience. Understanding how inference works is the first step toward building smarter, faster, and more reliable AI applications.

📄️ What is LLM inference?

LLM inference is the process of using a trained language model to generate responses or predictions based on prompts.

📄️ Training vs. inference

LLM training builds the model while LLM inference applies it to generate real-time outputs from new inputs.

📄️ How does LLM inference work?

Learn how LLM inference works, from tokenization to prefill and decode stages, with tips on performance, KV caching, and optimization strategies.

📄️ Where is LLM inference run?

Learn the differences between CPUs, GPUs, and TPUs and where you can deploy them.

📄️ What is distributed inference?

Distributed inference is the practice of running model inference across multiple GPUs, workers, nodes, or regions to achieve scalable, reliable, and cost-efficient serving. This document explains what distributed inference is, why teams use it in production, its key challenges, and how modern runtimes and platforms support distributed LLM inference at scale.

📄️ Serverless vs. self-hosted LLM inference

Understand the differences between serverless LLM APIs and self-hosted LLM deployments.

📄️ OpenAI-compatible API

An OpenAI-compatible API implements the same request and response formats as OpenAI's official API, allowing developers to switch between different models without changing existing code.

Stay updated with the handbook

Get the latest insights and updates on LLM inference and optimization techniques.

Monthly insights
Latest techniques
Handbook updates