April 11, 2025 • Written By Fog Dong and Sherlock Xu
You launched a Llama 3.1 8B container on Kubernetes, but it took 10 minutes to start. Why? Apparently, the container infrastructure was struggling under GenAI workloads.
A fast cold start is critical for ensuring your deployment can react quickly to traffic changes without large delays. By bringing up instances on-demand, you can scale dynamically and avoid over-provisioning compute capacity. This responsiveness reduces costs while maintaining a high level of service.
Recognizing this, we completely redesigned our LLM cold start strategy, from pulling images to loading model weights into GPU memory. Ultimately, we developed a solution that cut cold start time down to under 30 seconds, with true on-demand model loading.
In this blog post, we will share the story of how we did it. First, let's look at some fundamental questions.
Containers used for serving GenAI workloads are generally much larger than simple web app containers. For example, a Llama 3.1 8B container may include:
This brings the total size to 20.2 GB, and in many cases, even higher as the model grows in size. For comparison, a python3.10-slim container may be only around 154 MB.
A container image is made up of multiple compressed layers, and each of those layers is typically stored as a compressed tarball (e.g., gzip
or zstd
) in a registry (e.g., Docker Hub). Additionally, the image includes a JSON manifest listing its layers, base image, and configuration details.
This layered architecture, while efficient for version control and sharing common components, poses challenges when scaled to accommodate large AI models. Let's break down exactly what happens when an LLM container starts up on Kubernetes.
For Llama 3.1 8B, the cold start timeline may look like this:
Deployment Timeline: Image Pull: 5–6 min ███████ Layer Extract: 3–4 min ████ Config & Start: ~2 min █ ───────────────────── Total: ~11 min
It's important to note that the delay is not caused by computation or inference, but simply by preparing the container to start.
Now that we've covered the basics, let's dive deeper into the major bottlenecks that slow down LLM container startup.
Container registries and runtimes were originally designed for small images (e.g., web applications). They may face performance challenges with LLM inference workloads.
gzip
or zstd
. A single 50GB model might require 5-8 minutes just for decompression on modern CPUs.Most container storage drivers aren't suitable for handling large GenAI model files:
.safetensors
files).LLMs introduces their unique challenges in how model files are transferred and loaded during deployment:
These limitations combined can greatly prolong the startup process, even on high-bandwidth networks. Beyond delays, they create cascading problems: network saturation, disk I/O spikes, and increased infrastructure costs (due to over-provisioning resources to manage these inefficiencies).
To accelerate LLM container startup, we must fundamentally reconsider how container images are pulled and model weights are loaded.
The first challenge was slow image download from container registries. Given their limitations mentioned above, we explored an alternative: pulling container images directly from object storage systems, such as Google Cloud Storage (GCS) and Amazon S3.
For the Llama 3.1 8B container, object storage proved dramatically faster as the image pull time dropped to around 10 seconds in our tests.
Method | Speed (depending on machine/network) | Time |
---|---|---|
Cloud Registry Pull (GAR/ECR) | 60 MB/s | ~350s |
Internal Registry Pull (Harbor) | 120 MB/s | ~170s |
Direct GCS/S3 Download | 2 GB/s or higher | ~10s |
We attribute the speed improvements to:
Range
requests, leveraging full network bandwidth.Despite the faster download speeds, we noticed the extraction step significantly slowed things down, often cutting the overall effective throughput in half. Disk I/O during extraction remained a critical bottleneck.
That made the extraction step our next optimization target.
Our breakthrough came when we realized we could completely bypass extraction by rethinking how containers access their filesystem. This is where FUSE (Filesystem in Userspace) changes the game.
FUSE allows non-privileged users to create and mount custom filesystems without kernel-level code or root permissions. FUSE-based tools like stargz-snapshotter
allow containers to access image data on-demand, without extracting all layers upfront. This avoids CPU-bound extraction and excessive disk I/O.
With FUSE, we refined our workflow featuring on-demand access of model data. Rather than extracting every layer to disk, we maintain image layers in an uncompressed, seekable-tar format in object storage. This allows containers to access them as a seekable, lazily loaded file database as if they were locally available, even though the underlying data may still reside remotely. This means containers can stream individual file chunks or data blocks from remote storage only when they actually need them.
Using the seekable-tar format as the foundation, we separated the model weights from the container image and enabled direct loading into GPU memory.
Without optimization, model weights are downloaded, written to disk, and then loaded into GPU memory. This is a slow process since it’s sequential and IO-intensive.
We streamlined this step by introducing zero-copy stream-based model loading. This means model files can be streamed directly from remote storage into GPU memory without intermediate disk reads and writes. With these optimizations, we reduced the total cold start time of Llama 3.1 8B to less than 30 seconds.
Cutting LLM container cold starts is not only a technical optimization, but also a strategic advantage for your products. For businesses deploying LLMs, faster cold starts mean lower infrastructure costs, higher engineering flexibility, and improved user experiences.
By sharing our experience, we look to help others facing similar challenges in deploying large-scale AI workloads. We believe these techniques can benefit the broader AI engineering community, supporting more scalable and cost-effective LLM deployments on Kubernetes.
Check out the following resources to learn more: