Multi-Model vLLM Serving: GPU Memory Management on RunPod L40S
· 5 min read
Run multiple vLLM instances on a single GPU with precise memory allocation.
The Setup
Two models, one GPU, zero conflicts:
# Model 1: Embedding (lightweight)
vllm serve Qwen/Qwen3-Embedding-0.6B \
--host 0.0.0.0 \
--port 8010 \
--gpu-memory-utilization 0.15 \
--max-model-len 8192
# Model 2: Guard/Generation (heavy)
vllm serve Qwen/Qwen3Guard-Gen-8B \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.60 \
--max-model-len 32768
| Model | Port | GPU Memory | Context | Use Case |
|---|---|---|---|---|
| Qwen3-Embedding-0.6B | 8010 | 15% (~7.2GB) | 8K | Embeddings, RAG |
| Qwen3Guard-Gen-8B | 8000 | 60% (~28.8GB) | 32K | Content moderation, generation |
Total GPU allocation: 75% (36GB of 48GB)
Why L40S?
| Spec | NVIDIA L40S |
|---|---|
| VRAM | 48GB GDDR6 |
| Memory Bandwidth | 864 GB/s |
| FP16 Performance | 362 TFLOPS |
| Architecture | Ada Lovelace |
| Use Case | Inference workloads |
L40S is optimized for inference over training. The 48GB VRAM enables multi-model deployments that would require multiple smaller GPUs otherwise.
GPU Memory Breakdown
┌─────────────────────────────────────────────────────────────┐
│ L40S 48GB VRAM │
├─────────────────────────────────────────────────────────────┤
│ Qwen3-Embedding-0.6B │ Qwen3Guard-Gen-8B │ Reserved │
│ 15% (7.2GB) │ 60% (28.8GB) │ 25% (12GB) │
├────────────────────────┼─────────────────────┼──────────────┤
│ Model: ~1.2GB │ Model: ~16GB │ System │
│ KV Cache: ~6GB │ KV Cache: ~12.8GB │ overhead │
│ Context: 8192 │ Context: 32768 │ & buffer │
└────────────────────────┴─────────────────────┴──────────────┘
Memory Calculation
Model weights (FP16):
- 0.6B params × 2 bytes = ~1.2GB
- 8B params × 2 bytes = ~16GB
KV Cache (per token):
KV cache = num_layers × 2 × hidden_size × 2 bytes × batch_size × seq_len
The --gpu-memory-utilization flag reserves memory for both model weights and KV cache.
Key vLLM Flags
| Flag | Purpose | Trade-off |
|---|---|---|
--gpu-memory-utilization | % of GPU memory to use | Higher = more throughput, risk OOM |
--max-model-len | Max context length | Higher = more memory per request |
--host 0.0.0.0 | Bind to all interfaces | Required for external access |
--port | API port | Separate ports for multi-model |
Additional Optimization Flags
# For production workloads
vllm serve Qwen/Qwen3Guard-Gen-8B \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.60 \
--max-model-len 32768 \
--max-num-seqs 256 \
--enable-chunked-prefill \
--disable-log-requests
| Flag | Effect |
|---|---|
--max-num-seqs | Max concurrent sequences |
--enable-chunked-prefill | Better memory efficiency for long prompts |
--disable-log-requests | Reduce I/O overhead in production |
RunPod Deployment
Pod Configuration
# runpod.yaml
gpu: NVIDIA L40S
gpu_count: 1
volume_size: 50 # GB for model cache
container_image: vllm/vllm-openai:latest
Startup Script
#!/bin/bash
# start-models.sh
# Start embedding model in background
vllm serve Qwen/Qwen3-Embedding-0.6B \
--host 0.0.0.0 \
--port 8010 \
--gpu-memory-utilization 0.15 \
--max-model-len 8192 &
# Wait for embedding model to load
sleep 30
# Start guard model
vllm serve Qwen/Qwen3Guard-Gen-8B \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.60 \
--max-model-len 32768
Using systemd (recommended)
# /etc/systemd/system/vllm-embedding.service
[Unit]
Description=vLLM Embedding Service
After=network.target
[Service]
ExecStart=/usr/bin/vllm serve Qwen/Qwen3-Embedding-0.6B \
--host 0.0.0.0 --port 8010 \
--gpu-memory-utilization 0.15 --max-model-len 8192
Restart=always
Environment="CUDA_VISIBLE_DEVICES=0"
[Install]
WantedBy=multi-user.target
API Usage
Embedding Endpoint
import httpx
async def get_embeddings(texts: list[str]) -> list[list[float]]:
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:8010/v1/embeddings",
json={
"model": "Qwen/Qwen3-Embedding-0.6B",
"input": texts
}
)
return [e["embedding"] for e in response.json()["data"]]
Guard/Generation Endpoint
async def generate(prompt: str, max_tokens: int = 512) -> str:
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:8000/v1/completions",
json={
"model": "Qwen/Qwen3Guard-Gen-8B",
"prompt": prompt,
"max_tokens": max_tokens
}
)
return response.json()["choices"][0]["text"]
Combined RAG Pipeline
async def rag_query(query: str, documents: list[str]) -> str:
# 1. Embed query and documents (port 8010)
query_emb, doc_embs = await asyncio.gather(
get_embeddings([query]),
get_embeddings(documents)
)
# 2. Find most similar documents
similarities = [cosine_similarity(query_emb[0], d) for d in doc_embs]
top_docs = sorted(zip(documents, similarities), key=lambda x: -x[1])[:3]
# 3. Generate answer with context (port 8000)
context = "\n".join([d[0] for d in top_docs])
prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"
return await generate(prompt)
Monitoring GPU Usage
nvidia-smi
# Watch GPU utilization
watch -n 1 nvidia-smi
# Output example:
# +-----------------------------------------------------------------------------+
# | Processes: |
# | GPU GI CI PID Type Process name GPU Memory |
# | ID ID Usage |
# |=============================================================================|
# | 0 N/A N/A 12345 C ...vllm.entrypoints.openai 7168MiB |
# | 0 N/A N/A 12346 C ...vllm.entrypoints.openai 28672MiB |
# +-----------------------------------------------------------------------------+
vLLM Metrics
# Prometheus metrics endpoint
curl http://localhost:8000/metrics | grep vllm
# Key metrics:
# vllm:num_requests_running
# vllm:num_requests_waiting
# vllm:gpu_cache_usage_perc
# vllm:cpu_cache_usage_perc
Troubleshooting
| Issue | Cause | Fix |
|---|---|---|
| OOM on startup | Memory sum > 100% | Reduce --gpu-memory-utilization |
| OOM during inference | KV cache overflow | Lower --max-model-len |
| Slow first request | Model loading | Pre-warm with dummy request |
| Port conflict | Both on same port | Use different --port values |
| CUDA error | Wrong GPU index | Set CUDA_VISIBLE_DEVICES=0 |
Memory Planning Formula
Total allocation = model1_util + model2_util + overhead
Safe configuration:
- 2 models: keep sum under 80%
- 3 models: keep sum under 75%
- Always leave 10-15% for system overhead
Alternative Configurations
High Throughput (Embedding-Heavy)
# More memory for embedding, batch processing
vllm serve Qwen/Qwen3-Embedding-0.6B \
--gpu-memory-utilization 0.25 \
--max-model-len 8192 \
--max-num-seqs 512
vllm serve Qwen/Qwen3Guard-Gen-8B \
--gpu-memory-utilization 0.50 \
--max-model-len 16384
Long Context (Generation-Heavy)
# Minimal embedding, max context for generation
vllm serve Qwen/Qwen3-Embedding-0.6B \
--gpu-memory-utilization 0.10 \
--max-model-len 4096
vllm serve Qwen/Qwen3Guard-Gen-8B \
--gpu-memory-utilization 0.70 \
--max-model-len 65536
Cost Analysis (RunPod)
| GPU | VRAM | Price/hr | Models Supported |
|---|---|---|---|
| RTX 4090 | 24GB | ~$0.44 | Single model only |
| L40S | 48GB | ~$0.99 | 2-3 models |
| A100 80GB | 80GB | ~$1.89 | 3-4 models |
L40S sweet spot: Run 2 production models for ~$0.99/hr vs $0.88/hr for 2× RTX 4090 with added complexity.
Summary
| Configuration | Value |
|---|---|
| GPU | NVIDIA L40S (48GB) |
| Embedding Model | Qwen3-Embedding-0.6B @ 15% |
| Guard Model | Qwen3Guard-Gen-8B @ 60% |
| Reserved | 25% for overhead |
| Total Cost | ~$0.99/hr on RunPod |
Multi-model serving on a single GPU eliminates network latency between services and reduces infrastructure complexity. The key is precise memory allocation with --gpu-memory-utilization.
