Skip to main content

Multi-Model vLLM Serving: GPU Memory Management on RunPod L40S

· 5 min read
fr4nk
Software Engineer
Hugging Face

Run multiple vLLM instances on a single GPU with precise memory allocation.

The Setup

Two models, one GPU, zero conflicts:

# Model 1: Embedding (lightweight)
vllm serve Qwen/Qwen3-Embedding-0.6B \
--host 0.0.0.0 \
--port 8010 \
--gpu-memory-utilization 0.15 \
--max-model-len 8192

# Model 2: Guard/Generation (heavy)
vllm serve Qwen/Qwen3Guard-Gen-8B \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.60 \
--max-model-len 32768
ModelPortGPU MemoryContextUse Case
Qwen3-Embedding-0.6B801015% (~7.2GB)8KEmbeddings, RAG
Qwen3Guard-Gen-8B800060% (~28.8GB)32KContent moderation, generation

Total GPU allocation: 75% (36GB of 48GB)


Why L40S?

SpecNVIDIA L40S
VRAM48GB GDDR6
Memory Bandwidth864 GB/s
FP16 Performance362 TFLOPS
ArchitectureAda Lovelace
Use CaseInference workloads

L40S is optimized for inference over training. The 48GB VRAM enables multi-model deployments that would require multiple smaller GPUs otherwise.


GPU Memory Breakdown

┌─────────────────────────────────────────────────────────────┐
│ L40S 48GB VRAM │
├─────────────────────────────────────────────────────────────┤
│ Qwen3-Embedding-0.6B │ Qwen3Guard-Gen-8B │ Reserved │
│ 15% (7.2GB) │ 60% (28.8GB) │ 25% (12GB) │
├────────────────────────┼─────────────────────┼──────────────┤
│ Model: ~1.2GB │ Model: ~16GB │ System │
│ KV Cache: ~6GB │ KV Cache: ~12.8GB │ overhead │
│ Context: 8192 │ Context: 32768 │ & buffer │
└────────────────────────┴─────────────────────┴──────────────┘

Memory Calculation

Model weights (FP16):

  • 0.6B params × 2 bytes = ~1.2GB
  • 8B params × 2 bytes = ~16GB

KV Cache (per token):

KV cache = num_layers × 2 × hidden_size × 2 bytes × batch_size × seq_len

The --gpu-memory-utilization flag reserves memory for both model weights and KV cache.


Key vLLM Flags

FlagPurposeTrade-off
--gpu-memory-utilization% of GPU memory to useHigher = more throughput, risk OOM
--max-model-lenMax context lengthHigher = more memory per request
--host 0.0.0.0Bind to all interfacesRequired for external access
--portAPI portSeparate ports for multi-model

Additional Optimization Flags

# For production workloads
vllm serve Qwen/Qwen3Guard-Gen-8B \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.60 \
--max-model-len 32768 \
--max-num-seqs 256 \
--enable-chunked-prefill \
--disable-log-requests
FlagEffect
--max-num-seqsMax concurrent sequences
--enable-chunked-prefillBetter memory efficiency for long prompts
--disable-log-requestsReduce I/O overhead in production

RunPod Deployment

Pod Configuration

# runpod.yaml
gpu: NVIDIA L40S
gpu_count: 1
volume_size: 50 # GB for model cache
container_image: vllm/vllm-openai:latest

Startup Script

#!/bin/bash
# start-models.sh

# Start embedding model in background
vllm serve Qwen/Qwen3-Embedding-0.6B \
--host 0.0.0.0 \
--port 8010 \
--gpu-memory-utilization 0.15 \
--max-model-len 8192 &

# Wait for embedding model to load
sleep 30

# Start guard model
vllm serve Qwen/Qwen3Guard-Gen-8B \
--host 0.0.0.0 \
--port 8000 \
--gpu-memory-utilization 0.60 \
--max-model-len 32768
# /etc/systemd/system/vllm-embedding.service
[Unit]
Description=vLLM Embedding Service
After=network.target

[Service]
ExecStart=/usr/bin/vllm serve Qwen/Qwen3-Embedding-0.6B \
--host 0.0.0.0 --port 8010 \
--gpu-memory-utilization 0.15 --max-model-len 8192
Restart=always
Environment="CUDA_VISIBLE_DEVICES=0"

[Install]
WantedBy=multi-user.target

API Usage

Embedding Endpoint

import httpx

async def get_embeddings(texts: list[str]) -> list[list[float]]:
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:8010/v1/embeddings",
json={
"model": "Qwen/Qwen3-Embedding-0.6B",
"input": texts
}
)
return [e["embedding"] for e in response.json()["data"]]

Guard/Generation Endpoint

async def generate(prompt: str, max_tokens: int = 512) -> str:
async with httpx.AsyncClient() as client:
response = await client.post(
"http://localhost:8000/v1/completions",
json={
"model": "Qwen/Qwen3Guard-Gen-8B",
"prompt": prompt,
"max_tokens": max_tokens
}
)
return response.json()["choices"][0]["text"]

Combined RAG Pipeline

async def rag_query(query: str, documents: list[str]) -> str:
# 1. Embed query and documents (port 8010)
query_emb, doc_embs = await asyncio.gather(
get_embeddings([query]),
get_embeddings(documents)
)

# 2. Find most similar documents
similarities = [cosine_similarity(query_emb[0], d) for d in doc_embs]
top_docs = sorted(zip(documents, similarities), key=lambda x: -x[1])[:3]

# 3. Generate answer with context (port 8000)
context = "\n".join([d[0] for d in top_docs])
prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"

return await generate(prompt)

Monitoring GPU Usage

nvidia-smi

# Watch GPU utilization
watch -n 1 nvidia-smi

# Output example:
# +-----------------------------------------------------------------------------+
# | Processes: |
# | GPU GI CI PID Type Process name GPU Memory |
# | ID ID Usage |
# |=============================================================================|
# | 0 N/A N/A 12345 C ...vllm.entrypoints.openai 7168MiB |
# | 0 N/A N/A 12346 C ...vllm.entrypoints.openai 28672MiB |
# +-----------------------------------------------------------------------------+

vLLM Metrics

# Prometheus metrics endpoint
curl http://localhost:8000/metrics | grep vllm

# Key metrics:
# vllm:num_requests_running
# vllm:num_requests_waiting
# vllm:gpu_cache_usage_perc
# vllm:cpu_cache_usage_perc

Troubleshooting

IssueCauseFix
OOM on startupMemory sum > 100%Reduce --gpu-memory-utilization
OOM during inferenceKV cache overflowLower --max-model-len
Slow first requestModel loadingPre-warm with dummy request
Port conflictBoth on same portUse different --port values
CUDA errorWrong GPU indexSet CUDA_VISIBLE_DEVICES=0

Memory Planning Formula

Total allocation = model1_util + model2_util + overhead

Safe configuration:
- 2 models: keep sum under 80%
- 3 models: keep sum under 75%
- Always leave 10-15% for system overhead

Alternative Configurations

High Throughput (Embedding-Heavy)

# More memory for embedding, batch processing
vllm serve Qwen/Qwen3-Embedding-0.6B \
--gpu-memory-utilization 0.25 \
--max-model-len 8192 \
--max-num-seqs 512

vllm serve Qwen/Qwen3Guard-Gen-8B \
--gpu-memory-utilization 0.50 \
--max-model-len 16384

Long Context (Generation-Heavy)

# Minimal embedding, max context for generation
vllm serve Qwen/Qwen3-Embedding-0.6B \
--gpu-memory-utilization 0.10 \
--max-model-len 4096

vllm serve Qwen/Qwen3Guard-Gen-8B \
--gpu-memory-utilization 0.70 \
--max-model-len 65536

Cost Analysis (RunPod)

GPUVRAMPrice/hrModels Supported
RTX 409024GB~$0.44Single model only
L40S48GB~$0.992-3 models
A100 80GB80GB~$1.893-4 models

L40S sweet spot: Run 2 production models for ~$0.99/hr vs $0.88/hr for 2× RTX 4090 with added complexity.


Summary

ConfigurationValue
GPUNVIDIA L40S (48GB)
Embedding ModelQwen3-Embedding-0.6B @ 15%
Guard ModelQwen3Guard-Gen-8B @ 60%
Reserved25% for overhead
Total Cost~$0.99/hr on RunPod

Multi-model serving on a single GPU eliminates network latency between services and reduces infrastructure complexity. The key is precise memory allocation with --gpu-memory-utilization.