Multi-Model vLLM Serving: GPU Memory Management on RunPod L40S

January 20, 2026 · 5 min read

Software Engineer

Run multiple vLLM instances on a single GPU with precise memory allocation.

The Setup

Two models, one GPU, zero conflicts:

# Model 1: Embedding (lightweight)
vllm serve Qwen/Qwen3-Embedding-0.6B \
  --host 0.0.0.0 \
  --port 8010 \
  --gpu-memory-utilization 0.15 \
  --max-model-len 8192

# Model 2: Guard/Generation (heavy)
vllm serve Qwen/Qwen3Guard-Gen-8B \
  --host 0.0.0.0 \
  --port 8000 \
  --gpu-memory-utilization 0.60 \
  --max-model-len 32768

Model	Port	GPU Memory	Context	Use Case
Qwen3-Embedding-0.6B	8010	15% (~7.2GB)	8K	Embeddings, RAG
Qwen3Guard-Gen-8B	8000	60% (~28.8GB)	32K	Content moderation, generation

Total GPU allocation: 75% (36GB of 48GB)

Why L40S?

Spec	NVIDIA L40S
VRAM	48GB GDDR6
Memory Bandwidth	864 GB/s
FP16 Performance	362 TFLOPS
Architecture	Ada Lovelace
Use Case	Inference workloads

L40S is optimized for inference over training. The 48GB VRAM enables multi-model deployments that would require multiple smaller GPUs otherwise.

GPU Memory Breakdown

┌─────────────────────────────────────────────────────────────┐
│                    L40S 48GB VRAM                           │
├─────────────────────────────────────────────────────────────┤
│  Qwen3-Embedding-0.6B  │  Qwen3Guard-Gen-8B  │   Reserved   │
│        15% (7.2GB)     │     60% (28.8GB)    │  25% (12GB)  │
├────────────────────────┼─────────────────────┼──────────────┤
│  Model: ~1.2GB         │  Model: ~16GB       │  System      │
│  KV Cache: ~6GB        │  KV Cache: ~12.8GB  │  overhead    │
│  Context: 8192         │  Context: 32768     │  & buffer    │
└────────────────────────┴─────────────────────┴──────────────┘

Memory Calculation

Model weights (FP16):

0.6B params × 2 bytes = ~1.2GB
8B params × 2 bytes = ~16GB

KV Cache (per token):

KV cache = num_layers × 2 × hidden_size × 2 bytes × batch_size × seq_len

The --gpu-memory-utilization flag reserves memory for both model weights and KV cache.

Key vLLM Flags

Flag	Purpose	Trade-off
`--gpu-memory-utilization`	% of GPU memory to use	Higher = more throughput, risk OOM
`--max-model-len`	Max context length	Higher = more memory per request
`--host 0.0.0.0`	Bind to all interfaces	Required for external access
`--port`	API port	Separate ports for multi-model

Additional Optimization Flags

# For production workloads
vllm serve Qwen/Qwen3Guard-Gen-8B \
  --host 0.0.0.0 \
  --port 8000 \
  --gpu-memory-utilization 0.60 \
  --max-model-len 32768 \
  --max-num-seqs 256 \
  --enable-chunked-prefill \
  --disable-log-requests

Flag	Effect
`--max-num-seqs`	Max concurrent sequences
`--enable-chunked-prefill`	Better memory efficiency for long prompts
`--disable-log-requests`	Reduce I/O overhead in production

RunPod Deployment

Pod Configuration

# runpod.yaml
gpu: NVIDIA L40S
gpu_count: 1
volume_size: 50  # GB for model cache
container_image: vllm/vllm-openai:latest

Startup Script

#!/bin/bash
# start-models.sh

# Start embedding model in background
vllm serve Qwen/Qwen3-Embedding-0.6B \
  --host 0.0.0.0 \
  --port 8010 \
  --gpu-memory-utilization 0.15 \
  --max-model-len 8192 &

# Wait for embedding model to load
sleep 30

# Start guard model
vllm serve Qwen/Qwen3Guard-Gen-8B \
  --host 0.0.0.0 \
  --port 8000 \
  --gpu-memory-utilization 0.60 \
  --max-model-len 32768

Using systemd (recommended)

# /etc/systemd/system/vllm-embedding.service
[Unit]
Description=vLLM Embedding Service
After=network.target

[Service]
ExecStart=/usr/bin/vllm serve Qwen/Qwen3-Embedding-0.6B \
  --host 0.0.0.0 --port 8010 \
  --gpu-memory-utilization 0.15 --max-model-len 8192
Restart=always
Environment="CUDA_VISIBLE_DEVICES=0"

[Install]
WantedBy=multi-user.target

API Usage

Embedding Endpoint

import httpx

async def get_embeddings(texts: list[str]) -> list[list[float]]:
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://localhost:8010/v1/embeddings",
            json={
                "model": "Qwen/Qwen3-Embedding-0.6B",
                "input": texts
            }
        )
        return [e["embedding"] for e in response.json()["data"]]

Guard/Generation Endpoint

async def generate(prompt: str, max_tokens: int = 512) -> str:
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://localhost:8000/v1/completions",
            json={
                "model": "Qwen/Qwen3Guard-Gen-8B",
                "prompt": prompt,
                "max_tokens": max_tokens
            }
        )
        return response.json()["choices"][0]["text"]

Combined RAG Pipeline

async def rag_query(query: str, documents: list[str]) -> str:
    # 1. Embed query and documents (port 8010)
    query_emb, doc_embs = await asyncio.gather(
        get_embeddings([query]),
        get_embeddings(documents)
    )

    # 2. Find most similar documents
    similarities = [cosine_similarity(query_emb[0], d) for d in doc_embs]
    top_docs = sorted(zip(documents, similarities), key=lambda x: -x[1])[:3]

    # 3. Generate answer with context (port 8000)
    context = "\n".join([d[0] for d in top_docs])
    prompt = f"Context:\n{context}\n\nQuestion: {query}\nAnswer:"

    return await generate(prompt)

Monitoring GPU Usage

nvidia-smi

# Watch GPU utilization
watch -n 1 nvidia-smi

# Output example:
# +-----------------------------------------------------------------------------+
# | Processes:                                                                  |
# |  GPU   GI   CI        PID   Type   Process name                GPU Memory |
# |        ID   ID                                                 Usage       |
# |=============================================================================|
# |    0   N/A  N/A     12345      C   ...vllm.entrypoints.openai    7168MiB |
# |    0   N/A  N/A     12346      C   ...vllm.entrypoints.openai   28672MiB |
# +-----------------------------------------------------------------------------+

vLLM Metrics

# Prometheus metrics endpoint
curl http://localhost:8000/metrics | grep vllm

# Key metrics:
# vllm:num_requests_running
# vllm:num_requests_waiting
# vllm:gpu_cache_usage_perc
# vllm:cpu_cache_usage_perc

Troubleshooting

Issue	Cause	Fix
OOM on startup	Memory sum > 100%	Reduce `--gpu-memory-utilization`
OOM during inference	KV cache overflow	Lower `--max-model-len`
Slow first request	Model loading	Pre-warm with dummy request
Port conflict	Both on same port	Use different `--port` values
CUDA error	Wrong GPU index	Set `CUDA_VISIBLE_DEVICES=0`

Memory Planning Formula

Total allocation = model1_util + model2_util + overhead

Safe configuration:
- 2 models: keep sum under 80%
- 3 models: keep sum under 75%
- Always leave 10-15% for system overhead

Alternative Configurations

High Throughput (Embedding-Heavy)

# More memory for embedding, batch processing
vllm serve Qwen/Qwen3-Embedding-0.6B \
  --gpu-memory-utilization 0.25 \
  --max-model-len 8192 \
  --max-num-seqs 512

vllm serve Qwen/Qwen3Guard-Gen-8B \
  --gpu-memory-utilization 0.50 \
  --max-model-len 16384

Long Context (Generation-Heavy)

# Minimal embedding, max context for generation
vllm serve Qwen/Qwen3-Embedding-0.6B \
  --gpu-memory-utilization 0.10 \
  --max-model-len 4096

vllm serve Qwen/Qwen3Guard-Gen-8B \
  --gpu-memory-utilization 0.70 \
  --max-model-len 65536

Cost Analysis (RunPod)

GPU	VRAM	Price/hr	Models Supported
RTX 4090	24GB	~$0.44	Single model only
L40S	48GB	~$0.99	2-3 models
A100 80GB	80GB	~$1.89	3-4 models

L40S sweet spot: Run 2 production models for ~$0.99/hr vs $0.88/hr for 2× RTX 4090 with added complexity.

Summary

Configuration	Value
GPU	NVIDIA L40S (48GB)
Embedding Model	Qwen3-Embedding-0.6B @ 15%
Guard Model	Qwen3Guard-Gen-8B @ 60%
Reserved	25% for overhead
Total Cost	~$0.99/hr on RunPod

Multi-model serving on a single GPU eliminates network latency between services and reduces infrastructure complexity. The key is precise memory allocation with --gpu-memory-utilization.

The Setup​

Why L40S?​

GPU Memory Breakdown​

Memory Calculation​

Key vLLM Flags​

Additional Optimization Flags​

RunPod Deployment​

Pod Configuration​

Startup Script​

Using systemd (recommended)​

API Usage​

Embedding Endpoint​

Guard/Generation Endpoint​

Combined RAG Pipeline​

Monitoring GPU Usage​

nvidia-smi​

vLLM Metrics​

Troubleshooting​

Memory Planning Formula​

Alternative Configurations​

High Throughput (Embedding-Heavy)​

Long Context (Generation-Heavy)​

Cost Analysis (RunPod)​

Summary​