vLLM Prefill-Decode Disaggregation (PD Disagg) Setup Guide¶
Source https://github.com/vllm-project/vllm/blob/main/examples/online_serving/pd-disagg-readme.md.
This guide explains how to run vLLM in a Prefill-Decode Disaggregated configuration for DeepSeek-R1-0528-NVFP4 or Kimi-K2-Thinking-NVFP4 on GB200, where prefill and decode workloads run on separate nodes/instances, connected via a router.
Configuration Summary¶
| Component | TP Size | Nodes | GPUs | Port |
|---|---|---|---|---|
| Prefill Instance 0 | 8 | 2 (master + worker) | 8 | 8087 |
| Prefill Instance 1 | 8 | 2 (master + worker) | 8 | 8087 |
| Decode Instance | 8 | 2 (master + worker) | 8 | 8087 |
| Router | - | 1 | 0 | 8123 |
Total: 7 nodes, 24 GPUs (4 GPUs/node)
Environment Variables¶
System Environment¶
export NVIDIA_GDRCOPY=1
export NVSHMEM_IB_ENABLE_IBGDA=1
export NVSHMEM_BOOTSTRAP_UID_SOCK_IFNAME={NETWORK_INTERFACE}
export UCX_IB_ROCE_REACHABILITY_MODE=local_subnet
export VLLM_SKIP_P2P_CHECK=1
export GLOO_SOCKET_IFNAME={NETWORK_INTERFACE}
export NCCL_SOCKET_IFNAME={NETWORK_INTERFACE}
export NCCL_CUMEM_ENABLE=1
export NCCL_MNNVL_ENABLE=1
export NCCL_NVLS_ENABLE=1
export NCCL_TIMEOUT=1800
export TORCH_NCCL_HEARTBEAT_TIMEOUT_SEC=1800
export HF_HOME={HF_CACHE_DIR}
vLLM Environment¶
export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_USE_TRTLLM_RAGGED_DEEPSEEK_PREFILL=1
export VLLM_USE_NCCL_SYMM_MEM=1
PD Disaggregation Environment (Required for all prefill/decode nodes)¶
export VLLM_NIXL_SIDE_CHANNEL_HOST=$(hostname -i) # or {NODE_IP}
export VLLM_NIXL_SIDE_CHANNEL_PORT=5600
export VLLM_NIXL_ABORT_REQUEST_TIMEOUT=300
Decode-Specific Environment (Optional, for performance tuning)¶
# Enable multi-stream for shared experts (beneficial for smaller batches)
export VLLM_SHARED_EXPERTS_STREAM_TOKEN_THRESHOLD=8192
vLLM Serve Commands¶
Common Arguments (Shared by all instances)¶
COMMON_ARGS="
--model {MODEL_PATH}
--kv-cache-dtype fp8
--tensor-parallel-size 1
--pipeline-parallel-size 1
--enable-expert-parallel
--data-parallel-rpc-port 13345
--max-model-len 4096
--data-parallel-size-local 4
--disable-uvicorn-access-log
--no-enable-prefix-caching
--port 8087
--trust_remote_code
--no-enable-chunked-prefill
--all2all-backend allgather_reducescatter
--data-parallel-hybrid-lb
--compilation_config.custom_ops+=+quant_fp8,+rms_norm,+rotary_embedding
--compilation_config.pass_config.fuse_attn_quant true
--compilation_config.pass_config.fuse_allreduce_rms true
--compilation_config.pass_config.eliminate_noops true
--async-scheduling
--kv-transfer-config '{\"kv_connector\":\"NixlConnector\",\"kv_role\":\"kv_both\",\"kv_load_failure_policy\":\"fail\"}'
"
Prefill Instance Arguments¶
PREFILL_ARGS="
$COMMON_ARGS
--swap-space 16
--max-num-seqs 8
--enforce-eager
--gpu-memory-utilization 0.9
--max-num-batched-tokens 16384
"
Decode Instance Arguments¶
DECODE_ARGS="
$COMMON_ARGS
--compilation-config '{\"cudagraph_mode\":\"FULL_DECODE_ONLY\"}'
--gpu-memory-utilization 0.9
--stream-interval 50
--max-num-seqs 512
--max-num-batched-tokens 4096
--max-cudagraph-capture-size 512
"
Starting the Instances¶
1. Prefill Master (Node 0)¶
# Run on {PREFILL_MASTER_0_HOSTNAME}
vllm serve {MODEL_PATH} \
$PREFILL_ARGS \
--data-parallel-address {PREFILL_MASTER_0_HOSTNAME} \
--data-parallel-size 8
2. Prefill Worker (Node 1, same instance as Master 0)¶
# Run on {PREFILL_WORKER_0_HOSTNAME}
vllm serve {MODEL_PATH} \
$PREFILL_ARGS \
--data-parallel-address {PREFILL_MASTER_0_HOSTNAME} \
--data-parallel-start-rank 4 \
--data-parallel-size 8
3. Additional Prefill Instance (Nodes 2-3)¶
Repeat steps 1-2 with {PREFILL_MASTER_1_HOSTNAME} and {PREFILL_WORKER_1_HOSTNAME}.
4. Decode Master¶
# Run on {DECODE_MASTER_HOSTNAME}
vllm serve {MODEL_PATH} \
$DECODE_ARGS \
--data-parallel-address {DECODE_MASTER_HOSTNAME} \
--data-parallel-size 8
5. Decode Worker¶
# Run on {DECODE_WORKER_HOSTNAME}
vllm serve {MODEL_PATH} \
$DECODE_ARGS \
--data-parallel-address {DECODE_MASTER_HOSTNAME} \
--data-parallel-start-rank 4 \
--data-parallel-size 8
Router Setup¶
The router distributes requests to prefill instances and routes KV cache transfers to decode instances.
Build & Run Router¶
cd /vllm-workspace/router # or your vllm router directory
RUST_LOG=warn cargo run --release -- \
--policy round_robin \
--vllm-pd-disaggregation \
--max-concurrent-requests 9216 \
--prefill http://{PREFILL_MASTER_0_HOSTNAME}:8087 \
--prefill http://{PREFILL_WORKER_0_HOSTNAME}:8087 \
--prefill http://{PREFILL_MASTER_1_HOSTNAME}:8087 \
--prefill http://{PREFILL_WORKER_1_HOSTNAME}:8087 \
--decode http://{DECODE_MASTER_HOSTNAME}:8087 \
--decode http://{DECODE_WORKER_HOSTNAME}:8087 \
--host 0.0.0.0 \
--port 8123 \
--intra-node-data-parallel-size 4
Key Router Options: - --vllm-pd-disaggregation: Enable prefill-decode disaggregation mode - --prefill: Prefill instance endpoints (can specify multiple) - --decode: Decode instance endpoints (can specify multiple) - --intra-node-data-parallel-size: Number of GPUs per node for hybrid load balancing
Health Checks¶
Wait for all instances to be ready before starting the router:
# Check prefill instances
curl -s http://{PREFILL_MASTER_0_HOSTNAME}:8087/health
curl -s http://{PREFILL_WORKER_0_HOSTNAME}:8087/health
curl -s http://{PREFILL_MASTER_1_HOSTNAME}:8087/health
curl -s http://{PREFILL_WORKER_1_HOSTNAME}:8087/health
# Check decode instance
curl -s http://{DECODE_MASTER_HOSTNAME}:8087/health
curl -s http://{DECODE_WORKER_HOSTNAME}:8087/health
# Check router
curl -s http://{ROUTER_HOSTNAME}:8123/health
Running Benchmarks¶
vllm bench serve \
--model {MODEL_PATH} \
--host {ROUTER_HOSTNAME} \
--port 8123 \
--dataset-name random \
--ignore-eos \
--num-prompts 5120 \
--max-concurrency 2048 \
--random-input-len 4096 \
--random-output-len 2048 \
--ready-check-timeout-sec 0 \
--trust_remote_code
Placeholder Reference¶
| Placeholder | Description |
|---|---|
{MODEL_PATH} | Path to the model (e.g., nvidia/DeepSeek-R1-0528-FP4-v2) |
{NETWORK_INTERFACE} | Network interface name (e.g., eth0, enP22p3s0f1np1) |
{HF_CACHE_DIR} | Hugging Face cache directory |
{PREFILL_MASTER_0_HOSTNAME} | Hostname/IP of prefill instance 0 master node |
{PREFILL_WORKER_0_HOSTNAME} | Hostname/IP of prefill instance 0 worker node |
{PREFILL_MASTER_1_HOSTNAME} | Hostname/IP of prefill instance 1 master node |
{PREFILL_WORKER_1_HOSTNAME} | Hostname/IP of prefill instance 1 worker node |
{DECODE_MASTER_HOSTNAME} | Hostname/IP of decode instance master node |
{DECODE_WORKER_HOSTNAME} | Hostname/IP of decode instance worker node |
{ROUTER_HOSTNAME} | Hostname/IP of router node |