How to Use vLLM for High-Performance LLM Serving
vLLM is a high-throughput, memory-efficient inference engine for large language models. It uses PagedAttention to manage GPU memory dynamically, achieving up to 24x higher throughput than naive implementations. Running vLLM on your Breeze is ideal for production LLM workloads that require fast response times.
Prerequisites
- A Breeze instance with an NVIDIA GPU (16+ GB VRAM recommended)
- CUDA 11.8 or later installed
- Python 3.9 or later
- At least 40 GB of disk space for model weights
Installing vLLM
python3 -m venv ~/vllm-env
source ~/vllm-env/bin/activate
pip install vllm
Starting the OpenAI-Compatible Server
Launch vLLM with a Hugging Face model:
python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8B-Instruct \
--host 0.0.0.0 \
--port 8000 \
--max-model-len 4096 \
--tensor-parallel-size 1
vLLM downloads the model automatically from Hugging Face on first launch and caches it locally.
Making API Requests
vLLM serves an OpenAI-compatible API, so you can use any OpenAI SDK or curl:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3-8B-Instruct",
"messages": [{"role": "user", "content": "Write a haiku about cloud servers."}],
"temperature": 0.7,
"max_tokens": 100
}'
Using the Python Client
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="meta-llama/Llama-3-8B-Instruct",
messages=[{"role": "user", "content": "What are the benefits of self-hosted AI?"}],
stream=True
)
for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
Key Configuration Options
- --tensor-parallel-size — number of GPUs for tensor parallelism (set to your GPU count)
- --max-model-len — maximum sequence length, reduce to save GPU memory
- --gpu-memory-utilization — fraction of GPU memory to use (default 0.9)
- --quantization — use
awqorgptqfor quantized models that fit in less VRAM - --enforce-eager — disables CUDA graphs, useful for debugging
Serving Quantized Models
Run quantized models to serve larger LLMs on smaller GPUs:
python -m vllm.entrypoints.openai.api_server \
--model TheBloke/Llama-2-13B-chat-AWQ \
--quantization awq \
--host 0.0.0.0 --port 8000
Benchmarking Performance
vLLM includes a benchmarking tool to measure throughput:
python -m vllm.entrypoints.openai.api_server &
python -m vllm.benchmark_serving \
--backend openai \
--base-url http://localhost:8000 \
--model meta-llama/Llama-3-8B-Instruct \
--num-prompts 100
This reports tokens-per-second throughput and latency percentiles, helping you right-size your Breeze instance for your workload.
Running as a Systemd Service
[Unit]
Description=vLLM Inference Server
After=network.target
[Service]
User=deploy
ExecStart=/home/deploy/vllm-env/bin/python -m vllm.entrypoints.openai.api_server \
--model meta-llama/Llama-3-8B-Instruct \
--host 0.0.0.0 --port 8000
Restart=always
Environment=HUGGING_FACE_HUB_TOKEN=your_token_here
[Install]
WantedBy=multi-user.target
This ensures your vLLM server on your Breeze starts automatically and restarts on failure.