Docs / AI & Machine Learning / How to Use vLLM for High-Performance LLM Serving

How to Use vLLM for High-Performance LLM Serving

By Admin · Mar 2, 2026 · Updated Apr 23, 2026 · 33 views · 3 min read

How to Use vLLM for High-Performance LLM Serving

vLLM is a high-throughput, memory-efficient inference engine for large language models. It uses PagedAttention to manage GPU memory dynamically, achieving up to 24x higher throughput than naive implementations. Running vLLM on your Breeze is ideal for production LLM workloads that require fast response times.

Prerequisites

  • A Breeze instance with an NVIDIA GPU (16+ GB VRAM recommended)
  • CUDA 11.8 or later installed
  • Python 3.9 or later
  • At least 40 GB of disk space for model weights

Installing vLLM

python3 -m venv ~/vllm-env
source ~/vllm-env/bin/activate
pip install vllm

Starting the OpenAI-Compatible Server

Launch vLLM with a Hugging Face model:

python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --max-model-len 4096 \
  --tensor-parallel-size 1

vLLM downloads the model automatically from Hugging Face on first launch and caches it locally.

Making API Requests

vLLM serves an OpenAI-compatible API, so you can use any OpenAI SDK or curl:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-3-8B-Instruct",
    "messages": [{"role": "user", "content": "Write a haiku about cloud servers."}],
    "temperature": 0.7,
    "max_tokens": 100
  }'

Using the Python Client

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="meta-llama/Llama-3-8B-Instruct",
    messages=[{"role": "user", "content": "What are the benefits of self-hosted AI?"}],
    stream=True
)
for chunk in response:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")

Key Configuration Options

  • --tensor-parallel-size — number of GPUs for tensor parallelism (set to your GPU count)
  • --max-model-len — maximum sequence length, reduce to save GPU memory
  • --gpu-memory-utilization — fraction of GPU memory to use (default 0.9)
  • --quantization — use awq or gptq for quantized models that fit in less VRAM
  • --enforce-eager — disables CUDA graphs, useful for debugging

Serving Quantized Models

Run quantized models to serve larger LLMs on smaller GPUs:

python -m vllm.entrypoints.openai.api_server \
  --model TheBloke/Llama-2-13B-chat-AWQ \
  --quantization awq \
  --host 0.0.0.0 --port 8000

Benchmarking Performance

vLLM includes a benchmarking tool to measure throughput:

python -m vllm.entrypoints.openai.api_server &
python -m vllm.benchmark_serving \
  --backend openai \
  --base-url http://localhost:8000 \
  --model meta-llama/Llama-3-8B-Instruct \
  --num-prompts 100

This reports tokens-per-second throughput and latency percentiles, helping you right-size your Breeze instance for your workload.

Running as a Systemd Service

[Unit]
Description=vLLM Inference Server
After=network.target

[Service]
User=deploy
ExecStart=/home/deploy/vllm-env/bin/python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-8B-Instruct \
  --host 0.0.0.0 --port 8000
Restart=always
Environment=HUGGING_FACE_HUB_TOKEN=your_token_here

[Install]
WantedBy=multi-user.target

This ensures your vLLM server on your Breeze starts automatically and restarts on failure.

Was this article helpful?