Deploy Mixtral MoE Models

By Admin · Mar 15, 2026 · Updated Jun 25, 2026 · 650 views · 1 min read

What is Mixtral?

Mixtral is a Mixture-of-Experts (MoE) model from Mistral AI. Despite having many total parameters, it only activates a subset for each token, making it efficient. Mixtral 8x7B matches GPT-3.5 quality while using only 13B active parameters per forward pass.

Running with Ollama

ollama pull mixtral:8x7b
ollama run mixtral:8x7b "Explain quantum computing in simple terms"

# API usage
curl http://localhost:11434/api/generate -d '{
    "model": "mixtral:8x7b",
    "prompt": "Write a Python function to sort a list",
    "stream": false
}'

Running with vLLM (Higher Performance)

pip install vllm

python -m vllm.entrypoints.openai.api_server \
    --model mistralai/Mixtral-8x7B-Instruct-v0.1 \
    --dtype auto \
    --tensor-parallel-size 2 \
    --max-model-len 32768 \
    --port 8000

Hardware Requirements

Mixtral 8x7B: 24GB+ VRAM (quantized) or 96GB (full precision)
With 4-bit quantization: runs on 2x 16GB GPUs or 1x 48GB GPU
CPU inference: possible but slow (32GB+ RAM)

Quantization for Smaller GPUs

# Ollama automatically uses quantized versions
ollama pull mixtral:8x7b-instruct-v0.1-q4_K_M  # 4-bit quantized

# With llama.cpp
./main -m mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf \
    -p "Your prompt" -n 512 -t 8