What is Mixtral?
Mixtral is a Mixture-of-Experts (MoE) model from Mistral AI. Despite having many total parameters, it only activates a subset for each token, making it efficient. Mixtral 8x7B matches GPT-3.5 quality while using only 13B active parameters per forward pass.
Running with Ollama
ollama pull mixtral:8x7b
ollama run mixtral:8x7b "Explain quantum computing in simple terms"
# API usage
curl http://localhost:11434/api/generate -d '{
"model": "mixtral:8x7b",
"prompt": "Write a Python function to sort a list",
"stream": false
}'
Running with vLLM (Higher Performance)
pip install vllm
python -m vllm.entrypoints.openai.api_server \
--model mistralai/Mixtral-8x7B-Instruct-v0.1 \
--dtype auto \
--tensor-parallel-size 2 \
--max-model-len 32768 \
--port 8000
Hardware Requirements
- Mixtral 8x7B: 24GB+ VRAM (quantized) or 96GB (full precision)
- With 4-bit quantization: runs on 2x 16GB GPUs or 1x 48GB GPU
- CPU inference: possible but slow (32GB+ RAM)
Quantization for Smaller GPUs
# Ollama automatically uses quantized versions
ollama pull mixtral:8x7b-instruct-v0.1-q4_K_M # 4-bit quantized
# With llama.cpp
./main -m mixtral-8x7b-instruct-v0.1.Q4_K_M.gguf \
-p "Your prompt" -n 512 -t 8