How to Deploy a Hugging Face Model on Your Breeze
Hugging Face hosts thousands of pre-trained models for natural language processing, computer vision, and audio tasks. Deploying these models on your own Breeze gives you a private, scalable inference endpoint without per-request API fees or data leaving your infrastructure.
Prerequisites
- A Breeze instance with at least 4 GB of RAM (more for larger models)
- Python 3.9 or later
- A Hugging Face account and access token (for gated models)
Installing the Transformers Library
python3 -m venv ~/hf-deploy
source ~/hf-deploy/bin/activate
pip install transformers torch accelerate sentencepiece fastapi uvicorn
Loading and Testing a Model
Use the pipeline API for the simplest way to load and run a model:
from transformers import pipeline
# Text classification
classifier = pipeline("sentiment-analysis")
result = classifier("I love running AI models on my own server!")
print(result)
# Text generation
generator = pipeline("text-generation", model="microsoft/DialoGPT-medium")
response = generator("What is the meaning of life?", max_length=100)
print(response[0]["generated_text"])
Deploying with Text Generation Inference (TGI)
For production deployments, use Hugging Face’s Text Generation Inference server:
docker run -d --name tgi \
-p 8080:80 \
-v /data/hf-models:/data \
ghcr.io/huggingface/text-generation-inference:latest \
--model-id mistralai/Mistral-7B-Instruct-v0.2 \
--max-input-length 4096 \
--max-total-tokens 8192
TGI provides optimized inference with continuous batching, quantization support, and an OpenAI-compatible API.
Building a Custom API with FastAPI
Wrap any Hugging Face model in a production API:
from fastapi import FastAPI
from transformers import pipeline
from pydantic import BaseModel
app = FastAPI(title="HF Model API")
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
class TextInput(BaseModel):
text: str
max_length: int = 150
@app.post("/summarize")
async def summarize(input: TextInput):
result = summarizer(input.text, max_length=input.max_length, min_length=30)
return {"summary": result[0]["summary_text"]}
@app.get("/health")
async def health():
return {"status": "healthy", "model": "bart-large-cnn"}
Run with uvicorn main:app --host 0.0.0.0 --port 8000 --workers 2.
Using Specific Model Types
Hugging Face supports many task types. Here are common examples:
- Named Entity Recognition —
pipeline("ner", model="dslim/bert-base-NER") - Question Answering —
pipeline("question-answering", model="deepset/roberta-base-squad2") - Translation —
pipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr") - Image Classification —
pipeline("image-classification", model="google/vit-base-patch16-224") - Speech Recognition —
pipeline("automatic-speech-recognition", model="openai/whisper-small")
Optimizing for Production
Reduce memory usage and improve speed with these techniques:
# Use half-precision (FP16) to halve memory usage
model = pipeline("text-generation", model="mistralai/Mistral-7B-Instruct-v0.2", torch_dtype="float16")
# Enable BetterTransformer for faster inference
from optimum.bettertransformer import BetterTransformer
model.model = BetterTransformer.transform(model.model)
# Use ONNX Runtime for CPU inference
from optimum.onnxruntime import ORTModelForSequenceClassification
model = ORTModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", export=True)
Setting Up Model Caching
Configure the Hugging Face cache directory to a persistent volume on your Breeze:
export HF_HOME=/data/huggingface
export TRANSFORMERS_CACHE=/data/huggingface/hub
Add these to your systemd service environment to ensure models are cached across restarts and not re-downloaded.