Docs / AI & Machine Learning / How to Deploy a Hugging Face Model on Your Breeze

How to Deploy a Hugging Face Model on Your Breeze

By Admin · Mar 2, 2026 · Updated Apr 24, 2026 · 30 views · 3 min read

How to Deploy a Hugging Face Model on Your Breeze

Hugging Face hosts thousands of pre-trained models for natural language processing, computer vision, and audio tasks. Deploying these models on your own Breeze gives you a private, scalable inference endpoint without per-request API fees or data leaving your infrastructure.

Prerequisites

  • A Breeze instance with at least 4 GB of RAM (more for larger models)
  • Python 3.9 or later
  • A Hugging Face account and access token (for gated models)

Installing the Transformers Library

python3 -m venv ~/hf-deploy
source ~/hf-deploy/bin/activate
pip install transformers torch accelerate sentencepiece fastapi uvicorn

Loading and Testing a Model

Use the pipeline API for the simplest way to load and run a model:

from transformers import pipeline

# Text classification
classifier = pipeline("sentiment-analysis")
result = classifier("I love running AI models on my own server!")
print(result)

# Text generation
generator = pipeline("text-generation", model="microsoft/DialoGPT-medium")
response = generator("What is the meaning of life?", max_length=100)
print(response[0]["generated_text"])

Deploying with Text Generation Inference (TGI)

For production deployments, use Hugging Face’s Text Generation Inference server:

docker run -d --name tgi \
  -p 8080:80 \
  -v /data/hf-models:/data \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id mistralai/Mistral-7B-Instruct-v0.2 \
  --max-input-length 4096 \
  --max-total-tokens 8192

TGI provides optimized inference with continuous batching, quantization support, and an OpenAI-compatible API.

Building a Custom API with FastAPI

Wrap any Hugging Face model in a production API:

from fastapi import FastAPI
from transformers import pipeline
from pydantic import BaseModel

app = FastAPI(title="HF Model API")
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")

class TextInput(BaseModel):
    text: str
    max_length: int = 150

@app.post("/summarize")
async def summarize(input: TextInput):
    result = summarizer(input.text, max_length=input.max_length, min_length=30)
    return {"summary": result[0]["summary_text"]}

@app.get("/health")
async def health():
    return {"status": "healthy", "model": "bart-large-cnn"}

Run with uvicorn main:app --host 0.0.0.0 --port 8000 --workers 2.

Using Specific Model Types

Hugging Face supports many task types. Here are common examples:

  • Named Entity Recognitionpipeline("ner", model="dslim/bert-base-NER")
  • Question Answeringpipeline("question-answering", model="deepset/roberta-base-squad2")
  • Translationpipeline("translation_en_to_fr", model="Helsinki-NLP/opus-mt-en-fr")
  • Image Classificationpipeline("image-classification", model="google/vit-base-patch16-224")
  • Speech Recognitionpipeline("automatic-speech-recognition", model="openai/whisper-small")

Optimizing for Production

Reduce memory usage and improve speed with these techniques:

# Use half-precision (FP16) to halve memory usage
model = pipeline("text-generation", model="mistralai/Mistral-7B-Instruct-v0.2", torch_dtype="float16")

# Enable BetterTransformer for faster inference
from optimum.bettertransformer import BetterTransformer
model.model = BetterTransformer.transform(model.model)

# Use ONNX Runtime for CPU inference
from optimum.onnxruntime import ORTModelForSequenceClassification
model = ORTModelForSequenceClassification.from_pretrained("distilbert-base-uncased-finetuned-sst-2-english", export=True)

Setting Up Model Caching

Configure the Hugging Face cache directory to a persistent volume on your Breeze:

export HF_HOME=/data/huggingface
export TRANSFORMERS_CACHE=/data/huggingface/hub

Add these to your systemd service environment to ensure models are cached across restarts and not re-downloaded.

Was this article helpful?