How to Run Whisper Speech-to-Text on Your Breeze
Whisper is an open-source automatic speech recognition (ASR) model that can transcribe audio in dozens of languages with remarkable accuracy. Running Whisper on your own Breeze keeps your audio data private and eliminates per-minute transcription costs.
Prerequisites
- A Breeze instance with at least 4 GB of RAM (8 GB recommended for larger models)
- Python 3.9 or later
- FFmpeg installed for audio processing
Installing FFmpeg and Dependencies
FFmpeg is required for handling various audio formats:
sudo apt update
sudo apt install -y ffmpeg python3 python3-pip python3-venv
Installing Whisper
Create a virtual environment and install the Whisper package:
python3 -m venv ~/whisper-env
source ~/whisper-env/bin/activate
pip install openai-whisper
For GPU acceleration (NVIDIA), ensure CUDA is installed and use pip install openai-whisper[gpu].
Transcribing Audio Files
Use the command-line interface to transcribe an audio file:
whisper audio.mp3 --model medium --language en --output_format txt
Available model sizes and their approximate RAM requirements:
- tiny — ~1 GB RAM, fastest but least accurate
- base — ~1 GB RAM, good for clear speech
- small — ~2 GB RAM, solid accuracy
- medium — ~5 GB RAM, recommended balance
- large — ~10 GB RAM, best accuracy
Using Whisper in Python
For programmatic use, import Whisper directly:
import whisper
model = whisper.load_model("medium")
result = model.transcribe("meeting_recording.mp3")
print(result["text"])
The result dictionary also contains segments with timestamps for each phrase, useful for generating subtitles.
Building a Transcription API
Wrap Whisper in a FastAPI application for on-demand transcription:
from fastapi import FastAPI, UploadFile
import whisper, tempfile, os
app = FastAPI()
model = whisper.load_model("medium")
@app.post("/transcribe")
async def transcribe(file: UploadFile):
with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as tmp:
tmp.write(await file.read())
tmp_path = tmp.name
result = model.transcribe(tmp_path)
os.unlink(tmp_path)
return {"text": result["text"], "segments": result["segments"]}
Batch Processing Multiple Files
Process an entire directory of audio files:
#!/bin/bash
for file in /data/audio/*.mp3; do
echo "Transcribing: $file"
whisper "$file" --model medium --output_dir /data/transcripts/ --output_format srt
done
Using Faster-Whisper for Better Performance
The faster-whisper library uses CTranslate2 for significantly faster inference with lower memory usage:
pip install faster-whisper
from faster_whisper import WhisperModel
model = WhisperModel("medium", compute_type="int8")
segments, info = model.transcribe("audio.mp3")
for segment in segments:
print(f"[{segment.start:.2f}s - {segment.end:.2f}s] {segment.text}")
This can run 4 to 8 times faster than standard Whisper on CPU, making it ideal for Breeze instances without GPU.