How to Deploy TensorFlow Serving for Model Inference
TensorFlow Serving is a production-grade serving system for deploying machine learning models. It provides a high-performance gRPC and REST interface for model inference, making it ideal for serving TensorFlow and Keras models at scale on your Breeze.
Prerequisites
- A Breeze instance with at least 4 GB of RAM
- Docker installed (recommended method)
- A trained TensorFlow or Keras model saved in SavedModel format
Preparing Your Model
Export your TensorFlow model in the SavedModel format with proper versioning:
import tensorflow as tf
model = tf.keras.models.load_model("my_model.h5")
export_path = "/models/my_model/1"
tf.saved_model.save(model, export_path)
print(f"Model exported to {export_path}")
The version number directory (/1) is critical because TensorFlow Serving uses it for automatic version management.
Installing TensorFlow Serving with Docker
Pull the official TensorFlow Serving Docker image:
docker pull tensorflow/serving:latest
Start the serving container with your model mounted:
docker run -d --name tf-serving \
-p 8501:8501 -p 8500:8500 \
-v /models/my_model:/models/my_model \
-e MODEL_NAME=my_model \
tensorflow/serving
Port 8501 serves the REST API and port 8500 serves the gRPC API.
Making Prediction Requests
Send a prediction request to the REST endpoint:
curl -X POST http://localhost:8501/v1/models/my_model:predict \
-H "Content-Type: application/json" \
-d '{"instances": [[1.0, 2.0, 3.0, 4.0]]}'
The response contains the model’s predictions in JSON format.
Serving Multiple Models
Create a model configuration file at /models/models.config:
model_config_list {
config {
name: "classifier"
base_path: "/models/classifier"
model_platform: "tensorflow"
}
config {
name: "regressor"
base_path: "/models/regressor"
model_platform: "tensorflow"
}
}
Start TensorFlow Serving with the config file:
docker run -d --name tf-serving \
-p 8501:8501 \
-v /models:/models \
tensorflow/serving \
--model_config_file=/models/models.config
Monitoring with Prometheus
TensorFlow Serving exposes metrics at /monitoring/prometheus/metrics. Configure Prometheus to scrape this endpoint:
scrape_configs:
- job_name: 'tf-serving'
metrics_path: '/monitoring/prometheus/metrics'
static_configs:
- targets: ['localhost:8501']
Enabling Model Batching
For higher throughput, enable request batching by creating a batching_parameters.txt file:
max_batch_size { value: 32 }
batch_timeout_micros { value: 5000 }
max_enqueued_batches { value: 100 }
num_batch_threads { value: 4 }
Pass it to TensorFlow Serving with --enable_batching=true --batching_parameters_file=/models/batching_parameters.txt. This groups incoming requests together and processes them in parallel for better GPU utilization on your Breeze.