How to Deploy TensorFlow Serving for Model Inference

By Admin · Mar 2, 2026 · Updated Jun 25, 2026 · 58 views · 2 min read

How to Deploy TensorFlow Serving for Model Inference

TensorFlow Serving is a production-grade serving system for deploying machine learning models. It provides a high-performance gRPC and REST interface for model inference, making it ideal for serving TensorFlow and Keras models at scale on your Breeze.

Prerequisites

A Breeze instance with at least 4 GB of RAM
Docker installed (recommended method)
A trained TensorFlow or Keras model saved in SavedModel format

Preparing Your Model

Export your TensorFlow model in the SavedModel format with proper versioning:

import tensorflow as tf

model = tf.keras.models.load_model("my_model.h5")

export_path = "/models/my_model/1"
tf.saved_model.save(model, export_path)
print(f"Model exported to {export_path}")

The version number directory (/1) is critical because TensorFlow Serving uses it for automatic version management.

Installing TensorFlow Serving with Docker

Pull the official TensorFlow Serving Docker image:

docker pull tensorflow/serving:latest

Start the serving container with your model mounted:

docker run -d --name tf-serving \
  -p 8501:8501 -p 8500:8500 \
  -v /models/my_model:/models/my_model \
  -e MODEL_NAME=my_model \
  tensorflow/serving

Port 8501 serves the REST API and port 8500 serves the gRPC API.

Making Prediction Requests

Send a prediction request to the REST endpoint:

curl -X POST http://localhost:8501/v1/models/my_model:predict \
  -H "Content-Type: application/json" \
  -d '{"instances": [[1.0, 2.0, 3.0, 4.0]]}'

The response contains the model’s predictions in JSON format.

Serving Multiple Models

Create a model configuration file at /models/models.config:

model_config_list {
  config {
    name: "classifier"
    base_path: "/models/classifier"
    model_platform: "tensorflow"
  }
  config {
    name: "regressor"
    base_path: "/models/regressor"
    model_platform: "tensorflow"
  }
}

Start TensorFlow Serving with the config file:

docker run -d --name tf-serving \
  -p 8501:8501 \
  -v /models:/models \
  tensorflow/serving \
  --model_config_file=/models/models.config

Monitoring with Prometheus

TensorFlow Serving exposes metrics at /monitoring/prometheus/metrics. Configure Prometheus to scrape this endpoint:

scrape_configs:
  - job_name: 'tf-serving'
    metrics_path: '/monitoring/prometheus/metrics'
    static_configs:
      - targets: ['localhost:8501']

Enabling Model Batching

For higher throughput, enable request batching by creating a batching_parameters.txt file:

max_batch_size { value: 32 }
batch_timeout_micros { value: 5000 }
max_enqueued_batches { value: 100 }
num_batch_threads { value: 4 }

Pass it to TensorFlow Serving with --enable_batching=true --batching_parameters_file=/models/batching_parameters.txt. This groups incoming requests together and processes them in parallel for better GPU utilization on your Breeze.

How to Deploy TensorFlow Serving for Model Inference