TensorFlow: Trigger explicit GC to free C-allocated tensor memory #5394

Signed-off-by: Michael Mayer <michael@photoprism.app>
2026-01-23 02:24:24 +00:00 · 2025-12-23 12:06:26 +01:00 · 2025-12-23 12:06:26 +01:00 · 28eb11d468
commit 28eb11d468
parent 898f6bc69b
11 changed files with 175 additions and 9 deletions
--- a/internal/ai/classify/README.md
+++ b/internal/ai/classify/README.md
@ -0,0 +1,31 @@
+## PhotoPrism — Classification Package
+
+**Last Updated:** December 23, 2025
+
+### Overview
+
+`internal/ai/classify` wraps PhotoPrism’s TensorFlow-based image classification (labels). It loads SavedModel classifiers (Nasnet by default), prepares inputs, runs inference, and maps output probabilities to label rules.
+
+### How It Works
+
+- **Model Loading** — The classifier loads a SavedModel under `assets/models/<name>` and resolves model tags and input/output ops (see `vision.yml` overrides for custom models).
+- **Input Preparation** — JPEGs are decoded and resized/cropped to the model’s expected input resolution.
+- **Inference** — The model outputs probabilities; `Rules` apply thresholds and priority to produce final labels.
+
+### Memory & Performance
+
+TensorFlow tensors allocate C memory and are freed by Go GC finalizers. To keep RSS bounded during long runs, PhotoPrism periodically triggers garbage collection to return freed tensor memory to the OS. Tune with:
+
+- `PHOTOPRISM_TF_GC_EVERY` (default **200**, `0` disables).  
+  Lower values reduce peak RSS but increase GC overhead and can slow indexing.
+
+### Troubleshooting Tips
+
+- **Labels are empty:** Verify the model labels file and that `Rules` thresholds are not too strict.
+- **Model load failures:** Ensure `saved_model.pb` and `variables/` exist under the configured model path.
+- **Unexpected outputs:** Check `TensorFlow.Input/Output` settings in `vision.yml` for custom models.
+
+### Related Docs
+
+- [`internal/ai/vision/README.md`](../vision/README.md) — model registry and `vision.yml` configuration
+- [`internal/ai/tensorflow/README.md`](../tensorflow/README.md) — TensorFlow helpers, GC behavior, and model loading
--- a/internal/ai/classify/model.go
+++ b/internal/ai/classify/model.go
@ -133,6 +133,8 @@ func (m *Model) Run(img []byte, confidenceThreshold int) (result Labels, err err
 		return nil, loadErr
 	}

+	defer tensorflow.MaybeCollectTensorMemory()
+
 	// Create input tensor from image.
 	tensor, err := m.createTensor(img)

--- a/internal/ai/face/README.md
+++ b/internal/ai/face/README.md
@ -1,6 +1,6 @@
 ## Face Detection and Embedding Guidelines

-**Last Updated:** October 10, 2025
+**Last Updated:** December 23, 2025

 ### Overview

@ -46,6 +46,10 @@ Runtime selection lives in `Config.FaceEngine()`; `auto` resolves to ONNX when t

 ### Embedding Handling

+#### Memory Management
+
+FaceNet embeddings are generated through TensorFlow bindings that allocate tensors in C memory. Those allocations are released by Go GC finalizers, so long-running indexing jobs can show steadily rising RSS even when the Go heap stays small. To keep memory bounded during extended face indexing runs, PhotoPrism now triggers periodic garbage collection and returns freed C-allocated tensor buffers to the OS. You can tune this behavior with `PHOTOPRISM_TF_GC_EVERY` (default **200**; set to `0` to disable). Lower values reduce peak RSS but increase GC overhead and can slow indexing, so keep the default unless memory pressure is severe.
+
 #### Normalization

 All embeddings, regardless of origin, are normalized to unit length (‖x‖₂ = 1):
--- a/internal/ai/face/model.go
+++ b/internal/ai/face/model.go
@ -129,6 +129,8 @@ func (m *Model) loadModel() error {

 // Run returns the face embeddings for an image.
 func (m *Model) Run(img image.Image) Embeddings {
+	defer tensorflow.MaybeCollectTensorMemory()
+
 	// Create input tensor from image.
 	tensor, err := imageToTensor(img, m.resolution)

--- a/internal/ai/nsfw/README.md
+++ b/internal/ai/nsfw/README.md
@ -0,0 +1,31 @@
+## PhotoPrism — NSFW Package
+
+**Last Updated:** December 23, 2025
+
+### Overview
+
+`internal/ai/nsfw` runs the built-in TensorFlow NSFW classifier to score images for drawing, hentai, neutral, porn, and sexy content. It is used during indexing and metadata workflows when the NSFW model is enabled.
+
+### How It Works
+
+- **Model Loading** — Loads the NSFW SavedModel from `assets/models/` and resolves input/output ops (inferred if missing).
+- **Input Preparation** — JPEG images are decoded and transformed to the configured input resolution.
+- **Inference & Output** — Produces five class probabilities mapped into a `Result` struct for downstream thresholds and UI badges.
+
+### Memory & Performance
+
+TensorFlow tensors allocate C memory and are freed by Go GC finalizers. To keep RSS bounded during long runs, PhotoPrism periodically triggers garbage collection to return freed tensor memory to the OS. Tune with:
+
+- `PHOTOPRISM_TF_GC_EVERY` (default **200**, `0` disables).  
+  Lower values reduce peak RSS but increase GC overhead and can slow indexing.
+
+### Troubleshooting Tips
+
+- **Model fails to load:** Verify `saved_model.pb` and `variables/` exist under the model path.
+- **Unexpected scores:** Confirm the input resolution matches the model and that logits are handled correctly.
+- **High memory usage:** Adjust `PHOTOPRISM_TF_GC_EVERY` or reduce concurrent indexing load.
+
+### Related Docs
+
+- [`internal/ai/vision/README.md`](../vision/README.md) — model registry and run scheduling
+- [`internal/ai/tensorflow/README.md`](../tensorflow/README.md) — TensorFlow helpers, GC behavior, and model loading
--- a/internal/ai/nsfw/model.go
+++ b/internal/ai/nsfw/model.go
@ -75,6 +75,8 @@ func (m *Model) Run(img []byte) (result Result, err error) {
 		return result, loadErr
 	}

+	defer tensorflow.MaybeCollectTensorMemory()
+
 	// Create input tensor from image.
 	input, err := tensorflow.ImageTransform(
 		img, fs.ImageJpeg, m.meta.Input.Resolution())
--- a/internal/ai/tensorflow/README.md
+++ b/internal/ai/tensorflow/README.md
@ -0,0 +1,41 @@
+## PhotoPrism — TensorFlow Package
+
+**Last Updated:** December 23, 2025
+
+### Overview
+
+`internal/ai/tensorflow` provides the shared TensorFlow helpers used by PhotoPrism’s built-in AI features (labels, NSFW, and FaceNet embeddings). It wraps SavedModel loading, input/output discovery, image tensor preparation, and label handling so higher-level packages can focus on domain logic.
+
+### Key Components
+
+- **Model Loading** — `SavedModel`, `GetModelTagsInfo`, and `GetInputAndOutputFromSavedModel` discover and load SavedModel graphs with appropriate tags.
+- **Input Preparation** — `Image`, `ImageTransform`, and `ImageTensorBuilder` convert JPEG images to tensors with the configured resolution, color order, and resize strategy.
+- **Output Handling** — `AddSoftmax` can insert a softmax op when a model exports logits.
+- **Labels** — `LoadLabels` loads label lists for classification models.
+
+### Model Loading Notes
+
+- Built-in models live under `assets/models/` and are accessed via helpers in `internal/ai/vision` and `internal/ai/classify`.
+- When a model lacks explicit tags or signatures, the helpers attempt to infer input/output operations. Logs will show when inference kicks in.
+- Classification models may emit logits; if `ModelInfo.Output.Logits` is true, a softmax op is injected at load time.
+
+### Memory & Garbage Collection
+
+TensorFlow tensors are allocated in C memory and freed by Go GC finalizers in the TensorFlow bindings. Long-running inference can therefore show increasing RSS even when the Go heap is small. PhotoPrism periodically triggers garbage collection to return freed C-allocated tensor buffers to the OS. Control this behavior with:
+
+- `PHOTOPRISM_TF_GC_EVERY` (default **200**, `0` disables).  
+  Lower values reduce peak RSS but increase GC overhead and can slow indexing.
+
+### Troubleshooting Tips
+
+- **Model fails to load:** Verify the SavedModel path, tags, and that `saved_model.pb` plus `variables/` exist under `assets/models/<name>`.
+- **Input/output mismatch:** Check logs for inferred inputs/outputs and confirm `vision.yml` overrides (name, resolution, and `TensorFlow.Input/Output`).
+- **Unexpected probabilities:** Ensure logits are handled correctly and labels match output indices.
+- **High memory usage:** Confirm `PHOTOPRISM_TF_GC_EVERY` is set appropriately; model weights remain resident for the life of the process by design.
+
+### Related Docs
+
+- [`internal/ai/vision/README.md`](../vision/README.md) — model registry, `vision.yml` configuration, and run scheduling
+- [`internal/ai/face/README.md`](../face/README.md) — FaceNet embeddings and face-specific tuning
+- [`internal/ai/classify/README.md`](../classify/README.md) — classification workflow using TensorFlow helpers
+- [`internal/ai/nsfw/README.md`](../nsfw/README.md) — NSFW model usage and result mapping
--- a/internal/ai/tensorflow/gc.go
+++ b/internal/ai/tensorflow/gc.go
@ -0,0 +1,43 @@
+package tensorflow
+
+import (
+	"os"
+	"runtime/debug"
+	"strconv"
+	"strings"
+	"sync/atomic"
+)
+
+const gcEveryDefault uint64 = 200
+
+var (
+	gcEvery   = gcEveryDefault
+	gcCounter uint64
+)
+
+func init() {
+	if v := strings.TrimSpace(os.Getenv("PHOTOPRISM_TF_GC_EVERY")); v != "" {
+		if strings.HasPrefix(v, "-") {
+			gcEvery = 0
+			return
+		}
+
+		if n, err := strconv.ParseUint(v, 10, 64); err == nil {
+			gcEvery = n
+		}
+	}
+}
+
+// MaybeCollectTensorMemory triggers GC and returns freed C-allocated tensor memory
+// to the OS every gcEvery calls; set gcEvery to 0 to disable the throttling.
+func MaybeCollectTensorMemory() {
+	if gcEvery == 0 {
+		return
+	}
+
+	if atomic.AddUint64(&gcCounter, 1)%gcEvery != 0 {
+		return
+	}
+
+	debug.FreeOSMemory()
+}
--- a/internal/ai/vision/README.md
+++ b/internal/ai/vision/README.md
@ -1,12 +1,12 @@
 ## PhotoPrism — Vision Package

-**Last Updated:** December 10, 2025
+**Last Updated:** December 23, 2025

 ### Overview

 `internal/ai/vision` provides the shared model registry, request builders, and parsers that power PhotoPrism’s caption, label, face, NSFW, and future generate workflows. It reads `vision.yml`, normalizes models, and dispatches calls to one of three engines:

- **TensorFlow (built‑in)** — default Nasnet / NSFW / Facenet models, no remote service required.
+- **TensorFlow (built‑in)** — default Nasnet / NSFW / Facenet models, no remote service required. Long-running TensorFlow inference can accumulate C-allocated tensor memory until GC finalizers run, so PhotoPrism periodically triggers garbage collection to return that memory to the OS; tune with `PHOTOPRISM_TF_GC_EVERY` (default **200**, `0` disables). Lower values reduce peak RSS but increase GC overhead and can slow indexing, so keep the default unless memory pressure is severe.
 - **Ollama** — local or proxied multimodal LLMs. See [`ollama/README.md`](ollama/README.md) for tuning and schema details. The engine defaults to `${OLLAMA_BASE_URL:-http://ollama:11434}/api/generate`, trimming any trailing slash on the base URL; set `OLLAMA_BASE_URL=https://ollama.com` to opt into cloud defaults.
 - **OpenAI** — cloud Responses API. See [`openai/README.md`](openai/README.md) for prompts, schema variants, and header requirements.

@ -199,6 +199,10 @@ Models:
 - **Ollama**: private, GPU/CPU-hosted multimodal LLMs; best for richer captions/labels without cloud traffic.
 - **OpenAI**: highest quality reasoning and multimodal support; requires API key and network access.

+### Model Unload on Idle
+
+PhotoPrism currently keeps TensorFlow models resident for the lifetime of the process to avoid repeated load costs. A future “model unload on idle” mode would track last-use timestamps and close the TensorFlow session/graph after a configurable idle period, releasing the model’s memory footprint back to the OS. The trade-off is higher latency and CPU overhead when a model is used again, plus extra I/O to reload weights. This may be attractive for low-frequency or memory-constrained deployments but would slow continuous indexing jobs, so it is not enabled today.
+
 ### Related Docs

 - Ollama specifics: [`internal/ai/vision/ollama/README.md`](ollama/README.md)
--- a/internal/ai/vision/api_ollama.go
+++ b/internal/ai/vision/api_ollama.go
@ -17,10 +17,13 @@ func NewApiRequestOllama(images Files, fileScheme scheme.Type) (*ApiRequest, err
 	for i := range images {
 		switch fileScheme {
 		case scheme.Data, scheme.Base64:
-			if file, err := os.Open(images[i]); err != nil {
+			file, err := os.Open(images[i])
+			if err != nil {
 				return nil, fmt.Errorf("%s (create data url)", err)
-			} else {
-				imagesData[i] = media.DataBase64(file)
+			}
+			imagesData[i] = media.DataBase64(file)
+			if err := file.Close(); err != nil {
+				return nil, fmt.Errorf("%s (close data url)", err)
 			}
 		default:
 			return nil, fmt.Errorf("unsupported file scheme %s", clean.Log(fileScheme))
--- a/internal/ai/vision/api_request.go
+++ b/internal/ai/vision/api_request.go
@ -132,10 +132,13 @@ func NewApiRequestImages(images Files, fileScheme scheme.Type) (*ApiRequest, err
 				imageUrls[i] = fmt.Sprintf("%s/%s", DownloadUrl, fileUuid)
 			}
 		case scheme.Data:
-			if file, err := os.Open(images[i]); err != nil {
+			file, err := os.Open(images[i])
+			if err != nil {
 				return nil, fmt.Errorf("%s (create data url)", err)
-			} else {
-				imageUrls[i] = media.DataUrl(file)
+			}
+			imageUrls[i] = media.DataUrl(file)
+			if err := file.Close(); err != nil {
+				return nil, fmt.Errorf("%s (close data url)", err)
 			}
 		default:
 			return nil, fmt.Errorf("unsupported file scheme %s", clean.Log(fileScheme))