photoprism/internal/ai/vision/ollama/README.md
Michael Mayer 75f183aa25 AI: Add support for OLLAMA_BASE_URL env expansion in vision.yml #5361
Signed-off-by: Michael Mayer <michael@photoprism.app>
2025-12-10 10:52:26 +01:00

14 KiB
Raw Permalink Blame History

PhotoPrism — Ollama Engine Integration

Last Updated: December 10, 2025

Overview

This package provides PhotoPrisms native adapter for Ollama-compatible multimodal models. It lets Caption, Labels, and future Generate workflows call locally hosted models without changing worker logic, reusing the shared API client (internal/ai/vision/api_client.go) and result types (LabelResult, CaptionResult). Requests stay inside your infrastructure, rely on base64 thumbnails, and honor the same ACL, timeout, and logging hooks as the default TensorFlow engines. The adapter resolves ${OLLAMA_BASE_URL}/api/generate, trimming trailing slashes and defaulting to http://ollama:11434; set OLLAMA_BASE_URL=https://ollama.com to opt into cloud defaults.

Context & Constraints

  • Engine defaults live in internal/ai/vision/ollama and are applied whenever a model sets Engine: ollama. Aliases map to ApiFormatOllama, scheme.Base64, and a default 720px thumbnail. Cloud defaults are only selected when OLLAMA_BASE_URL equals https://ollama.com.
  • Responses may arrive as newline-delimited JSON chunks. decodeOllamaResponse keeps the most recent chunk, while parseOllamaLabels replays plain JSON strings found in response.
  • Structured JSON is optional for captions but enforced for labels when Format: json (default for label models targeting the Ollama engine).
  • The adapter never overwrites TensorFlow defaults. If an Ollama call fails, downstream code still has Nasnet, NSFW, and Face models available.
  • Workers assume a single-image payload per request. Run photoprism vision run to validate multi-image prompts before changing that invariant.

Goals

  • Let operators opt into local, private LLMs for captions and labels via vision.yml.
  • Provide safe defaults (prompts, schema, sampling) so most deployments only need to specify Name, Engine, and Service.Uri.
  • Surface reproducible logs, metrics, and CLI commands that make it easy to compare Ollama output against TensorFlow/OpenAI engines.

Non-Goals

  • Managing Ollama itself (model downloads, GPU scheduling, or authentication). Use the Compose profiles provided in the repository.
  • Adding new HTTP endpoints or bypassing the existing photoprism vision CLI.
  • Replacing TensorFlow workers—Ollama engines are additive and opt-in.

Architecture & Request Flow

  1. Model SelectionConfig.Model(ModelType) returns the top-most enabled entry. When Engine: ollama, ApplyEngineDefaults() fills in the request/response format, base64 file scheme, and a 720px resolution unless overridden.
  2. Request BuildollamaBuilder.Build wraps thumbnails with NewApiRequestOllama, which encodes them as base64 strings. Model.GetModel() resolves the exact Ollama tag (gemma3:4b, qwen2.5vl:7b, etc.).
  3. TransportPerformApiRequest uses a single HTTP POST (default timeout 10min). Authentication is optional; provide Service.Key if you proxy through an API gateway.
  4. ParsingollamaParser.Parse converts payloads into ApiResponse. It normalizes confidences (LabelConfidenceDefault = 0.5 when missing), copies NSFW scores, and canonicalizes label names via normalizeLabelResult.
  5. Persistenceentity.SrcOllama is stamped on labels/captions so UI badges and audits reflect the new source.

Prompt, Schema, & Options Guidance

  • System Prompts
    • Labels: LabelSystem enforces single-word nouns. Set System to override; assign LabelSystemSimple when you need descriptive phrases.
    • Captions: no system prompt by default; rely on user prompt or set one explicitly for stylistic needs.
  • User Prompts
    • Captions use CaptionPrompt, which requests one sentence in active voice.
    • Labels default to LabelPromptDefault; when DetectNSFWLabels is true, the adapter swaps in LabelPromptNSFW.
    • For stricter noun enforcement, set Prompt to LabelPromptStrict.
  • Schemas
    • Labels rely on schema.LabelsJson(nsfw) (simple JSON template). Setting Format: json auto-attaches a reminder (model.SchemaInstructions()).
    • Override via Schema (inline YAML) or SchemaFile. PHOTOPRISM_VISION_LABEL_SCHEMA_FILE always wins if present.
  • Options
    • Labels: default Temperature equals DefaultTemperature (0.1 unless configured), TopP=0.9, Stop=["\n\n"].
    • Captions: only Temperature is set; other parameters inherit global defaults.
    • Custom Options merge with engine defaults. Leave ForceJson=true for labels so PhotoPrism can reject malformed payloads early.

Supported Ollama Vision Models

Model (Ollama Tag) Size & Footprint Strengths JSON & Language Notes When To Use
gemma3:4b / 12b / 27b 4B/12B/27B parameters, ~3.3GB → 17GB downloads, 128K context Multimodal text+image reasoning with SigLIP encoder, handles OCR/long documents, supports tool/function calling Emits structured JSON reliably; >140 languages with strong default English output High-quality captions + multilingual labels when you have ≥12GB VRAM (4B works on 8GB with Q4_K_M)
qwen2.5vl:7b 8.29B params (Q4_K_M) ≈6GB download, 125K context Excellent charts, GUI grounding, DocVQA, multi-image reasoning, agentic tool use JSON mode tuned for schema compliance; supports 20+ languages with strong Chinese/English parity Label extraction for mixed-language archives or UI/diagram analysis
qwen3-vl:2b / 4b / 8b Dense 2B/4B/8B tiers (~3GB, ~3.5GB, ~6GB downloads) with native 256K context extendable to 1M; fits single 1224GB GPUs or high-end CPUs (2B) Spatial + video reasoning upgrades (Interleaved-MRoPE, DeepStack), 32-language OCR, GUI/agent control, long-document ingest Emits JSON reliably when prompts specify schema; multilingual captions/labels with Thinking variants boosting STEM reasoning General-purpose captions/labels when you need long-context doc/video support without cloud APIs; 2B for CPU/edge, 4B as balanced default, 8B when accuracy outweighs latency
llama3.2-vision:11b 11B params, ~7.8GB download, requires ≥8GB VRAM; 90B variant needs ≥64GB Strong general reasoning, captioning, OCR, supported by Meta ecosystem tooling Vision tasks officially supported in English; text-only tasks cover eight major languages Keep captions consistent with Meta-compatible prompts or when teams already standardize on Llama 3.x
minicpm-v:8b-2.6 8B params, ~5.5GB download, 32K context Optimized for edge GPUs, high OCR accuracy, multi-image/video support, low token count (≈640 tokens for 1.8MP) Multilingual (EN/ZH/DE/FR/IT/KR). Emits concise JSON but may need stricter stopping sequences Memory-constrained deployments that still require NSFW/OCR-aware label output

Tip: pull models inside the dev container with docker compose --profile ollama up -d and then docker compose exec ollama ollama pull gemma3:4b. Keep the profile stopped when you do not need extra GPU/CPU load.

Qwen3-VL models stream their JSON payload via the thinking field. PhotoPrism v2025.11+ captures this automatically; if you run older builds, upgrade before enabling these models or responses will appear empty.

Configuration

Environment Variables

  • PHOTOPRISM_VISION_LABEL_SCHEMA_FILE — Absolute path to a JSON snippet that overrides the default label schema (applies to every Ollama label model).
  • PHOTOPRISM_VISION_YAML — Custom vision.yml path. Keep it synced in Git if you automate deployments.
  • OLLAMA_HOST, OLLAMA_MODELS, OLLAMA_MAX_QUEUE, OLLAMA_NUM_PARALLEL, etc. — Provided in compose*.yaml to tune the Ollama daemon. Adjust OLLAMA_KEEP_ALIVE if you want models to stay loaded between worker batches.
  • OLLAMA_API_KEY / OLLAMA_API_KEY_FILE — Default bearer token picked up when Service.Key is empty; useful for hosted Ollama services (e.g., Ollama Cloud).
  • OLLAMA_BASE_URL — Base URL for the Ollama API; defaults to http://ollama:11434, trailing slashes are trimmed. Set to https://ollama.com to enable cloud defaults.
  • PHOTOPRISM_LOG_LEVEL=trace — Enables verbose request/response previews (truncated to avoid leaking images). Use temporarily when debugging parsing issues.

vision.yml Example

Models:
  - Type: labels
    Name: qwen2.5vl:7b
    Engine: ollama
    Run: newly-indexed
    Resolution: 720
    Format: json
    Options:
      Temperature: 0.05
      Stop: ["\n\n"]
      ForceJson: true
    Service:
      Uri: ${OLLAMA_BASE_URL}/api/generate
      RequestFormat: ollama
      ResponseFormat: ollama
      FileScheme: base64

  - Type: caption
    Name: gemma3:4b
    Engine: ollama
    Disabled: false
    Options:
      Temperature: 0.2
    Service:
      Uri: ${OLLAMA_BASE_URL}/api/generate

Guidelines:

  • Place new entries after the default TensorFlow models so they take precedence while Nasnet/NSFW remain as fallbacks.
  • Always specify the exact Ollama tag (model:version) so upgrades are deliberate.
  • Keep option flags before positional arguments in CLI snippets (photoprism vision run -m labels --count 1).
  • If you proxy requests (e.g., through Traefik), set Service.Key to Bearer <token> and configure the proxy to inject/validate it.

Operational Checklist

  • Scheduling — Use Run: newly-indexed for incremental runs, Run: manual for ad-hoc CLI calls, or Run: on-schedule when paired with the scheduler. Leave Run: auto if you want the worker to decide based on other model states.
  • Timeouts & Retries — Default timeout is 10minutes (ServiceTimeout). Ollama streaming responses complete faster in practice; if you need stricter SLAs, wrap photoprism vision run in a job runner and retry failed batches manually.
  • Fallbacks — Keep Nasnet configured even when Ollama labels are primary. labels.go stops at the first successful engine, so duplicates are avoided.
  • Security — When exposing Ollama beyond localhost, terminate TLS at Traefik and enable API keys. Never return full JSON payloads in logs; rely on trace mode only for debugging and sanitize before sharing.
  • Model Storage — Bind-mount ./storage/services/ollama:/root/.ollama (see Compose) so pulled models survive container restarts. Run docker compose exec ollama ollama list during deployments to verify availability.

Observability & Testing

  • CLI Smoke Tests
    • Captions: photoprism vision run -m caption --count 5 --force.
    • Labels: photoprism vision run -m labels --count 5 --force.
    • After each run, check photoprism vision ls for source=ollama.
  • Unit Tests
    • go test ./internal/ai/vision/ollama ./internal/ai/vision -run Ollama -count=1 covers transport parsing and model defaults.
    • Add fixtures under internal/ai/vision/testdata when capturing new response shapes; keep files small and anonymized.
  • Logging
    • Set PHOTOPRISM_LOG_LEVEL=debug to watch summary lines (“processed labels/caption via ollama”).
    • Use log.Trace sparingly; it prints truncated JSON blobs for troubleshooting.
  • Metrics
    • /api/v1/metrics exposes counts per label source; scrape after a batch to compare throughput with TensorFlow/OpenAI runs.

Code Map

  • internal/ai/vision/ollama/*.go — Engine defaults, schema helpers, transport structs.
  • internal/ai/vision/engine_ollama.go — Builder/parser glue plus label/caption normalization.
  • internal/ai/vision/api_ollama.go — Base64 payload builder.
  • internal/ai/vision/api_client.go — Streaming decoder shared among engines.
  • internal/ai/vision/models.go — Default caption model definition (gemma3).
  • compose*.yaml — Ollama service profile, Traefik labels, and persistent volume wiring.
  • frontend/src/common/util.js — Maps src="ollama" to the correct badge; keep it updated when adding new source strings.

Next Steps

  • Add formal schema validation (JSON Schema or JTD) so malformed label responses fail fast before normalization.
  • Support multiple thumbnails per request once core workflows confirm the API contract (requires worker + UI changes).
  • Emit per-model latency and success metrics from the vision worker to simplify tuning when several Ollama engines run side-by-side.
  • Mirror any loader changes into PhotoPrism Plus/Pro templates to keep splash + browser checks consistent after enabling external engines.