14 KiB
PhotoPrism — Ollama Engine Integration
Last Updated: December 10, 2025
Overview
This package provides PhotoPrism’s native adapter for Ollama-compatible multimodal models. It lets Caption, Labels, and future Generate workflows call locally hosted models without changing worker logic, reusing the shared API client (internal/ai/vision/api_client.go) and result types (LabelResult, CaptionResult). Requests stay inside your infrastructure, rely on base64 thumbnails, and honor the same ACL, timeout, and logging hooks as the default TensorFlow engines. The adapter resolves ${OLLAMA_BASE_URL}/api/generate, trimming trailing slashes and defaulting to http://ollama:11434; set OLLAMA_BASE_URL=https://ollama.com to opt into cloud defaults.
Context & Constraints
- Engine defaults live in
internal/ai/vision/ollamaand are applied whenever a model setsEngine: ollama. Aliases map toApiFormatOllama,scheme.Base64, and a default 720 px thumbnail. Cloud defaults are only selected whenOLLAMA_BASE_URLequalshttps://ollama.com. - Responses may arrive as newline-delimited JSON chunks.
decodeOllamaResponsekeeps the most recent chunk, whileparseOllamaLabelsreplays plain JSON strings found inresponse. - Structured JSON is optional for captions but enforced for labels when
Format: json(default for label models targeting the Ollama engine). - The adapter never overwrites TensorFlow defaults. If an Ollama call fails, downstream code still has Nasnet, NSFW, and Face models available.
- Workers assume a single-image payload per request. Run
photoprism vision runto validate multi-image prompts before changing that invariant.
Goals
- Let operators opt into local, private LLMs for captions and labels via
vision.yml. - Provide safe defaults (prompts, schema, sampling) so most deployments only need to specify
Name,Engine, andService.Uri. - Surface reproducible logs, metrics, and CLI commands that make it easy to compare Ollama output against TensorFlow/OpenAI engines.
Non-Goals
- Managing Ollama itself (model downloads, GPU scheduling, or authentication). Use the Compose profiles provided in the repository.
- Adding new HTTP endpoints or bypassing the existing
photoprism visionCLI. - Replacing TensorFlow workers—Ollama engines are additive and opt-in.
Architecture & Request Flow
- Model Selection —
Config.Model(ModelType)returns the top-most enabled entry. WhenEngine: ollama,ApplyEngineDefaults()fills in the request/response format, base64 file scheme, and a 720 px resolution unless overridden. - Request Build —
ollamaBuilder.Buildwraps thumbnails withNewApiRequestOllama, which encodes them as base64 strings.Model.GetModel()resolves the exact Ollama tag (gemma3:4b,qwen2.5vl:7b, etc.). - Transport —
PerformApiRequestuses a single HTTP POST (default timeout 10 min). Authentication is optional; provideService.Keyif you proxy through an API gateway. - Parsing —
ollamaParser.Parseconverts payloads intoApiResponse. It normalizes confidences (LabelConfidenceDefault = 0.5when missing), copies NSFW scores, and canonicalizes label names vianormalizeLabelResult. - Persistence —
entity.SrcOllamais stamped on labels/captions so UI badges and audits reflect the new source.
Prompt, Schema, & Options Guidance
- System Prompts
- Labels:
LabelSystemenforces single-word nouns. SetSystemto override; assignLabelSystemSimplewhen you need descriptive phrases. - Captions: no system prompt by default; rely on user prompt or set one explicitly for stylistic needs.
- Labels:
- User Prompts
- Captions use
CaptionPrompt, which requests one sentence in active voice. - Labels default to
LabelPromptDefault; whenDetectNSFWLabelsis true, the adapter swaps inLabelPromptNSFW. - For stricter noun enforcement, set
PrompttoLabelPromptStrict.
- Captions use
- Schemas
- Labels rely on
schema.LabelsJson(nsfw)(simple JSON template). SettingFormat: jsonauto-attaches a reminder (model.SchemaInstructions()). - Override via
Schema(inline YAML) orSchemaFile.PHOTOPRISM_VISION_LABEL_SCHEMA_FILEalways wins if present.
- Labels rely on
- Options
- Labels: default
TemperatureequalsDefaultTemperature(0.1 unless configured),TopP=0.9,Stop=["\n\n"]. - Captions: only
Temperatureis set; other parameters inherit global defaults. - Custom
Optionsmerge with engine defaults. LeaveForceJson=truefor labels so PhotoPrism can reject malformed payloads early.
- Labels: default
Supported Ollama Vision Models
| Model (Ollama Tag) | Size & Footprint | Strengths | JSON & Language Notes | When To Use |
|---|---|---|---|---|
gemma3:4b / 12b / 27b |
4B/12B/27B parameters, ~3.3 GB → 17 GB downloads, 128 K context | Multimodal text+image reasoning with SigLIP encoder, handles OCR/long documents, supports tool/function calling | Emits structured JSON reliably; >140 languages with strong default English output | High-quality captions + multilingual labels when you have ≥12 GB VRAM (4B works on 8 GB with Q4_K_M) |
qwen2.5vl:7b |
8.29 B params (Q4_K_M) ≈6 GB download, 125 K context | Excellent charts, GUI grounding, DocVQA, multi-image reasoning, agentic tool use | JSON mode tuned for schema compliance; supports 20+ languages with strong Chinese/English parity | Label extraction for mixed-language archives or UI/diagram analysis |
qwen3-vl:2b / 4b / 8b |
Dense 2B/4B/8B tiers (~3 GB, ~3.5 GB, ~6 GB downloads) with native 256 K context extendable to 1 M; fits single 12–24 GB GPUs or high-end CPUs (2B) | Spatial + video reasoning upgrades (Interleaved-MRoPE, DeepStack), 32-language OCR, GUI/agent control, long-document ingest | Emits JSON reliably when prompts specify schema; multilingual captions/labels with Thinking variants boosting STEM reasoning | General-purpose captions/labels when you need long-context doc/video support without cloud APIs; 2B for CPU/edge, 4B as balanced default, 8B when accuracy outweighs latency |
llama3.2-vision:11b |
11 B params, ~7.8 GB download, requires ≥8 GB VRAM; 90 B variant needs ≥64 GB | Strong general reasoning, captioning, OCR, supported by Meta ecosystem tooling | Vision tasks officially supported in English; text-only tasks cover eight major languages | Keep captions consistent with Meta-compatible prompts or when teams already standardize on Llama 3.x |
minicpm-v:8b-2.6 |
8 B params, ~5.5 GB download, 32 K context | Optimized for edge GPUs, high OCR accuracy, multi-image/video support, low token count (≈640 tokens for 1.8 MP) | Multilingual (EN/ZH/DE/FR/IT/KR). Emits concise JSON but may need stricter stopping sequences | Memory-constrained deployments that still require NSFW/OCR-aware label output |
Tip: pull models inside the dev container with
docker compose --profile ollama up -dand thendocker compose exec ollama ollama pull gemma3:4b. Keep the profile stopped when you do not need extra GPU/CPU load.
Qwen3-VL models stream their JSON payload via the
thinkingfield. PhotoPrism v2025.11+ captures this automatically; if you run older builds, upgrade before enabling these models or responses will appear empty.
Configuration
Environment Variables
PHOTOPRISM_VISION_LABEL_SCHEMA_FILE— Absolute path to a JSON snippet that overrides the default label schema (applies to every Ollama label model).PHOTOPRISM_VISION_YAML— Customvision.ymlpath. Keep it synced in Git if you automate deployments.OLLAMA_HOST,OLLAMA_MODELS,OLLAMA_MAX_QUEUE,OLLAMA_NUM_PARALLEL, etc. — Provided incompose*.yamlto tune the Ollama daemon. AdjustOLLAMA_KEEP_ALIVEif you want models to stay loaded between worker batches.OLLAMA_API_KEY/OLLAMA_API_KEY_FILE— Default bearer token picked up whenService.Keyis empty; useful for hosted Ollama services (e.g., Ollama Cloud).OLLAMA_BASE_URL— Base URL for the Ollama API; defaults tohttp://ollama:11434, trailing slashes are trimmed. Set tohttps://ollama.comto enable cloud defaults.PHOTOPRISM_LOG_LEVEL=trace— Enables verbose request/response previews (truncated to avoid leaking images). Use temporarily when debugging parsing issues.
vision.yml Example
Models:
- Type: labels
Name: qwen2.5vl:7b
Engine: ollama
Run: newly-indexed
Resolution: 720
Format: json
Options:
Temperature: 0.05
Stop: ["\n\n"]
ForceJson: true
Service:
Uri: ${OLLAMA_BASE_URL}/api/generate
RequestFormat: ollama
ResponseFormat: ollama
FileScheme: base64
- Type: caption
Name: gemma3:4b
Engine: ollama
Disabled: false
Options:
Temperature: 0.2
Service:
Uri: ${OLLAMA_BASE_URL}/api/generate
Guidelines:
- Place new entries after the default TensorFlow models so they take precedence while Nasnet/NSFW remain as fallbacks.
- Always specify the exact Ollama tag (
model:version) so upgrades are deliberate. - Keep option flags before positional arguments in CLI snippets (
photoprism vision run -m labels --count 1). - If you proxy requests (e.g., through Traefik), set
Service.KeytoBearer <token>and configure the proxy to inject/validate it.
Operational Checklist
- Scheduling — Use
Run: newly-indexedfor incremental runs,Run: manualfor ad-hoc CLI calls, orRun: on-schedulewhen paired with the scheduler. LeaveRun: autoif you want the worker to decide based on other model states. - Timeouts & Retries — Default timeout is 10 minutes (
ServiceTimeout). Ollama streaming responses complete faster in practice; if you need stricter SLAs, wrapphotoprism vision runin a job runner and retry failed batches manually. - Fallbacks — Keep Nasnet configured even when Ollama labels are primary.
labels.gostops at the first successful engine, so duplicates are avoided. - Security — When exposing Ollama beyond localhost, terminate TLS at Traefik and enable API keys. Never return full JSON payloads in logs; rely on trace mode only for debugging and sanitize before sharing.
- Model Storage — Bind-mount
./storage/services/ollama:/root/.ollama(see Compose) so pulled models survive container restarts. Rundocker compose exec ollama ollama listduring deployments to verify availability.
Observability & Testing
- CLI Smoke Tests
- Captions:
photoprism vision run -m caption --count 5 --force. - Labels:
photoprism vision run -m labels --count 5 --force. - After each run, check
photoprism vision lsforsource=ollama.
- Captions:
- Unit Tests
go test ./internal/ai/vision/ollama ./internal/ai/vision -run Ollama -count=1covers transport parsing and model defaults.- Add fixtures under
internal/ai/vision/testdatawhen capturing new response shapes; keep files small and anonymized.
- Logging
- Set
PHOTOPRISM_LOG_LEVEL=debugto watch summary lines (“processed labels/caption via ollama”). - Use
log.Tracesparingly; it prints truncated JSON blobs for troubleshooting.
- Set
- Metrics
/api/v1/metricsexposes counts per label source; scrape after a batch to compare throughput with TensorFlow/OpenAI runs.
Code Map
internal/ai/vision/ollama/*.go— Engine defaults, schema helpers, transport structs.internal/ai/vision/engine_ollama.go— Builder/parser glue plus label/caption normalization.internal/ai/vision/api_ollama.go— Base64 payload builder.internal/ai/vision/api_client.go— Streaming decoder shared among engines.internal/ai/vision/models.go— Default caption model definition (gemma3).compose*.yaml— Ollama service profile, Traefik labels, and persistent volume wiring.frontend/src/common/util.js— Mapssrc="ollama"to the correct badge; keep it updated when adding new source strings.
Next Steps
- Add formal schema validation (JSON Schema or JTD) so malformed label responses fail fast before normalization.
- Support multiple thumbnails per request once core workflows confirm the API contract (requires worker + UI changes).
- Emit per-model latency and success metrics from the vision worker to simplify tuning when several Ollama engines run side-by-side.
- Mirror any loader changes into PhotoPrism Plus/Pro templates to keep splash + browser checks consistent after enabling external engines.