docs(sync): expand hybrid manifest architecture with detailed workflows

- Add complete type definitions (HybridManifest, SnapshotFile, etc.)
- Add ASCII flow diagrams for upload, download, and snapshot workflows
- Document edge cases and failure modes:
  - Concurrent uploads with ETag/rev handling
  - Manifest corruption recovery
  - Missing snapshot handling
  - Schema version mismatch
  - Large pending operations batching
- Add configuration constants reference
- Add implementation plan with 4 phases
- Add file reference for remote storage layout
This commit is contained in:
Johannes Millan 2025-12-03 19:12:24 +01:00
parent 7362eb13dd
commit 05ecf23b69

View file

@ -2,6 +2,7 @@
**Status:** Proposal / Planned
**Context:** Optimizing WebDAV/Dropbox sync for the Operation Log architecture.
**Related:** [Operation Log Architecture](./operation-log-architecture.md)
---
@ -21,6 +22,8 @@ The current `OperationLogSyncService` fallback for file-based providers (WebDAV,
- **File Proliferation:** Rapidly creates thousands of small files, degrading WebDAV directory listing performance.
- **Latency:** On slow connections (standard WebDAV), this makes sync feel sluggish.
---
## 2. Proposed Solution: Hybrid Manifest
Instead of treating the manifest solely as an _index_ of files, we treat it as a **buffer** for recent operations.
@ -31,100 +34,578 @@ Instead of treating the manifest solely as an _index_ of files, we treat it as a
- **Lazy Flush:** New operation files (`ops_*.json`) are only created when the manifest buffer fills up.
- **Snapshots:** A "base state" file allows us to delete old operation files and clear the manifest history.
### 2.2. New Data Structures
### 2.2. Data Structures
**Updated Manifest:**
```typescript
interface HybridManifest {
version: number; // e.g., 2
version: 2;
// The baseline state (snapshot). If present, clients load this first.
lastSnapshot?: {
fileName: string; // e.g. "snapshots/state_v1_170123.json"
serverSeq: number; // The max sequence number included in this snapshot
timestamp: number;
};
lastSnapshot?: SnapshotReference;
// Ops stored directly in the manifest (The Buffer)
// Limit: ~50 ops or 100KB
embeddedOperations: Operation[];
// Limit: ~50 ops or 100KB payload size
embeddedOperations: EmbeddedOperation[];
// References to external operation files (The Overflow)
// Older ops that were flushed out of the buffer
operationFiles: string[];
operationFiles: OperationFileReference[];
// Merged vector clock from all embedded operations
// Used for quick conflict detection without parsing all ops
frontierClock: VectorClock;
// Last modification timestamp (for ETag-like cache invalidation)
lastModified: number;
}
interface SnapshotReference {
fileName: string; // e.g. "snapshots/snap_1701234567890.json"
schemaVersion: number; // Schema version of the snapshot
vectorClock: VectorClock; // Clock state at snapshot time
timestamp: number; // When snapshot was created
}
interface OperationFileReference {
fileName: string; // e.g. "ops/overflow_1701234567890.json"
opCount: number; // Number of operations in file (for progress estimation)
minSeq: number; // First operation's logical sequence in this file
maxSeq: number; // Last operation's logical sequence
}
// Embedded operations are lightweight - full Operation minus redundant fields
interface EmbeddedOperation {
id: string;
actionType: string;
opType: OpType;
entityType: EntityType;
entityId?: string;
entityIds?: string[];
payload: unknown;
clientId: string;
vectorClock: VectorClock;
timestamp: number;
schemaVersion: number;
}
```
**Snapshot File Format:**
```typescript
interface SnapshotFile {
version: 1;
schemaVersion: number; // App schema version
vectorClock: VectorClock; // Merged clock at snapshot time
timestamp: number;
data: AppDataComplete; // Full application state
checksum?: string; // Optional SHA-256 for integrity verification
}
```
---
## 3. Workflows
### 3.1. Upload (Write Path)
When a client has local pending operations to sync:
1. **Lock & Read:** Acquire remote lock (if applicable) and download `manifest.json`.
2. **Evaluate Buffer:**
- Check size of `manifest.embeddedOperations`.
- Check size of `pendingOps`.
3. **Strategy Selection:**
- **Scenario A (Standard):** If `manifest.embedded + pending < THRESHOLD`:
- Append pending ops to `manifest.embeddedOperations`.
- **Result:** 1 Write (Manifest). 0 New files.
- **Scenario B (Overflow):** If buffer is full:
- Move _existing_ `embeddedOperations` into a new external file (e.g., `ops/overflow_TIMESTAMP.json`).
- Add that new filename to `manifest.operationFiles`.
- Place _new_ `pendingOps` into the now-empty `manifest.embeddedOperations`.
- **Result:** 1 Upload (Overflow) + 1 Write (Manifest).
4. **Write:** Upload updated `manifest.json`.
```
┌─────────────────────────────────────────────────────────────────┐
│ Upload Flow │
└─────────────────────────────────────────────────────────────────┘
┌───────────────────────────────┐
│ 1. Download manifest.json │
└───────────────────────────────┘
┌───────────────────────────────┐
│ 2. Detect remote changes │
│ (compare frontierClock) │
└───────────────────────────────┘
┌───────────────┴───────────────┐
▼ ▼
Remote has new ops? No remote changes
│ │
▼ │
Download & apply first ◄───────┘
┌───────────────────────────────┐
│ 3. Check buffer capacity │
│ embedded.length + pending │
└───────────────────────────────┘
┌───────────────┴───────────────┐
▼ ▼
< BUFFER_LIMIT (50) >= BUFFER_LIMIT
│ │
▼ ▼
Append to embedded Flush embedded to file
│ + add pending to empty buffer
│ │
└───────────────┬───────────────┘
┌───────────────────────────────┐
│ 4. Check snapshot trigger │
│ (operationFiles > 50 OR │
│ total ops > 5000) │
└───────────────────────────────┘
┌───────────────┴───────────────┐
▼ ▼
Trigger snapshot No snapshot needed
│ │
└───────────────┬───────────────┘
┌───────────────────────────────┐
│ 5. Upload manifest.json │
└───────────────────────────────┘
```
**Detailed Steps:**
1. **Download Manifest:** Fetch `manifest.json` (or create empty v2 manifest if not found).
2. **Detect Remote Changes:**
- Compare `manifest.frontierClock` with local `lastSyncedClock`.
- If remote has unseen changes → download and apply before uploading (prevents lost updates).
3. **Evaluate Buffer:**
- `BUFFER_LIMIT = 50` operations (configurable)
- `BUFFER_SIZE_LIMIT = 100KB` payload size (prevents manifest bloat)
4. **Strategy Selection:**
- **Scenario A (Append):** If `embedded.length + pending.length < BUFFER_LIMIT`:
- Append `pendingOps` to `manifest.embeddedOperations`.
- Update `manifest.frontierClock` with merged clocks.
- **Result:** 1 Write (manifest). Fast path.
- **Scenario B (Overflow):** If buffer would exceed limit:
- Upload `manifest.embeddedOperations` to new file `ops/overflow_TIMESTAMP.json`.
- Add file reference to `manifest.operationFiles`.
- Place `pendingOps` into now-empty `manifest.embeddedOperations`.
- **Result:** 1 Upload (overflow file) + 1 Write (manifest).
5. **Upload Manifest:** Write updated `manifest.json`.
### 3.2. Download (Read Path)
When a client checks for updates:
1. **Read Manifest:** Download `manifest.json`.
2. **Check Snapshot:**
- If `manifest.lastSnapshot` is newer than local data, download and apply the snapshot file first.
3. **Process Files:**
- Download and apply any files in `manifest.operationFiles` that haven't been seen yet.
4. **Process Embedded:**
- Apply operations found in `manifest.embeddedOperations`.
```
┌─────────────────────────────────────────────────────────────────┐
│ Download Flow │
└─────────────────────────────────────────────────────────────────┘
┌───────────────────────────────┐
│ 1. Download manifest.json │
└───────────────────────────────┘
┌───────────────────────────────┐
│ 2. Quick-check: any changes? │
│ Compare frontierClock │
└───────────────────────────────┘
┌───────────────┴───────────────┐
▼ ▼
No changes (clocks equal) Changes detected
│ │
▼ ▼
Done ┌────────────────────────┐
│ 3. Need snapshot? │
│ (local behind snapshot)│
└────────────────────────┘
┌───────────────┴───────────────┐
▼ ▼
Download snapshot Skip to ops
+ apply as base │
│ │
└───────────────┬───────────────┘
┌────────────────────────┐
│ 4. Download new op │
│ files (filter seen) │
└────────────────────────┘
┌────────────────────────┐
│ 5. Apply embedded ops │
│ (filter by op.id) │
└────────────────────────┘
┌────────────────────────┐
│ 6. Update local │
│ lastSyncedClock │
└────────────────────────┘
```
**Detailed Steps:**
1. **Download Manifest:** Fetch `manifest.json`.
2. **Quick-Check Changes:**
- Compare `manifest.frontierClock` against local `lastSyncedClock`.
- If clocks are equal → no changes, done.
3. **Check Snapshot Needed:**
- If local state is older than `manifest.lastSnapshot.vectorClock` → download snapshot first.
- Apply snapshot as base state (replaces local state).
4. **Download Operation Files:**
- Filter `manifest.operationFiles` to only files with `maxSeq > localLastAppliedSeq`.
- Download and parse each file.
- Collect all operations.
5. **Apply Embedded Operations:**
- Filter `manifest.embeddedOperations` by `op.id` (skip already-applied).
- Add to collected operations.
6. **Apply All Operations:**
- Sort by `vectorClock` (causal order).
- Detect conflicts using existing `detectConflicts()` logic.
- Apply non-conflicting ops; present conflicts to user.
7. **Update Tracking:**
- Set `localLastSyncedClock = manifest.frontierClock`.
---
## 4. Snapshotting (Compaction)
To prevent the "chain" of operation files from growing forever, any client can trigger a snapshot.
To prevent unbounded growth of operation files, any client can trigger a snapshot.
**Trigger:**
### 4.1. Triggers
- Total external `operationFiles` count > 50.
- OR Total distinct operations in history > 5000.
| Condition | Threshold | Rationale |
| ------------------------------- | --------- | -------------------------------------- |
| External `operationFiles` count | > 50 | Prevent WebDAV directory bloat |
| Total operations since snapshot | > 5000 | Bound replay time for fresh installs |
| Time since last snapshot | > 7 days | Ensure periodic cleanup |
| Manifest size | > 500KB | Prevent manifest from becoming too big |
**Process:**
### 4.2. Process
1. **Download Everything:** Ensure the client has the full, consistent state.
2. **Generate Snapshot:** Serialize the current `AppDataComplete` to a file (e.g., `snapshots/snap_SEQ_TIMESTAMP.json`).
3. **Upload Snapshot:** Upload the new snapshot file.
4. **Update Manifest:**
- Set `lastSnapshot` to the new file.
- **Clear** `operationFiles` (delete the old JSON files from the server to save space).
- **Clear** `embeddedOperations`.
5. **Cleanup:** (Async) Delete the obsolete snapshot files and old operation files from the server.
```
┌─────────────────────────────────────────────────────────────────┐
│ Snapshot Flow │
└─────────────────────────────────────────────────────────────────┘
┌───────────────────────────────┐
│ 1. Ensure full sync complete │
│ (no pending local/remote) │
└───────────────────────────────┘
┌───────────────────────────────┐
│ 2. Read current state from │
│ NgRx (authoritative) │
└───────────────────────────────┘
┌───────────────────────────────┐
│ 3. Generate snapshot file │
│ + compute checksum │
└───────────────────────────────┘
┌───────────────────────────────┐
│ 4. Upload snapshot file │
│ (atomic, verify success) │
└───────────────────────────────┘
┌───────────────────────────────┐
│ 5. Update manifest │
│ - Set lastSnapshot │
│ - Clear operationFiles │
│ - Clear embeddedOperations │
│ - Reset frontierClock │
└───────────────────────────────┘
┌───────────────────────────────┐
│ 6. Upload manifest │
└───────────────────────────────┘
┌───────────────────────────────┐
│ 7. Cleanup (async, best- │
│ effort): delete old files │
└───────────────────────────────┘
```
## 5. Advantages
### 4.3. Snapshot Atomicity
| Metric | Old Approach | New Approach |
| :---------------------- | :----------------------------------- | :----------------------------------------------------- |
| **Requests per Sync** | 3 (Upload Op + Read Man + Write Man) | **2** (Read Man + Write Man) |
| **Files on Server** | 1 per sync (Unlimited growth) | **Bounded** (1 Manifest + ~0-50 Op Files + 1 Snapshot) |
| **Fresh Install Speed** | Slow (Replay thousands of JSONs) | **Fast** (Download 1 Snapshot + recent delta) |
| **Conflict Handling** | Same (Vector Clocks) | Same (Vector Clocks) |
**Problem:** If the client crashes between uploading snapshot and updating manifest, other clients won't see the new snapshot.
## 6. Implementation Plan
**Solution:** Snapshot files are immutable and safe to leave orphaned. The manifest is the source of truth. Cleanup is best-effort.
1. **Modify `OperationLogSyncService`:**
- Update `_loadRemoteManifest` to handle v2 format.
- Refactor `_uploadPendingOpsViaFiles` to implement the buffer/overflow logic.
2. **Add Snapshot Logic:**
- Create `OperationLogSnapshotService` to handle generating and hydrating from the large snapshot files.
- Add simple heuristic to `sync.service` to decide when to snapshot.
3. **Migration:**
- When a v2 client sees a v1 manifest, it should automatically "upgrade" it (move existing files to `operationFiles` list and add version tag).
**Invariant:** Never delete the current `lastSnapshot` file until a new snapshot is confirmed.
---
## 5. Conflict Handling
The hybrid manifest doesn't change conflict detection - it still uses vector clocks. However, the `frontierClock` in the manifest enables **early conflict detection**.
### 5.1. Early Conflict Detection
Before downloading all operations, compare clocks:
```typescript
const comparison = compareVectorClocks(localFrontierClock, manifest.frontierClock);
switch (comparison) {
case VectorClockComparison.LESS_THAN:
// Remote is ahead - safe to download
break;
case VectorClockComparison.GREATER_THAN:
// Local is ahead - upload our changes
break;
case VectorClockComparison.CONCURRENT:
// Potential conflicts - download ops for detailed analysis
break;
case VectorClockComparison.EQUAL:
// No changes - skip download
break;
}
```
### 5.2. Conflict Resolution
When conflicts are detected at the operation level, the existing `ConflictResolutionService` handles them. The hybrid manifest doesn't change this flow.
---
## 6. Edge Cases & Failure Modes
### 6.1. Concurrent Uploads (Race Condition)
**Scenario:** Two clients download the manifest simultaneously, both append ops, both upload.
**Problem:** Second upload overwrites first client's operations.
**Solution:** Use provider-specific mechanisms:
| Provider | Mechanism |
| ----------- | ------------------------------------------- |
| **Dropbox** | Use `update` mode with `rev` parameter |
| **WebDAV** | Use `If-Match` header with ETag |
| **Local** | File locking (already implemented in PFAPI) |
**Implementation:**
```typescript
interface HybridManifest {
// ... existing fields
// Optimistic concurrency control
etag?: string; // Server-assigned revision (Dropbox rev, WebDAV ETag)
}
async uploadManifest(manifest: HybridManifest, expectedEtag?: string): Promise<void> {
// If expectedEtag provided, use conditional upload
// On conflict (412 Precondition Failed), re-download and retry
}
```
### 6.2. Manifest Corruption
**Scenario:** Manifest JSON is invalid (partial write, encoding issue).
**Recovery Strategy:**
1. Attempt to parse manifest.
2. On parse failure, check for backup manifest (`manifest.json.bak`).
3. If no backup, reconstruct from operation files using `listFiles()`.
4. If reconstruction fails, fall back to snapshot-only state.
```typescript
async loadManifestWithRecovery(): Promise<HybridManifest> {
try {
return await this._loadRemoteManifest();
} catch (parseError) {
PFLog.warn('Manifest corrupted, attempting recovery...');
// Try backup
try {
return await this._loadBackupManifest();
} catch {
// Reconstruct from files
return await this._reconstructManifestFromFiles();
}
}
}
```
### 6.3. Snapshot File Missing
**Scenario:** Manifest references a snapshot that doesn't exist on the server.
**Recovery Strategy:**
1. Log error and notify user.
2. Fall back to replaying all available operation files.
3. If operation files also reference missing ops, show data loss warning.
### 6.4. Schema Version Mismatch
**Scenario:** Snapshot was created with schema version 3, but local app is version 2.
**Handling:**
- If `snapshot.schemaVersion > CURRENT_SCHEMA_VERSION + MAX_VERSION_SKIP`:
- Reject snapshot, prompt user to update app.
- If `snapshot.schemaVersion > CURRENT_SCHEMA_VERSION`:
- Load with warning (some fields may be stripped by Typia).
- If `snapshot.schemaVersion < CURRENT_SCHEMA_VERSION`:
- Run migrations on loaded state.
### 6.5. Large Pending Operations
**Scenario:** User was offline for a week, has 500 pending operations.
**Handling:**
- Don't try to embed all 500 in manifest.
- Batch into multiple overflow files (100 ops each).
- Upload files first, then update manifest once.
```typescript
const BATCH_SIZE = 100;
const chunks = chunkArray(pendingOps, BATCH_SIZE);
for (const chunk of chunks) {
await this._uploadOverflowFile(chunk);
}
// Single manifest update at the end
await this._uploadManifest(manifest);
```
---
## 7. Advantages Summary
| Metric | Current (v1) | Hybrid Manifest (v2) |
| :---------------------- | :----------------------------------- | :---------------------------------------------------- |
| **Requests per Sync** | 3 (Upload Op + Read Man + Write Man) | **1-2** (Read Man, optional Write) |
| **Files on Server** | Unbounded growth | **Bounded** (1 Manifest + 0-50 Op Files + 1 Snapshot) |
| **Fresh Install Speed** | O(n) - replay all ops | **O(1)** - load snapshot + small delta |
| **Conflict Detection** | Must parse all ops | **Quick check** via frontierClock |
| **Bandwidth per Sync** | ~2KB (op file) + manifest overhead | **~1KB** (manifest only for small changes) |
| **Offline Resilience** | Good | **Same** (operations buffered locally) |
---
## 8. Implementation Plan
### Phase 1: Core Infrastructure
1. **Update Types** (`operation.types.ts`):
- Add `HybridManifest`, `SnapshotReference`, `OperationFileReference` interfaces.
- Keep backward compatibility with existing `OperationLogManifest`.
2. **Manifest Handling** (`operation-log-sync.service.ts`):
- Update `_loadRemoteManifest()` to detect version and parse accordingly.
- Add `_migrateV1ToV2Manifest()` for automatic upgrade.
- Implement buffer/overflow logic in `_uploadPendingOpsViaFiles()`.
3. **Add FrontierClock Tracking**:
- Merge vector clocks when adding embedded operations.
- Store `lastSyncedFrontierClock` locally for quick-check.
### Phase 2: Snapshot Support
4. **Create `HybridSnapshotService`**:
- `generateSnapshot()`: Serialize current state + compute checksum.
- `uploadSnapshot()`: Upload with retry logic.
- `loadSnapshot()`: Download + validate + apply.
5. **Integrate Snapshot Triggers**:
- Check conditions after each upload.
- Add manual "Force Snapshot" option in settings for debugging.
### Phase 3: Robustness
6. **Optimistic Concurrency**:
- Implement ETag/rev-based conditional uploads.
- Add retry-on-conflict logic.
7. **Recovery Logic**:
- Manifest corruption recovery.
- Missing file handling.
- Schema migration for snapshots.
### Phase 4: Testing & Migration
8. **Add Tests**:
- Unit tests for buffer overflow logic.
- Integration tests for multi-client scenarios.
- Stress tests for large operation counts.
9. **Migration Path**:
- v1 clients continue to work (read v2 manifest, ignore new fields).
- v2 clients auto-upgrade v1 manifests on first write.
---
## 9. Configuration Constants
```typescript
// Buffer limits
const EMBEDDED_OP_LIMIT = 50; // Max operations in manifest buffer
const EMBEDDED_SIZE_LIMIT_KB = 100; // Max payload size in KB
// Snapshot triggers
const SNAPSHOT_FILE_THRESHOLD = 50; // Trigger when operationFiles exceeds this
const SNAPSHOT_OP_THRESHOLD = 5000; // Trigger when total ops exceed this
const SNAPSHOT_AGE_DAYS = 7; // Trigger if no snapshot in N days
// Batching
const UPLOAD_BATCH_SIZE = 100; // Ops per overflow file
// Retry
const MAX_UPLOAD_RETRIES = 3;
const RETRY_DELAY_MS = 1000;
```
---
## 10. Open Questions
1. **Encryption:** Should snapshots be encrypted differently than operation files? (Same encryption is simpler)
2. **Compression:** Should we gzip the snapshot file? (Trade-off: smaller size vs. no partial reads)
3. **Checksum Verification:** Is SHA-256 overkill for snapshot integrity? (Consider CRC32 for speed)
4. **Clock Drift:** How to handle clients with significantly wrong system clocks? (Vector clocks help, but timestamps in snapshot could confuse users)
---
## 11. File Reference
```
Remote Storage Layout (v2):
├── manifest.json # HybridManifest (buffer + references)
├── ops/
│ ├── overflow_170123.json # Flushed operations (batches of 100)
│ └── overflow_170456.json
└── snapshots/
└── snap_170789.json # Full state snapshot
```
```
Code Files:
src/app/core/persistence/operation-log/
├── operation.types.ts # Add HybridManifest types
├── operation-log-sync.service.ts # Buffer/overflow logic
├── hybrid-snapshot.service.ts # NEW: Snapshot generation/loading
└── manifest-recovery.service.ts # NEW: Corruption recovery
```