P2POS family vault)¶

Status (in-repo): Replication failed → pending requeue after cooldown (P2POS_REP_FAILED_RETRY_AFTER_SECS), Rust unit test requeue_stale_failed_replication_jobs_after_peer_outage, docker-compose.operator-infra.yml + docker-compose.family-nodes.yml, docs/MILESTONE_1_RUNBOOK.md, scripts/verify-webrtc-e2e.sh. Same-origin API default documented in vault-web. Further polish (strict CORS, HA operator, optional WebRTC Playwright job) can still be filed.

Use this document as the single briefing for an implementer or coding agent. Canonical product definition: docs/P2POS_SOVEREIGN_FAMILY_VAULT_ARCHITECTURE.md §2.5 (Milestone 1) and §12 (reachability). Do not expand scope into Milestone 2 (Android, QR enrollment polish) unless explicitly asked.

Role¶

You are implementing and hardening Milestone 1: operator-hosted signaling + STUN/TURN; user runs two nodes (e.g. Docker) and the web app; ICE selects paths automatically; redundancy and catch-up across two nodes; bootstrap trends toward a single user entry.

Match existing repo style (Rust workspace, p2pos-node, p2pos-net, family-vault, apps/vault-web, docker/e2e). Prefer small, reviewable PRs. Every behavior change should have automated tests where practical.

Personas (must remain true after your work)¶

Persona	Responsibility
Operator	Hosts signaling / seeder + STUN/TURN (e.g. VPS). No access to family keys or plaintext photos.
User	Runs nodes (PC/Docker today); uses vault-web (and future apps) on top of Sover. Must not manually pick ICE paths or maintain IP lists for WebRTC—ICE nominates pairs after STUN/TURN supply candidates.

Functional goals (Milestone 1)¶

Two nodes + web app
User can run two p2pos-node instances (Docker or bare metal), register each other as peers, and use vault-web to create albums, upload encrypted photos, and see replication status.
Node ↔ node replication uses WebRTC data channels when ICE succeeds, with HTTP fallback for large blobs or failures (already Phase G—preserve and fix bugs only).
Browser ↔ node
Support vault API over WebRTC (data channel tunnel) for the paths that today can use VITE_VAULT_WEBRTC=1, with HTTP still available for bootstrap (auth) and selected metadata routes as today.
GET /v1/nodes must expose browser-reachable signaling and ICE hints (P2POS_BROWSER_* pattern already in p2pos-node—extend if split deploy needs more fields).
Operator vs user split (documentation + optional compose)
Ship a clear runbook: minimal operator stack (signal + coturn, env vars, ports, TLS notes) vs family stack (nodes + UI only, pointing at operator URLs).
Prefer one compose file pair or profiles so docker compose can start only infra or only nodes+UI without copy-paste errors.
Bootstrap: single entry (product direction)
Reduce reliance on “type the right IP”: document one URL pattern (e.g. UI behind nginx on a known host, or future DNS name).
If you add code: e.g. relative API base ("") so the SPA talks to the same origin that served the static files; optional VITE_* only for dev.
Full QR enrollment is Milestone 2—do not block M1 on it; do document the gap.
Redundancy and catch-up
One node stopped: other node still serves data already replicated; document that the user keeps one bookmark/origin when possible.
Replication jobs: if a peer is down, jobs should remain retryable and eventually succeed when the peer returns. Close the gap where jobs flip to failed after a fixed attempt cap (rep_worker.rs / vault_db.rs) with no automatic re-queue—e.g. reset failed → pending on a timer, on peer heartbeat, or admin API; pick the smallest design that is testable.
Add automated test (Rust integration or E2E) proving: peer down → jobs pending/failed per policy → peer up → replication completes without manual DB edits.
Security / ops (minimal for M1 “technical family”)
Do not need full production hardening in one go; do address any change that makes localhost-only demos accidentally unsafe (e.g. CORS, session binding) if you touch those layers—see architecture Security audit section.
Document required secrets (P2POS_REPLICATE_PSK, TURN creds) for operator + family deploys.

Automated testing (required deliverables)¶

1. Keep existing CI green¶

cargo test --workspace
apps/vault-web: npm test + npm run build

docker/e2e: current Playwright job

docker compose -f docker-compose.yml -f docker-compose.ci.yml --profile e2e up --build --abort-on-container-exit playwright

2. Extend or add tests (minimum bar)¶

Area	What to add
Replication catch-up	Rust test or dockerized test: simulate failed/pending queue, bring peer back, assert `blob_peer_status` / job state reaches ok.
E2E optional second profile	If CI stability allows: add a non-default workflow job or compose profile that runs Playwright (or a shorter smoke test) with WebRTC vault enabled (`VITE_VAULT_WEBRTC=1`), or document why it stays manual and add a script `scripts/verify-webrtc-e2e.sh` that fails on regression.
Operator/family split	Lightweight smoke: script that starts only `coturn`+`signal`, then nodes with env pointing at it, `curl` health + one authenticated flow—or documentation-only if fully covered by existing compose + new doc steps (explicitly state which).

3. Playwright / UI¶

Reuse docker/e2e/playwright/tests/ui-behind-nat.spec.ts patterns (data-testid, replication status polling).
New flows: add data-testid hooks in vault-web instead of brittle CSS selectors.

Suggested implementation order¶

Replication retry / failed-job policy + Rust tests (unblocks honest “catch-up” story).
Docs: operator vs family runbook + link from README.md / docker/e2e/README.md.
Bootstrap UX: same-origin API base where possible; env template for split deploy.
CI: replication integration test; optional WebRTC E2E job or script.
Polish: metrics/logging for ICE selected pair (optional, behind feature flag)—nice for demos, not blocking.

Acceptance criteria (Definition of Done)¶

§2.5 Milestone 1 behaviors are true for a technical user following your updated docs (two nodes, UI, signal, STUN/TURN, replication, WebRTC path optional but documented).
ICE path selection remains automatic; docs do not tell users to pick LAN vs TURN IPs for WebRTC.
Catch-up: documented + tested behavior when a peer was unavailable and later returns.
CI passes (cargo test, npm test, Playwright docker-e2e); new tests are deterministic.
Milestone 2 items (Android, QR product polish) are out of scope unless filed separately.

References (read before coding)¶

docs/P2POS_SOVEREIGN_FAMILY_VAULT_ARCHITECTURE.md — §2.5, §12, Phase G, security audit
docker/e2e/docker-compose.yml, docker-compose.ci.yml, docs/E2E_DOCKER_INFRA.md
crates/p2pos-node/src/rep_worker.rs, vault_db.rs — replication queue
apps/vault-web/src/api/vaultRtc.ts, client.ts — WebRTC vault transport
.github/workflows/ci.yml

One-line prompt (for chat agents)¶

Implement Milestone 1 per docs/P2POS_SOVEREIGN_FAMILY_VAULT_ARCHITECTURE.md §2.5: split operator vs family deploy docs, fix replication catch-up when a peer restarts (no stuck permanent failures without policy), keep ICE automatic, improve single-entry bootstrap where trivial, and add automated tests (Rust and/or Playwright) so CI proves catch-up and existing E2E stays green.