Skip to content

Agent / team prompt: complete Milestone 1 (Sover / P2POS family vault)

Status (in-repo): Replication failed → pending requeue after cooldown (P2POS_REP_FAILED_RETRY_AFTER_SECS), Rust unit test requeue_stale_failed_replication_jobs_after_peer_outage, docker-compose.operator-infra.yml + docker-compose.family-nodes.yml, docs/MILESTONE_1_RUNBOOK.md, scripts/verify-webrtc-e2e.sh. Same-origin API default documented in vault-web. Further polish (strict CORS, HA operator, optional WebRTC Playwright job) can still be filed.

Use this document as the single briefing for an implementer or coding agent. Canonical product definition: docs/P2POS_SOVEREIGN_FAMILY_VAULT_ARCHITECTURE.md §2.5 (Milestone 1) and §12 (reachability). Do not expand scope into Milestone 2 (Android, QR enrollment polish) unless explicitly asked.


Role

You are implementing and hardening Milestone 1: operator-hosted signaling + STUN/TURN; user runs two nodes (e.g. Docker) and the web app; ICE selects paths automatically; redundancy and catch-up across two nodes; bootstrap trends toward a single user entry.

Match existing repo style (Rust workspace, p2pos-node, p2pos-net, family-vault, apps/vault-web, docker/e2e). Prefer small, reviewable PRs. Every behavior change should have automated tests where practical.


Personas (must remain true after your work)

Persona Responsibility
Operator Hosts signaling / seeder + STUN/TURN (e.g. VPS). No access to family keys or plaintext photos.
User Runs nodes (PC/Docker today); uses vault-web (and future apps) on top of Sover. Must not manually pick ICE paths or maintain IP lists for WebRTC—ICE nominates pairs after STUN/TURN supply candidates.

Functional goals (Milestone 1)

  1. Two nodes + web app
  2. User can run two p2pos-node instances (Docker or bare metal), register each other as peers, and use vault-web to create albums, upload encrypted photos, and see replication status.
  3. Node ↔ node replication uses WebRTC data channels when ICE succeeds, with HTTP fallback for large blobs or failures (already Phase G—preserve and fix bugs only).

  4. Browser ↔ node

  5. Support vault API over WebRTC (data channel tunnel) for the paths that today can use VITE_VAULT_WEBRTC=1, with HTTP still available for bootstrap (auth) and selected metadata routes as today.
  6. GET /v1/nodes must expose browser-reachable signaling and ICE hints (P2POS_BROWSER_* pattern already in p2pos-node—extend if split deploy needs more fields).

  7. Operator vs user split (documentation + optional compose)

  8. Ship a clear runbook: minimal operator stack (signal + coturn, env vars, ports, TLS notes) vs family stack (nodes + UI only, pointing at operator URLs).
  9. Prefer one compose file pair or profiles so docker compose can start only infra or only nodes+UI without copy-paste errors.

  10. Bootstrap: single entry (product direction)

  11. Reduce reliance on “type the right IP”: document one URL pattern (e.g. UI behind nginx on a known host, or future DNS name).
  12. If you add code: e.g. relative API base ("") so the SPA talks to the same origin that served the static files; optional VITE_* only for dev.
  13. Full QR enrollment is Milestone 2—do not block M1 on it; do document the gap.

  14. Redundancy and catch-up

  15. One node stopped: other node still serves data already replicated; document that the user keeps one bookmark/origin when possible.
  16. Replication jobs: if a peer is down, jobs should remain retryable and eventually succeed when the peer returns. Close the gap where jobs flip to failed after a fixed attempt cap (rep_worker.rs / vault_db.rs) with no automatic re-queue—e.g. reset failedpending on a timer, on peer heartbeat, or admin API; pick the smallest design that is testable.
  17. Add automated test (Rust integration or E2E) proving: peer down → jobs pending/failed per policy → peer up → replication completes without manual DB edits.

  18. Security / ops (minimal for M1 “technical family”)

  19. Do not need full production hardening in one go; do address any change that makes localhost-only demos accidentally unsafe (e.g. CORS, session binding) if you touch those layers—see architecture Security audit section.
  20. Document required secrets (P2POS_REPLICATE_PSK, TURN creds) for operator + family deploys.

Automated testing (required deliverables)

1. Keep existing CI green

  • cargo test --workspace
  • apps/vault-web: npm test + npm run build
  • docker/e2e: current Playwright job
    docker compose -f docker-compose.yml -f docker-compose.ci.yml --profile e2e up --build --abort-on-container-exit playwright
    

2. Extend or add tests (minimum bar)

Area What to add
Replication catch-up Rust test or dockerized test: simulate failed/pending queue, bring peer back, assert blob_peer_status / job state reaches ok.
E2E optional second profile If CI stability allows: add a non-default workflow job or compose profile that runs Playwright (or a shorter smoke test) with WebRTC vault enabled (VITE_VAULT_WEBRTC=1), or document why it stays manual and add a script scripts/verify-webrtc-e2e.sh that fails on regression.
Operator/family split Lightweight smoke: script that starts only coturn+signal, then nodes with env pointing at it, curl health + one authenticated flow—or documentation-only if fully covered by existing compose + new doc steps (explicitly state which).

3. Playwright / UI

  • Reuse docker/e2e/playwright/tests/ui-behind-nat.spec.ts patterns (data-testid, replication status polling).
  • New flows: add data-testid hooks in vault-web instead of brittle CSS selectors.

Suggested implementation order

  1. Replication retry / failed-job policy + Rust tests (unblocks honest “catch-up” story).
  2. Docs: operator vs family runbook + link from README.md / docker/e2e/README.md.
  3. Bootstrap UX: same-origin API base where possible; env template for split deploy.
  4. CI: replication integration test; optional WebRTC E2E job or script.
  5. Polish: metrics/logging for ICE selected pair (optional, behind feature flag)—nice for demos, not blocking.

Acceptance criteria (Definition of Done)

  • §2.5 Milestone 1 behaviors are true for a technical user following your updated docs (two nodes, UI, signal, STUN/TURN, replication, WebRTC path optional but documented).
  • ICE path selection remains automatic; docs do not tell users to pick LAN vs TURN IPs for WebRTC.
  • Catch-up: documented + tested behavior when a peer was unavailable and later returns.
  • CI passes (cargo test, npm test, Playwright docker-e2e); new tests are deterministic.
  • Milestone 2 items (Android, QR product polish) are out of scope unless filed separately.

References (read before coding)

  • docs/P2POS_SOVEREIGN_FAMILY_VAULT_ARCHITECTURE.md — §2.5, §12, Phase G, security audit
  • docker/e2e/docker-compose.yml, docker-compose.ci.yml, docs/E2E_DOCKER_INFRA.md
  • crates/p2pos-node/src/rep_worker.rs, vault_db.rs — replication queue
  • apps/vault-web/src/api/vaultRtc.ts, client.ts — WebRTC vault transport
  • .github/workflows/ci.yml

One-line prompt (for chat agents)

Implement Milestone 1 per docs/P2POS_SOVEREIGN_FAMILY_VAULT_ARCHITECTURE.md §2.5: split operator vs family deploy docs, fix replication catch-up when a peer restarts (no stuck permanent failures without policy), keep ICE automatic, improve single-entry bootstrap where trivial, and add automated tests (Rust and/or Playwright) so CI proves catch-up and existing E2E stays green.