Agent / team prompt: complete Milestone 1 (Sover / P2POS family vault)¶
Status (in-repo): Replication failed → pending requeue after cooldown (P2POS_REP_FAILED_RETRY_AFTER_SECS), Rust unit test requeue_stale_failed_replication_jobs_after_peer_outage, docker-compose.operator-infra.yml + docker-compose.family-nodes.yml, docs/MILESTONE_1_RUNBOOK.md, scripts/verify-webrtc-e2e.sh. Same-origin API default documented in vault-web. Further polish (strict CORS, HA operator, optional WebRTC Playwright job) can still be filed.
Use this document as the single briefing for an implementer or coding agent. Canonical product definition: docs/P2POS_SOVEREIGN_FAMILY_VAULT_ARCHITECTURE.md §2.5 (Milestone 1) and §12 (reachability). Do not expand scope into Milestone 2 (Android, QR enrollment polish) unless explicitly asked.
Role¶
You are implementing and hardening Milestone 1: operator-hosted signaling + STUN/TURN; user runs two nodes (e.g. Docker) and the web app; ICE selects paths automatically; redundancy and catch-up across two nodes; bootstrap trends toward a single user entry.
Match existing repo style (Rust workspace, p2pos-node, p2pos-net, family-vault, apps/vault-web, docker/e2e). Prefer small, reviewable PRs. Every behavior change should have automated tests where practical.
Personas (must remain true after your work)¶
| Persona | Responsibility |
|---|---|
| Operator | Hosts signaling / seeder + STUN/TURN (e.g. VPS). No access to family keys or plaintext photos. |
| User | Runs nodes (PC/Docker today); uses vault-web (and future apps) on top of Sover. Must not manually pick ICE paths or maintain IP lists for WebRTC—ICE nominates pairs after STUN/TURN supply candidates. |
Functional goals (Milestone 1)¶
- Two nodes + web app
- User can run two
p2pos-nodeinstances (Docker or bare metal), register each other as peers, and use vault-web to create albums, upload encrypted photos, and see replication status. -
Node ↔ node replication uses WebRTC data channels when ICE succeeds, with HTTP fallback for large blobs or failures (already Phase G—preserve and fix bugs only).
-
Browser ↔ node
- Support vault API over WebRTC (data channel tunnel) for the paths that today can use
VITE_VAULT_WEBRTC=1, with HTTP still available for bootstrap (auth) and selected metadata routes as today. -
GET /v1/nodesmust expose browser-reachable signaling and ICE hints (P2POS_BROWSER_*pattern already inp2pos-node—extend if split deploy needs more fields). -
Operator vs user split (documentation + optional compose)
- Ship a clear runbook: minimal operator stack (signal + coturn, env vars, ports, TLS notes) vs family stack (nodes + UI only, pointing at operator URLs).
-
Prefer one compose file pair or profiles so
docker composecan start only infra or only nodes+UI without copy-paste errors. -
Bootstrap: single entry (product direction)
- Reduce reliance on “type the right IP”: document one URL pattern (e.g. UI behind nginx on a known host, or future DNS name).
- If you add code: e.g. relative API base (
"") so the SPA talks to the same origin that served the static files; optionalVITE_*only for dev. -
Full QR enrollment is Milestone 2—do not block M1 on it; do document the gap.
-
Redundancy and catch-up
- One node stopped: other node still serves data already replicated; document that the user keeps one bookmark/origin when possible.
- Replication jobs: if a peer is down, jobs should remain retryable and eventually succeed when the peer returns. Close the gap where jobs flip to
failedafter a fixed attempt cap (rep_worker.rs/vault_db.rs) with no automatic re-queue—e.g. resetfailed→pendingon a timer, on peer heartbeat, or admin API; pick the smallest design that is testable. -
Add automated test (Rust integration or E2E) proving: peer down → jobs pending/failed per policy → peer up → replication completes without manual DB edits.
-
Security / ops (minimal for M1 “technical family”)
- Do not need full production hardening in one go; do address any change that makes localhost-only demos accidentally unsafe (e.g. CORS, session binding) if you touch those layers—see architecture Security audit section.
- Document required secrets (
P2POS_REPLICATE_PSK, TURN creds) for operator + family deploys.
Automated testing (required deliverables)¶
1. Keep existing CI green¶
cargo test --workspaceapps/vault-web:npm test+npm run builddocker/e2e: current Playwright job
docker compose -f docker-compose.yml -f docker-compose.ci.yml --profile e2e up --build --abort-on-container-exit playwright
2. Extend or add tests (minimum bar)¶
| Area | What to add |
|---|---|
| Replication catch-up | Rust test or dockerized test: simulate failed/pending queue, bring peer back, assert blob_peer_status / job state reaches ok. |
| E2E optional second profile | If CI stability allows: add a non-default workflow job or compose profile that runs Playwright (or a shorter smoke test) with WebRTC vault enabled (VITE_VAULT_WEBRTC=1), or document why it stays manual and add a script scripts/verify-webrtc-e2e.sh that fails on regression. |
| Operator/family split | Lightweight smoke: script that starts only coturn+signal, then nodes with env pointing at it, curl health + one authenticated flow—or documentation-only if fully covered by existing compose + new doc steps (explicitly state which). |
3. Playwright / UI¶
- Reuse
docker/e2e/playwright/tests/ui-behind-nat.spec.tspatterns (data-testid, replication status polling). - New flows: add
data-testidhooks invault-webinstead of brittle CSS selectors.
Suggested implementation order¶
- Replication retry / failed-job policy + Rust tests (unblocks honest “catch-up” story).
- Docs: operator vs family runbook + link from
README.md/docker/e2e/README.md. - Bootstrap UX: same-origin API base where possible; env template for split deploy.
- CI: replication integration test; optional WebRTC E2E job or script.
- Polish: metrics/logging for ICE selected pair (optional, behind feature flag)—nice for demos, not blocking.
Acceptance criteria (Definition of Done)¶
- §2.5 Milestone 1 behaviors are true for a technical user following your updated docs (two nodes, UI, signal, STUN/TURN, replication, WebRTC path optional but documented).
- ICE path selection remains automatic; docs do not tell users to pick LAN vs TURN IPs for WebRTC.
- Catch-up: documented + tested behavior when a peer was unavailable and later returns.
- CI passes (
cargo test,npm test, Playwright docker-e2e); new tests are deterministic. - Milestone 2 items (Android, QR product polish) are out of scope unless filed separately.
References (read before coding)¶
docs/P2POS_SOVEREIGN_FAMILY_VAULT_ARCHITECTURE.md— §2.5, §12, Phase G, security auditdocker/e2e/docker-compose.yml,docker-compose.ci.yml,docs/E2E_DOCKER_INFRA.mdcrates/p2pos-node/src/rep_worker.rs,vault_db.rs— replication queueapps/vault-web/src/api/vaultRtc.ts,client.ts— WebRTC vault transport.github/workflows/ci.yml
One-line prompt (for chat agents)¶
Implement Milestone 1 per docs/P2POS_SOVEREIGN_FAMILY_VAULT_ARCHITECTURE.md §2.5: split operator vs family deploy docs, fix replication catch-up when a peer restarts (no stuck permanent failures without policy), keep ICE automatic, improve single-entry bootstrap where trivial, and add automated tests (Rust and/or Playwright) so CI proves catch-up and existing E2E stays green.