Files
sq/todos/SQ-023-node-recovery.md
2026-02-26 21:52:50 +01:00

1.0 KiB

SQ-023: Node Recovery / Catch-Up

Status: [ ] TODO Blocked by: SQ-021, SQ-018 Priority: Medium

Description

A node that was offline catches up from peers or S3 when it rejoins.

Files to Create/Modify

  • crates/sq-cluster/src/recovery.rs - on-join catch-up logic
  • crates/sq-server/src/grpc/cluster.rs - FetchSegment RPC impl
  • crates/sq-cluster/src/replication.rs - integrate recovery into join flow

Recovery Flow

  1. Rejoining node contacts peers via seed list
  2. For each topic-partition, compare local latest offset with peers
  3. If peer has newer data: fetch missing segments via FetchSegment RPC
  4. If peer has also trimmed: fetch from S3
  5. Replay fetched segments into local WAL
  6. Mark node as "caught up" and start accepting writes

Acceptance Criteria

  • Node joins late: fetches missing data from peer, all messages readable
  • Node catches up from S3 when peer has trimmed that segment
  • Recovery doesn't block existing cluster operations
  • Recovery progress is logged