1.0 KiB
1.0 KiB
SQ-023: Node Recovery / Catch-Up
Status: [ ] TODO
Blocked by: SQ-021, SQ-018
Priority: Medium
Description
A node that was offline catches up from peers or S3 when it rejoins.
Files to Create/Modify
crates/sq-cluster/src/recovery.rs- on-join catch-up logiccrates/sq-server/src/grpc/cluster.rs- FetchSegment RPC implcrates/sq-cluster/src/replication.rs- integrate recovery into join flow
Recovery Flow
- Rejoining node contacts peers via seed list
- For each topic-partition, compare local latest offset with peers
- If peer has newer data: fetch missing segments via FetchSegment RPC
- If peer has also trimmed: fetch from S3
- Replay fetched segments into local WAL
- Mark node as "caught up" and start accepting writes
Acceptance Criteria
- Node joins late: fetches missing data from peer, all messages readable
- Node catches up from S3 when peer has trimmed that segment
- Recovery doesn't block existing cluster operations
- Recovery progress is logged