31
todos/SQ-023-node-recovery.md
Normal file
31
todos/SQ-023-node-recovery.md
Normal file
@@ -0,0 +1,31 @@
|
||||
# SQ-023: Node Recovery / Catch-Up
|
||||
|
||||
**Status:** `[ ] TODO`
|
||||
**Blocked by:** SQ-021, SQ-018
|
||||
**Priority:** Medium
|
||||
|
||||
## Description
|
||||
|
||||
A node that was offline catches up from peers or S3 when it rejoins.
|
||||
|
||||
## Files to Create/Modify
|
||||
|
||||
- `crates/sq-cluster/src/recovery.rs` - on-join catch-up logic
|
||||
- `crates/sq-server/src/grpc/cluster.rs` - FetchSegment RPC impl
|
||||
- `crates/sq-cluster/src/replication.rs` - integrate recovery into join flow
|
||||
|
||||
## Recovery Flow
|
||||
|
||||
1. Rejoining node contacts peers via seed list
|
||||
2. For each topic-partition, compare local latest offset with peers
|
||||
3. If peer has newer data: fetch missing segments via FetchSegment RPC
|
||||
4. If peer has also trimmed: fetch from S3
|
||||
5. Replay fetched segments into local WAL
|
||||
6. Mark node as "caught up" and start accepting writes
|
||||
|
||||
## Acceptance Criteria
|
||||
|
||||
- [ ] Node joins late: fetches missing data from peer, all messages readable
|
||||
- [ ] Node catches up from S3 when peer has trimmed that segment
|
||||
- [ ] Recovery doesn't block existing cluster operations
|
||||
- [ ] Recovery progress is logged
|
||||
Reference in New Issue
Block a user