Evidence-Based Failover Targets

When proposing performance/SLA goals (e.g., failover completion time), require an evidence-based rationale and a repeatable way to validate them—especially for worst-case behavior.

copy reviewer prompt

Prompt

Reviewer Prompt

When proposing performance/SLA goals (e.g., failover completion time), require an evidence-based rationale and a repeatable way to validate them—especially for worst-case behavior.

Apply this by documenting:

  • Target definition: what exactly “10s failover” includes/excludes (detection, consensus/propagation, promotion, client-observable recovery).
  • Worst-case assumptions: network latency bounds, timeouts, quorum/heartbeat intervals, failure model, and any upper bounds.
  • Measurement basis: links to benchmarks/profiles or at least a structured summary of real-world observations (cloud regions/AZs, instance/network reliability stats).
  • Acceptance tests: automated tests (or staging runs) that exercise the failure mode(s) producing the worst case and assert the target.

Example standard text developers can follow:

Failover target: <=10s (worst-case), <=5s (some configs).
Definition: time from heartbeat timeout detection to topology+role update propagated to all impacted nodes, including client observable readiness.
Assumptions: cross-AZ RTT p99=..., heartbeat interval=..., coordinator tick=..., quorum=...
Evidence: benchmark/staging results in regions X/Y with injected failures; include traces/profiles.
Regression guard: CI/staging test fails if p99/worst-case exceeds target.

This ensures performance claims aren’t just experience-based anecdotes, and it prevents future changes from silently breaking latency guarantees.

Source discussions