For any control-plane action that depends on acknowledgement (epochs/status updates, joins/removes, metadata propagation), the codebase must explicitly define end-to-end error handling: timeout, retry/escalation, cascade effects, and the operator override path.

Apply this standard by ensuring each such operation documents/implements:

Non-ack contract: what the system does when a target node (or FC/TD hop) never acknowledges (stop progress, keep serving with defined mode, or mark entities unavailable).
Timeout and retry strategy: exact thresholds and backoff; whether retries continue indefinitely or transition to an escalated state.
Cascade/feedback clarity: whether non-ack at the node level affects TD/other components, and what errors are propagated upward.
Operator override: if an escape hatch exists, define a safe default (refuse or limit impact) and a forced path with prominent warnings and justification.
Troubleshooting path: a dedicated “everything is broken”/unwedging procedure, including what signals/metrics/errors to look for.
Recovery-mode correctness: distinguish automatic recovery from rejoin required when persistence/durability expectations aren’t met; avoid wording that implies guarantees the system doesn’t actually provide.
Admin guardrails for unsafe actions: when removal/failover can be destructive (e.g., removing a primary), require safer sequencing (failover first, then remove) or ensure force flags are clearly constrained; optionally support client-safe mitigation like connection draining/rotation.

Example (pseudocode pattern for non-ack handling):

function applyTopologyChange(change, targetNode, basedEpoch):
  reqId = newReqId()
  sendRequest(targetNode, change, basedEpoch, reqId)

  deadline = now() + ACK_TIMEOUT
  attempts = 0
  while now() < deadline and attempts < MAX_RETRIES:
    status = waitForAck(reqId, minRemainingTime())
    if status == ACKED and status.epoch == basedEpoch:
      return OK
    attempts++

  // Escalation (defined behavior)
  markEntity(DEGRADED, reason="no ack", epoch=basedEpoch)
  propagateUpstreamIfNeeded(reason="no ack")
  logError(warn="Operation not acknowledged; escalation to operator required", reqId)

  if operatorOverrideEnabled and operatorConfirmed:
    applyForced(change)
  else
    return ERROR("non-ack timeout; operation refused by contract")

This turns ambiguous failure modes into deterministic, testable behavior and ensures operators have a clear, warned path to recover when the normal acknowledgement-based workflow cannot complete.

Add Repository

Private Repository