Prompt
Adopt a single error-handling standard across LLM/provider layers:
1) Retry only transient failures (including connection jitter) with bounded, classified backoff
- Centralize retry behavior in shared utilities (e.g., @retry / retry_or_fallback) using error classification.
- Treat connection/network/DNS and rate-limit/server errors as retryable.
- Ensure the last attempt returns/raises a clear max-retry outcome (so callers never “fall through” without an error).
- Don’t silently support unsupported patterns: fail fast for generator/streaming functions (retrying mid-stream is unsafe).
2) Keep return/output contracts stable between success and failure
- Never change output types on exception paths (e.g., if success returns tuple(text, token_count), failure must return the same shape or be normalized before setting outputs).
3) Standardize failure encoding and checking at boundaries
- If providers encode failure in return values (e.g., prefixed strings), require callers to use the shared prefix constant and an is_error_result-style helper instead of ad-hoc string checks.
Example (type-stable normalization before setting output):
try:
transcription = seq2txt_mdl.transcription(tmp_path)
# success may be tuple(text, token_count)
txt = transcription[0] if isinstance(transcription, tuple) else transcription
except Exception as e:
logging.warning(f"Transcription failed: {e}")
txt = ""
self.set_output("text", txt)
Example (standard retry rule for transient polling blips):
@retry
def _describe_task_status(self, req):
return self.client.DescribeTaskStatus(req)
while retries < max_retries:
resp = self._describe_task_status(req) # transient 429/5xx/DNS blips survive
Result: fewer hidden failure modes, consistent caller behavior, and reliable recovery for transient LLM/API issues.