When optimizing for speed, make the fast path both correct for the actual hardware and lightweight in implementation.

1) Keep kernel support documentation accurate and capability-aware

Example (pattern):

# Supported FA3 archs reflect what kernels are compiled for
# sm90 (Hopper), sm89 (Ada), sm80/sm86 (Ampere)
# sm100 (Blackwell): fallback to SDPA until FA3 is recompiled

def _load_flash_attention_3():
    if not torch.cuda.is_available():
        return None
    # ... select FA3 vs fallback based on device capability ...

2) Avoid unnecessary overhead in parsing/utility code

Example (pattern):

# checkpoint_files contains only filenames like model_123456.pt
last_step = int(max(f.split("_")[-1].split(".")[0] for f in checkpoint_files))

Result: faster execution with fewer CPU-side bottlenecks, and fewer correctness surprises caused by stale/incorrect hardware assumptions.