When optimizing for speed, make the fast path both correct for the actual hardware and lightweight in implementation.
1) Keep kernel support documentation accurate and capability-aware
Example (pattern):
# Supported FA3 archs reflect what kernels are compiled for
# sm90 (Hopper), sm89 (Ada), sm80/sm86 (Ampere)
# sm100 (Blackwell): fallback to SDPA until FA3 is recompiled
def _load_flash_attention_3():
if not torch.cuda.is_available():
return None
# ... select FA3 vs fallback based on device capability ...
2) Avoid unnecessary overhead in parsing/utility code
model_<step>.pt, prefer simple split/string ops over regex.basename) when you already operate on plain filenames.Example (pattern):
# checkpoint_files contains only filenames like model_123456.pt
last_step = int(max(f.split("_")[-1].split(".")[0] for f in checkpoint_files))
Result: faster execution with fewer CPU-side bottlenecks, and fewer correctness surprises caused by stale/incorrect hardware assumptions.
Enter the URL of a public GitHub repository