When working on LLM inference/engines, ensure (1) determinism tests actually test the intended invariants, (2) optional GPU-accelerated kernels are only loaded when the hardware supports them (with a safe fallback), and (3) KV-cache shape assumptions are documented accurately.

Determinism tests: structure assertions so the generation invariants change only by the parameters you intend.

Example (temperature=0 should ignore randomness; outputs should match across different seeds for the same prompt):

prompt = [261, 72, 101, 108, 108, 111]
engine = Engine(MockModel(), ByteTokenizer())

r1, _ = engine.generate_batch(prompt, temperature=0.0, max_tokens=5, seed=1)
r2, _ = engine.generate_batch(prompt, temperature=0.0, max_tokens=5, seed=42)
r3, _ = engine.generate_batch(prompt, temperature=0.0, max_tokens=5, seed=123)
assert r1 == r2 == r3

Hardware compatibility: gate kernel loading by compute capability to avoid runtime “no kernel image” crashes; optionally fall back to a compatible import/wheel.

Example:

flash_attn = None
if torch.cuda.is_available():
    if torch.cuda.get_device_capability()[0] >= 9:
        flash_attn = get_kernel('varunneal/flash-attention-3').flash_attn_interface
    else:
        import flash_attn_interface as flash_attn  # fallback

KV-cache invariants: ensure assertions/comments match the real dimension constraints used by prefill/attention (e.g., remove incorrect dimension references and explicitly note fixed indices like the K/V pair dimension when applicable).

Add Repository

Private Repository