When optimizing hot kernels (SIMD/vectorized paths), remove avoidable overhead and improve locality:

1) Unroll small fixed-size tails (e.g., remainder=4)

2) Use packed/contiguous intermediate buffers in vector code

3) Keep ISA guards and branching structure “compiler-friendly”

4) For reductions, avoid inefficient reduction patterns

These rules target the same bottlenecks discussed: loop/branch overhead on tiny remainders, poor memory locality for intermediates, and avoidable SIMD/ISA/reduction inefficiencies.