When optimizing for CPU/GPU speed, don’t accidentally remove fast paths via build/toolchain settings, and don’t add expensive memory-access patterns in the hottest kernels.
Apply this standard: 1) Don’t globally disable ISA features or runtime CPU dispatch from generic toolchain files
2) In SIMD kernels, avoid unnecessary gather/indirection for contiguous data
3) Keep SIMD tail handling simple and width-driven
nn/offset-tail bookkeeping.4) Don’t assume #pragma omp simd will outperform hand-written intrinsics in hot paths
Example (contiguous access vs gather):
// Bad: gather when srcptr is contiguous and offsets are sequential
// __m256 v00_val = mask_gather_ps256(srcptr, v00_offset, mask);
// Good: use vector load directly when indices are contiguous
// (illustrative; exact mask logic depends on bounds)
__m256 v00_val = _mm256_loadu_ps(srcptr + base_index);
// apply mask/lerp/fma as needed
Example (width-driven tail loop instead of nn bookkeeping):
int x = 0;
#if __SSE2__
#if __AVX__
for (; x + 7 < grid_size; x += 8) {
// AVX body
}
#endif
for (; x + 3 < grid_size; x += 4) {
// SSE body
}
#endif
for (; x < grid_size; x++) {
// scalar remainder
}
Net effect: you preserve the compiler/runtime ability to select the fastest supported implementation, and you prevent common “hidden” slowdowns from gather/indirection and overly complex SIMD tails.
Enter the URL of a public GitHub repository