When optimizing hot kernels (SIMD/vectorized paths), remove avoidable overhead and improve locality:
1) Unroll small fixed-size tails (e.g., remainder=4)
// Instead of: for (int i=0;i<4;i++){ if(i%2)... }
// Do a straight-line unroll:
{
outptr[0] = ptr0[0];
outptr[1] = ptr1[0];
outptr[2] = ptr0[1];
outptr[3] = ptr1[1];
ptr0 += 2;
ptr1 += 2;
outptr += 4;
}
2) Use packed/contiguous intermediate buffers in vector code
// Instead of separate small-dim Mats that increase pointer indirections,
// create a single contiguous layout:
offset_blob.create(outw, outh, elemsize * 4, 4, opt.workspace_allocator);
value_blob.create(outw, outh, elemsize * 2, 2, opt.workspace_allocator);
// (Or merge offset+value into one buffer when feasible.)
3) Keep ISA guards and branching structure “compiler-friendly”
__AVX2__ implies __AVX__):
#if defined(__AVX__) && defined(__AVX2__).elempack==1 scalar handling inside SIMD-only macros in a way that blocks optimization; keep elempack branching clear at the appropriate scope.4) For reductions, avoid inefficient reduction patterns
These rules target the same bottlenecks discussed: loop/branch overhead on tiny remainders, poor memory locality for intermediates, and avoidable SIMD/ISA/reduction inefficiencies.
Enter the URL of a public GitHub repository