Unrolled SIMD Locality

When optimizing hot kernels (SIMD/vectorized paths), remove avoidable overhead and improve locality: 1) Unroll small fixed-size tails (e.g., remainder=4)

copy reviewer prompt

Prompt

Reviewer Prompt

When optimizing hot kernels (SIMD/vectorized paths), remove avoidable overhead and improve locality:

1) Unroll small fixed-size tails (e.g., remainder=4)

  • Replace remainder loops with straight-line stores/loads to eliminate loop counters and per-iteration branches.
  • Example (tail of 4 elements with alternating sources):
    // Instead of: for (int i=0;i<4;i++){ if(i%2)... }
    // Do a straight-line unroll:
    {
      outptr[0] = ptr0[0];
      outptr[1] = ptr1[0];
      outptr[2] = ptr0[1];
      outptr[3] = ptr1[1];
      ptr0 += 2;
      ptr1 += 2;
      outptr += 4;
    }
    

2) Use packed/contiguous intermediate buffers in vector code

  • Prefer layouts that store multiple small fields (e.g., 4 lanes of coeffs/offsets) contiguously to reduce pointer chasing, TLB misses, and cache fragmentation.
  • Example style:
    // Instead of separate small-dim Mats that increase pointer indirections,
    // create a single contiguous layout:
    offset_blob.create(outw, outh, elemsize * 4, 4, opt.workspace_allocator);
    value_blob.create(outw, outh, elemsize * 2, 2, opt.workspace_allocator);
    // (Or merge offset+value into one buffer when feasible.)
    

3) Keep ISA guards and branching structure “compiler-friendly”

  • Use correct feature checks (don’t assume __AVX2__ implies __AVX__): #if defined(__AVX__) && defined(__AVX2__).
  • Avoid placing elempack==1 scalar handling inside SIMD-only macros in a way that blocks optimization; keep elempack branching clear at the appropriate scope.

4) For reductions, avoid inefficient reduction patterns

  • Don’t perform costly “reduce” operations inside tight loops; restructure so reduction is done efficiently (e.g., accumulate vectors, then reduce once).

These rules target the same bottlenecks discussed: loop/branch overhead on tiny remainders, poor memory locality for intermediates, and avoidable SIMD/ISA/reduction inefficiencies.

Source discussions