When implementing performance-critical algorithms (e.g., pooling, math kernels, SIMD/ISA code), preserve the algorithm’s true numeric semantics and parameterization:

1) Match rounding/truncation semantics with the ISA’s exact instruction

Prefer the native intrinsic that implements the intended operation (e.g., truncation toward zero) rather than approximating via integer conversion that may differ.

Example (aarch64):

static inline float32x4_t trunc_ps(const float32x4_t& x)
{
#if __aarch64__
  return vrndq_f32(x); // trunc toward zero
#else
  // fallback, but verify it matches trunc-toward-zero semantics
  int32x4_t xi = vcvtq_s32_f32(x);
  return vcvtq_f32_s32(xi);
#endif
}

2) Don’t “constant-fold” parameters that are mathematically location-dependent

If kernel extents (or other algorithm parameters) vary per output position, compute them per position in the same execution space (shader/CPU) or pass the per-position values explicitly.
For adaptive pooling: kernel_w/kernel_h generally depend on gx/gy (output location), so computing a single constant kernel on CPU and reusing it is incorrect. Keep the dynamic computation consistent with the reference definitions.

3) Add a quick correctness checklist during review

Numeric ops: confirm rounding direction (floor/ceil/trunc) and tie behavior where relevant.
Parameter definition: confirm which variables are constant vs per-location/per-channel.
Kernel wiring: verify parameter order (e.g., bias vs kernel) to avoid silent swaps that still compile.

Applying this standard prevents cross-architecture drift (wrong rounding) and algorithm drift (wrong kernel sizes), which are frequent root causes of subtle accuracy regressions in algorithmic code paths.

Add Repository

Private Repository