Skip to content

Draft: Use masked load/stores in vectorized assignment loops

Reference issue

Fixes #1777

What does this implement/fix?

Currently, for vectorized dense assignment loops, i.e. A = B.cwiseAbs2() we use packet ops for the eligible portions of the array and then switch to scalar code for the tail end. In some situations, we also use scalar code for the first few elements if the start of the array is not aligned, as is the case for A.segment(begin,len) = B.segment(begin,len).cwiseAbs2(). This has two consequences: 1) for smaller arrays the scalar code can comprise an appreciable portion or the entirety of the computational cost, and 2) the scalar code uses different implementations of math functions, which raises some legitimate concerns regarding numerical consistency and reproducibility.

This MR uses masked loads/stores to replace the scalar portions of the vectorized assignment evaluators. In these situations, we always use a contiguous portion of the packet. For this reason, I opted to use the pload_partial and pstore_partial API instead of the overloaded pload, pstore -- which uses a generic mask and is a bit awkward. In the AVX implementation of pload_partial and pstore_partial, the compiler is able to optimize the usual case where offset == 0. As such, I think the overhead of using masked load/stores is minimized. Here is a code snippet of the AVX2 mask function.

template <typename Scalar>
EIGEN_STRONG_INLINE __m256i avx2_256_partial_mask(const Index n, const Index offset) {
  // if offset == 0 (the most common case), the compiler will eliminate much of this function
  static constexpr int Size = sizeof(Scalar);
  const __m256i cst_lin = _mm256_setr_epi32(0 / Size, 4 / Size, 8 / Size, 12 / Size, 16 / Size, 20 / Size, 24 / Size, 28 / Size);
  __m256i off = _mm256_set1_epi32(static_cast<int>(offset));
  __m256i off_n = _mm256_set1_epi32(static_cast<int>(offset + n));
  __m256i off_gt_lin = _mm256_cmpgt_epi32(off, cst_lin);         // offset > i
  __m256i off_n_gt_lin = _mm256_cmpgt_epi32(off_n, cst_lin);     // offset + n > i
  __m256i mask = _mm256_andnot_si256(off_gt_lin, off_n_gt_lin);  // offset + n > i && !(offset > i)
  return mask;
}

The general case where offset is not known at compile time generates this assembly:

        vmovd   xmm0, esi
        vpbroadcastd    ymm0, xmm0
        add     edi, esi
        vmovd   xmm1, edi
        vpbroadcastd    ymm1, xmm1
        vmovdqa ymm2, ymmword ptr [rip + .LCPI0_0] # ymm2 = [1,2,3,4,5,6,7,8]
        vpcmpgtd        ymm0, ymm2, ymm0
        vpcmpgtd        ymm1, ymm1, ymmword ptr [rip + .LCPI0_1]
        vpand   ymm0, ymm1, ymm0
        ret

This is weird because the compiler sees fit to make two constants (0,1,2,3,4,5,6,7) and (1,2,3,4,5,6,7,8).

However, when offset = 0 is known at compile time, only one comparison is used.

        vmovd   xmm0, edi
        vpbroadcastd    ymm0, xmm0
        vpcmpgtd        ymm0, ymm0, ymmword ptr [rip + .LCPI1_0]
        ret

Performance-wise, the biggest impact may be that awkward but common small fixed sizes like Vector3d will be fully vectorized. If masked loads/stores are not available, this could lead to a performance degradation, as the generic pload/store_partial looks slow. We'll have to measure that case and maybe use some sfinae to resolve that.

Additional information

Edited by Charles Schlosser

Merge request reports