Draft: Use masked load/stores in vectorized assignment loops
Reference issue
Fixes #1777
What does this implement/fix?
Currently, for vectorized dense assignment loops, i.e. A = B.cwiseAbs2()
we use packet ops for the eligible portions of the array and then switch to scalar code for the tail end. In some situations, we also use scalar code for the first few elements if the start of the array is not aligned, as is the case for A.segment(begin,len) = B.segment(begin,len).cwiseAbs2()
. This has two consequences: 1) for smaller arrays the scalar code can comprise an appreciable portion or the entirety of the computational cost, and 2) the scalar code uses different implementations of math functions, which raises some legitimate concerns regarding numerical consistency and reproducibility.
This MR uses masked loads/stores to replace the scalar portions of the vectorized assignment evaluators. In these situations, we always use a contiguous portion of the packet. For this reason, I opted to use the pload_partial
and pstore_partial
API instead of the overloaded pload, pstore -- which uses a generic mask and is a bit awkward. In the AVX implementation of pload_partial
and pstore_partial
, the compiler is able to optimize the usual case where offset == 0
. As such, I think the overhead of using masked load/stores is minimized. Here is a code snippet of the AVX2 mask function.
template <typename Scalar>
EIGEN_STRONG_INLINE __m256i avx2_256_partial_mask(const Index n, const Index offset) {
// if offset == 0 (the most common case), the compiler will eliminate much of this function
static constexpr int Size = sizeof(Scalar);
const __m256i cst_lin = _mm256_setr_epi32(0 / Size, 4 / Size, 8 / Size, 12 / Size, 16 / Size, 20 / Size, 24 / Size, 28 / Size);
__m256i off = _mm256_set1_epi32(static_cast<int>(offset));
__m256i off_n = _mm256_set1_epi32(static_cast<int>(offset + n));
__m256i off_gt_lin = _mm256_cmpgt_epi32(off, cst_lin); // offset > i
__m256i off_n_gt_lin = _mm256_cmpgt_epi32(off_n, cst_lin); // offset + n > i
__m256i mask = _mm256_andnot_si256(off_gt_lin, off_n_gt_lin); // offset + n > i && !(offset > i)
return mask;
}
The general case where offset is not known at compile time generates this assembly:
vmovd xmm0, esi
vpbroadcastd ymm0, xmm0
add edi, esi
vmovd xmm1, edi
vpbroadcastd ymm1, xmm1
vmovdqa ymm2, ymmword ptr [rip + .LCPI0_0] # ymm2 = [1,2,3,4,5,6,7,8]
vpcmpgtd ymm0, ymm2, ymm0
vpcmpgtd ymm1, ymm1, ymmword ptr [rip + .LCPI0_1]
vpand ymm0, ymm1, ymm0
ret
This is weird because the compiler sees fit to make two constants (0,1,2,3,4,5,6,7) and (1,2,3,4,5,6,7,8).
However, when offset = 0 is known at compile time, only one comparison is used.
vmovd xmm0, edi
vpbroadcastd ymm0, xmm0
vpcmpgtd ymm0, ymm0, ymmword ptr [rip + .LCPI1_0]
ret
Performance-wise, the biggest impact may be that awkward but common small fixed sizes like Vector3d
will be fully vectorized. If masked loads/stores are not available, this could lead to a performance degradation, as the generic pload/store_partial
looks slow. We'll have to measure that case and maybe use some sfinae to resolve that.