Draft: Use masked load/stores in vectorized assignment loops
Reference issue
Fixes #1777
What does this implement/fix?
Currently, for vectorized dense assignment loops, i.e. A = B.cwiseAbs2()
we use packet ops for the eligible portions of the array and then switch to scalar code for the tail end. In some situations, we also use scalar code for the first few elements if the start of the array is not aligned, as is the case for A.segment(begin,len) = B.segment(begin,len).cwiseAbs2()
. This has two consequences: 1) for smaller arrays the scalar code can comprise an appreciable portion or the entirety of the computational cost, and 2) the scalar code uses different implementations of math functions, which raises some legitimate concerns regarding numerical consistency and reproducibility.
This MR uses masked loads/stores to replace the scalar portions of the vectorized assignment evaluators. In these situations, we always use a contiguous portion of the packet. For this reason, I opted to use the pload_partial
and pstore_partial
API instead of the overloaded pload, pstore -- which uses a generic mask and is a bit awkward. In the AVX implementation of pload_partial
and pstore_partial
, the compiler is able to optimize the usual case where offset == 0
. As such, I think the overhead of using masked load/stores is minimized. Here is a code snippet of the AVX2 mask function.
template <typename Scalar>
EIGEN_STRONG_INLINE __m256i avx2_256_partial_mask(const Index n, const Index offset) {
// if offset == 0 (the most common case), the compiler will eliminate much of this function
static constexpr int Size = sizeof(Scalar);
const __m256i cst_lin = _mm256_setr_epi32(0 / Size, 4 / Size, 8 / Size, 12 / Size, 16 / Size, 20 / Size, 24 / Size, 28 / Size);
__m256i off = _mm256_set1_epi32(static_cast<int>(offset));
__m256i off_n = _mm256_set1_epi32(static_cast<int>(offset + n));
__m256i off_gt_lin = _mm256_cmpgt_epi32(off, cst_lin); // offset > i
__m256i off_n_gt_lin = _mm256_cmpgt_epi32(off_n, cst_lin); // offset + n > i
__m256i mask = _mm256_andnot_si256(off_gt_lin, off_n_gt_lin); // offset + n > i && !(offset > i)
return mask;
}
The general case where offset is not known at compile time generates this assembly:
vmovd xmm0, esi
vpbroadcastd ymm0, xmm0
add edi, esi
vmovd xmm1, edi
vpbroadcastd ymm1, xmm1
vmovdqa ymm2, ymmword ptr [rip + .LCPI0_0] # ymm2 = [1,2,3,4,5,6,7,8]
vpcmpgtd ymm0, ymm2, ymm0
vpcmpgtd ymm1, ymm1, ymmword ptr [rip + .LCPI0_1]
vpand ymm0, ymm1, ymm0
ret
This is weird because the compiler sees fit to make two constants (0,1,2,3,4,5,6,7) and (1,2,3,4,5,6,7,8).
However, when offset = 0 is known at compile time, only one comparison is used.
vmovd xmm0, edi
vpbroadcastd ymm0, xmm0
vpcmpgtd ymm0, ymm0, ymmword ptr [rip + .LCPI1_0]
ret
Performance-wise, the biggest impact may be that awkward but common small fixed sizes like Vector3d
will be fully vectorized. If masked loads/stores are not available, this could lead to a performance degradation, as the generic pload/store_partial
looks slow. We'll have to measure that case and maybe use some sfinae to resolve that.
Additional information
Merge request reports
Activity
requested review from @cantonios and @rmlarsen1
Not with this MR. If
find_best_packet
chooses the half packet, then the partial_load/store applies to that packet choice. If we do change/replace find_best_packet, this MR should be compatible with that change.Edited by Charles Schlosser
added 1 commit
- 0e65c19f - blindly implement partial packets in products
added 1 commit
- 4a4be0b3 - blindly implement partial packets in products
added 1 commit
- d5feebd5 - feed monkies more bananas to type random code more quickly
added 1 commit
- e324ffe6 - correctly call etor_product_partial_packet_impl::run
added 1 commit
- 36c9cabb - fix avx mask logic for scalars where sizeof != 4
added 1 commit
- 26668153 - account for generic masked load/stores in unalignedcount test
added 1 commit
- 22f8932a - specify all ops in unaligned_dense_assignment_loop unaligned
added 1 commit
- 6f32ee07 - reinterpret int64_t* to long long* and same with unsigned; specify alined...
added 9 commits
-
6f32ee07...969c31ee - 3 commits from branch
libeigen:master
- 206523ec - fix two more avx pload typos
- 29c847d9 - tweak linear vectorized traversal logic
- 214bfbf5 - more partial packet vectorization logic
- bff83dd8 - fix srcPartialPacket logic
- df169e6d - fix reverse partial packet logic
- 4570684a - Merge branch 'master' of https://gitlab.com/libeigen/eigen into partial_packet
Toggle commit list-
6f32ee07...969c31ee - 3 commits from branch
added 1 commit
- 5a129ac7 - alignas(alignof(Packet)) -- say that 5 times fast
added 1 commit
- a6757437 - confirming that sse pload causes segfault in partial packet access
added 1 commit
- e96cc2e6 - check size before calling partial packets in assignment loop
added 1 commit
- 821a4205 - initialize partial packet partial redux with partial packet
added 1 commit
- d5927f6f - enforce n > 0 for partial redux partial packet
added 1 commit
- e5870f9a - fix generic mask load/stores, non-avx2 store
@rmlarsen1 can you run this in your TF performance unit test? There are no AVX512 instructions, but that may be useful to determine the effect of the generic partial packet code. thanks!
added 3 commits
-
c6db610b - 1 commit from branch
libeigen:master
- 2ef2aa61 - delete avx specialization of packet16b load/store
- 661eeab4 - Merge branch 'master' of https://gitlab.com/libeigen/eigen into partial_packet
-
c6db610b - 1 commit from branch
added 1 commit
- d9972510 - make partial packet product evaluators more succinct
added 6 commits
-
d9972510...211c5dfc - 4 commits from branch
libeigen:master
- a6ee77f4 - Merge branch 'master' of https://gitlab.com/libeigen/eigen into partial_packet
- 2fc390bd - resolve conflicts
-
d9972510...211c5dfc - 4 commits from branch
added 5 commits
-
37ba365d...bc57b926 - 4 commits from branch
libeigen:master
- 3374ccc5 - Merge branch 'master' of https://gitlab.com/libeigen/eigen into partial_packet
-
37ba365d...bc57b926 - 4 commits from branch
What was the original intent behind the
offset
parameter inpload/store_partial
?I interpreted it as
for(int i = offset; i < offset + n; i++) to[i] = from[i];
currently, we do
for(int i = offset; i < offset + n; i++) to[i] = from[i-offset];
The current interpretation requires a swizzle once the elements are loaded into a packet. Altivec requires the swizzle (a shift) no matter what interpretation is used.
Do you have a specific application in mind for the offset?
Edited by Charles Schlosser
mentioned in issue #1777