Fixes #2714: contractions fail with SEGV when computing non linear scalars such as autodiff
Reference issue
What does this implement/fix?
This MR is intended to fix a bug in Tensor contractions when using non-linear scalars or scalars that need initialization/finalization. The issue affects contractions running on DefautlDevice
and ThreadPoolDevice
, causing segmentation faults and memory leaks.
It turns out that the current contraction code uses a raw memory chuck to allocate the operands, correctly deallocating this memory block after the computation. However, with exotic scalars such as Eigen::AutodiffScalar
, each element is eventually a graph with pointers to other memory positions that need (or do not) be initialized/finalized before/after the contraction computation.
The raw memory approach is relevant for the sake of performance, providing an aligned memory that allows vectorization.
This MR fixes this issue by performing two additional steps:
- initializing the scalars by explicitly calling their constructors
- finalizing the scalars by explicitly invoking the destructor
It is noteworthy that these two steps shouldn't allocate/deallocate the memory by themselves. The device already allocated the memory. Thus, we cannot allocate it again. In the same way, we cannot deallocate the memory by ourselves since the device will deallocate it later on using the right aligning offset.
As a solution, the fix uses placement new and explicit destructor calls. These two actions are implemented as features in the TensorContractionKernel
struct:
template <typename ResScalar, typename LhsScalar, typename RhsScalar,
typename StorageIndex, typename OutputMapper, typename LhsMapper,
typename RhsMapper>
struct TensorContractionKernel {
// ...
EIGEN_DEVICE_FUNC void initialize_block(BlockMemHandle block) {
//...
}
template <typename Device>
EIGEN_DEVICE_FUNC BlockMemHandle allocate(Device& d, LhsBlock* lhs_block, RhsBlock* rhs_block) {
BlockMemHandle result = BlockMemAllocator::allocate(d, bm, bk, bn, lhs_block, rhs_block);
initialize_block(result);
return result;
}
template <typename Device>
EIGEN_DEVICE_FUNC BlockMemHandle allocateSlices(
Device& d, const StorageIndex num_lhs, const StorageIndex num_rhs,
const StorageIndex num_slices, std::vector<LhsBlock>* lhs_blocks,
std::vector<RhsBlock>* rhs_blocks) {
BlockMemHandle result = BlockMemAllocator::allocateSlices(
d, bm, bk, bn, num_lhs, num_rhs, num_slices, lhs_blocks, rhs_blocks);
initialize_block(result);
return result;
}
EIGEN_DEVICE_FUNC void finalize_block(BlockMemHandle block) {
// ...
}
template <typename Device>
EIGEN_DEVICE_FUNC void deallocate(Device& d, BlockMemHandle handle) {
finalize_block (handle);
BlockMemAllocator::deallocate(d, handle);
}
Tests
This MR includes two new tests: test_scalar_initialization
(in cxx11_tensor_contraction.cpp) and test_multithread_contraction_with_scalar_initialization
(in cxx11_tensor_thread_pool.cpp). Both tests use a simple Scalar InitializableScalar
.
Additional information
-
Contractions of raw scalars such as floats or doubles are not affected by the initialization/finalization methods because this fix checks for
NumTraits<LhsScalar>::RequireInitialization
. -
Before this MR and even using this MR, contractions of exotic scalars do not compile on GPU devices.
Merge request reports
Activity
- Resolved by Luiz doleron
Very cool. Perhaps we should test with this?
https://gitlab.com/libeigen/eigen/-/blob/master/test/AnnoyingScalar.h
added 1 commit
- 7dcedd4c - fix multi-blocks/multi-slices + annoying tests
added 6 commits
-
8f858a4e - 1 commit from branch
libeigen:master
- abf3f02f - init & destroy scalars when RequireInitialization
- 0408f0d3 - added test for contraction on default device
- 93077496 - test contraction+requiredinitialization+threadpool
- 2dd9385e - fix multi-blocks/multi-slices + annoying tests
- 93217340 - fix commentary
Toggle commit list-
8f858a4e - 1 commit from branch
added 7 commits
- 6b20ef53 - init & destroy scalars when RequireInitialization
- 5fb7889e - added test for contraction on default device
- e6b7d44b - test contraction+requiredinitialization+threadpool
- 7dcedd4c - fix multi-blocks/multi-slices + annoying tests
- 63a3f7fe - fix commentary
- e86ae6fc - init/finalizing scalars on shard threading
- 0c16e596 - Merge branch 'fix-contraction-autodiff' of...
Toggle commit listadded 1 commit
- 1e5eac48 - Example of Tensor + Autodiff to train MLP network
Added an example of training an MLP network using only
Eigen::Tensor
andEigen::AutoDiff
:template <typename T, typename Device> T loop(const Device &device, const Eigen::Tensor<T, 2> &_TRUE, const Eigen::Tensor<T, 2> &_X, Eigen::Tensor<T, 2> &_W0, Eigen::Tensor<T, 2> &_W1, T learning_rate) { // converting tensors to autodiff Eigen::Tensor<AutoDiff_T, 2> TRUE = convert(_TRUE); Eigen::Tensor<AutoDiff_T, 2> X = convert(_X); Eigen::Tensor<AutoDiff_T, 2> W0 = convert(_W0, _W0.size() + _W1.size()); Eigen::Tensor<AutoDiff_T, 2> W1 = convert(_W1, _W0.size() + _W1.size(), W0.size()); // forward pass const Eigen::array<Eigen::IndexPair<int>, 1> contract_dims = {Eigen::IndexPair<int>(1, 0)}; // Hidden layer Eigen::Tensor<AutoDiff_T, 2> Z0(X.dimension(0), W0.dimension(1)); Z0.device(device) = X.contract(W0, contract_dims); Eigen::Tensor<AutoDiff_T, 2> Y0 = ReLU(Z0); // Output Layer Eigen::Tensor<AutoDiff_T, 2> Z1(Y0.dimension(0), W1.dimension(1)); Z1 = Y0.contract(W1, contract_dims); auto Y1 = softmax(Z1); AutoDiff_T LOSS = categorical_cross_entropy(TRUE, Y1); // backward pass auto gradients = unpack_gradients(LOSS, W0, W1); auto grad0 = std::get<0>(gradients); auto grad1 = std::get<1>(gradients); // update pass _W0 = _W0 - grad0 * grad0.constant(learning_rate); _W1 = _W1 - grad1 * grad1.constant(learning_rate); T result = LOSS.value(); return result; }
Running the big cross compilation test https://gitlab.com/libeigen/eigen_ci_cross_testing/-/pipelines/1001885726 to see if there are any annoying compiler warnings or regressions. The fixes appear to be sensible and the code is easy to understand. @cantonios owns the tensor portions of Eigen, so I'll defer to his review.
262 } 263 } 264 265 slice += slice_mid * _LhsScalar_size + slice_end * _RhsScalar_size; 266 } 267 268 } 269 270 } 271 220 272 template <typename Device> 221 273 EIGEN_DEVICE_FUNC BlockMemHandle allocate(Device& d, LhsBlock* lhs_block, 222 274 RhsBlock* rhs_block) { 223 return BlockMemAllocator::allocate(d, bm, bk, bn, lhs_block, rhs_block); 275 BlockMemHandle result = BlockMemAllocator::allocate(d, bm, bk, bn, lhs_block, rhs_block); 276 initialize_block(result); Hi @cantonios , thank you for the feedback. This MR was intended to fix contractions on default and TP devices. In my tests, contractions of AutoDiffScalar on GPU devices don't compile. Maybe fixing these contractions on GPU requires more sophisticated work.
For the case of TP/default devices, the suggestion here was to run the initialization/finalization in
TensorContractionKernel
because this looks like a good balance between type specialization and cohesion.It turns out that making the device initialize/finalize by itself will require the device to know the operands layout. One way to achieve it without including a high coupling between the device type and the operation type is passing the operands layout as an optional parameter to
device.allocate()
ordevice.deallocate()
.If I'm not missing something, my suggestion would be initializing/finalizing the scalars in
TensorContractionKernel
(or perhaps inTensorContractionBlockMemAllocator
?). Once the contractions are running on TP/default devices, we can move to tackle the issues on GPU devices. What do you think?We can't just hack it so that CPU contractions work. The system still needs to be designed consistently. Otherwise we'll need to redesign the whole thing in order to get the GPU (or other custom devices like sycl) to work.
The purpose of having a
Device
is to handle these device-specific differences. The device itself shouldn't need to know the layout - only the number of elements and type. We probably need something liketemplate<typename T> T* allocate(size_t count); template<typename T> void deallocate(T* ptr);
that on CPU (including threadpool) will just call
new
/delete
if the type requires initialization. Something similar to what we already do inMemory.h
. You would then need to modifyBlockMemAllocator
to use this - that's the thing that controls the layout.changed this line in version 8 of the diff
Hi, as asked, I implemented the initialization of Scalars on the device:
struct DefaultDevice { // ... template<typename T> void* allocate_elements(size_t num_elements) const { const size_t element_size = sizeof(T); size_t num_bytes = num_elements * element_size; // calculating size size_t align = numext::maxi(EIGEN_MAX_ALIGN_BYTES, 1); num_bytes = divup<Index>(num_bytes, align) * align; // aligning void * result = allocate(num_bytes); // allocating if (NumTraits<T>::RequireInitialization) { // initializing when necessary char * mem_pos = reinterpret_cast<char*>(result); for (size_t i = 0; i < num_elements; ++i) { new(mem_pos)T(); mem_pos += element_size; } } return result; } //... };
This code also passes the tests:
struct TensorContractionBlockMemAllocator { template <typename Device> static BlockMemHandle allocate(...) { BlockSizes sz = ComputeLhsRhsBlockSizes(bm, bk, bn); // calculates alignment, block and memory sizes char* block_mem = static_cast<char*>(d.template allocate_elements<LhsScalar>(sz.lhs_count + sz.rhs_count)); // allocating memory using only LhsScalar as type *lhs_block = static_cast<LhsScalar*>(static_cast<void*>(block_mem)); *rhs_block = static_cast<RhsScalar*>(static_cast<void*>(block_mem + sz.lhs_size)); // sz.lhs_size is calculated by the host return block_mem; } //... };
Note that:
- Since we are initializing the whole data using only one type, the memory is initialized using only
LhsScalar
. - On the host side,
ComputeLhsRhsBlockSizes
computes the memory size and alignment, which are device-dependent concerns in my understanding.
Because of the previous two issues, I implemented a secondary approach:
struct TensorContractionBlockMemAllocator { template <typename Device> EIGEN_DEVICE_FUNC static BlockMemHandle allocate(Device& d, const Index bm, const Index bk, const Index bn, LhsScalar** lhs_block, RhsScalar** rhs_block) { std::vector<void*> blocks; Index left_block_count = bm * bk; Index right_block_count = bn * bk; char* block_mem = static_cast<char*>(d.template allocate_blocks<LhsScalar, RhsScalar>(left_block_count, right_block_count, blocks)); *lhs_block = static_cast<LhsScalar*>(blocks[0]); *rhs_block = static_cast<RhsScalar*>(blocks[1]); return block_mem; } };
This last version solves the two previous issues:
- It uses both
LhsScalar
&RhsScalar
to allocate/initialize the data. - The device does all the calculations of memory size and alignment.
We have a slightly more complex call for the
ThreadPool
use case. In the multithreaded contraction, we have a single memory block sharded into several parts to allow multiple threads to compute their jobs without blocking each other.The following call covers this scenario:
template <typename LhsScalar, typename RhsScalar> struct TensorContractionBlockMemAllocator { //... template <typename Device> static BlockMemHandle allocateSlices( Device& d, const Index bm, const Index bk, const Index bn, const Index num_lhs, const Index num_rhs, const Index num_slices, std::vector<LhsScalar*>* lhs_blocks, std::vector<RhsScalar*>* rhs_blocks) { std::vector<void*> blocks; char* block_mem = static_cast<char*>(d.template allocate_blocks<LhsScalar, RhsScalar>(bm * bk, bn * bk, blocks, num_lhs, num_rhs, num_slices)); Index blocks_index = 0; for (Index slice = 0; slice < num_slices; slice++) { if (num_lhs > 0) lhs_blocks[slice].resize(num_lhs); for (Index m = 0; m < num_lhs; m++) { void* block_memory = blocks[blocks_index++]; lhs_blocks[slice][m] = static_cast<LhsScalar*>(block_memory); } if (num_rhs > 0) rhs_blocks[slice].resize(num_rhs); for (Index n = 0; n < num_rhs; n++) { void* block_memory = blocks[blocks_index++]; rhs_blocks[slice][n] = static_cast<RhsScalar*>(block_memory); } } return block_mem; } };
Note again that the memory positions in
blocks
are computed by the device.The deallocation side uses the same ideas:
~EvalParallelContext() { kernel_.deallocate(device_, packed_mem_, nm0_, nn0_, std::min<Index>(nk_, P - 1)); if (parallelize_by_sharding_dim_only_) { const int num_worker_threads = device_.numThreadsInPool(); if (shard_by_col_) { Index num_blocks = num_worker_threads * gn_; kernel_.deallocate(device_, thread_local_pre_alocated_mem_, 0, num_blocks, 1); } else { Index num_blocks = num_worker_threads * gm_; kernel_.deallocate(device_, thread_local_pre_alocated_mem_, num_blocks, 0, 1); } } }
and
template <typename LhsScalar, typename RhsScalar> struct TensorContractionBlockMemAllocator { //... template <typename Device> EIGEN_DEVICE_FUNC static void deallocate(Device& d, BlockMemHandle handle, const Index bm, const Index bk, const Index bn, const Index num_lhs = 1, const Index num_rhs = 1, const Index num_slices = 1) { d.template deallocate_blocks<LhsScalar, RhsScalar>(handle, bm * bk, bn * bk, num_lhs, num_rhs, num_slices); } };
I included the two alternatives in the last commit. One can control which one to use by defining
__USING_SINGLE_TYPE_CONTRACTIONS__
.Let me know if one of the alternatives works. If so, I will remove the other alternative and the flag USING_SINGLE_TYPE_CONTRACTIONS.
PS.: I have included a new test for the case of ThreadPool using a custom allocator.
Edited by Luiz doleron- Since we are initializing the whole data using only one type, the memory is initialized using only
Hi @cantonios , is there something I can help with this review? Based on your last feedback, I provided two code alternatives. Is any of them good? PS: sorry for asking, but this subject is very important to my current work!
- Resolved by Luiz doleron
I think we also need the size of the array for the deallocation so we can call the correct number of deconstructors. That’s why I made a custom device with the ptr and sizes of each array. A generic fix requires a lot of work.
added 1 commit
- 73946bcd - commit assumes type(LhsScalar) == type(RhsScalar)
added 1 commit
- eb799e85 - moving initialization/finalizaation code to device